A Blog About Anime, Code, and Pr0 H4x

iOS - How to convert BGRA video streams to Grayscale SUPER fast.

February 26, 2012 at 12:00 PM

==========

For a practical example of this algorithm, you can check out my app See It - Video Magnifier.

==========

I thought I would share a little tidbit of iOS development trickery that I finally got worked out today.

I'm working on an app at the moment that takes in a live video feed from a device's back-facing camera, and does some filtering to each frame before displaying it on the device's screen for the user. One of the filtering options my app does, is converting the video feed from color, to grayscale.

Before I go any farther, I'm going to assume that you are familiar with AVCaptureSession, and setting up everything to access a device camera, and display a live preview feed. If not, take a look at Apple's RosyWriter example, and the AVCamDemo example from the WWDC 2010 example code.

The method I was using at first to accomplish this, was to iterate through each pixel in a frame, calculate that pixel's weighted average of Red, Green, and Blue values, and then apply that weighted average to the red, green, and blue channels of the pixel.

Like so,

- (void)processPixelBuffer: (CVImageBufferRef)pixelBuffer
              int BYTES_PER_PIXEL = 4;
          
              int bufferWidth = CVPixelBufferGetWidth(pixelBuffer);
              int bufferHeight = CVPixelBufferGetHeight(pixelBuffer);
              unsigned char *pixels = (unsigned char *)CVPixelBufferGetBaseAddress(pixelBuffer);
          
              for (int i = 0; i < (bufferWidth * bufferHeight); i++) {
                  // Calculate the combined grayscale weight of the RGB channels
                  int weight = (pixel[0] * 0.11) + (pixel[1] * 0.59) + (pixel[2] * 0.3);
          
                  // Apply the grayscale weight to each of the colorchannels
                  pixels[0] = weight; // Blue
                  pixels[1] = weight; // Green
                  pixels[2] = weight; // Red
                  pixels += BYTES_PER_PIXEL;
              }
          }
          

This above snippet of code will take a BGRA format CVImageBufferRef, and convert it to grayscale. Unfortunately for us, it's not very fast. Using the AVCaptureSessionPresetMedium video input setting, I was getting ~7fps on my 4th generation iPod Touch. (Which isn't necessarily bad, but could be better.)

So whilst Googling about for a method to increase my BGRA to Grayscale conversion algorithm's speed, I came across two articles discussing RGB to Grayscale conversion using ARM NEON intrinsic functions. Specifically,

ARM NEON intrinsic functions is not only a mouthful to say, but is also some sort of serious low-level coding black magic available to devices with ARM Cortex A8 (and later) CPUs, that allows you to be able to perform multiple computations at once. I couldn't explain the theory of it all to you to save my life, so I won't even bother tyring. Suffice it to say that using intrinsic functions will allow us to be able to process eight pixels at a time, as opposed to just one.

Those interested in learning more about the nitty gritty of ARM NEON, should take a look at "Introduction to NEON on iPhone" over on Wandering Coder.

So this article explains how to perform a BGRA to Grayscale conversion on an AVCaptureSession video feed, but it doesn't do it very well. So allow me to help fill in the gaps.

First, prepare your project to use NEON intrinsics by adding '-mfpu=neon ' to your project's "Other C flags", and setting your project's compiler to "LLVM GCC 4.2" (Which you should be using already, since all the cool kids use Automatic Reference Counting these days.)

Next, make sure you add this line to your video processing class' header file, otherwise you're going to get all sorts of frustrating compile errors.

#import "arm_neon.h"
          

Finally, implement the intrinsic BGRA to grayscale method outlined in this article. (Show below.)

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int numPixels)
          {
              int i;
              uint8x8_t rfac = vdup_n_u8 (77);
              uint8x8_t gfac = vdup_n_u8 (151);
              uint8x8_t bfac = vdup_n_u8 (28);
              int n = numPixels / 8;
          
              // Convert per eight pixels
              for (i=0; i < n; ++i)
              {
                  uint16x8_t  temp;
                  uint8x8x4_t rgb  = vld4_u8 (src);
                  uint8x8_t result;
          
                  temp = vmull_u8 (rgb.val[0],      bfac);
                  temp = vmlal_u8 (temp,rgb.val[1], gfac);
                  temp = vmlal_u8 (temp,rgb.val[2], rfac);
          
                  result = vshrn_n_u16 (temp, 8);
                  vst1_u8 (dest, result);
                  src  += 8*4;
                  dest += 8;
              }
          }
          

This last step however, is the trickiest of all! Because unfortunately that article doesn't really tell you how to use this function, and it also leaves out one very important tidbit of information you're going to need. That being, the result that the BGRA to grayscale method actually creates.

You see, you pass the method a CVPixelBuffer of image data (An array of pixels wherein each pixel has four values representing levels of Blue, Green, Red, and Alpha), along with an empty memory buffer. What the method then does, is fill that buffer with the grayscale value for each pixel in the CVPixelBuffer, which by itself is totally useless.

So in order to actually make your video feed appear grayscale, you have to not only run each preview frame through your intrinsic method, but then apply the created grayscale values to each pixel in your preview frame.

So enough talk, here's all the code you're going to need to make it all happen!

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int numPixels)
          {
              int i;
              uint8x8_t rfac = vdup_n_u8 (77);
              uint8x8_t gfac = vdup_n_u8 (151);
              uint8x8_t bfac = vdup_n_u8 (28);
              int n = numPixels / 8;
          
              // Convert per eight pixels
              for (i=0; i < n; ++i)
              {
                  uint16x8_t  temp;
                  uint8x8x4_t rgb  = vld4_u8 (src);
                  uint8x8_t result;
          
                  temp = vmull_u8 (rgb.val[0],      bfac);
                  temp = vmlal_u8 (temp,rgb.val[1], gfac);
                  temp = vmlal_u8 (temp,rgb.val[2], rfac);
          
                  result = vshrn_n_u16 (temp, 8);
                  vst1_u8 (dest, result);
                  src  += 8*4;
                  dest += 8;
              }
          }
          
          // Method that processes a CVPixelBuffer representation of a preview frame
          - (void)processPixelBuffer: (CVImageBufferRef)pixelBuffer
          {
              // lock the pixel buffer into place in memory
              CVPixelBufferLockBaseAddress( pixelBuffer, 0 );
          
              // Get the dimensions of the preview frame
              int bufferWidth = CVPixelBufferGetWidth(pixelBuffer);
              int bufferHeight = CVPixelBufferGetHeight(pixelBuffer);
          
              // Turn the CVPixelBuffer into something the intrinsic function can process
              uint8_t *pixel = CVPixelBufferGetBaseAddress(pixelBuffer);
          
              // Allocate some memory for the grayscale values that the intrinsic function will create
              uint8_t * baseAddressGray = (uint8_t *) malloc(bufferWidth*bufferHeight);
          
              // Convert BGRA values to grayscale values
              neon_convert(baseAddressGray, pixel, bufferWidth*bufferHeight);
          
              // Iterate through each pixel in the preview frame, and apply the weighted value of that pixel's RGB color channels
              for (int i = 0; i < (bufferWidth * bufferHeight); i++) {
                  pixel[0] = baseAddressGray[i];
                  pixel[1] = baseAddressGray[i];
                  pixel[2] = baseAddressGray[i];
                  pixel += BYTES_PER_PIXEL;
              }
          
              // Release the grayscale values buffer
              free(baseAddressGray);
          
              // Unlock the pixel buffer, we're done processing it
              CVPixelBufferUnlockBaseAddress( pixelBuffer, 0 );
          }
          

And that's it! Using this method, I'm able to get ~25fps on the same preview frame that I was getting ~7fps using my original method.

[ Kick it up a notch, with inline assembly! ]

While the performance boost you get from using intrinsic functions is pretty great, you can get even more performance out of your app by using an inline ASM BGRA to Grayscale conversion method! Which is exactly what this article talks about doing, but again, the author doesn't explain how to do it (and there's a bug in his code that makes it unusable).

Luckily for you, myself (and a kind anon), have worked out the kinks, and everything works super smooth now.

All you need to do, is add this method to your existing code,

static void neon_asm_convert(uint8_t * __restrict dest, uint8_t * __restrict src, int numPixels)
          {
              __asm__ volatile("lsr %2, %2, #3 \n"
                           "# build the three constants: \n"
                           "mov r4, #28 \n" // Blue channel multiplier
                           "mov r5, #151 \n" // Green channel multiplier
                           "mov r6, #77 \n" // Red channel multiplier
                           "vdup.8 d4, r4 \n"
                           "vdup.8 d5, r5 \n"
                           "vdup.8 d6, r6 \n"
                           "0: \n"
                           "# load 8 pixels: \n"
                           "vld4.8 {d0-d3}, [%1]! \n"
                           "# do the weight average: \n"
                           "vmull.u8 q7, d0, d4 \n"
                           "vmlal.u8 q7, d1, d5 \n"
                           "vmlal.u8 q7, d2, d6 \n"
                           "# shift and store: \n"
                           "vshrn.u16 d7, q7, #8 \n" // Divide q3 by 256 and store in the d7
                           "vst1.8 {d7}, [%0]! \n"
                           "subs %2, %2, #1 \n" // Decrement iteration count
                           "bne 0b \n" // Repeat unil iteration count is not zero
                           :
                           : "r"(dest), "r"(src), "r"(numPixels)
                           : "r4", "r5", "r6"
                           );
          }
          

And replace

neon_convert(baseAddressGray, pixel, bufferWidth*bufferHeight);
          

with

neon_asm_convert(baseAddressGray, pixel, bufferWidth*bufferHeight);
          

and you're done!

[ Here's some benchmarks ]

==========

For a practical example of this algorithm, you can check out my app See It - Video Magnifier.

==========

Go to Page