Homework 3: Image processing using CUDA

(Updated Oct 9) Due Sunday, 11th October at 11 pm EDT

Goal

The goal of this homework is to implement a 2D bilateral filter for image processing. You will implement brute-force (non-separable) image space 2D convolution using optimization techniques you learned in this class.

Bilateral Filter

A bilateral filter is an one-pass, nonlinear feature-preserving filter based on the convolution operator. In this homework, we will focus on 2D image filtering only, although the principle can be easily extended to three and more dimensions. A general 2D convolution with an m x n kernel (where the current pixel is centered) in the image domain can be defined as a weighted sum of neighboring pixels as follows:

I_{new}(x,y) = \sum_{j=y-n/2}^{y+n/2}\sum_{i=x-m/2}^{i+m/2}w(i,j,x,y)I(i,j)
Equation 1. General 2D convolution equation. (Added Oct 9)

The common low-pass filter, such as a Gaussian filter, has a weight w(i,j,x,y) based on the distance from the center of the kernel (x,y) to each pixel (i,j). For the bilateral filter, the weight is determined based on two distances: an image space distance and a color space distance. The main idea is that if a pixel is far from the center pixel of the kernel either in image space or in color space then it does not contribute much to the final result. The weight w(i,j,x,y) for the bilateral filter is defined as follows:

 w(i,j,x,y)=\frac{1}{\sum w}\exp^{-\frac{1}{2}\(\frac{d(i,j,x,y)}{\delta_{i}}\)^2}\exp^{-\frac{1}{2}\(\frac{d(I(i,j),I(x,y))}{\delta_{c}}\)^2}where

d(i,j,x,y)=\sqrt{(i-x)^2+(j-y)^2}     and   d(I_1,I_2) = \sqrt{ (R_1-R_2)^2 + (G_1-G_2)^2 + (B_1 - B_2)^2 } 

\delta_{i} and \delta_{c} are image distance and color distance parameters that control the amount of contribution for each distance, respectively. Large parameter values decrease the effect, and vice versa. The color space distance between two three-channel color points, i.e., I_1=(R_1,G_1,B_1) and I_2=(R_2,G_2,B_2) can be computed using L2 norm of the vector difference as shown above. The bilateral filter can produce piecewise-linear, cartoon-like images as follows:

              

Before (left) and after (right) bilateral filtering (kernel size 11x11, \delta_{i}=5, \delta_{c}=64).
.

Homework Instructions

The base code can be download from here. Once you download the zip file, unzip it in your directory and compile it as follows.

tar -xzvf homework3.tgz ./
cd homework3
mv HW3_bilateral2d_gpu_skeleton.cu HW3_bilateral2d_gpu.cu
make

Then you can run the code as follows:

cd .bin/
./bilateral ../flower.bmp out.bmp 5 10 64 10 1

The parameters for bilateral are 

bilateral input.bmp output.bmp halfkernelsize \delta_{i}  \delta_{c} numtest checkaccuracy

halfkernelsize : Distance from the center of the kernel to boundary. For example, halfkernelsize=5 defines 11x11 kernel (5+1+5 for x and y dimension).
numtest : Total number of tests to perform. Use 10 for this homework.
checkaccuracy : 0 - no check, 1 - compute the average of per-pixel color differences between CPU and GPU results.

HW3_main.cpp is the main function that opens the input image file and runs the filter. HW3_bilateral2d_cpu.cpp contains the CPU implementation of the bilateral filter. Since the provided skeleton GPU code does not contain anything, you should not get a correct result until you implement your bilateral filter. You should add your GPU bilateral filter to HW3_bilateral2d_gpu.cu following the instructions below:

  1. You will submit HW3_bilateral2d_gpu.cu only, and you should use the function interface "__global__ void kernel_BilateralFilter_texture(..)" "void kernel_2d_convolution_gpu(...)" (Updated Oct 6) as defined in the file.
  2. Because you will submit HW3_bilateral2d_gpu.cu only, it should contain everything to be compiled correctly. To check this, keep the original base code intact, and compile it with your .cu file before you finally submit it.
  3. To make the problem simpler, you can assume that the filter kernel is square, meaning that the x and y dimensions of your filter kernel are the same (m == n in the equation above).
  4. The inputs are 4 channel (RGBA, 8bits per channel) color images. However, when you applying the filter, process the RGB channel only (just copy the alpha channel from the input to the output image).
  5. The input image type is unsigned char (8 bits) that you need to convert to float when you compute the convolution to prevent precision errors. Simply speaking, for example, convert unsigned char 255 to float 255.0f when you compute a weighted sum. (Updated Oct 6)
  6. Performance measurement will be based on the average running time over 10 runs on the resonance cluster, similar to the previous homework. The timing code is provided in HW3_main.cpp, so you don't need to write it on your own.
  7. Boundary condition of the filter is "clamp to edge" (repeat the closest edge pixel value for outside of the image).
  8. In HW3_bilateral2d_gpu.cu.skeleton, kernel_2d_convolution_gpu(..) is defined as bool and returns a value. Change this to void. (Added Oct 6)

Questions

  1. Briefly explain your implementation and optimization strategy and why you choose that approach.
  2. How many memory operations and floating point operations are required per pixel for a general non-separable 2D convolution using an m x n kernel? ("general" means this convolution should handle any given kernel weights so that you cannot assume any simplification based on the kernel structure, as shown in Equation 1.) (Updated Oct 9)
  3. Let's assume we have asymptotic GPU operation costs as follows:
    • Global memory read/write is 500
    • Shared memory/cache/register read/write is 4 (Updated Oct 7 - ignore register read/write cost)
    • Multiply, Add, or Multiply and Add (MADD) is 4
    • Division is 32
    • Exponential is 32
    • Sqrt is 32 (Added Oct 6)
            Derive a rough performance analysis by comparing the total cost per pixel of your bilateral filter implementation and a naive implementation using only global memory for a kernel of size m x n.

Extra Credit

Implement the bilateral filter for arbitrary x and y filter dimensions (m != n, and the center pixel is not the center of the filter). You should submit a separate tar ball containing everything including a new main function. You can define the halfkernelsize as left x, right x, up y, down y, so that the center pixel is not actually center of the kernel. (Added Oct 9)

Grading

The breakdown is as follows:

  • Basic functionality (45 pts) - use flower.bmp
    • Code compiles without errors, runs a filter 10 times with the kernel size of 11x11 (\delta_{i}=5, \delta_{c}=64) with an error of less than 0.00005 (20 pts) 
      • ex) bilateral flower.bmp out.bmp  5 5 64 10 1
    • Runs the filter 10 times with the various kernel sizes (listed below) with an error of less than 0.00005 (25 pts) (Updated Oct 7)
      • 5x5 (half kernel size = 2, \delta_{i}=5, \delta_{c}=64)
      • 9x9 (half kernel size = 4, \delta_{i}=9, \delta_{c}=64)
      • 17x17 (half kernel size = 8, \delta_{i}=17, \delta_{c}=64)
      • 33x33 (half kernel size = 16, \delta_{i}=33, \delta_{c}=64)
      • 65x65 (half kernel size = 32, \delta_{i}=65, \delta_{c}=64)
      • 127x127 (half kernel size = 63, \delta_{i}=127, \delta_{c}=64)
    • For this test, use flower.bmp
  • Optimization (40 pts) - use status_of_liberty.bmp
    • Run 41x41 bilateral filter (\delta_{i}=5, \delta_{c}=64) on status_of_liberty.bmp 
      • Faster than 190 ms (20 pts)
      • Faster than 150 ms (20 pts) (Updated Oct 6 - Since 400 ms is too easy to achieve, we adjusted the goal a little higher)
      • ex) bilateral status_of_liberty.bmp out.bmp 20 5 64 10 1
    • Both run should produce the result with an error of less than 0.00005.
    • Note that if you run a large kernel as this, it takes a long time if you check the accuracy due to the CPU filter (not the GPU). Be patient.
    • Questions (15 pts, 5 points per question)

    Extra credit is worth 10 points

    Running time and error calculation functions are already provided in HW3_main.cpp, so you do not implement your own timer. 

    Submission

    To submit the files, create a folder named lastname_firstinitial_hw3 and place your files (HW3_Convolution2D_gpu.cu and a text file answering the questions) in this folder. If you have files for extra credit, create a separate zip file that contains them and include a text file explaining them. Compress the folder (please use .zip compression) and submit on the course iSite page in the HW 3 dropbox. Please contact the TFs if you miss the deadline on Fri, Oct 9th, 11 pm EDT.