Homework 4: CUDA+MPI


Due Sunday, 18 October at 11 pm EDT


Goal

The goal of this homework is to introduce you to multi-GPU progamming with CUDA and the Message Passing Interface (MPI) library. You will implement a simple image processing operation in CUDA (i.e., a 2d low-pass filter) and distribute the computation across many compute nodes (with many GPUs) using simple MPI primitives.



A little diversion...


As you have already seen in class, the current GPU programming model exposes two important hierarchies that one must absolutely master to achieve maximum performance.

Here is a rough picture (based on a NVIDIA GT200 configuration):

  • the memory / communication hierarchy:
    1. host memory (>4GB) and the PCI-Express 2.0 X16 bus (<8 GB/s)
    2. device memory (<4GB, offchip) and global memory access (<80GB/s)
    3. shared memory (16KB) and cached access (2 clock cycles)
    4. register file (8KB) and register access (1 clock cycle)

  • the threading / execution hierarchy:
    1. processes (CPU)
    2. grids of blocks (GPU)
    3. blocks of threads (GPU)
    4. multiple execution pipelines,
      e.g. madd (multiply-add) + mul (multiply) in 1 clock cycle

Basically, the closer you are to the Arithmetic Logic Units (ALUs), the faster the memory gets but you have less! Since computation is generally orders of magnitude faster than communication, you must pay special attention to how you fetch data. Interestingly, the new NVIDIA architecture (codename 'Fermi') exposes additional levels in the memory hierarchy by introducing L1 and L2 caches...

From this perspective, using MPI across multiple GPU-nodes is just one solution to extend the hierarchies but with severe penalties: multiple GPUs in a single system may have to share the same PCI-Express bus while multiple GPU systems may have to communicate through slow network interfaces.

We hope that this homework will provide some intuition on how to make best use of these hierarchies.

If you want to read more about all this, we suggest the following references:

Sequoia: Programming the memory hierarchy.

CUDASA: Compute Unified Device and Systems Architecture

Fermi Compute Architecture White Paper
http://www.nvidia.com/object/fermi_architecture.html



Introducing the Average "boxcar" Filter


This (crude) low-pass filter is probably one of the simplest. It just consists of averaging the pixel values in the neighborhood of the pixel under consideration:

I_{new}(x, y) = \frac{1}{n \times m} \sum_{\tiny{j=y-\frac{n}{2}}}^{\tiny{y+\frac{n}{2}}} \sum_{\tiny{y=x-\frac{m}{2}}}^{\tiny{i+\frac{m}{2}}} I(i, j)


Even though it is possible to greatly accelerate this filtering operation (e.g., with a separable convolution) we ask you to implement it using a discrete 2D convolution.

Here is an example:

Input



Output using a 13x13 boxcar filter



In this homework, we'll assume a square boxcar filter (i.e. m=n).



Skeleton Code


The skeleton code is available here.

Setup:
wget http://www.cs264.org/files/cs264-hw4_v2.tar.gz
tar xzvf cs264-hw4_v2.tar.gz
cd cs264-hw4


Compilation (Makefile with mpic++):
make -j2

Cleaning:
make clean
or
make clobber

Execution:
mpirun -np $NPROCS ./bin/mpi_boxcar $INPUT_FILENAME $OUTPUT_FILENAME \
$HALF_KERNEL_SIZE $NUM_TEST $CHECK_ACCURACY

Description of the parameters:

  • $NPROCS
    number of MPI processes to launch

  • $INPUT_FILENAME
    input image filename (bmp format)

  • $OUTPUT_FILENAME
    output image filename (bmp format)

  • $HALF_KERNEL_SIZE
    half the filter kernel size; the full kernel size will be $HALF_KERNEL_SIZE * 2 + 1

  • $NUM_TEST
    the total number of test runs to perform

  • $CHECK_ACCURACY
    check the accuracy of the MPI+CUDA code against a reference CPU implementation (boolean; 0 or 1)

Example:

mpirun -np 1 ./bin/mpi_boxcar ./data/furtv.bmp ./out.bmp 6 10 1


Description of the source files:

  • HW4_main.cpp
    contains the skeleton of the MPI code to distribute the load on multiple GPUs

  • HW4_boxcar_cpu.cpp
    contains a reference CPU implementation of the boxcar filter

  • HW4_boxcar_gpu.cu
    contains the skeleton of the boxcar filtering GPU code

Important: You have to submit HW4_main.cpp and HW4_boxcar_gpu.cu *only*.


Homework Instructions


You goal is to gradually complete the skeleton code until you get a multi-GPU / multi-node code running. The portions of the source code that you have to complete are identified by "TODO":

grep -n TODO *

A) single-GPU code

You must modify the skeleton code to run a 2D boxcar filter of any (square) size on a single GPU (*no* separable filter).

The following command:

mpirun -np 1 ./bin/mpi_boxcar ./data/furtv.bmp ./hw4.A.bmp 6 10 1

Should print something like:

...
[SUMMARY] nProcs: 1, Filter size: 13 x 13, ... Error : 0.000000
...


You can also use a loop to make sure your code works with different filter sizes:

(hacky one-liner)
for hks in $(seq 2 2 10); do mpirun -np 1 ./bin/mpi_boxcar ./data/furtv.bmp ./hw4.A.$hks.bmp $hks 10 1 2> /dev/null | grep SUMMARY; done;

(output)
[SUMMARY] nProcs: 1, Filter size: 5 x 5, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 9 x 9, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 13 x 13, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 17 x 17, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 21 x 21, ... Error : 0.000000



Notes:

  • We are not interested in raw speed for this part of the homework, only functionality.
  • For this part of the homework you can use the command alias gpu-login, you will get an interactive session with 2 GPUs.
  • Boundary condition of the filter is "clamp to edge" (repeat the closest edge pixel value for outside of the image).
  • The inputs are 4 channel (RGBA, 8bits per channel) color images. However, when you applying the filter, process the RGB channel only (just copy the alpha channel from the input to the output image).
  • The input image type is unsigned char (8 bits) that you need to convert to float when you compute the convolution to prevent precision errors. Simply speaking, for example, convert unsigned char 255 to float 255.0f.
  • Performance measurement will be based on the average running time over 10 runs on the resonance cluster, similar to the previous homework. The timing code is provided in HW4_main.cpp, so you don't need to write it on your own.

B) dual-GPU / single-node


You'll now implement a master / worker MPI code that uses up to 2 processes. You can use any MPI primitive that you like. It is however possible to come up with a solution that only uses the ones presented in class (i.e. Introduction to MPI, slide 34).

You have to:

  1. break up the image into tiles (you may need to experiment with multiple "tile strategies")

  2. distribute the tiles to the 2 processes (note that the master can also be a worker...)

  3. run the boxcar image filtering on each tile
    IMPORTANT: you need to take care of the necessary tile overlap!

  4. accumulate the results on the master process

Here is what we are expecting to see:

(hacky one-liner)
for hks in $(seq 2 2 10); do mpirun -np 2 ./bin/mpi_boxcar ./data/furtv.bmp ./hw4.B.$hks.bmp $hks 10 1 2> /dev/null | grep SUMMARY; done;

(output)
[SUMMARY] nProcs: 2, Filter size: 5 x 5, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 9 x 9, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 13 x 13, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 17 x 17, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 21 x 21, ... Error : 0.000000


Notes:

  • We are not interested in raw speed for this part of the homework, only functionality.
  • For this part of the homework you can use the command alias gpu-login, you will get an interactive session with 2 GPUs.
  • Be very careful with your pointer arithmetic! Precision could be a problem... A simple solution could be to use (ulong) and cast back to (uchar4*), for instance.

C) multi-GPU / multi-node


You'll now extend your code to use many processes. You'll use a different job queue for this (gpubatch.q), please refer to the following instructions:

To purge your job queue (note that you are limited to 5 jobs):
qdel -u $USER

To delete a specific job:
qdel <JOB_ID>

To submit a job for this part of the homework:
qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_accuracy.sh
(have a look at job_accuracy.sh, you can modify it but you'll be evaluated on the original one)

To check your job queue:
qstat

To get the output of your completed job(s):
grep SUMMARY hw4.stdouterr.*


Here is what we are expecting to see:

(hacky one-liner)
qdel -u $USER; rm -vf hw4.stdouterr.*; for NPROCS in 2 4 6 8; do qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_accuracy.sh; done; while true; do qstat; grep SUMMARY hw4.stdouterr.*; sleep 1; done;

(output)
hw4.stdouterr.5347:[SUMMARY] nProcs: 2, Filter size: 13 x 13, ... Error : 0.000000
hw4.stdouterr.5348:[SUMMARY] nProcs: 4, Filter size: 13 x 13, ... Error : 0.000000
hw4.stdouterr.5349:[SUMMARY] nProcs: 6, Filter size: 13 x 13, ... Error : 0.000000
hw4.stdouterr.5350:[SUMMARY] nProcs: 8, Filter size: 13 x 13, ... Error : 0.000000


Notes:

  • We are not interested in raw speed for this part of the homework, only functionality.
  • For some reason, $NPROCS must be a multiple of 2 on the gpubatch.q queue. Also, don't try to allocate more than 8 processes (GPUs) or your job may never be scheduled.
  • More information on how to use resonance with MPI on multiple nodes can be found here.
  • Be very careful with your pointer arithmetic! Precision could be a problem... A simple solution may be to use (ulong) and cast back to (uchar4*), for example.

D) Filtering performance on a gigapixel image

You'll now optimize your code to process a gigapixel image in a reasonable amount of time (a few seconds). Please refer to the grading section to know when you should stop optimizing.

Get the image:
(cd data && wget http://www.cs264.org/files/San-Francisco-in-Ruins-1906.bmp)

To submit a job for this part of the homework:
qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_speed.sh
(have a look at job_speed.sh, you can modify it but you'll be evaluated on the original one)

Here is what we are expecting to see:

(hacky one-liner)
qdel -u $USER; rm -vf hw4.stdouterr.*; for NPROCS in 2 4 6 8; do qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_speed.sh; done; while true; do qstat; grep SUMMARY hw4.stdouterr.*; sleep 1; done;

(example output)
hw4.stdouterr.5042:[SUMMARY] nProcs: 2, Filter size: 51 x 51, Total average computing time: 6528.226
hw4.stdouterr.5043:[SUMMARY] nProcs: 4, Filter size: 51 x 51, Total average computing time: 2077.393
hw4.stdouterr.5044:[SUMMARY] nProcs: 6, Filter size: 51 x 51, Total average computing time: 1594.457
hw4.stdouterr.5045:[SUMMARY] nProcs: 8, Filter size: 51 x 51, Total average computing time: 1880.812



Notes:

  • We are interested in raw speed for this part of the homework, so make sure to minimize data transfer between the multiple processes!
  • For some reason, $NPROCS must be a multiple of 2 on the gpubatch.q queue. Also, don't try to allocate more than 8 processes (GPUs) or your job may never be scheduled.
  • More information on how to use resonance with MPI on multiple nodes can be found here.
  • Be very careful with your pointer arithmetic! Precision could be a problem... A simple solution may be to use (ulong) and cast back to (uchar4*), for example.

Questions


  1. Briefly explain your implementation and optimization / problem decomposition strategy and why you choose that approach. Please expose both CUDA and MPI strategies.

  2. Do you see any performance boost with small filters (e.g. 5x5)? Why?

  3. How does your code scale with the number of processes on the gigapixel image? Why?

  4. What is the bottleneck of your implementation? Why?


Please relate your answers to the "little diversion" section.


Extra Credit (for the fearless)


Implement histogram equalization (wikipedia, example) using CUDA+MPI.


Grading


The breakdown is as follows:

  • Basic functionality (45 pts)
    • Part A: 10 pts
    • Part B: 15 pts
    • Part C: 20 pts

  • Optimization, Part D (35 pts)
    • $NPROCS=2 runs faster than 10000ms: 10 pts
    • $NPROCS=4 runs faster than 3000ms 4000ms (updated Oct, 18th): 10 pts
    • $NPROCS=6 runs faster than 1700ms 2500ms (updated Oct, 18th): 15 pts

  • Questions (20 pts, 5 per question)

  • Extra Credit (10 pts)


Submission


To submit your work:

Create a folder named lastname_firstinitial_hw4, copy the 2 source files (HW4_main.cpp, HW4_boxcar_gpu.cu) and a text file answering the questions. Create a tarball (i.e. tar czvf) and submit it to the course iSite page in the HW4 dropbox. If you have files for extra credit, submit them as lastname_firstinitial_hw4xtra.tar.gz

Address any questions to the forum or to tf@cs264.org

Good luck to everyone!