Homework 4: CUDA+MPI
Due Sunday, 18 October at 11 pm EDT
Goal
The goal of this homework is to introduce you to multi-GPU progamming with CUDA and the Message Passing Interface (MPI) library. You will implement a simple image processing operation in CUDA (i.e., a 2d low-pass filter) and distribute the computation across many compute nodes (with many GPUs) using simple MPI primitives.A little diversion...
As you have already seen in class, the current GPU programming model exposes two important hierarchies that one must absolutely master to achieve maximum performance.
Here is a rough picture (based on a NVIDIA GT200 configuration):
- the memory / communication hierarchy:
- host memory (>4GB) and the PCI-Express 2.0 X16 bus (<8 GB/s)
- device memory (<4GB, offchip) and global memory access (<80GB/s)
- shared memory (16KB) and cached access (2 clock cycles)
- register file (8KB) and register access (1 clock cycle)
- the threading / execution hierarchy:
- processes (CPU)
- grids of blocks (GPU)
- blocks of threads (GPU)
- multiple execution pipelines,
e.g. madd (multiply-add) + mul (multiply) in 1 clock cycle
Basically, the closer you are to the Arithmetic Logic Units (ALUs), the faster the memory gets but you have less! Since computation is generally orders of magnitude faster than communication, you must pay special attention to how you fetch data. Interestingly, the new NVIDIA architecture (codename 'Fermi') exposes additional levels in the memory hierarchy by introducing L1 and L2 caches...
From this perspective, using MPI across multiple GPU-nodes is just one solution to extend the hierarchies but with severe penalties: multiple GPUs in a single system may have to share the same PCI-Express bus while multiple GPU systems may have to communicate through slow network interfaces.
We hope that this homework will provide some intuition on how to make best use of these hierarchies.
If you want to read more about all this, we suggest the following references:
Sequoia: Programming the memory hierarchy.
CUDASA: Compute Unified Device and Systems Architecture
Fermi Compute Architecture White Paper
http://www.nvidia.com/object/fermi_architecture.html
Introducing the Average "boxcar" Filter
This (crude) low-pass filter is probably one of the simplest. It just consists of averaging the pixel values in the neighborhood of the pixel under consideration:
Even though it is possible to greatly accelerate this filtering operation (e.g., with a separable convolution) we ask you to implement it using a discrete 2D convolution.
Here is an example:
Input
Output using a 13x13 boxcar filter
In this homework, we'll assume a square boxcar filter (i.e.
Skeleton Code
The skeleton code is available here.
Setup:
wget http://www.cs264.org/files/cs264-hw4_v2.tar.gz
tar xzvf cs264-hw4_v2.tar.gz
cd cs264-hw4
Compilation (Makefile with mpic++):
make -j2
Cleaning:
make clean
or
make clobber
Execution:
mpirun -np $NPROCS ./bin/mpi_boxcar $INPUT_FILENAME $OUTPUT_FILENAME \
$HALF_KERNEL_SIZE $NUM_TEST $CHECK_ACCURACY
Description of the parameters:
- $NPROCS
number of MPI processes to launch - $INPUT_FILENAME
input image filename (bmp format) - $OUTPUT_FILENAME
output image filename (bmp format) - $HALF_KERNEL_SIZE
half the filter kernel size; the full kernel size will be $HALF_KERNEL_SIZE * 2 + 1 - $NUM_TEST
the total number of test runs to perform - $CHECK_ACCURACY
check the accuracy of the MPI+CUDA code against a reference CPU implementation (boolean; 0 or 1)
Example:
mpirun -np 1 ./bin/mpi_boxcar ./data/furtv.bmp ./out.bmp 6 10 1
Description of the source files:
- HW4_main.cpp
contains the skeleton of the MPI code to distribute the load on multiple GPUs - HW4_boxcar_cpu.cpp
contains a reference CPU implementation of the boxcar filter - HW4_boxcar_gpu.cu
contains the skeleton of the boxcar filtering GPU code
Important: You have to submit HW4_main.cpp and HW4_boxcar_gpu.cu *only*.
Homework Instructions
You goal is to gradually complete the skeleton code until you get a multi-GPU / multi-node code running. The portions of the source code that you have to complete are identified by "TODO":
grep -n TODO *
A) single-GPU code
You must modify the skeleton code to run a 2D boxcar filter of any (square) size on a single GPU (*no* separable filter).The following command:
mpirun -np 1 ./bin/mpi_boxcar ./data/furtv.bmp ./hw4.A.bmp 6 10 1
Should print something like:
...
[SUMMARY] nProcs: 1, Filter size: 13 x 13, ... Error : 0.000000
...
You can also use a loop to make sure your code works with different filter sizes:
(hacky one-liner)
for hks in $(seq 2 2 10); do mpirun -np 1 ./bin/mpi_boxcar ./data/furtv.bmp ./hw4.A.$hks.bmp $hks 10 1 2> /dev/null | grep SUMMARY; done;
(output)
[SUMMARY] nProcs: 1, Filter size: 5 x 5, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 9 x 9, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 13 x 13, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 17 x 17, ... Error : 0.000000
[SUMMARY] nProcs: 1, Filter size: 21 x 21, ... Error : 0.000000
Notes:
- We are not interested in raw speed for this part of the homework, only functionality.
- For this part of the homework you can use the command alias gpu-login, you will get an interactive session with 2 GPUs.
- Boundary condition of the filter is "clamp to edge" (repeat the closest edge pixel value for outside of the image).
- The inputs are 4 channel (RGBA, 8bits per channel) color images. However, when you applying the filter, process the RGB channel only (just copy the alpha channel from the input to the output image).
- The input image type is unsigned char (8 bits) that you need to convert to float when you compute the convolution to prevent precision errors. Simply speaking, for example, convert unsigned char 255 to float 255.0f.
- Performance measurement will be based on the average running time over 10 runs on the resonance cluster, similar to the previous homework. The timing code is provided in HW4_main.cpp, so you don't need to write it on your own.
B) dual-GPU / single-node
You'll now implement a master / worker MPI code that uses up to 2 processes. You can use any MPI primitive that you like. It is however possible to come up with a solution that only uses the ones presented in class (i.e. Introduction to MPI, slide 34).
You have to:
- break up the image into tiles (you may need to experiment with multiple "tile strategies")
- distribute the tiles to the 2 processes (note that the master can also be a worker...)
- run the boxcar image filtering on each tile
IMPORTANT: you need to take care of the necessary tile overlap! - accumulate the results on the master process
Here is what we are expecting to see:
(hacky one-liner)
for hks in $(seq 2 2 10); do mpirun -np 2 ./bin/mpi_boxcar ./data/furtv.bmp ./hw4.B.$hks.bmp $hks 10 1 2> /dev/null | grep SUMMARY; done;
(output)
[SUMMARY] nProcs: 2, Filter size: 5 x 5, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 9 x 9, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 13 x 13, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 17 x 17, ... Error : 0.000000
[SUMMARY] nProcs: 2, Filter size: 21 x 21, ... Error : 0.000000
Notes:
- We are not interested in raw speed for this part of the homework, only functionality.
- For this part of the homework you can use the command alias gpu-login, you will get an interactive session with 2 GPUs.
- Be very careful with your pointer arithmetic! Precision could be a problem... A simple solution could be to use (ulong) and cast back to (uchar4*), for instance.
C) multi-GPU / multi-node
You'll now extend your code to use many processes. You'll use a different job queue for this (gpubatch.q), please refer to the following instructions:
To purge your job queue (note that you are limited to 5 jobs):
qdel -u $USER
To delete a specific job:
qdel <JOB_ID>
To submit a job for this part of the homework:
qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_accuracy.sh
(have a look at job_accuracy.sh, you can modify it but you'll be evaluated on the original one)
To check your job queue:
qstat
To get the output of your completed job(s):
grep SUMMARY hw4.stdouterr.*
Here is what we are expecting to see:
(hacky one-liner)
qdel -u $USER; rm -vf hw4.stdouterr.*; for NPROCS in 2 4 6 8; do qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_accuracy.sh; done; while true; do qstat; grep SUMMARY hw4.stdouterr.*; sleep 1; done;
(output)
hw4.stdouterr.5347:[SUMMARY] nProcs: 2, Filter size: 13 x 13, ... Error : 0.000000
hw4.stdouterr.5348:[SUMMARY] nProcs: 4, Filter size: 13 x 13, ... Error : 0.000000
hw4.stdouterr.5349:[SUMMARY] nProcs: 6, Filter size: 13 x 13, ... Error : 0.000000
hw4.stdouterr.5350:[SUMMARY] nProcs: 8, Filter size: 13 x 13, ... Error : 0.000000
Notes:
- We are not interested in raw speed for this part of the homework, only functionality.
- For some reason, $NPROCS must be a multiple of 2 on the gpubatch.q queue. Also, don't try to allocate more than 8 processes (GPUs) or your job may never be scheduled.
- More information on how to use resonance with MPI on multiple nodes can be found here.
- Be very careful with your pointer arithmetic! Precision could be a problem... A simple solution may be to use (ulong) and cast back to (uchar4*), for example.
D) Filtering performance on a gigapixel image
You'll now optimize your code to process a gigapixel image in a reasonable amount of time (a few seconds). Please refer to the grading section to know when you should stop optimizing.Get the image:
(cd data && wget http://www.cs264.org/files/San-Francisco-in-Ruins-1906.bmp)
To submit a job for this part of the homework:
qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_speed.sh
(have a look at job_speed.sh, you can modify it but you'll be evaluated on the original one)
Here is what we are expecting to see:
(hacky one-liner)
qdel -u $USER; rm -vf hw4.stdouterr.*; for NPROCS in 2 4 6 8; do qsub -v NPROCS=$NPROCS -pe ortegpu $NPROCS job_speed.sh; done; while true; do qstat; grep SUMMARY hw4.stdouterr.*; sleep 1; done;
(example output)
hw4.stdouterr.5042:[SUMMARY] nProcs: 2, Filter size: 51 x 51, Total average computing time: 6528.226
hw4.stdouterr.5043:[SUMMARY] nProcs: 4, Filter size: 51 x 51, Total average computing time: 2077.393
hw4.stdouterr.5044:[SUMMARY] nProcs: 6, Filter size: 51 x 51, Total average computing time: 1594.457
hw4.stdouterr.5045:[SUMMARY] nProcs: 8, Filter size: 51 x 51, Total average computing time: 1880.812
Notes:
- We are interested in raw speed for this part of the homework, so make sure to minimize data transfer between the multiple processes!
- For some reason, $NPROCS must be a multiple of 2 on the gpubatch.q queue. Also, don't try to allocate more than 8 processes (GPUs) or your job may never be scheduled.
- More information on how to use resonance with MPI on multiple nodes can be found here.
- Be very careful with your pointer arithmetic! Precision could be a problem... A simple solution may be to use (ulong) and cast back to (uchar4*), for example.
Questions
- Briefly explain your implementation and optimization / problem decomposition strategy and why you choose that approach. Please expose both CUDA and MPI strategies.
- Do you see any performance boost with small filters (e.g. 5x5)? Why?
- How does your code scale with the number of processes on the gigapixel image? Why?
- What is the bottleneck of your implementation? Why?
Extra Credit (for the fearless)
Implement histogram equalization (wikipedia, example) using CUDA+MPI.
Grading
The breakdown is as follows:
- Basic functionality (45 pts)
- Part A: 10 pts
- Part B: 15 pts
- Part C: 20 pts
- Optimization, Part D (35 pts)
- $NPROCS=2 runs faster than 10000ms: 10 pts
- $NPROCS=4 runs faster than
3000ms4000ms (updated Oct, 18th): 10 pts - $NPROCS=6 runs faster than
1700ms2500ms (updated Oct, 18th): 15 pts - Questions (20 pts, 5 per question)
- Extra Credit (10 pts)
Submission
To submit your work:
Create a folder named lastname_firstinitial_hw4, copy the 2 source files (HW4_main.cpp, HW4_boxcar_gpu.cu) and a text file answering the questions. Create a tarball (i.e. tar czvf) and submit it to the course iSite page in the HW4 dropbox. If you have files for extra credit, submit them as lastname_firstinitial_hw4xtra.tar.gz
Address any questions to the forum or to tf@cs264.org
Good luck to everyone!