Homework 1: Basic CUDA Programming

Due Friday, Sept 25, 11pm EDT

The goal of this assignment is to implement some common linear algebra operators in CUDA.  After completing the homework, you should be familiar with memory allocation on the GPU, memory transfer between host and GPU, kernel invocation, and parallel reductions. 


Part 1: SAXPY

Here you'll implement the SAXPY (Scalar Alpha X Plus Y) operation, which is one of the level 1 BLAS functions. SAXPY is given by
\alpha x + y ,
for vectors x and y and scalar \alpha.

Setup.  Download the skeleton code for both part 1 and 2 (homework1.zip [updated 09.22.09]) into your working directory $WDIR.  In both parts, we're reusing the build system from the CUDA SDK.  To setup the code for this on resonance, from the command line do

    cd $WDIR
    unzip homwork1.zip 
    cd saxpy
    mkdir tmp
    ln -s $CUDASDK_HOME/lib .
    ln -s $CUDASDK_HOME/common .

Compiling.  You can compile the code using the provided Makefile.  From the $WDIR/saxpy directory, just do make, optionally providing flags dbg=1 for debug mode and dbg=1 emu=1 for emulation mode.

Binaries are written to $WDIR/saxpy/bin/linux/release/saxpy in release mode, and  $WDIR/saxpy/bin/linux/emudebug/saxpy in emulation debug mode.

Completing the skeleton code.  The code for part 1 is in the saxpy subdirectory.  The skeleton code is setup to apply the saxpy operation to two input vectors on both the CPU and GPU, time their execution, and compare the results for correctness.  Your job is to complete missing segments of code.

In function saxpy_gpu():

Code segment 1:
  • Allocate device memory
  • Copy input data from host memory to device memory
Code segment 2:
  • Specify kernel launch parameters and invoke the kernel
  • Measure kernel computation time
Code segment 3:
  • Copy the result matrix from the device memory to host memory
Code segment 4:
  • Free device memory
In the GPU kernel saxpy_gpu_kernel():

Code segment 5:
  • Determine thread's output index
  • Compute correct value
  • Check that output location is valid and write
Running the application.  To run the application, invoke from the command line as

        saxpy -inx="<fname_X>" -iny="<fname_Y>" -alpha=<alpha> \
        {-outy="<outfname_Y>"} {-device=<dev>},
 
 where <fname_X> and <fname_Y> are input files corresponding to vectors x and y, respectively, and <alpha> is the alpha parameter for saxpy().   You can optionally specify an output file <outfname_Y> to which the result vector will be written, and the device number <dev>.

Alternatively, you can generate test data by invoking as

        saxpy -outy="<outfname_Y>" -n=<n>,

 where <outfname_Y> specifies the output filename, and <n>, vector length,  to generate a random vector in [0,1]n and write it to file.


Part 2: Vector Reduction

Adapted from GLCPC material for the course "Many-Core Processors for Science and Engineering Applications", by Nady Obeid, Xiao-Long Wu, I-Jui Sung, and Wen-mei Hwu, UIUC

In this part, you'll compute the sum of a vector via a parallel reduction, where the result is obtained in O(lg N) total steps.  One reduction scheme is shown in Figure 1.



Figure 1: Reduction scheme, using O(lg N) steps and O(N) computations


This scheme has benefits over other reduction patterns, in terms of minimizing thread divergence and coalescing memory accesses. 


Setup. As we did in part 1, from the command line do

    cd $WDIR/vector_reduction
    mkdir tmp
    ln -s $CUDASDK_HOME/lib .
    ln -s $CUDASDK_HOME/common .

Compiling.  From the $WDIR/reduction directory, again just do make, optionally providing flags dbg=1 for debug mode and dbg=1 emu=1 for emulation mode.

Binaries are written to $WDIR/vector_reduction/bin/linux/release/vector_reduction in release mode, and  $WDIR/vector_reduction/bin/linux/emudebug/vector_reduction in emulation debug mode.

Completing the skeleton code.  The skeleton code for this part is in the vector_reduction subdirectory.  It's setup to compute the sum of a fixed-length (512) vector on the CPU and GPU and compare the results for correctness. 

To complete the code, first modify the function computeOnDevice() defined in vector_reduction.cu.

 

Code segment 1:

  • Allocate CUDA device memory
  • Copy input data from host memory to CUDA device memory
  • Copy results from CUDA device memory back to host memory 

In vector_reduction_kernel.cu, modify the kernel function to implement the reduction scheme shown in Figure 1.1. Your implementation should use shared memory to increase efficiency.

 

Code segment 2:

  • Load data from global memory to shared memory
  • Do sum reduction on the shared memory data
  • Store data back to shared memory  
Running the application.  There are two modes of operation for the application.
  • No arguments: The application will create a randomly initialized array to process. After the device kernel is invoked, it will compute the correct solution value using the CPU, and compare that solution with the device-computed solution. If it matches (within a certain tolerance), if will print out ”Test PASSED” to the screen before exiting. 
  • One argument: The application will initialize the input array with the values found in the file provided as an argument.
In either case, the program will print out the final result of the CPU and GPU computations, and whether or not the comparison passed.  While both part 1 and 2 use text input files, be aware that part 1 files are floating point, and Part 2, integers (and of fixed length), and not interchangeable.

A test input data.txt is provided.

Questions
  1. How many times does your thread block synchronize to reduce the array of 512 elements to a single value? 

  2. What is the minimum, maximum, and average number of ”real” operations that a thread will perform? ”Real” operations are those [updated: 09.21.09] floating point arithmetic operations that directly contribute to the final reduction value. 


Extra Credit

Extend the code in part 2 to support reductions on arbitrary length vectors larger than 512 entries.  If you choose to give this a try, just make a new directory $WDIR/vector_reduction1 and implement it there.

[updated: 09.21.09] Your solution to part 2 lets you reduce a vector that fits into a single block, since you're doing the reduction in shared memory.  To reduce a vector that doesn't fit into a single block, you can have each block reduce a portion of the vector and recursively apply the reduction kernel to the partial results.  

We don't provide skeleton code, but assume a command-line interface similar to part two, where we pass in the name of a data file to use as input.        


Submission and Grading

Create a zip (rar, tar.gz, etc.) archive of the contents of your working directory $WDIR containing all the changes and additions you’ve made to the source code.  Please omit binaries and any data files you've generated from the archive.  Create a text file, Word document, or PDF file with answers to questions 1–2 above. 

[updated 9.19.09] To submit your homework, create a folder named lastname_firstinitial_hw1 and place your files in this folder. Compress the folder (please use .zip compression) and submit on the course iSite page in the HW 1 dropbox. Please contact the TFs if you miss the deadline on Fri, Sept 25, 11 pm EDT.

[updated 9.23.09] The iSite page now supports DCE and FAS/cross-registered MIT students, so you shouldn't have any problems.  If you still have problems with iSite, or if you miss the deadline, send an email to dale@eecs.harvard.edu entitled "CS264: Homework 1 Submission" with your .zip file as an attachment.

You’ll be graded on the following parameters: 

  • Demo/knowledge: 35 points 
    • Builds and runs on resonance
    • Produces correct result output file for our test inputs [updated: 09.21.09
      • part 1: test inputs x,y\in [0,1]^n and \alpha\in [0,1] 
      • part 2 and extra credit: test vectors of 512 elements, each integer values between 0 and 1000, inclusive
  • Functionality: 40 points
    • Correct usage of CUDA library calls and C extensions
    • Correct usage of thread id’s in saxpy
    • Correct checks for valid output locations in saxpy kernel
    • Correct implementation of data-parallel reduction algorithm 
  • Report: 25 points
    • Correct answers to questions 1-2

Extra credit is worth up to a 5 points bonus.