Homework 6: PageRank on Hadoop Clusters using EC2


Due Date: Nov 8th (Sun) 2009, 11:00 PM EST

Goals

In this problem set, you will run your HW 5 PageRank code on Hadoop clusters using Amazon EC2.  The specific goals of this homework are:

  1. Learn how to set up and run a Hadoop job on Amazon EC2.
  2. Modify the HW 5 code so that it runs on a large dataset without errors.

We will make the example solution for HW 5 available on Thursday should you need it. In the meantime you can get started with the setup for this homework. 

Set Up

Use your Cloudera Hadoop training VM (0.3.1) as a client machine. First, you need to install some necessary files. Follow the setup guidelines here

Notes:
1. Install boto version 1.8 or above (do not use sudo apt-get python-boto because this will install version 1.3)
2. The AWS access key and secret access key can be found on the AWS site (Your Account->Security Credential). This is NOT your private key.
3. In your ec2-clusters.cfg,
  • keyname is the key pair name you created in AWS
  • private_key is the local path you saved your key file (***.pem)
4. Make your private key file not accessible to public (e.g., chmod 700 ***.pem)
5. Install FoxyProxy to use proxy in Firefox

Once you install and create all setup files, then you should be able to launch the Hadoop cluster using the following script:

>> hadoop-ec2 launch-cluster --user-packages 'python-simplejson' my-hadoop-cluster  10


The above command will launch one master node with 10 client nodes with python-simplejson installed.

To use a proxy, you should do the following:

>> hadoop-ec2 proxy my-hadoop-cluster

Run the Hadoop job on the cluster

Once you launch your Hadoop cluster, you need to log in to the master node.

>> hadoop-ec2 login my-hadoop-cluster

and use /mnt for your scratch space to work with. For example, you can create your own directory under /mnt

>> mkdir /mnt/HW6
>> cd /mnt/HW6

Then copy the dataset and your pagerank code to that directory. Download the large test datasets from the following links:

  • http://harvard-cs264.s3.amazonaws.com/enwiki-20090929-one-page-per-line-part1.gz
  • http://harvard-cs264.s3.amazonaws.com/enwiki-20090929-one-page-per-line-part2.gz
  • http://harvard-cs264.s3.amazonaws.com/enwiki-20090929-one-page-per-line-part3.gz

example:
>> wget http://harvard-cs264.s3.amazonaws.com/enwiki-20090929-one-page-per-line-part1.gz
>> gunzip enwiki-20090929-one-page-per-line-part1.gz

Unzip all three files, then copy those files to your hadoop dfs

>> hadoop dfs -copyFromLocal enwiki-20090929-one-page-per-line-part1 enwiki-20090929-one-page-per-line-part1

Finally, run your PageRank code using the datasets and find the ten pages with the highest PageRank.

NOTE:
1. Copying/unzipping large data files can take an extremely long time. For example, part1 can be about 10G when it is uncompressed. When you terminate your cluster, you lose all the files. So, plan carefully when you do this homework.
2. IMPORTANT!!! You will be billed based on your cluster up-time no matter if you are really using it or not. So, do not forget to terminate your clusters/instances when you finish your job. 
3. You can specify the number of mappers / reducers by using the -jobconf option
Example: -jobconf mapred.map.tasks=1 -jobconf mapred.reduce.tasks=1
4. Your HW5 pagerank code may not work on the large dataset. Modify your code to handle this dataset. 
 

Grading

The breakdown is as follows (total 100 pts):

  • Run your pagerank code on 5 hadoop cluster nodes using 20 reducers, find the top ten highest-PageRank pages correctly (50 pts)
  • Run your pagerank code on 19 hadoop cluster nodes using 100 reducers, find the top ten highest-PageRank pages correctly (50 pts)

Note: for the clean up stage, you may use one reducer. You can manually specify the number of mappers, too, but usually it can be automatically determined based on the size of the input.

In addition to the top ten pages, you also need to submit the following pdf print-outs for each test:
  • The main Hadoop Map/Reduce Job tracker page after running your pagerank code (example), and
  • Each individual job status page to prove your job was successfully processed on the cluster (example). 
You should have 12 pdfs in total for this, one for the link graph creation, 10 for the pagerank computations, one for clean up.

Use firefox's print to pdf function for each hadoop job you launch and name each pdf correctly (e.g., linkgraph.pdf, pagerank-1.pdf, etc).

Submission


To submit the files, create a folder named lastname_firstinitial_hw6 and place all your files in this folder. Compress the folder (please use .zip compression) and submit on the course iSite page in the HW 6 dropbox. Please contact the TFs if you miss the deadline on Sun, Nov 8, 11 pm EST.