Homework 5: PageRank on Wikipedia with Hadoop


Due Date: Nov 2nd (Mon) 2009, 11:00 PM EST

Goals

In this problem set, you will write several MapReduce programs and rum them on your local Cloudera hadoop virtual machine to compute the "importance" of various Wikipedia pages/articles as determined by the PageRank metric. Your input will be the English-language edition of Wikipedia. The specific goals of this homework are:

  1. Understand how to run a hadoop job on your local virtual machine.
  2. Understand the PageRank algorithm and how it works in MapReduce.
  3. Implement PageRank and execute it on a corpus of data.

PageRank Algorithm

Learn and understand the PageRank algorithm. http://infolab.stanford.edu/~backrub/google.html and Wikipedia both have excellent descriptions.

The Dataset

In this problem set, we're working with a handful of Wikipedia encyclopedias. If you were to crawl the Wikipedia page, we'd end up with over 3 million distinct pages.  Storing those on HDFS, one-per-file, is suboptimal: Hadoop would be overwhelmed with the overhead associated with opening and closing files.  Instead, we start with an XML form of Wikipedia, available at http://download.wikimedia.org/enwiki/.  This corpus (with only the latest revision, not all historical data  ) is 5.6GB compressed.  We've pre-processed this corpus a bit further, so that every page appears on its own line, in a very simple XML schema that drops unnecessary.  This makes it easy to use the default InputFormat, which performs one map() call per line of each file it reads. The mapper will still perform a separate map() for each page of Wikipedia, but since it is sequentially scanning through a small number of very large files, performance is much better than in the separate-file case.

Each page of Wikipedia is represented as follows (in one line):
<page><title>The Title</title><text>The page body</text></page>

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.

For this homework, you should use pre-processed datasets, which can be downloaded as follows:

  • http://harvard-cs264.s3.amazonaws.com/scowiki-20090929-one-page-per-line.gz (small)
  • http://harvard-cs264.s3.amazonaws.com/afwiki-20091002-one-page-per-line.gz (medium)

MapReduce Steps for PageRank

This presents the high-level requirements of what each phase of the program should do:
(While there are other, equivalent implementations of PageRank, this suggests one such method that can be implemented in this assignment.)  You'll want to implement a MapReduce program to do each of these steps.

Step 1: Create Link Graph

Process individual lines of the input corpus, one at a time. These lines contain the XML representations of individual pages of Wikipedia. Turn this into a link graph and initial page rank values (Use 1-d as your initial PageRank value, where d is the damping factor).

You'll need to look up and understand the Wikipedia Link format, which is descibed at http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style_(links) and http://en.wikipedia.org/wiki/Wikilink#Wikilinks.  You probably don't need 100% parsing accuracy to get the right result.

Think: What output key do you need to use? What data needs to be in the output value?

Step 2: Process PageRank

This is the component which will be run in your main loop. The output of this section should be directly readable as the input of this same step, so that you can run it multiple times. In this section, you should divide fragments of the input PageRank up among the links on a page, and then recombine all the fragments received by a page into the next iteration of PageRank.

Step 3: Cleanup and Sorting

One of the goals of this assignment is to understand which pages on Wikipedia have a high PageRank value. Therefore, we use one more "cleanup" pass to extract this data into a form we can inspect. Strip away any data that you were using during the repetitions of step 2 so that the output is just a mapping between page names and PageRank values. We would like our output data sorted by PageRank value.

Hint: Use only 1 reducer to sort everything. What should your key and value be to sort things by PageRank?

At this point, the data can be inspected and the most highly-ranked pages can be determined.

Homework Instructions

Implement the PageRank algorithm using Hadoop MapReduce (as described above). Implement three steps introduced in the previous section, and test them on small and medium datasets to find the top ten highest-PageRank pages in each dataset. Your hadoop job should run the link graph generator, calculate PageRank for 10 iterations, and then run the cleanup pass. You should write a bash shell script that launches a hadoop job that combines all three steps. Name your shell script "run_pagerank.sh".
 
When writing your MapReduce jobs, you have a choice. You can use "Hadoop streaming" to write the jobs in a language of your choice (typically you'd use a scripting language like python, perl, or Ruby, here), or you can use the native Java APIs and write the jobs in Java.

Some pagerank advice:
  • The data is going to be dirty; don't worry about executing the algorithm perfectly, and feel free to take reasonable shortcuts.  (For example, the solution ignores pages that don't link anywhere.)

Some general advice:
  • Work on the large dataset only after you've got everything working on the small dataset. The majority of this assignment is about implementing an algorithm in MapReduce terms, and the whole dataset is not required for that.
  • Test running a single pass of the PageRank mapper/reducer before putting it in a loop
  • Each pass will require its own input and output directory; one output directory is used as the input directory for the next pass of the algorithm. Set the input and output directory names in the JobConf to values that make sense for this flow. 
  • If the dataset is large, the PageRank for each page may be a very small floating-point number. You may want to multiply all PageRank values by a constant (10,000 or so) in the cleanup step to make these numbers more readable. 
  • The final cleanup step should have 1 reduce task, to get a single list of all pages.  You can use more mappers and reducers in the intermediate phases.
  • Don't forget, this is "real" data. We've done most of the dirty work for you in terms of formatting the input into a presentable manor, but there might be lines which don't conform to the layout you expect, blank lines, etc. Your code must be robust to these parsing errors. Just ignore any lines that are illegal -- but don't cause a crash!

Advice if you use the Streaming APIs:
  • You can use the templates at http://harvard-cs264.s3.amazonaws.com/python/common.py and http://harvard-cs264.s3.amazonaws.com/python/template.py, if you'd like, for python.
  • Use a shell script to tie together the jobs.

Advice if you use the Java APIs:
  • SequenceFiles are a special binary format used by Hadoop for fast intermediate I/O. The output of the link graph generator, the input and output of the PageRank cycle, and the input to the cleanup pass, can all be set to org.apache.hadoop.mapred.SequenceInputFormat and SequenceOutputFormat for faster processing, if you don't need the intermediate values for debugging.
  • Create a new JobClient and JobConf object for each MapReduce pass. main() should call a series of driver methods.
  • Remember that you need to remove your intermediate/output directories between executions of your program
  • The simplest thing is to have the input and output types for each of the passes be Text.  If you're using Text, you should design a textual representation for the data that must be passed through each phase, that you can serialize to and parse from efficiently. Alternatively, if you're feeling daring, implement your own subclass of Writable which includes the information you need.

Testing your code:
  • Before you run any MapReduce code, you should unit test individual functions, calling them on a single piece of example input data (e.g., a single line out of the data set) and seeing what they do.
  • After you have unit tested all your components, you should do some small integration tests which make sure all the working parts fit together, and work on a small amount of representative data. 

[added Oct 26] Because we have found a few bugs in the cloudera training VM used in the lab, use the older version (0.3.1) for this homework instead (can be downloaded at http://cloudera.com/vm/cloudera-training-0.3.1.tar.bz2). You should submit all the mapper/reducer codes as well as the shell script that launches your code as a single hadoop job. Therefore, your shell script should specify all the required parameters, such as input/output files/paths, correctly. We will use exactly same VM, so your code should run without any modification.

[added Nov 2] In the report, explain your implementation details, such as shortcuts you took and links you ignored. 

Questions

Problem 1: Explain how the PageRank formula relates to taking a random walk on the Internet.

Problem 2: Explain the "damping factor".  What happens when this factor is 0?

Problem 3: Given a damping factor of 0.65 and the following web graph,
A => [B, C, D],   
B => [A, E],   
C => [],   
D => [F]
E => [A, B, C]
F => [A]
G => [C]
calculate the PageRank of every page.  Normalize so that the sum of all the page ranks sums up to 1.  
 
Problem 4: Describe the data types you used for keys and values in the various MapReduce stages.  How did you serialize and de-serialize the data stream?

Problem 5: What scalability hazards, if any, do you see in this system? Do you have any ideas how these might be overcome or mitigated?

Grading

The breakdown is as follows:

  • Basic functionality - Run MapReduce on the provided datasets (80 pts)
    • Code runs on Cloudera hadoop VM with small dataset and finds top ten highest-PageRank pages correctly (40 pts)
    • Code runs on Cloudera hadoop VM with medium dataset top ten highest-PageRank pages correctly (40 pts)
  • Questions (20pts, 4 pts per question)

Submission

To submit the files, create a folder named lastname_firstinitial_hw5 and place all your files in this folder. Compress the folder (please use .zip compression) and submit on the course iSite page in the HW 5 dropbox. Please contact the TFs if you miss the deadline on Mon, Nov 2nd, 11 pm EST.




Contains materials Copyright © 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License -- http://creativecommons.org/licenses/by/3.0/