Assignment 3 - MapReduce algorithm design
Assignment 3: Computing Relative Frequencies
For this assignment you will explore a set of 100,000 Wikipedia documents:
- s3://cs9223/wikitext_100k.txt, or
Each line in this file consists of the plain text extracted from a Wikipedia document.
Compute the relative frequencies of each word that occurs in the documents in wikitext_100k.txt and output the top 100 word pairs sorted by decreasing order of relative frequency.
Recall that the relative frequency (RF) of word B given word A is defined as follows:
where count(A,B) is the number of times A and B co-occur in a document and count(A) the number of times A occurs with anything else. Intuitively, given a document collection, the relative frequency captures the proportion of time the word B appears in the same document as A. (See Section 3.3, in Data-Intensive Text Processing with MapReduce).
In the lecture on March 10th, you learned different approaches to do this, and in this assignment, you will implement them:
a. Write a mapreduce program which uses the Stripes approach and writes its output in a file named rfstripes.txt
b. Write a mapreduce program which uses the Pairs approach and writes its output in a file named rfpairs.txt
c. Compare the performance of the two approaches and output the relative performance to a file named rfcomp.txt. Compute the relative performance as follows: (running time for Pairs/ running time for Stripes). Also include an analysis comparing the communication costs for the two approaches.
You can write your program using Java or Python (with Hadoop streaming) and you should run it on AWS. It is a good idea to develop and test your program on a local machine before deploying in on the cloud. In addition to the source code, you will submit a shell script file that could be called as follows: ./run.sh input_directory input_filename output_directory.
1. run.sh must include all the necessary commands to run your code and produce the required results.
2. All files needed to execute run.sh must be stored in the same directory.
3. You should both read from and write to Amazon S3 directly. In other words, input_directory and output_directory should be in S3.
4. You must use Hadoop version 1.0.3.
When and What to submit
Your assignment is due on April 6th, 2014 at 11:59pm. All the required files should be compressed in one ZIP file and submitted to NYU Classes. The ZIP file should contain:
a. Source code and run.sh (For Java, in addition to the source, you must include the JAR file)
b. The result files: rfstripes.txt, rfpairs.txt, rfcomp.txt
c. A plain-text document named rf.txt that describes
- the input/output format in each Hadoop task, i.e., the keys for the mappers and reducers
- the Hadoop cluster setting you used, i.e., number of mappers and reducers
- the running time for run.sh
Note: all files must be in the same directory