CS9223 - Assignment 1 - MapReduce on AWS

Assignment Description

To get some hands on experience with the MapReduce programming model and a cloud infrastructure, you will implement a few variations of WordCount -- given a file, you will count how many words there are in the file. You can assume that words are separated by a space.

As input you will read a CSV file that contains Wikipedia articles. Each line in the file contains the text of a Wikipedia article. You can find the input file at: Location: s3://cs9223/wikitext-big-notitle.csv https://s3.amazonaws.com/cs9223/wikitext-big-notitle.csv

Here are the problems you will solve:

a) Write a MapReduce program to count all the words in the input file and outputs the 100 most frequent words.

b) Modify your program to use a combiner and compare performance (time in seconds) against your first implementation.

c) Add to your code a method that uses MapReduce to count the number of words with a given length. The output should contain the 100 most frequent words in decreasing order of frequency. For example: words with 2, 3, 4, … letters.

and  1000
not  800
...

d) Add to your code a method count the number of words that have a given prefix and and outputs the 100 most frequent words with the given prefix in decreasing order of frequency. For example: count how many words start with "dis".

disconnect 100
disappear 50
...

Requirements

1. You must write your code in Java. In addition to the source code, you will submit a JAR file that can be called using the following command:

hadoop -jar wordcount.jar WordCount -input <> -output <> [-combiner] [-word-length x] [-prefix yyy]

The inputs of your Java program should be the following:

WordCount: classname

-input: Input (directly from S3)
-output: Output (directly to S3)
-combiner: (Extras) Use or not use combiner
-word-length: (Extras) Consider word with x characters only
-prefix: (Extras) Consider word with yyy prefix only

2. You must use Hadoop version 1.0.3.

3. You should both read from and write to Amazon S3 directly.

When and what to submit

Your assignment is due on Sept 29th, 2013. The required file should be submitted to my.poly.edu.

You create and submit a ZIP file that contains:

1) Java source code and jar file: WordCount.java WordCount.jar

2) Output files should be written to S3. You will submit a file called s3_url containing the URL to the S3 directory where the outputs are located.

For each problem that involves counting (a,c,d), you will create a text file with names a_output, c_output and d_output respectively, where each line contains a word and the count of how often it occurred separated by tab, in decreasing order of frequency.

For item (b), create a text file, named b_output, containing the two performance numbers separated by a tab in one line.

Notes and Suggestions

You can find information on how to create an account and use AWS at: http://www.vistrails.org/index.php/AWS_Setup

Your token is worth $100 -- use it carefully, since there will be other assignments. Amazon charges by the hour, see http://aws.amazon.com/ec2/pricing/

Install Hadoop on your own computer. This will make it easier for you to debug your program and will also save you credit at AWS.

Warning

You must remember to terminate manually the AWS cluster (called Job Flows) when you are done: if you just close the browser, the job flows continue to run, and amazon will continue to charge you for days and weeks, exhausting your credit and charging you huge amount on your credit card.

Cs9223 Mapreduce Assignment

Contents