Course: Massive Data Analysis 2014/Hadoop Exercise

Before you start

You must have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
What to submit for these exercises:
- Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
- Results: put the results in your S3 bucket (don't forget to make it public) [instruction]
- Complete this form to submit the links to your GitHub repository and S3 bucket. Deadline: 11:59 PM on Oct 8, 2014
- Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A

Note: Input for exercises: s3://mda2014/input/wikipedia.txt
Exercise 0: WordCount
- Run the basic WordCount example on your local machine and AWS
- Follow the instructions to create your Amazon Elastic MapReduce (EMR) cluster
- Instructions to run WordCount on your local machine and EMR cluster will be given in class
- Note: You don't have to submit code and results for this exercise.

Exercise 1: Fixed-Length WordCount
- For this exercise, you will only count words with 5 characters
- Output: Key is the word, and value is the number of times the word appears in the input.

Exercise 2: InitialCount
- Count the number of words based on their initial (first character), i.e., count the number of words per initial
- The letter case should not be taken into account. For example, Apple and apple will be both counted for initial A
- Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

Exercise 3: Top-K WordCount
- Output the top 100 most frequent 7-character words, in descending order of frequency
- Output: Key is the word, and value is the number of times the word appears in the input.