Difference between revisions of "Course: Massive Data Analysis 2014/Hadoop Exercise"

Latest revision as of 20:46, 8 October 2014

Before you start

You must have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
What to submit for these exercises:
- Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
- Results: put the results in your S3 bucket (don't forget to make it public) [instruction]
- Complete this form to submit the links to your GitHub repository and S3 bucket. Deadline: 11:59 PM on Oct 8, 2014
- Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A

Hands-on exercises

Note: Input for exercises: s3://mda2014/input/wikipedia.txt
Exercise 0: WordCount
- Run the basic WordCount example on your local machine and AWS
- Follow the instructions to create your Amazon Elastic MapReduce (EMR) cluster
- Instructions to run WordCount on your local machine and EMR cluster will be given in class
- Note: You don't have to submit code and results for this exercise.

Exercise 1: Fixed-Length WordCount
- For this exercise, you will only count words with 5 characters
- Output: Key is the word, and value is the number of times the word appears in the input.

Exercise 2: InitialCount
- Count the number of words based on their initial (first character), i.e., count the number of words per initial
- The letter case should not be taken into account. For example, Apple and apple will be both counted for initial A
- Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

Exercise 3: Top-K WordCount
- Output the top 100 most frequent 7-character words, in descending order of frequency
- Output: Key is the word, and value is the number of times the word appears in the input.

@@ Line 1: / Line 1: @@
 == Before you start ==
 * You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
-* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
+* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
-* What to submit
+* What to submit for these exercises:
-** Code: place your code in a public GitHub repository
+** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
-** Results: put the results in your S3 bucket (don't forget to make it public)
+** Results: put the results in your S3 bucket (don't forget to make it public) [[http://bigdata.poly.edu/~tuananh/files/S3MakePublicInstruction.pdf instruction]]
-** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''
+** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on Oct 8, 2014'''
+** Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A
-== Exercise 0: WordCount ==
+== Hands-on exercises ==
-* Run the basic WordCount example on your local machine and AWS
+* '''Note''': Input for exercises: s3://mda2014/input/wikipedia.txt
-* Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
+* Exercise 0: WordCount
-* Instructions to run WordCount on your local machine and EMR cluster will be given in class
+** Run the basic WordCount example on your local machine and AWS
-* '''Note: You don't have to submit code and results for this exercise.'''
+** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
+** Instructions to run WordCount on your local machine and EMR cluster will be given in class
+** '''Note: You don't have to submit code and results for this exercise.'''
-== Exercise 1: Fixed-Length WordCount ==
+* Exercise 1: Fixed-Length WordCount
+** For this exercise, you will only count words with 5 characters
+** Output: Key is the word, and value is the number of times the word appears in the input.
+* Exercise 2: InitialCount
+** Count the number of words based on their initial (first character), i.e., count the number of words per initial
+** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
+** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).
-== Exercise 2: InitialCount ==
+* Exercise 3: Top-K WordCount
+** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
-== Exercise 3 Top-K WordCount ==
+** Output: Key is the word, and value is the number of times the word appears in the input.

Difference between revisions of "Course: Massive Data Analysis 2014/Hadoop Exercise"

Latest revision as of 20:46, 8 October 2014

Before you start

Hands-on exercises

Navigation menu

Search