Difference between revisions of "Course: Massive Data Analysis 2014/Hadoop Exercise"

From VistrailsWiki
Jump to navigation Jump to search
(Created page with '== Before you start == * You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the c…')
 
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Before you start ==
== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit:
* What to submit for these exercises:
** Code: place your code in a public GitHub repository
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Results: put the results in your S3 bucket (don't forget to make it public) [[http://bigdata.poly.edu/~tuananh/files/S3MakePublicInstruction.pdf instruction]]
** Complete this form to add the links to your GitHub repository and S3 bucket
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on Oct 8, 2014'''
** Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A


== Exercise 0: WordCount ==
== Hands-on exercises ==
* Run the basic WordCount example on your local machine and AWS
* '''Note''': Input for exercises: s3://mda2014/input/wikipedia.txt
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''


* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.


* Exercise 2: InitialCount
** Count the number of words based on their initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).


== Exercise 1: Fixed-Length WordCount ==
* Exercise 3: Top-K WordCount
 
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
 
** Output: Key is the word, and value is the number of times the word appears in the input.
== Exercise 2: InitialCount ==
 
== Exercise 3 Top-K WordCount ==

Latest revision as of 20:46, 8 October 2014

Before you start

  • You must have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
  • Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
  • What to submit for these exercises:
    • Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
    • Results: put the results in your S3 bucket (don't forget to make it public) [instruction]
    • Complete this form to submit the links to your GitHub repository and S3 bucket. Deadline: 11:59 PM on Oct 8, 2014
    • Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A

Hands-on exercises

  • Note: Input for exercises: s3://mda2014/input/wikipedia.txt
  • Exercise 0: WordCount
    • Run the basic WordCount example on your local machine and AWS
    • Follow the instructions to create your Amazon Elastic MapReduce (EMR) cluster
    • Instructions to run WordCount on your local machine and EMR cluster will be given in class
    • Note: You don't have to submit code and results for this exercise.
  • Exercise 1: Fixed-Length WordCount
    • For this exercise, you will only count words with 5 characters
    • Output: Key is the word, and value is the number of times the word appears in the input.
  • Exercise 2: InitialCount
    • Count the number of words based on their initial (first character), i.e., count the number of words per initial
    • The letter case should not be taken into account. For example, Apple and apple will be both counted for initial A
    • Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).
  • Exercise 3: Top-K WordCount
    • Output the top 100 most frequent 7-character words, in descending order of frequency
    • Output: Key is the word, and value is the number of times the word appears in the input.