VistrailsWiki - User contributions [en]

Course: Big Data 2015

2015-04-13T21:07:00Z

Tuananh: /* Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) */

= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =

* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2015

* Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)

* Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208.
* Some classes will include a lab session, please "always bring your laptop.''

= News =

* 04/05/2015: New quizzes are available at http://www.newgradiance.com
* [[Big Data 2015: Final Project]]
* 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup
* 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: [[Cloudera VM Setup]]
* There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1

= Background (2 weeks) =

== Week 1 - Feb 2: Course Overview; The evolution of Data Management and introduction to Big Data ==

* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form

== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf

* Lab:
** SQL hands on: [[Big Data 2015 - SQL Lab]]

* Other useful reading:
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]

* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)

== Feb 16: Holiday ==

= Big Data Foundations and Infrastructure (3 weeks) =

== Week 3 - Feb 23: Introduction to Map Reduce ==
* Lab: (continuation)
** SQL hands on: [[Big Data 2015 - SQL Lab]]
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf
* Required Reading:
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2
** Mining of Massive Datasets (v 2.1). Chapter 2 - 2.1, 2.2, and 2.3
* Other useful reading:
** Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520

* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services

== Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf

* Lab: Hands-on Hadoop (local)

* Required reading:
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Mining of Massive Datasets (2nd Edition), Chapter 2.

* Programming assignment: Map Reduce (check NYU Classes)

== Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf

* Lab: Hands-on Hadoop on AWS
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html

* Some links to AWS CLI documentation:
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html

* Required reading:
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)

* Programming assignment: check NYU Classes on March 10th

== March 16th: Spring Break ==

= Transparency and Reproducibility (1 week) =

== Week 6 - March 23: Data Exploration and Reproducibility ==

* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf

* Lab: Hands-on reproducibility. Before class, please
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf
** Questions? Email Fernando at fchirigati@nyu.edu

* Programming assignment 4: Exploring urban data (see NYU Classes)

= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =

== Week 7 - March 30th: Finding similar items ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf

* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]

* Homework Assignment
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.

== Week 8 - April 6th: Association Rules ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf

* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]

* Suggested additional reading:
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html

* Homework Assignment
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.

== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==

* Lecture notes:
** ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf

* Lab: Using Amazon AWS to analyze and visualize taxi data
** ** https://github.com/ViDA-NYU/aws_taxi

== Week 10 - April 20th: Parallel Databases ==

** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf

** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext

== Week 11 - April 27th: Graph Analysis ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf

== Week 12 - May 4: Final Exam ==

== Week 13 - May 11: Project Presentations ==

== Week 14 - May 18: Project Presentations ==

Course: Big Data 2015

2015-03-09T20:01:42Z

Tuananh: /* Week 5: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce */

= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =

* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2015

* Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)

* Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208.
* Some classes will include a lab session, please "always bring your laptop.''

= News =

* 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup
* 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: [[Cloudera VM Setup]]
* There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1

= Background (2 weeks) =

== Week 1 - Feb 2: Course Overview; The evolution of Data Management and introduction to Big Data ==

* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form

== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf

* Lab:
** SQL hands on: [[Big Data 2015 - SQL Lab]]

* Other useful reading:
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]

* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)

== Feb 16: Holiday ==

= Big Data Foundations and Infrastructure (3 weeks) =

== Week 3 - Feb 23: Introduction to Map Reduce ==
* Lab: (continuation)
** SQL hands on: [[Big Data 2015 - SQL Lab]]
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf
* Required Reading:
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2
** Mining of Massive Datasets (v 2.1). Chapter 2 - 2.1, 2.2, and 2.3
* Other useful reading:
** Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520

* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services

== Week 4: Algorithm Design for MapReduce: Relational Operations ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf

* Lab: Hands-on Hadoop

* Required reading:
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Mining of Massive Datasets (2nd Edition), Chapter 2.

* Programming assignment: Map Reduce

== Week 5: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf

* Lab: Hands-on Hadoop
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip

* Required reading:
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf)
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext

= Transparency and Reproducibility (1 week) =

== Week 6: Data Exploration and Reproducibility ==

* Lecture notes: http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf

* Lab: VisTrails

* Programming assignment: Exploring urban data

= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =

== Week 8: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Huy Vo (NYU CUSP) ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/vis_and_big_data_resized.pdf

== Week 9: Association Rules ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/association-rules.pdf

* Assignment on frequent items and association rule mining. ''Due on Dec 7th.'' Check http://www.newgradiance.com/services

* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]

* Suggested additional reading:
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html

== Week 10: Finding similar items ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf

* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]

* Homework Assignment
** There are two new quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. They ''due on May 5th.''
** Your final assignment is available at http://www.vistrails.org/index.php/Assignment_4_-_Querying_with_Pig_and_Mapreduce. This is an optional assignment and will count towards extra credit

== Week 11: Graph Analysis ==

* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf

== Week 12: TBD ==

== Week 13: TBD ==

== Week 14: Final Exam ==

== Week 15: Project Presentations ==

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-08T20:46:22Z

Tuananh: /* Before you start */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit for these exercises:
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public) [[http://bigdata.poly.edu/~tuananh/files/S3MakePublicInstruction.pdf instruction]]
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on Oct 8, 2014'''
** Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A

== Hands-on exercises ==
* '''Note''': Input for exercises: s3://mda2014/input/wikipedia.txt
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.

* Exercise 2: InitialCount
** Count the number of words based on their initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-06T15:31:24Z

Tuananh: /* Hands-on exercises */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit for these exercises:
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* '''Note''': Input for exercises: s3://mda2014/input/wikipedia.txt
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.

* Exercise 2: InitialCount
** Count the number of words based on their initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.

User:Tuananh

2014-10-04T05:30:26Z

Tuananh: Created page with 'Homepage: http://bigdata.poly.edu/~tuananh'

Homepage: http://bigdata.poly.edu/~tuananh

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-04T05:28:38Z

Tuananh: /* Before you start */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit for these exercises:
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.

* Exercise 2: InitialCount
** Count the number of words based on their initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T20:10:04Z

Tuananh: /* Hands-on exercises */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit for these exercises:
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.

* Exercise 2: InitialCount
** Count the number of words based on their initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T20:07:22Z

Tuananh: /* Before you start */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit for these exercises:
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.

* Exercise 2: InitialCount
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top 100 most frequent 7-character words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T20:07:04Z

Tuananh: /* Before you start */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit for this exercises:
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.

* Exercise 2: InitialCount
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top 100 most frequent 7-character words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T20:05:47Z

Tuananh: /* Hands-on exercises */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit
** Code: place your code in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appears in the input.

* Exercise 2: InitialCount
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top 100 most frequent 7-character words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T20:05:21Z

Tuananh: /* Hands-on exercises */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit
** Code: place your code in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appear in the input.

* Exercise 2: InitialCount
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top 100 most frequent 7-character words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appear in the input.

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T20:05:00Z

Tuananh: /* Hands-on exercises */

== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit
** Code: place your code in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''

== Hands-on exercises ==
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''

* Exercise 1: Fixed-Length WordCount
** For this exercise, you will only count words with 5 characters
** Output: Key is the word, and value is the number of times the word appear in the input.

* Exercise 2: InitialCount
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

* Exercise 3: Top-K WordCount
** Output the top 100 most frequent 7-character words, in descending order of frequency

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T19:50:00Z

Tuananh:

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T19:48:49Z

Tuananh: /* Exercise 1: Fixed-Length WordCount */

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T19:36:41Z

Tuananh: /* Exercise 0: WordCount */

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T19:32:47Z

Tuananh: /* Exercise 0: WordCount */

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T19:32:11Z

Tuananh: /* Exercise 0: WordCount */

Course: Massive Data Analysis 2014/Hadoop Exercise

2014-10-03T19:29:23Z

Tuananh: /* Before you start */