<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://www.vistrails.org//api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Tuananh</id>
	<title>VistrailsWiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://www.vistrails.org//api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Tuananh"/>
	<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php/Special:Contributions/Tuananh"/>
	<updated>2026-05-05T11:40:43Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.36.2</generator>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2015&amp;diff=9512</id>
		<title>Course: Big Data 2015</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2015&amp;diff=9512"/>
		<updated>2015-04-13T21:07:00Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2015&lt;br /&gt;
&lt;br /&gt;
* Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208. &lt;br /&gt;
* Some classes will include a lab session, please  &amp;quot;always bring your laptop.''&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 04/05/2015: New quizzes are available at http://www.newgradiance.com&lt;br /&gt;
* [[Big Data 2015: Final Project]]&lt;br /&gt;
* 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: [[Cloudera VM Setup]]&lt;br /&gt;
* There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1&lt;br /&gt;
&lt;br /&gt;
= Background (2 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab:&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Feb 16: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 23:  Introduction to Map Reduce ==&lt;br /&gt;
* Lab: (continuation)&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* Required Reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local)&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (2nd Edition), Chapter 2.&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce == &lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop on AWS&lt;br /&gt;
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip&lt;br /&gt;
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
&lt;br /&gt;
* Some links to AWS CLI documentation:&lt;br /&gt;
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html&lt;br /&gt;
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool&lt;br /&gt;
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html&lt;br /&gt;
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws&lt;br /&gt;
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== March 16th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - March 23: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 30th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 8 - April 6th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** ** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 10 - April 20th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf&lt;br /&gt;
&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 27th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 12 - May 4: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 13 - May 11: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 14 - May 18: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2015&amp;diff=9308</id>
		<title>Course: Big Data 2015</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2015&amp;diff=9308"/>
		<updated>2015-03-09T20:01:42Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Week 5: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2015&lt;br /&gt;
&lt;br /&gt;
* Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208. &lt;br /&gt;
* Some classes will include a lab session, please  &amp;quot;always bring your laptop.''&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: [[Cloudera VM Setup]]&lt;br /&gt;
* There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1&lt;br /&gt;
&lt;br /&gt;
= Background (2 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab:&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Feb 16: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 23:  Introduction to Map Reduce ==&lt;br /&gt;
* Lab: (continuation)&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* Required Reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 4: Algorithm Design for MapReduce: Relational Operations  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (2nd Edition), Chapter 2.&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Map Reduce&lt;br /&gt;
&lt;br /&gt;
== Week 5: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce == &lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop&lt;br /&gt;
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf)&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 6: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: VisTrails&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Exploring urban data&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 8: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Huy Vo (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/vis_and_big_data_resized.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
* Assignment on frequent items and association rule mining. ''Due on Dec 7th.''  Check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 10:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** There are two new quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. They ''due on May  5th.''&lt;br /&gt;
** Your final assignment is available at http://www.vistrails.org/index.php/Assignment_4_-_Querying_with_Pig_and_Mapreduce. This is an optional assignment and will count towards extra credit&lt;br /&gt;
&lt;br /&gt;
== Week 11: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 12: TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 13: TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 14: Final Exam  ==&lt;br /&gt;
&lt;br /&gt;
== Week 15: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8360</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8360"/>
		<updated>2014-10-08T20:46:22Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Before you start */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit for these exercises:&lt;br /&gt;
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public) [[http://bigdata.poly.edu/~tuananh/files/S3MakePublicInstruction.pdf instruction]]&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on Oct 8, 2014'''&lt;br /&gt;
** Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* '''Note''': Input for exercises: s3://mda2014/input/wikipedia.txt&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on their initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8343</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8343"/>
		<updated>2014-10-06T15:31:24Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Hands-on exercises */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit for these exercises:&lt;br /&gt;
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* '''Note''': Input for exercises: s3://mda2014/input/wikipedia.txt&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on their initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=User:Tuananh&amp;diff=8333</id>
		<title>User:Tuananh</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=User:Tuananh&amp;diff=8333"/>
		<updated>2014-10-04T05:30:26Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: Created page with 'Homepage: http://bigdata.poly.edu/~tuananh'&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Homepage: http://bigdata.poly.edu/~tuananh&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8332</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8332"/>
		<updated>2014-10-04T05:28:38Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Before you start */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit for these exercises:&lt;br /&gt;
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on their initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8331</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8331"/>
		<updated>2014-10-03T20:10:04Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Hands-on exercises */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit for these exercises:&lt;br /&gt;
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on their initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8330</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8330"/>
		<updated>2014-10-03T20:07:22Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Before you start */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit for these exercises:&lt;br /&gt;
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top 100 most frequent 7-character words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8329</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8329"/>
		<updated>2014-10-03T20:07:04Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Before you start */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit for this exercises:&lt;br /&gt;
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top 100 most frequent 7-character words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8328</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8328"/>
		<updated>2014-10-03T20:05:47Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Hands-on exercises */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top 100 most frequent 7-character words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appears in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8327</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8327"/>
		<updated>2014-10-03T20:05:21Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Hands-on exercises */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appear in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top 100 most frequent 7-character words, in descending order of frequency&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appear in the input.&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8326</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8326"/>
		<updated>2014-10-03T20:05:00Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Hands-on exercises */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
** Output: Key is the word, and value is the number of times the word appear in the input.&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial&lt;br /&gt;
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' &lt;br /&gt;
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).&lt;br /&gt;
&lt;br /&gt;
* Exercise 3: Top-K WordCount&lt;br /&gt;
** Output the top 100 most frequent 7-character words, in descending order of frequency&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8325</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8325"/>
		<updated>2014-10-03T19:50:00Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Hands-on exercises ==&lt;br /&gt;
* Exercise 0: WordCount&lt;br /&gt;
** Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
** Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
** '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
* Exercise 1: Fixed-Length WordCount&lt;br /&gt;
** For this exercise, you will only count words with 5 characters&lt;br /&gt;
&lt;br /&gt;
* Exercise 2: InitialCount&lt;br /&gt;
&lt;br /&gt;
* Exercise 3 Top-K WordCount&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8324</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8324"/>
		<updated>2014-10-03T19:48:49Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Exercise 1: Fixed-Length WordCount */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 0: WordCount ==&lt;br /&gt;
* Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
* Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
* Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
* '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 1: Fixed-Length WordCount ==&lt;br /&gt;
* For this exercise, you will only count words with 5 characters&lt;br /&gt;
&lt;br /&gt;
== Exercise 2: InitialCount ==&lt;br /&gt;
&lt;br /&gt;
== Exercise 3 Top-K WordCount ==&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8323</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8323"/>
		<updated>2014-10-03T19:36:41Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Exercise 0: WordCount */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 0: WordCount ==&lt;br /&gt;
* Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
* Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf&lt;br /&gt;
* Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
* '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 1: Fixed-Length WordCount ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Exercise 2: InitialCount ==&lt;br /&gt;
&lt;br /&gt;
== Exercise 3 Top-K WordCount ==&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8322</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8322"/>
		<updated>2014-10-03T19:32:47Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Exercise 0: WordCount */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 0: WordCount ==&lt;br /&gt;
* Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
* Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://example.com&lt;br /&gt;
* Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
* '''Note: You don't have to submit code and results for this exercise.'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 1: Fixed-Length WordCount ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Exercise 2: InitialCount ==&lt;br /&gt;
&lt;br /&gt;
== Exercise 3 Top-K WordCount ==&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8321</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8321"/>
		<updated>2014-10-03T19:32:11Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Exercise 0: WordCount */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 0: WordCount ==&lt;br /&gt;
* Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
* Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://example.com&lt;br /&gt;
* Instructions to run WordCount on your local machine and EMR cluster will be given in class&lt;br /&gt;
&lt;br /&gt;
== Exercise 1: Fixed-Length WordCount ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Exercise 2: InitialCount ==&lt;br /&gt;
&lt;br /&gt;
== Exercise 3 Top-K WordCount ==&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8320</id>
		<title>Course: Massive Data Analysis 2014/Hadoop Exercise</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Massive_Data_Analysis_2014/Hadoop_Exercise&amp;diff=8320"/>
		<updated>2014-10-03T19:29:23Z</updated>

		<summary type="html">&lt;p&gt;Tuananh: /* Before you start */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Before you start ==&lt;br /&gt;
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.&lt;br /&gt;
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.&lt;br /&gt;
* What to submit&lt;br /&gt;
** Code: place your code in a public GitHub repository&lt;br /&gt;
** Results: put the results in your S3 bucket (don't forget to make it public)&lt;br /&gt;
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''&lt;br /&gt;
&lt;br /&gt;
== Exercise 0: WordCount ==&lt;br /&gt;
* Run the basic WordCount example on your local machine and AWS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Exercise 1: Fixed-Length WordCount ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Exercise 2: InitialCount ==&lt;br /&gt;
&lt;br /&gt;
== Exercise 3 Top-K WordCount ==&lt;/div&gt;</summary>
		<author><name>Tuananh</name></author>
	</entry>
</feed>