Difference between revisions of "Course: Big Data 2015"

From VistrailsWiki
Jump to navigation Jump to search
 
(39 intermediate revisions by 3 users not shown)
Line 8: Line 8:
* Some classes will include a lab session, please  "always bring your laptop.''
* Some classes will include a lab session, please  "always bring your laptop.''


= Background (4 weeks) =
= News =
 
* 04/05/2015: New quizzes are available at http://www.newgradiance.com
* [[Big Data 2015: Final Project]]
* 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup
* 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: [[Cloudera VM Setup]]
* There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1
 
= Background (2 weeks) =


== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==
== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==
Line 16: Line 24:
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form


== Week 2: Introduction to Databases, Relational Model and SQL ==
== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf
* Lecture notes:   
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf
* Lab:
** SQL hands on: [[Big Data 2015 - SQL Lab]]


* Other useful reading:  
* Other useful reading:  
Line 26: Line 38:
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]


== Week 3: Other Data Models and Query Optimization ==
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)
* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/xml_schema_query.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/query-opt.pdf


* Lab: SQL
== Feb 16: Holiday ==


* Programming assignment: Using SQL for data analysis and cleaning
= Big Data Foundations and Infrastructure (3 weeks) =


== Week 4: Data Exploration and Reproducibility ==
== Week 3 - Feb 23: Introduction to Map Reduce ==
* Lab: (continuation)
** SQL hands on: [[Big Data 2015 - SQL Lab]]
* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf
* Required Reading:
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3
* Other useful reading:
** Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520


* Lecture notes:  http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services


* Lab: VisTrails


* Programming assignment: Exploring urban data
== Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations  ==


= Big Data Foundations and Infrastructure (3 weeks) =
* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf


== Week 5: Cloud computing, Map Reduce and  Hadoop ==
* Lab: Hands-on Hadoop (local)
* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf


* Required reading:  
* Required reading:  
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).
** Mining of Massive Datasets (2nd Edition), Chapter 2.


* Other useful reading:  
* Programming assignment: Map Reduce (check NYU Classes)
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520
 
== Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce ==


* Lab: Hands-on Hadoop
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf


* Homework Assignment -- Your first quiz is available on [http://www.newgradiance.com Gradiance]. It is ''due on March 17th at 5pm.''
* Lab: Hands-on Hadoop on AWS
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html


== Week 6: Algorithm Design for MapReduce  ==
* Some links to AWS CLI documentation:
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html


* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design.pdf


* Required reading:  
* Required reading:  
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Mining of Massive Datasets (2nd Edition), Chapter 2.
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)
 
* Programming assignment: check NYU Classes on March 10th
 
== March 16th: Spring Break ==
 
 
= Transparency and Reproducibility  (1 week) =


== Week 6 - March 23: Data Exploration and Reproducibility  ==


== Week 7: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages ==
* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf


* Lecture notes:
* Lab: Hands-on reproducibility. Before class, please
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/data-analysis-mapreduce.pdf
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf
** Questions? Email Fernando at fchirigati@nyu.edu


* Programming assignment 4: Exploring urban data (see NYU Classes)


= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =


== Week 8: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Huy Vo (NYU CUSP) ==
== Week 7 - March 30th: Finding similar items  ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/vis_and_big_data_resized.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf


* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]


== Week 9: Association Rules  ==
* Homework Assignment
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.
 
== Week 8 - April 6th: Association Rules  ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/association-rules.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf


* Assignment on frequent items and association rule mining. ''Due on Dec 7th.''  Check http://www.newgradiance.com/services


* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
Line 99: Line 138:
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html


* Homework Assignment
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.


== Week 10: Finding similar items  ==
== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf
 
* Lab: Using Amazon AWS to analyze and visualize taxi data
** https://github.com/ViDA-NYU/aws_taxi
 
== Week 10 - April 20th: Parallel Databases ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf


* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
* Required reading:
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext


* Homework Assignment
* Suggested reading:
** There are two new quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. They ''due on May  5th.''
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
** Your final assignment is available at http://www.vistrails.org/index.php/Assignment_4_-_Querying_with_Pig_and_Mapreduce. This is an optional assignment and will count towards extra credit
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf


== Week 11: Graph Analysis ==
== Week 11 - April 27th: Graph Analysis ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf


== Week 12: TBD ==
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms


== Week 13: TBD ==
== Week 12 - May 4: Final Exam ==


== Week 14: Final Exam  ==
== Week 13 - May 11: Project Presentations ==


== Week 15: Project Presentations ==
== Week 14 - May 18: Project Presentations ==

Latest revision as of 01:13, 29 April 2015

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

  • Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208.
  • Some classes will include a lab session, please "always bring your laptop.

News

Background (2 weeks)

Week 1 - Feb 2: Course Overview; The evolution of Data Management and introduction to Big Data

Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL

  • Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)

Feb 16: Holiday

Big Data Foundations and Infrastructure (3 weeks)

Week 3 - Feb 23: Introduction to Map Reduce


Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations

  • Lab: Hands-on Hadoop (local)
  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2.
  • Programming assignment: Map Reduce (check NYU Classes)

Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce


  • Programming assignment: check NYU Classes on March 10th

March 16th: Spring Break

Transparency and Reproducibility (1 week)

Week 6 - March 23: Data Exploration and Reproducibility

  • Programming assignment 4: Exploring urban data (see NYU Classes)

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 7 - March 30th: Finding similar items

  • Homework Assignment
    • See quizzes on Gradiance -- Distance measures and document similarity.

Week 8 - April 6th: Association Rules


  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
  • Homework Assignment
    • See quizes on Gradiance -- Distance measures and document similarity.

Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP)

Week 10 - April 20th: Parallel Databases

Week 11 - April 27th: Graph Analysis

  • Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms

Week 12 - May 4: Final Exam

Week 13 - May 11: Project Presentations

Week 14 - May 18: Project Presentations