Difference between revisions of "Course: Big Data 2015"

From VistrailsWiki
Jump to navigation Jump to search
Line 13: Line 13:
== There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1 ==
== There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1 ==


= Background (4 weeks) =
= Background (2 weeks) =


== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==
== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==
Line 35: Line 35:
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]


== Week 3:  Other Data Models and  Query Optimization ==
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)
* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/xml_schema_query.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/query-opt.pdf
 
* Lab: SQL
 
* Programming assignment: Using SQL for data analysis and cleaning
 
== Week 4: Data Exploration and Reproducibility  ==
 
* Lecture notes:  http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf
 
* Lab: VisTrails


* Programming assignment: Exploring urban data
== Feb 16: Holiday ==


= Big Data Foundations and Infrastructure (3 weeks) =
= Big Data Foundations and Infrastructure (3 weeks) =


== Week 5: Cloud computing, Map Reduce and  Hadoop ==
== Week 3 - Feb 23: Introduction to Map Reduce ==
* Lab: (continuation)
** SQL hands on: [[Big Data 2015 - SQL Lab]]
* Lecture notes:   
* Lecture notes:   
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf
 
* Required Reading:  
* Required reading:  
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Mining of Massive Datasets (v 2.1)Chapter 2 - 2.1, 2.2, and 2.3
** Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).
 
* Other useful reading:  
* Other useful reading:  
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520


* Lab: Hands-on Hadoop
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services


* Homework Assignment -- Your first quiz is available on [http://www.newgradiance.com Gradiance]. It is ''due on March 17th at 5pm.''


== Week 6: Algorithm Design for MapReduce  ==
== Week 4: Algorithm Design for MapReduce  ==


* Lecture notes:   
* Lecture notes:   
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design.pdf
* Lab: Hands-on Hadoop


* Required reading:  
* Required reading:  
Line 78: Line 66:
** Mining of Massive Datasets (2nd Edition), Chapter 2.
** Mining of Massive Datasets (2nd Edition), Chapter 2.


* Programming assignment: Map Reduce


== Week 7: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages ==
== Week 5: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/data-analysis-mapreduce.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/data-analysis-mapreduce.pdf
= Transparency and Reproducibility  (1 week) =
== Week 6: Data Exploration and Reproducibility  ==
* Lecture notes:  http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf
* Lab: VisTrails
* Programming assignment: Exploring urban data





Revision as of 05:52, 23 February 2015

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

  • Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208.
  • Some classes will include a lab session, please "always bring your laptop.

News

02/10/2015: Programming assignment 1 posted. Check NYUClasses!

There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1

Background (2 weeks)

Week 1 - Feb 2: Course Overview; The evolution of Data Management and introduction to Big Data

Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL

  • Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)

Feb 16: Holiday

Big Data Foundations and Infrastructure (3 weeks)

Week 3 - Feb 23: Introduction to Map Reduce


Week 4: Algorithm Design for MapReduce

  • Lab: Hands-on Hadoop
  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2.
  • Programming assignment: Map Reduce

Week 5: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages


Transparency and Reproducibility (1 week)

Week 6: Data Exploration and Reproducibility

  • Lab: VisTrails
  • Programming assignment: Exploring urban data


Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 8: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Huy Vo (NYU CUSP)


Week 9: Association Rules

  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html


Week 10: Finding similar items

Week 11: Graph Analysis

Week 12: TBD

Week 13: TBD

Week 14: Final Exam

Week 15: Project Presentations