Difference between revisions of "Course: Big Data 2016"

From VistrailsWiki
Jump to navigation Jump to search
Line 13: Line 13:
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]


= Background (2 weeks) =
== Week 1 - Jan 25:  Course Overview; The evolution of Data Management and introduction to Big Data ==


== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==
 
== Week 2 - Feb 1:  Course Overview; The evolution of Data Management and introduction to Big Data ==


* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf
Line 21: Line 22:
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form


== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL ==
* Lecture notes:   
* Lecture notes:   
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf
Line 37: Line 38:
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)


== Feb 16: Holiday ==
== Week 4 - Feb 15: Holiday ==


= Big Data Foundations and Infrastructure (3 weeks) =
= Big Data Foundations and Infrastructure (3 weeks) =


== Week 3 - Feb 23:  Introduction to Map Reduce ==
== Week 5 - Feb 22:  Introduction to Map Reduce ==
* Lab: (continuation)
* Lab: (continuation)
** SQL hands on: [[Big Data 2015 - SQL Lab]]
** SQL hands on: [[Big Data 2015 - SQL Lab]]
Line 55: Line 56:




== Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations  ==
== Week 6 - Feb 29: Algorithm Design for MapReduce: Relational Operations  ==


* Lecture notes:   
* Lecture notes:   
Line 68: Line 69:
* Programming assignment: Map Reduce (check NYU Classes)
* Programming assignment: Map Reduce (check NYU Classes)


== Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce ==  
== Week 7 - March 7: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce ==  


* Lecture notes:
* Lecture notes:
Line 92: Line 93:
* Programming assignment: check NYU Classes on March 10th
* Programming assignment: check NYU Classes on March 10th


== March 16th: Spring Break ==
== Week 8 -- March 14th: Spring Break ==




= Transparency and Reproducibility  (1 week) =
= Transparency and Reproducibility  (1 week) =


== Week 6 - March 23: Data Exploration and Reproducibility  ==
== Week 9 - March 21: Data Exploration and Reproducibility  ==


* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf
Line 112: Line 113:
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =


== Week 7 - March 30th:  Finding similar items  ==
== Week 10 - March 28th:  Finding similar items  ==


* Lecture notes:
* Lecture notes:
Line 122: Line 123:
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.  
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.  


== Week 8 - April 6th: Association Rules  ==
== Week 11 - April 4th: Association Rules  ==


* Lecture notes:
* Lecture notes:
Line 138: Line 139:
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.


== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==


* Lecture notes:
* Lecture notes:
Line 146: Line 147:
** https://github.com/ViDA-NYU/aws_taxi
** https://github.com/ViDA-NYU/aws_taxi


== Week 10 - April 20th: Parallel Databases ==
== Week 13 - April 18th: Parallel Databases ==


* Lecture notes:
* Lecture notes:
Line 161: Line 162:
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf


== Week 11 - April 27th: Graph Analysis ==
== Week 14 - April 25th: Graph Analysis ==


* Lecture notes:
* Lecture notes:
Line 168: Line 169:
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms


== Week 12 - May 4: Final Exam ==
== Week 15 - May 2: Final Exam ==


== Week 13 - May 11: Project Presentations ==
== Week 16 - May 9: Project Presentations ==


== Week 14 - May 18: Project Presentations ==
== Week 17 - May 16: Project Presentations ==

Revision as of 23:02, 6 January 2016

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

  • Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102.
  • Some classes will include a lab session, please always bring your laptop.

News

Week 1 - Jan 25: Course Overview; The evolution of Data Management and introduction to Big Data

Week 2 - Feb 1: Course Overview; The evolution of Data Management and introduction to Big Data

Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL

  • Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)

Week 4 - Feb 15: Holiday

Big Data Foundations and Infrastructure (3 weeks)

Week 5 - Feb 22: Introduction to Map Reduce


Week 6 - Feb 29: Algorithm Design for MapReduce: Relational Operations

  • Lab: Hands-on Hadoop (local)
  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2.
  • Programming assignment: Map Reduce (check NYU Classes)

Week 7 - March 7: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce


  • Programming assignment: check NYU Classes on March 10th

Week 8 -- March 14th: Spring Break

Transparency and Reproducibility (1 week)

Week 9 - March 21: Data Exploration and Reproducibility

  • Programming assignment 4: Exploring urban data (see NYU Classes)

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 10 - March 28th: Finding similar items

  • Homework Assignment
    • See quizzes on Gradiance -- Distance measures and document similarity.

Week 11 - April 4th: Association Rules


  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
  • Homework Assignment
    • See quizes on Gradiance -- Distance measures and document similarity.

Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP)

Week 13 - April 18th: Parallel Databases

Week 14 - April 25th: Graph Analysis

  • Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms

Week 15 - May 2: Final Exam

Week 16 - May 9: Project Presentations

Week 17 - May 16: Project Presentations