Course: Big Data 2016
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
DS-GA 1004- Big Data: Tentative Schedule -- subject to change
- Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016
- Instructors:
- Professor Juliana Freire (http://vgc.poly.edu/~juliana)
- Dr. Erin C Carson
- Dr. Nicholas Knight
- TAs:
- Yuan Feng
- Kevin Ye
- Lecture: Mondays, 4:55pm-7:35pm at Silver 207
- Some classes will include a lab session, please always bring your laptop.
News
- 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup
- 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See NYU HPC Access Instructions
Week 1 - Jan 25: Course Overview
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf
- Lab: Computing infrastructure for the course
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form
Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model
- Lecture notes:
- Lab: getting started with MySQL
- Required Reading:
- Chapter 1 of Mining of Massive Data Analysis
- Suggested Reading:
Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)
- Lecture notes:
- Lab: SQL
- Programming assignment: Using SQL for data analysis and cleaning (check NYU Classes)
Week 4 - Feb 15: Holiday
Transparency and Reproducibility (1 week)
Week 5 - Feb 22: Data Exploration and Reproducibility
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf
- Lab: Hands-on reproducibility.
- Programming assignment: Exploring urban data (see NYU Classes)
Big Data Foundations and Infrastructure (3 weeks)
Week 6 - Feb 29: Introduction to Map Reduce
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf
- Lab: Hands-on Hadoop (local and AWS)
- Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services
Week 7 - March 7: MapReduce Algorithm Design Patterns
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf
- Lab: Hands-on Hadoop (HPC)
- Programming assignment: Map Reduce (check NYU Classes)
- Readings:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)
Week 8-- March 14th: Spring Break
Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK
- Lecture notes: ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf
- Lab: NoSQL
- Programming assignment: Pig and Spark
- Readings:
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Additional Suggested reading:
- Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
- Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
- BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
- Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf
Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)
Week 10 - March 28th: Finding similar items
- Reading: Chapter 3 Mining of Massive Datasets
- Homework Assignment
- See quizzes on Gradiance -- Distance measures and document similarity.
Week 11 - April 4th: Association Rules
- Reading: Chapter 6 Mining of Massive Datasets
- Suggested additional reading:
- Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
- Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
- Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
- Homework Assignment
- See quizes on Gradiance -- Distance measures and document similarity.
Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS)
- Lab: Using Amazon AWS to analyze and visualize taxi data
Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&T Research
Week 14 - April 25th: Graph Analysis
- Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms