Course: Big Data 2016
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
DS-GA 1004- Big Data: Tentative Schedule -- subject to change
- Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016
- Instructors:
- Professor Juliana Freire (http://vgc.poly.edu/~juliana)
- Dr. Erin C Carson
- Dr. Nicholas Knight
- TAs:
- Yuan Feng
- Kevin Ye
- Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102.
- Some classes will include a lab session, please always bring your laptop.
News
- 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup
- 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See NYU HPC Access Instructions
Week 1 - Jan 25: Course Overview; Lab: Computing infrastructure for the course
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form
Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL
- In-class assignment: relational algebra
Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)
- Lab: SQL
- Programming assignment: Using SQL for data analysis and cleaning
Week 4 - Feb 15: Holiday
Big Data Foundations and Infrastructure (3 weeks)
Week 5 - Feb 22: Introduction to Map Reduce
- Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services
- Lab: Hands-on Hadoop (local and AWS)
Week 6 - Feb 29: MapReduce Algorithm Design Patterns
- Lab: Hands-on Hadoop (HPC)
- Programming assignment: Map Reduce (check NYU Classes)
Week 7 - March 7: Parallel Databases vs MapReduce; Introduction to SPARK
- Lab: Hands-on SPARK (HPC)
- Programming assignment: check NYU Classes on March 10th
Week 8 -- March 14th: Spring Break
Transparency and Reproducibility (1 week)
Week 9 - March 21: Data Exploration and Reproducibility
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf
- Lab: Hands-on reproducibility. Before class, please
- Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads
- Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt
- Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt
- http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf
- Questions? Email Fernando at fchirigati@nyu.edu
- Programming assignment 4: Exploring urban data (see NYU Classes)
Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)
Week 10 - March 28th: Finding similar items
- Reading: Chapter 3 Mining of Massive Datasets
- Homework Assignment
- See quizzes on Gradiance -- Distance measures and document similarity.
Week 11 - April 4th: Association Rules
- Reading: Chapter 6 Mining of Massive Datasets
- Suggested additional reading:
- Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
- Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
- Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
- Homework Assignment
- See quizes on Gradiance -- Distance measures and document similarity.
Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP)
- Lab: Using Amazon AWS to analyze and visualize taxi data
Week 13 - April 18th: Parallel Databases
- Lecture notes:
- Required reading:
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Suggested reading:
- Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
- Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
- BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
- Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf
Week 14 - April 25th: Graph Analysis
- Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms