Course: Big Data 2016

From VistrailsWiki
Jump to navigation Jump to search

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

  • TAs:
    • Yuan Feng
    • Kevin Ye
  • Lecture: Mondays, 4:55pm-7:35pm at Silver 207
  • Some classes will include a lab session, please always bring your laptop.


Week 1 - Jan 25: Course Overview

Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model

Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)

Week 4 - Feb 15: Holiday

Transparency and Reproducibility (1 week)

Week 5 - Feb 22: Data Exploration and Reproducibility

Big Data Foundations and Infrastructure (3 weeks)

Week 6 - Feb 29: Introduction to Map Reduce

Week 7 - March 7: MapReduce Algorithm Design Patterns

Week 8-- March 14th: Spring Break

Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 10 - March 28th: Finding similar items & Spark

  • Homework Assignment
    • See quizzes on Gradiance -- Distance measures and document similarity.

Week 11 - April 4th: Large-Scale Visualization -- Invited lecture by Professor Claudio Silva

Week 12 - April 11th: Visualization: Using D3 -- Invited lecture by Bowen Yu

Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&T Research

  • Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.
  • Bio: Divesh Srivastava is the head of Database Research at AT&T Labs-Research. He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India. His research interests and publications span a variety of topics in data management.
  • Lecture notes:

Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS)

Week 15 - May 2: Association Rules

  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997.
  • Homework Assignment

Week 16 - May 9: Final Exam

Week 17 - May 16: Project Presentations