Difference between revisions of "Course: Big Data 2016"

From VistrailsWiki
Jump to navigation Jump to search
 
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =
= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =
[[Course: Big Data 2017]]


* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016
Line 12: Line 14:
** Kevin Ye
** Kevin Ye


* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102.
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207


* Some classes will include a lab session, please  always ''bring your laptop''.
* Some classes will include a lab session, please  always ''bring your laptop''.
Line 18: Line 20:
= News =
= News =


* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. To obtain your credit, please follow the instructions at http://www.vistrails.org/index.php/AWS_Setup
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]


== Week 1 - Jan 25:  Course Overview; The evolution of Data Management and introduction to Big Data ==
== Week 1 - Jan 25:  Course Overview ==


* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf
*''' Lab:'''  Computing infrastructure for the course
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form


== Week 2 - Feb 1:  Course Overview; The evolution of Data Management and introduction to Big Data ==
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==
 
* '''Lecture notes:'''
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf
* '''Lab:''' getting started with MySQL
* '''Required Reading:'''
** Chapter 1 of Mining of Massive Data Analysis
* '''Suggested Reading:'''
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013


* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form


== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL ==
*''' Lecture notes:'''
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf
* '''Lab:''' SQL
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf


* Lab:
== Week 4 - Feb 15: Holiday ==
** SQL hands on: [[Big Data 2015 - SQL Lab]]


* Other useful reading:
= Transparency and Reproducibility  (1 week) =
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]


* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==


== Week 4 - Feb 15: Holiday ==
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!


= Big Data Foundations and Infrastructure (3 weeks) =
= Big Data Foundations and Infrastructure (3 weeks) =


== Week 5 - Feb 22:  Introduction to Map Reduce ==
== Week 6 - Feb 29:  Introduction to Map Reduce ==
* Lab: (continuation)
** SQL hands on: [[Big Data 2015 - SQL Lab]]
* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf
* Required Reading:
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3
* Other useful reading:
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520


* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf
* '''Lab:''' Hands-on Hadoop (local and AWS)
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)
** Quiz is due on 2016-03-14 12:00 PM EST


== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==
*''' Lecture notes:'''
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf
* '''Lab:''' Hands-on Hadoop (HPC)
* '''Programming assignment:''' Map Reduce (check NYU Classes)
* '''Readings''':
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)


== Week 6 - Feb 29: Algorithm Design for MapReduce: Relational Operations  ==
== Week 8-- March 14th: Spring Break ==


* Lecture notes: 
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf


* Lab: Hands-on Hadoop (local)
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK==


* Required reading:  
*''' Lecture notes:'''
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf
** Mining of Massive Datasets (2nd Edition), Chapter 2.
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf
* '''Lab:''' Hands-on Pig
* Assignment: Hands-on Map-Reduce (see NYU Classes)
* '''Readings''':
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf
* '''Additional Suggested reading:'''
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf


* Programming assignment: Map Reduce (check NYU Classes)
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =


== Week 7 - March 7: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce ==  
== Week 10 - March 28th: Finding similar items & Spark ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf


* Lab: Hands-on Hadoop on AWS
* Reading:  
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf


* Some links to AWS CLI documentation:
* Homework Assignment
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html


== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf


* Required reading:
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)


* Programming assignment: check NYU Classes on March 10th
* Videos:
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov


== Week 8 -- March 14th: Spring Break ==
== Week 12 - April 11th: Visualization: Using D3 -- Invited lecture by Bowen Yu ==


* Lecture notes and lab:
** UPDATED: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3_v2.1.pdf


= Transparency and Reproducibility  (1 week) =
== Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&T Research ==


== Week 9 - March 21: Data Exploration and Reproducibility ==
* Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data.  This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of        (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.


* Lecture noteshttp://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf
* Bio: Divesh Srivastava is the head of Database Research at AT&T Labs-Research. He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India. His research interests and publications span a variety of topics in data management.


* Lab: Hands-on reproducibility. Before class, please
* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/bdq.pdf
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf
** Questions? Email Fernando at fchirigati@nyu.edu


* Programming assignment 4: Exploring urban data (see NYU Classes)
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==


= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP-2016.pdf


== Week 10 - March 28th: Finding similar items  ==
* Lab: see NYU Classes


* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf


* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]


* Homework Assignment
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.


== Week 11 - April 4th: Association Rules  ==
== Week 15 - May 2: Association Rules  ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf




Line 145: Line 168:


* Homework Assignment
* Homework Assignment
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.
 
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf
 
* Lab: Using Amazon AWS to analyze and visualize taxi data
** https://github.com/ViDA-NYU/aws_taxi
 
== Week 13 - April 18th: Parallel Databases ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf
 
* Required reading:
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
 
* Suggested reading:
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf
 
== Week 14 - April 25th: Graph Analysis ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf


* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms


== Week 15 - May 2: Final Exam ==


== Week 16 - May 9: Project Presentations ==
== Week 16 - May 9: Final Exam ==


== Week 17 - May 16: Project Presentations ==
== Week 17 - May 16: Project Presentations ==

Latest revision as of 04:23, 30 January 2017

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

Course: Big Data 2017

  • TAs:
    • Yuan Feng
    • Kevin Ye
  • Lecture: Mondays, 4:55pm-7:35pm at Silver 207
  • Some classes will include a lab session, please always bring your laptop.

News

Week 1 - Jan 25: Course Overview

Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model

Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)

Week 4 - Feb 15: Holiday

Transparency and Reproducibility (1 week)

Week 5 - Feb 22: Data Exploration and Reproducibility

Big Data Foundations and Infrastructure (3 weeks)

Week 6 - Feb 29: Introduction to Map Reduce

Week 7 - March 7: MapReduce Algorithm Design Patterns

Week 8-- March 14th: Spring Break

Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 10 - March 28th: Finding similar items & Spark

  • Homework Assignment
    • See quizzes on Gradiance -- Distance measures and document similarity.

Week 11 - April 4th: Large-Scale Visualization -- Invited lecture by Professor Claudio Silva


Week 12 - April 11th: Visualization: Using D3 -- Invited lecture by Bowen Yu

Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&T Research

  • Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.
  • Bio: Divesh Srivastava is the head of Database Research at AT&T Labs-Research. He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India. His research interests and publications span a variety of topics in data management.

Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS)

  • Lab: see NYU Classes



Week 15 - May 2: Association Rules


  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
  • Homework Assignment


Week 16 - May 9: Final Exam

Week 17 - May 16: Project Presentations