Difference between revisions of "Course: Big Data 2016"

From VistrailsWiki
Jump to navigation Jump to search
Line 56: Line 56:
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==


*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design.pdf
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf
* '''Lab:''' Hands-on Hadoop (HPC)
* '''Lab:''' Hands-on Hadoop (HPC)
* '''Programming assignment:''' Map Reduce (check NYU Classes)
* '''Programming assignment:''' Map Reduce (check NYU Classes)
Line 85: Line 85:
== Week 9 - March 21: Data Exploration and Reproducibility  ==
== Week 9 - March 21: Data Exploration and Reproducibility  ==


* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf
 
* '''Lab:''' Hands-on reproducibility.  
* Lab: Hands-on reproducibility. Before class, please
* '''Programming assignment:''' Exploring urban data (see NYU Classes)
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf
** Questions? Email Fernando at fchirigati@nyu.edu
 
* Programming assignment 4: Exploring urban data (see NYU Classes)


= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =
Line 101: Line 94:


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf


* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]  
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]  


* Homework Assignment
* Homework Assignment
Line 111: Line 104:


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf




Line 124: Line 117:
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.


== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf


* Lab: Using Amazon AWS to analyze and visualize taxi data
* Lab: Using Amazon AWS to analyze and visualize taxi data
** https://github.com/ViDA-NYU/aws_taxi
** https://github.com/ViDA-NYU/aws_taxi


== Week 13 - April 18th: Parallel Databases ==
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&T Research ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf
 
* Required reading:
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext


* Suggested reading:
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf


== Week 14 - April 25th: Graph Analysis ==
== Week 14 - April 25th: Graph Analysis ==


* Lecture notes:
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf


* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms


== Week 15 - May 2: Final Exam ==
== Week 15 - May 2: TBD ==


== Week 16 - May 9: Project Presentations ==
== Week 16 - May 9: Final Exam ==


== Week 17 - May 16: Project Presentations ==
== Week 17 - May 16: Project Presentations ==

Revision as of 22:05, 23 January 2016

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

  • TAs:
    • Yuan Feng
    • Kevin Ye
  • Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102.
  • Some classes will include a lab session, please always bring your laptop.

News

Week 1 - Jan 25: Course Overview

Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL

Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)

Week 4 - Feb 15: Holiday

Big Data Foundations and Infrastructure (3 weeks)

Week 5 - Feb 22: Introduction to Map Reduce

Week 6 - Feb 29: MapReduce Algorithm Design Patterns

Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK

Week 8 -- March 14th: Spring Break

Transparency and Reproducibility (1 week)

Week 9 - March 21: Data Exploration and Reproducibility

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 10 - March 28th: Finding similar items

  • Homework Assignment
    • See quizzes on Gradiance -- Distance measures and document similarity.

Week 11 - April 4th: Association Rules


  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
  • Homework Assignment
    • See quizes on Gradiance -- Distance measures and document similarity.

Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS)

Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&T Research

Week 14 - April 25th: Graph Analysis

  • Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms

Week 15 - May 2: TBD

Week 16 - May 9: Final Exam

Week 17 - May 16: Project Presentations