Difference between revisions of "Course: Big Data 2014"

From VistrailsWiki
Jump to navigation Jump to search
 
(22 intermediate revisions by the same user not shown)
Line 12: Line 12:


= News =
= News =
* The final exam will take place on May 12th.
* We will have our last class on May 19th.
* 4/21/2014: There are two new quizes on gradiance. They are due on 2014-04-28 23:59 PST.
* Homework assignment 4 has been posted: [[Assignment 4 - Querying with Pig and Mapreduce]]
* Homework assignment 3 has been posted: [[Assignment 3 - MapReduce algorithm design]]
** You can find instructions on how to log into the NYU Hadoop cluster at: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme-nyu-hadoop.txt
** I have created a list of frequently-asked questions which I hope will help you with your assignment: [[Assignment 3 - FAQ]]


* Your first assignment has been posted and it is due on Feb 17, 2014 5:00 pm. Here are the instructions: http://vistrails.org/index.php/Assignment_1_-_Data_Exploration
* Your first assignment has been posted and it is due on Feb 17, 2014 5:00 pm. Here are the instructions: http://vistrails.org/index.php/Assignment_1_-_Data_Exploration
Line 53: Line 65:




== Week 3.1 -- Feb 17  ==
== Week 3.1 -- Feb 17: Holiday ==
* No class, holiday
* No class, holiday
* Feb 20 Lab: hands-on SQL
* Feb 20 Lab: hands-on SQL
** [[Big Data Lab notes 02/19/14]]
** [[Big Data Lab notes 02/19/14]]


== Week 4 -- Feb 24: Overview: Advanced SQL and Query Optimization  ==
== Week 4 -- Feb 24: Overview: Advanced SQL and Query Optimization  ==
Line 67: Line 78:
* Homework assignment: [[Assignment 2 - Data Exploration using SQL]]
* Homework assignment: [[Assignment 2 - Data Exploration using SQL]]


= Big Data Foundations and Infrastructure (4 weeks) =
= Big Data Foundations and Infrastructure (2 weeks) =


== Week 5 -- Mar 3: Cloud computing, Map Reduce and  Hadoop ==
== Week 5 -- Mar 3: Cloud computing, Map Reduce and  Hadoop ==
Line 92: Line 103:




== Week 7 -- Mar 24: Data Management for Big Data, No-SQL and NewSQL Systems ==
= Machine Learning and Big Data  (3 weeks) =
 
== Week 7 -- Mar 23: Hashing and AllReduce ==
* Invited lecture by John Langford
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/langford_hashing_2014.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/langford_parallel_learning_2014.pdf
** http://cilvr.cs.nyu.edu/diglib/lsml/lecture08-hashing.pdf
** http://cilvr.cs.nyu.edu/diglib/lsml/lecture04-allreduce.pdf
 
* Homework assignment: [[Assignment 3 - MapReduce algorithm design]]
 
== Week 8 -- Mar 30: Bandits ==
* Invited lecture by John Langford
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/langford_interact.pdf
** http://cilvr.cs.nyu.edu/diglib/lsml/lecture10_using_exploration.pdf
** http://cilvr.cs.nyu.edu/diglib/lsml/lecture10_doing_exploration.pdf
 
== Week 9 -- Apr 7: Large Scale Machine Learning in the Real World ==
* Invited lecture by Leon Bottou
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/bottou-ml-real-world.pdf
** http://cilvr.cs.nyu.edu/diglib/lsml/lecture09-ads-bottou.pdf
** http://cilvr.cs.nyu.edu/diglib/lsml/lecture11-ads-bottou.pdf
 
= Big Data Foundations and Infrastructure -- cont. (2 weeks) =
 
== Week 10 -- April 14:  Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/paralleldb-vs-hadoop-2014.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/hive-pig.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/data-analysis-mapreduce.pdf
 
* Required reading:
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/BigData2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf)
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
 
 
* Additional reading:
** Pig Latin: A Not-So-Foreign Language for Data Processing: http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf
** Hive - A Warehousing Solution Over a Map-Reduce Framework: http://www.vldb.org/pvldb/2/vldb09-938.pdf
 
= Big Data Algorithms and Techniques (3 weeks) =
 
== Week 11 -- April 21: Data Management for Big Data (cont) and Association Rules  ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/association-rules.pdf
 
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/BigData2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
 
* Homework Assignment -- Your  quiz is available on [http://www.newgradiance.com Gradiance]. It is ''due on April  28th.''
 
== Week 12 -- Apr 28: Finding similar items: Invited lecture by Dr. Harish Doraiswami  ==
 
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/similarity.pdf
 
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
 
* Homework Assignment
** There are two new quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. They ''due on May  5th.''
** Your final assignment is available at http://www.vistrails.org/index.php/Assignment_4_-_Querying_with_Pig_and_Mapreduce. This is an optional assignment and will count towards extra credit


== Week 8 -- Mar 31: Query Processing on Mapreduce and High-level Languages ==
== Week 13 -- May 5: Graph Analysis and Exam Review ==


= Big Data Algorithms and Techniques (6 weeks) =
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/graph-algos.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/exam-review.pdf


== Week 9 -- Apr 7: Map Reduce Algorithm Design ==
== Week 14 -- May 12: Final Exam  ==


== Week 10 -- Apr 14: Finding similar items and information integration ==


== Week 11 -- Apr 21: Graph Analysis ==
== Week 15 -- May 19: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research) ==


== Week 12 -- Apr 28: Frequent Itemset Mining ==
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/intro-to-visualization.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/nanocubes.pdf


== Week 13 -- May 5: Interactive Analysis and Visualization of Big Data ==
* Reading:  


== Week 14 -- May 12: Machine Learning for Big Data ==
The Value of Visualization, Jarke Van Wijk
http://www.win.tue.nl/~vanwijk/vov.pdf


Tamara Munzner's Book draft 2 available online
http://www.cs.ubc.ca/~tmm/courses/533/book/


== Week 15 -- May 19: Final Exam ==
Nanocubes Paper
http://nanocubes.net
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf

Latest revision as of 02:59, 7 May 2014

DS-GA 1004/CSCI-GA 2568 Big Data: Tentative Schedule -- subject to change

  • Lecture: Mondays, 7:10pm-9:00pm at Cantor, room 101. Note new location!
    • Cantor Film Center (CANTR), 36 E 8th St, New York, NY 10003
  • Lab: Thursdays, 7:10pm-8:00pm at CIWW, room 109. Always bring your laptop.
    • Warren Weaver Hall (CIWW), 251 Mercer St, New York, NY 10012

News

  • The final exam will take place on May 12th.
  • We will have our last class on May 19th.
  • 4/21/2014: There are two new quizes on gradiance. They are due on 2014-04-28 23:59 PST.
  • Starting on Feb 10th, our class will meet at a new location: Cantor 101
  • We will have lab on Thu at CIWW, room 109. Bring your laptop!

Background (4 weeks)

Week 1 -- Jan 27: Course Overview; the evolution of Data Management


Week 2 -- Feb 3: Introduction to Databases

Week 3 -- Feb 10: Overview: Relational Model and SQL

  • Feb 13: Lab: Canceled -- University closed due to snow ==


Week 3.1 -- Feb 17: Holiday

Week 4 -- Feb 24: Overview: Advanced SQL and Query Optimization

Big Data Foundations and Infrastructure (2 weeks)

Week 5 -- Mar 3: Cloud computing, Map Reduce and Hadoop

  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).
  • Homework Assignment -- Your first quiz is available on Gradiance. It is due on March 17th at 5pm.

Week 6 -- Mar 10: Algorithm Design for MapReduce

  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2.


Machine Learning and Big Data (3 weeks)

Week 7 -- Mar 23: Hashing and AllReduce

  • Invited lecture by John Langford

Week 8 -- Mar 30: Bandits

  • Invited lecture by John Langford

Week 9 -- Apr 7: Large Scale Machine Learning in the Real World

  • Invited lecture by Leon Bottou

Big Data Foundations and Infrastructure -- cont. (2 weeks)

Week 10 -- April 14: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages


Big Data Algorithms and Techniques (3 weeks)

Week 11 -- April 21: Data Management for Big Data (cont) and Association Rules

  • Homework Assignment -- Your quiz is available on Gradiance. It is due on April 28th.

Week 12 -- Apr 28: Finding similar items: Invited lecture by Dr. Harish Doraiswami

Week 13 -- May 5: Graph Analysis and Exam Review

Week 14 -- May 12: Final Exam

Week 15 -- May 19: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research)

  • Reading:

The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf

Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/

Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf