# Difference between revisions of "Course: Big Data 2014"

Jump to navigation
Jump to search

# DS-GA 1004/CSCI-GA 2568 Big Data: Tentative Schedule --

(→News) |
|||

(36 intermediate revisions by the same user not shown) | |||

Line 13: | Line 13: | ||

= News = | = News = | ||

* The final exam will take place on May 12th. | |||

* We will have our last class on May 19th. | |||

* 4/21/2014: There are two new quizes on gradiance. They are due on 2014-04-28 23:59 PST. | |||

* Homework assignment 4 has been posted: [[Assignment 4 - Querying with Pig and Mapreduce]] | |||

* Homework assignment 3 has been posted: [[Assignment 3 - MapReduce algorithm design]] | |||

** You can find instructions on how to log into the NYU Hadoop cluster at: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme-nyu-hadoop.txt | |||

** I have created a list of frequently-asked questions which I hope will help you with your assignment: [[Assignment 3 - FAQ]] | |||

* Your first assignment has been posted and it is due on Feb 17, 2014 5:00 pm. Here are the instructions: http://vistrails.org/index.php/Assignment_1_-_Data_Exploration | |||

* I have sent a test email to the class list. If you have not received the message, make sure to sign up: http://www.cs.nyu.edu/mailman/listinfo/csci_ga_2568_001_sp14 | * I have sent a test email to the class list. If you have not received the message, make sure to sign up: http://www.cs.nyu.edu/mailman/listinfo/csci_ga_2568_001_sp14 | ||

Line 34: | Line 47: | ||

** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro] | ** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro] | ||

** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)] | ** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)] | ||

* Feb 6: Lab: Data Exploration and Reproducibility == | |||

** [[Lab notes 02/06/14]] | |||

* Homework assignment: [[Assignment 1 - Data Exploration]] | |||

== Week 3 -- Feb 10: Overview: Relational Model and SQL == | == Week 3 -- Feb 10: Overview: Relational Model and SQL == | ||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/relational-algebra.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/sql-intro.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/sql-more.pdf | |||

* Other useful reading: | |||

** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro] | |||

** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)] | |||

* Feb 13: Lab: Canceled -- University closed due to snow == | |||

== Week 3.1 -- Feb 17: Holiday == | |||

* No class, holiday | |||

* Feb 20 Lab: hands-on SQL | |||

** [[Big Data Lab notes 02/19/14]] | |||

== Week 4 -- Feb 24: Overview: Advanced SQL and Query Optimization == | == Week 4 -- Feb 24: Overview: Advanced SQL and Query Optimization == | ||

= Big Data Foundations and Infrastructure ( | * Lecture notes: | ||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/xml_schema_query.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/query-opt.pdf | |||

* Homework assignment: [[Assignment 2 - Data Exploration using SQL]] | |||

= Big Data Foundations and Infrastructure (2 weeks) = | |||

== Week 5 -- Mar 3: Cloud computing, Map Reduce and Hadoop == | == Week 5 -- Mar 3: Cloud computing, Map Reduce and Hadoop == | ||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/mapreduce-intro.pdf | |||

* Required reading: | |||

** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2 | |||

** Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce). | |||

* Other useful reading: | |||

** Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 | |||

* Homework Assignment -- Your first quiz is available on [http://www.newgradiance.com Gradiance]. It is ''due on March 17th at 5pm.'' | |||

== Week 6 -- Mar 10: Algorithm Design for MapReduce == | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/mapreduce-algo-design.pdf | |||

* Required reading: | |||

** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2 | |||

** Mining of Massive Datasets (2nd Edition), Chapter 2. | |||

= Machine Learning and Big Data (3 weeks) = | |||

== Week 7 -- Mar 23: Hashing and AllReduce == | |||

* Invited lecture by John Langford | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/langford_hashing_2014.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/langford_parallel_learning_2014.pdf | |||

** http://cilvr.cs.nyu.edu/diglib/lsml/lecture08-hashing.pdf | |||

** http://cilvr.cs.nyu.edu/diglib/lsml/lecture04-allreduce.pdf | |||

* Homework assignment: [[Assignment 3 - MapReduce algorithm design]] | |||

== Week 8 -- Mar 30: Bandits == | |||

* Invited lecture by John Langford | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/langford_interact.pdf | |||

** http://cilvr.cs.nyu.edu/diglib/lsml/lecture10_using_exploration.pdf | |||

** http://cilvr.cs.nyu.edu/diglib/lsml/lecture10_doing_exploration.pdf | |||

== Week 9 -- Apr 7: Large Scale Machine Learning in the Real World == | |||

* Invited lecture by Leon Bottou | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/bottou-ml-real-world.pdf | |||

** http://cilvr.cs.nyu.edu/diglib/lsml/lecture09-ads-bottou.pdf | |||

** http://cilvr.cs.nyu.edu/diglib/lsml/lecture11-ads-bottou.pdf | |||

= Big Data Foundations and Infrastructure -- cont. (2 weeks) = | |||

== Week 10 -- April 14: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages == | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/paralleldb-vs-hadoop-2014.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/hive-pig.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/data-analysis-mapreduce.pdf | |||

* Required reading: | |||

** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/BigData2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf) | |||

** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf | |||

** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext | |||

* Additional reading: | |||

** Pig Latin: A Not-So-Foreign Language for Data Processing: http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf | |||

** Hive - A Warehousing Solution Over a Map-Reduce Framework: http://www.vldb.org/pvldb/2/vldb09-938.pdf | |||

= Big Data Algorithms and Techniques (3 weeks) = | |||

== Week 11 -- April 21: Data Management for Big Data (cont) and Association Rules == | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/association-rules.pdf | |||

* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/BigData2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] | |||

* Homework Assignment -- Your quiz is available on [http://www.newgradiance.com Gradiance]. It is ''due on April 28th.'' | |||

== Week 12 -- Apr 28: Finding similar items: Invited lecture by Dr. Harish Doraiswami == | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/similarity.pdf | |||

* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] | |||

* Homework Assignment | |||

** There are two new quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. They ''due on May 5th.'' | |||

** Your final assignment is available at http://www.vistrails.org/index.php/Assignment_4_-_Querying_with_Pig_and_Mapreduce. This is an optional assignment and will count towards extra credit | |||

== Week 13 -- May 5: Graph Analysis and Exam Review == | |||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/graph-algos.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/exam-review.pdf | |||

== Week | == Week 14 -- May 12: Final Exam == | ||

== Week | == Week 15 -- May 19: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research) == | ||

* Lecture notes: | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/intro-to-visualization.pdf | |||

** http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/nanocubes.pdf | |||

* Reading: | |||

The Value of Visualization, Jarke Van Wijk | |||

http://www.win.tue.nl/~vanwijk/vov.pdf | |||

Tamara Munzner's Book draft 2 available online | |||

http://www.cs.ubc.ca/~tmm/courses/533/book/ | |||

Nanocubes Paper | |||

http://nanocubes.net | |||

http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf |

## Latest revision as of 02:59, 7 May 2014

# DS-GA 1004/CSCI-GA 2568 Big Data: Tentative Schedule -- *subject to change*

- Course Web page: http://cs.nyu.edu/courses/spring14/CSCI-GA.2568-001/index.html

- Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana/)

- Lecture: Mondays, 7:10pm-9:00pm at Cantor, room 101.
*Note new location!*- Cantor Film Center (CANTR), 36 E 8th St, New York, NY 10003

- Lab: Thursdays, 7:10pm-8:00pm at CIWW, room 109.
*Always bring your laptop.*- Warren Weaver Hall (CIWW), 251 Mercer St, New York, NY 10012

# News

- The final exam will take place on May 12th.

- We will have our last class on May 19th.

- 4/21/2014: There are two new quizes on gradiance. They are due on 2014-04-28 23:59 PST.

- Homework assignment 4 has been posted: Assignment 4 - Querying with Pig and Mapreduce

- Homework assignment 3 has been posted: Assignment 3 - MapReduce algorithm design
- You can find instructions on how to log into the NYU Hadoop cluster at: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme-nyu-hadoop.txt
- I have created a list of frequently-asked questions which I hope will help you with your assignment: Assignment 3 - FAQ

- Your first assignment has been posted and it is due on Feb 17, 2014 5:00 pm. Here are the instructions: http://vistrails.org/index.php/Assignment_1_-_Data_Exploration

- I have sent a test email to the class list. If you have not received the message, make sure to sign up: http://www.cs.nyu.edu/mailman/listinfo/csci_ga_2568_001_sp14

- Starting on Feb 10th, our class will meet at a new location: Cantor 101

- We will have lab on Thu at CIWW, room 109.
*Bring your laptop!*

# Background (4 weeks)

## Week 1 -- Jan 27: Course Overview; the evolution of Data Management

- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/course-overview.pdf
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/spreadsheet/embeddedform?formkey=dDRoTVcyMnRQUXhFUjl0cFFuTEVER1E6MA

## Week 2 -- Feb 3: Introduction to Databases

- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/intro-to-db.pdf
- Other useful reading:

- Feb 6: Lab: Data Exploration and Reproducibility ==

- Homework assignment: Assignment 1 - Data Exploration

## Week 3 -- Feb 10: Overview: Relational Model and SQL

- Lecture notes:
- Other useful reading:

- Feb 13: Lab: Canceled -- University closed due to snow ==

## Week 3.1 -- Feb 17: Holiday

- No class, holiday
- Feb 20 Lab: hands-on SQL

## Week 4 -- Feb 24: Overview: Advanced SQL and Query Optimization

- Lecture notes:

- Homework assignment: Assignment 2 - Data Exploration using SQL

# Big Data Foundations and Infrastructure (2 weeks)

## Week 5 -- Mar 3: Cloud computing, Map Reduce and Hadoop

- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).

- Other useful reading:
- Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520

- Homework Assignment -- Your first quiz is available on Gradiance. It is
*due on March 17th at 5pm.*

## Week 6 -- Mar 10: Algorithm Design for MapReduce

- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2.

# Machine Learning and Big Data (3 weeks)

## Week 7 -- Mar 23: Hashing and AllReduce

- Invited lecture by John Langford

- Lecture notes:

- Homework assignment: Assignment 3 - MapReduce algorithm design

## Week 8 -- Mar 30: Bandits

- Invited lecture by John Langford

- Lecture notes:

## Week 9 -- Apr 7: Large Scale Machine Learning in the Real World

- Invited lecture by Leon Bottou

- Lecture notes:

# Big Data Foundations and Infrastructure -- cont. (2 weeks)

## Week 10 -- April 14: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages

- Lecture notes:

- Required reading:
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/BigData2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf)
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext

- Additional reading:
- Pig Latin: A Not-So-Foreign Language for Data Processing: http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf
- Hive - A Warehousing Solution Over a Map-Reduce Framework: http://www.vldb.org/pvldb/2/vldb09-938.pdf

# Big Data Algorithms and Techniques (3 weeks)

## Week 11 -- April 21: Data Management for Big Data (cont) and Association Rules

- Reading: Chapter 6 Mining of Massive Datasets

- Homework Assignment -- Your quiz is available on Gradiance. It is
*due on April 28th.*

## Week 12 -- Apr 28: Finding similar items: Invited lecture by Dr. Harish Doraiswami

- Reading: Chapter 3 Mining of Massive Datasets

- Homework Assignment
- There are two new quizes on Gradiance -- Distance measures and document similarity. They
*due on May 5th.* - Your final assignment is available at http://www.vistrails.org/index.php/Assignment_4_-_Querying_with_Pig_and_Mapreduce. This is an optional assignment and will count towards extra credit

- There are two new quizes on Gradiance -- Distance measures and document similarity. They

## Week 13 -- May 5: Graph Analysis and Exam Review

- Lecture notes:

## Week 14 -- May 12: Final Exam

## Week 15 -- May 19: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research)

- Lecture notes:

- Reading:

The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf

Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/

Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf