CS-GY 6333 Massive Data Analysis: Tentative Schedule -- subject to change

Course Web page: http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/

Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana/)

Lecture: Mondays, 1:00pm-3:25pm at 2MTC, room 9.011.

News

On Sept 22nd, I distributed AWS tokens that will be needed for your assignments. If you have not received your token, let me know.
Your first assignment has been posted -- see details below and in NYU Classes.

Background (4 weeks)

Week 1 -- Sept 8: Course Overview; the evolution of Data Management

Lecture notes: http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/course-overview.pdf (http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/course-overview-6p.pdf)
Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
Course survey: https://docs.google.com/spreadsheet/embeddedform?formkey=dFpwTjROVzhLUWY2NVNXb0xvNTVLMnc6MA

Week 2 -- Sept 15: Provenance and Reproducibility

Lecture notes: http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf
The class will have a lab component. Please bring your laptops.
Before class, follow the instructions below to install and set up VisTrails as well as github

VisTrails setup:
- Download VisTrails 2.1.4 from http://www.vistrails.org/index.php/Downloads and follow the installation instructions. Start the system and then quit.
- Download the following packages:
  - http://vgc.poly.edu/~fchirigati/mda-class/gmaps.zip.
  - http://vgc.poly.edu/~fchirigati/mda-class/tabledata-backport.zip
- After you extract the content of the zip files, place them under $HOME/.vistrails/userpackages

Github setup:
- Create a github account (https://github.com/join)
- Learn how to set up git and create a public repository.

During class, you will add the trail of your analysis to github, and submit the link to your public github repo using this form: https://docs.google.com/forms/d/17OScN8Ea-El20AC4mHIb32S3e62mAbGEiU-BET0PyX8/viewform?usp=send_form

Week 3 -- Sept 22: Introduction to Databases; Relational Model and SQL

Assignment 1: Provenance and Data Exploration

Week 4 -- Sept 29: Overview: Advanced SQL and Query Optimization

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/xml_schema_query.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/query-opt.pdf

In-class exercise: http://vistrails.org/index.php/Big_Data_Lab_SQL

Big Data Foundations and Infrastructure (3 weeks)

Week 5 -- Oct 6: Cloud computing, Map Reduce and Hadoop

Lecture notes:
- http://vgc.poly.edu/~fchirigati/mda-class/mapreduce-intro.pdf

Lab: after the lecture, you will work on an in-class exercise. For this you need to install Hadoop on your laptop and have your account setup on AWS. See instructions below.

You will use two different Hadoop configurations:
- Local (on your laptop)
- Amazon AWS: Each student should have received a token with $100 credit towards computing time at AWS. If you have not received the token yet, contact us immediately! When using AWS, always remember to terminate your instances! If you don't, you will be charged and you are responsible for the charges beyond your credit.
- See installation instructions for Hadoop on your local machine and how to setup your AWS account in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/HadoopExerciseInstructions.pdf
- Warning: Install Hadoop in your machine and setup your AWS account before class starts. There will be no time for installing software during our in-class exercise.

In-Class Exercise: Hadoop Exercise

Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).

Other useful reading:
- Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520

Week 6 -- Oct 13: Fall Break

Week 7 -- Oct 20: Algorithm Design for MapReduce

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/mapreduce-algo-design.pdf

Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2.

Week 8 -- Oct 27: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages

Lecture notes:

Required reading:
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf)
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext

Additional reading:
- Pig Latin: A Not-So-Foreign Language for Data Processing: http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf
- Hive - A Warehousing Solution Over a Map-Reduce Framework: http://www.vldb.org/pvldb/2/vldb09-938.pdf

Big Data Algorithms and Techniques (3 weeks)

Week 9 -- Nov 3: Association Rules

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/association-rules.pdf

Reading: Chapter 6 Mining of Massive Datasets

Week 10 -- Nov 10: Finding similar items

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/similarity.pdf

Reading: Chapter 3 Mining of Massive Datasets

Week 11 -- Nov 17: Graph Analysis

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/graph-algos.pdf

Week 12 -- Nov 25: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research)

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/intro-to-visualization.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/nanocubes.pdf

Reading:

The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf

Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/

Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf

Course: Massive Data Analysis 2014

Contents

CS-GY 6333 Massive Data Analysis: Tentative Schedule -- subject to change

News

Background (4 weeks)

Week 1 -- Sept 8: Course Overview; the evolution of Data Management

Week 2 -- Sept 15: Provenance and Reproducibility

Week 3 -- Sept 22: Introduction to Databases; Relational Model and SQL

Week 4 -- Sept 29: Overview: Advanced SQL and Query Optimization

Big Data Foundations and Infrastructure (3 weeks)

Week 5 -- Oct 6: Cloud computing, Map Reduce and Hadoop

Week 6 -- Oct 13: Fall Break

Week 7 -- Oct 20: Algorithm Design for MapReduce

Week 8 -- Oct 27: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages

Big Data Algorithms and Techniques (3 weeks)

Week 9 -- Nov 3: Association Rules

Week 10 -- Nov 10: Finding similar items

Week 11 -- Nov 17: Graph Analysis

Week 12 -- Nov 25: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research)

Week 13 -- Dec 1: Data Cleaning and Integration

Week 14 -- Dec 8: Project Presentations

Week 15 -- Dec 15: Project Presentations

Navigation menu

Search