Course: Big Data 2015
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
DS-GA 1004- Big Data: Tentative Schedule -- subject to change
- Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2015
- Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)
- Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208.
- Some classes will include a lab session, please "always bring your laptop.
News
- 04/05/2015: New quizzes are available at http://www.newgradiance.com
- Big Data 2015: Final Project
- 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup
- 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: Cloudera VM Setup
- There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1
Background (2 weeks)
Week 1 - Feb 2: Course Overview; The evolution of Data Management and introduction to Big Data
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form
Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL
- Lecture notes:
- Lab:
- SQL hands on: Big Data 2015 - SQL Lab
- Other useful reading:
- Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)
Feb 16: Holiday
Big Data Foundations and Infrastructure (3 weeks)
Week 3 - Feb 23: Introduction to Map Reduce
- Lab: (continuation)
- SQL hands on: Big Data 2015 - SQL Lab
- Lecture notes:
- Required Reading:
- Data-Intensive Text Processing with MapReduce. Chapters 1 and 2
- Mining of Massive Datasets (v 2.1). Chapter 2 - 2.1, 2.2, and 2.3
- Other useful reading:
- Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520
- Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services
Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations
- Lecture notes:
- Lab: Hands-on Hadoop (local)
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2.
- Programming assignment: Map Reduce (check NYU Classes)
Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce
- Lecture notes:
- Lab: Hands-on Hadoop on AWS
- Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip
- Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
- Some links to AWS CLI documentation:
- http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
- http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html
- http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool
- EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html
- Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws
- EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)
- Programming assignment: check NYU Classes on March 10th
March 16th: Spring Break
Transparency and Reproducibility (1 week)
Week 6 - March 23: Data Exploration and Reproducibility
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf
- Lab: Hands-on reproducibility. Before class, please
- Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads
- Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt
- Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt
- http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf
- Questions? Email Fernando at fchirigati@nyu.edu
- Programming assignment 4: Exploring urban data (see NYU Classes)
Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)
Week 7 - March 30th: Finding similar items
- Reading: Chapter 3 Mining of Massive Datasets
- Homework Assignment
- See quizzes on Gradiance -- Distance measures and document similarity.
Week 8 - April 6th: Association Rules
- Reading: Chapter 6 Mining of Massive Datasets
- Suggested additional reading:
- Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
- Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
- Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
- Homework Assignment
- See quizes on Gradiance -- Distance measures and document similarity.
Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP)
- Lab: Using Amazon AWS to analyze and visualize taxi data
Week 10 - April 20th: Parallel Databases
- Lecture notes:
- Required reading:
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Suggested reading:
- Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
- Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
- BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
- Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf
Week 11 - April 27th: Graph Analysis
- Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms