Course: Big Data Analysis

Fall 2013

This schedule is tentative and subject to change

Make sure to check my.poly.edu for course announcements

News

On September 30th, our class will meet at a different place: 1 Metrotech Center, 19th floor. Bring your NYU Poly id -- you will need to show it to the security guard.

For frequently asked questions about the course and homework assignments, please check our BigDataAnalysisFAQ.

Week 1: Monday Sept. 9th - Course Overview

Course overview and introduction to Big Data Analysis
Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf
Student survey -- to be filled out today!

Required Reading

Additional References

Week 2: Monday Sept. 16th - Map-Reduce/Hadoop

Introduction to Map-Reduce and high-level data processing languages
Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf
Hand out AWS tokens. Notes on using AWS.
Introduction to Hadoop
The Map-Reduce ecosystem: Pig, Hive, Jaql, Mahout, BigInsights

Assignment

cs9223 Mapreduce Assignment
This is an individual assignment. You may not collude with any other individual, or plagiarise their work.

For more details see http://cis.poly.edu/policies.

You assignment is due on Sun Sept 29th. Make sure you can login and access my.poly.edu!
If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018

Required Reading

Week 3: Monday Sept. 23rd - Data Management for Big Data

Databases and Big Data: Persistence, Querying, Indexing, Transactions
Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf

Required Reading

Additional References

Week 4: Monday Sept 30th - Invited lecture by Dr. C. Mohan (IBM)

Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor

Tutorial: An In-Depth Look at Modern Database Systems

Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.

Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.

Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan

Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages

Pig Latin and Query Processing:
- Relational processing over MapReduce
- Queries over MapReduce
In-class assignment

Required Reading

Pig Latin: A Not-So-Foreign Language for Data Processing

Additional References

Week 6: Mon Oct. 14th - Fall Break - No class

Week 7: Monday Oct. 21st - Invited Speaker: Torsten Suel

Big Data and Information Retrieval. Invited lecture by Torsten Suel.
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/search-data.pdf

Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha

Guest lecture by Dennis Shasha: Statistics is Easy

Required Reading

http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students
Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008

Week 9: Monday Nov 5th - EM and Text Processing

TODO

Readings

Data-Intensive Text Processing with MapReduce, Chapter 6

Week 10: Monday Nov. 11th - Finding Similar Items and Information Integration

Similarity: Applications, Measures and Efficiency considerations
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf
Similarity application: Information integration on the Web:
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf
Homework presentation and demo

Required Reading

Mining of Massive Datasets, chapter 3; information integration; entity resolution

Homework Assignment

Due November 17th Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.

Week 11: Monday Nov 18th- Frequent Itemsets

- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf

Required Reading

Mining of Massive Datasets, Chapter 4

Homework Assignment

Due November 24th

Additional Reading

Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&acc=ACTIVE%20SERVICE&CFID=198467341&CFTOKEN=23537886&__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb
Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf
An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813

Week 12: Monday Nov. 25th - Clustering

Lecture notes:
- Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf
- Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf

Homework Assignment

Due Dec 1st

Readings

Mining of Massive Datasets, Chapter 7
See readings for previous class
Web Mining, by Bing Liu. http://www.cs.uic.edu/~liub/WebMiningBook.html
Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf

Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini

Introduction to Visual Analytics
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro-to-visualization.pdf

Readings

The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138

Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf

Week 14: Monday Dec. 9th - - Graph Algorithms

TODO

Readings

1998 PageRank Paper
Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)
Mining of Massive Datasets, Chapter 5 (Link Analysis)
Pregel: A System for Large-Scale Graph Processing. Google. [2]

Course: Big Data Analysis

Fall 2013

News

Week 1: Monday Sept. 9th - Course Overview

Required Reading

Additional References

Week 2: Monday Sept. 16th - Map-Reduce/Hadoop

Assignment

Required Reading

Week 3: Monday Sept. 23rd - Data Management for Big Data

Related Topics

Required Reading

Additional References

Week 4: Monday Sept 30th - Invited lecture by Dr. C. Mohan (IBM)

Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages

Required Reading

Additional References

Week 6: Mon Oct. 14th - Fall Break - No class

Week 7: Monday Oct. 21st - Invited Speaker: Torsten Suel

Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha

Required Reading

Week 9: Monday Nov 5th - EM and Text Processing

Readings

Week 10: Monday Nov. 11th - Finding Similar Items and Information Integration

Required Reading

Homework Assignment

Week 11: Monday Nov 18th- Frequent Itemsets

Required Reading

Homework Assignment

Additional Reading

Week 12: Monday Nov. 25th - Clustering

Homework Assignment

Readings

Further Readings

Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini

Readings

Week 14: Monday Dec. 9th - - Graph Algorithms

Readings

Week 15 Monday Dec. 16th - Final Exam

Other topics

Provenance

Navigation menu

Search