Difference between revisions of "Course: Big Data Analysis"

From VistrailsWiki
Jump to navigation Jump to search
Line 45: Line 45:
* Databases and Big Data: Persistence, Querying, Indexing, Transactions
* Databases and Big Data: Persistence, Querying, Indexing, Transactions
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf
* In-class exercise on Map-Reduce (to be distributed in class)


=== Related Topics ===
=== Related Topics ===
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].
* "NewSQL" stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],
* "NewSQL" stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],
Line 68: Line 69:
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]


== Week 4:  Monday Sept 30th - Statistics is easy - Invited Speaker: Dennis Shasha ==
== Week 4:  Monday Sept 30th - Query Processing on Mapreduce and High-level Languages ==


* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]
* Pig Latin and Query Processing:  
* Pig Latin and Query Processing:  
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_relational  Relational query processing: Review]
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_relational  Relational query processing: Review]
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_pig_mapreduce.ppt.pdf  Query Processing in Pig]
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_pig_mapreduce.ppt.pdf  Query Processing in Pig]
* In-class assignment


=== Required Reading ===
=== Required Reading ===
Line 79: Line 80:
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008


=== Homework Assignment ===
'''Due October 9th'''
[[BigDataHW1]]


== Week 5: Monday Oct. 8st - Finding Similar Items ==
== Week 5: Monday Oct. 7th Invited Speaker: Torsten Suel ==
* Similarity: Applications, Measures and Efficiency considerations
 
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf
* Big Data and Information Retrieval. Invited lecture by Torsten Suel.
* Similarity application: Information integration on the Web:
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/search-data.pdf
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf
 
* Homework presentation and demo
 
== Week 6: Mon Oct. 14th - Fall Break - No class ==


=== Required Reading ===
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]


=== Homework Assignment ===
'''Due October 15th at noon'''
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.


== Week 6Wednesday Oct. 17th - Invited Speaker: Torsten Suel ==
== Week 7Monday Oct. 22st - Graph Algorithms ==
'''Note this class will be held on Wednesday!'''


* Big Data and Information Retrieval. Invited lecture by Torsten Suel.
TODO
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/search-data.pdf


=== Readings ===
=== Readings ===
Line 107: Line 99:
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]
== Week 9: Monday Nov 5th - EM and Text Processing
TODO


== Week 7:  Monday Oct. 22st - Invited lecture by and Lauro Lins ==
* Introduction to Visualization
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro-to-visualization.pdf


=== Readings ===
=== Readings ===
The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138


Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf
* Data-Intensive Text Processing with MapReduce, Chapter 6


== Week 8: Monday Oct 29th- Class canceled due to storm ==




== Week 9: Monday Nov 5th- Data infrastructure and information integration ==  
== Week 10: Monday Nov. 11th  - - Finding Similar Items and Information Integration ==
* Big Table, HadoopDB.
* Similarity: Applications, Measures and Efficiency considerations
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf
* Similarity application: Information integration on the Web:  
* Similarity application: Information integration on the Web:  
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf
* Homework presentation and demo
=== Required Reading ===
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]
=== Homework Assignment ===
'''Due November 17th'''
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.


=== Readings ===
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.


== Week 10: Monday Nov. 12th  - Frequent Itemsets ==
== Week 11: Monday Nov 18th- Frequent Itemsets ==
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf


=== Readings ===
=== Required Reading ===
* Mining of Massive Datasets, Chapter 4
* Mining of Massive Datasets, Chapter 4
=== Homework Assignment ===
'''Due November 24th'''


=== Additional Reading ===
=== Additional Reading ===
Line 140: Line 148:
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813


== Week 11: Monday Nov 19th- Algorithms on MapReduce: text processing  ==


* Algorithms, link analysis, social networks
== Week 12: Monday Nov. 25th - Clustering ==
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf
* Discussion on the project
 
=== Readings ===
* Data-Intensive Text Processing with MapReduce, Chapter 4
 
== Week 12: Monday Nov. 26th - Graph Algorithms and Phase-I project presentations ==
 
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf
 
=== Readings ===
* Data-Intensive Text Processing with MapReduce, Chapter 4
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]
 
== Week 13: Monday Dec. 3rd - Clustering ==


* Lecture notes:  
* Lecture notes:  
** Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf
** Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf
**Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf
**Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf
=== Homework Assignment ===
'''Due Dec 1st'''


=== Readings ===
=== Readings ===
Line 169: Line 164:
* Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf
* Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf


== Week 14: Monday Dec. 10th - EM algorithms for text processing ==
== Further Readings ==
* [http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php Anomaly Detection: A Survey]
 
 
== Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini ==
* Introduction to Visual Analytics
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro-to-visualization.pdf
 
=== Readings ===
The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138
 
Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf
 
 
== Week 14: Monday Dec. 9th - Recommendation Systems ==


=== Readings ===
=== Readings ===
* Ullman chapter 9


* Data-Intensive Text Processing with MapReduce, Chapter 6




== Week 15  Monday Dec. 17 -  Phase-II Project presentation  ==


== Week 15  Monday Dec. 16th -  Final Exam ==


== Further Readings ==
* [http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php Anomaly Detection: A Survey]


== Other topics ==
== Other topics ==

Revision as of 19:51, 8 September 2013

Fall 2013

This schedule is tentative and subject to change

Make sure to check my.poly.edu for course announcements

Week 1: Monday Sept. 9th - Course Overview

Required Reading

Additional References

Week 2: Monday Sept. 16th - Map-Reduce/Hadoop

Required Reading

Additional References

Week 3: Monday Sept. 23rd - Data Management for Big Data

Related Topics

Required Reading

Additional References

Week 4: Monday Sept 30th - Query Processing on Mapreduce and High-level Languages

Required Reading


Week 5: Monday Oct. 7th Invited Speaker: Torsten Suel


Week 6: Mon Oct. 14th - Fall Break - No class

Week 7: Monday Oct. 22st - Graph Algorithms

TODO

Readings


Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha


== Week 9: Monday Nov 5th - EM and Text Processing

TODO


Readings

  • Data-Intensive Text Processing with MapReduce, Chapter 6


Week 10: Monday Nov. 11th - - Finding Similar Items and Information Integration

Required Reading

Homework Assignment

Due November 17th Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.


Week 11: Monday Nov 18th- Frequent Itemsets

Required Reading

  • Mining of Massive Datasets, Chapter 4

Homework Assignment

Due November 24th

Additional Reading


Week 12: Monday Nov. 25th - Clustering

Homework Assignment

Due Dec 1st

Readings

Further Readings


Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini

Readings

The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138

Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf


Week 14: Monday Dec. 9th - Recommendation Systems

Readings

  • Ullman chapter 9



Week 15 Monday Dec. 16th - Final Exam

Other topics

Provenance

Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.

Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science & Engineering, 2008.