<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://www.vistrails.org//api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Jsimeon</id>
	<title>VistrailsWiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://www.vistrails.org//api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Jsimeon"/>
	<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php/Special:Contributions/Jsimeon"/>
	<updated>2026-05-05T11:36:03Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.36.2</generator>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6920</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6920"/>
		<updated>2014-02-20T17:27:06Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
&lt;br /&gt;
February 10th, 2014:&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.] (Assigned to Group 1)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.] (Assigned to group 6).&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ] (Assigned Group 4).&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.] (Assigned Group 3).&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.] (Assigned Group 2).&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)] (Assigned Group 7.).&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf WebViews: accessing personalized web content and services. Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday February 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation 1. Indexing and Storage&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Tuesday February 18th - Query Compilation 2 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation and Rewriting&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6919</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6919"/>
		<updated>2014-02-20T17:17:10Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
&lt;br /&gt;
February 10th, 2014:&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.] (Assigned to Group 1)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.] (Assigned to group 6).&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ] (Assigned Group 4).&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.] (Assigned Group 3).&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.] (Assigned Group 2).&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf WebViews: accessing personalized web content and services. Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday February 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation 1. Indexing and Storage&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Tuesday February 18th - Query Compilation 2 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation and Rewriting&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6834</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6834"/>
		<updated>2014-02-11T15:53:27Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
&lt;br /&gt;
February 10th, 2014:&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf WebViews: accessing personalized web content and services. Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday February 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation 1. Indexing and Storage&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Tuesday February 18th - Query Compilation 2 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation and Rewriting&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6833</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6833"/>
		<updated>2014-02-11T15:53:00Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
&lt;br /&gt;
February 10th, 2014:&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf WebViews: accessing personalized web content and services. Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday February 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation 1. Indexing and Storage&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Tuesday February 18th - Query Compilation 2 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation and Rewriting&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6823</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6823"/>
		<updated>2014-02-11T00:10:06Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 2:   Tuesday February. 11th - Query Compilation 1 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf WebViews: accessing personalized web content and services. Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* Query Compilation 1. Indexing and Storage&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6822</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6822"/>
		<updated>2014-02-11T00:09:28Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf WebViews: accessing personalized web content and services. Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6821</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6821"/>
		<updated>2014-02-11T00:08:34Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6820</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6820"/>
		<updated>2014-02-11T00:07:41Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and  Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6819</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6819"/>
		<updated>2014-02-11T00:06:06Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&amp;amp;rep=rep1&amp;amp;type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6817</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6817"/>
		<updated>2014-02-11T00:03:23Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6816</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6816"/>
		<updated>2014-02-11T00:02:24Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8616&amp;amp;rep=rep1&amp;amp;type=pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Spyros Potamianos, Surajit Chaudhuri, Ravi Krishnamurthy, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6815</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6815"/>
		<updated>2014-02-11T00:00:26Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Optimizing Queries with Materialized Views. Spyros Potamianos, Surajit Chaudhuri, Ravi Krishnamurthy, and Kyuseok Shim. Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002. ]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6814</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6814"/>
		<updated>2014-02-10T23:54:22Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf DeWitt, D. J., Naughton, J. F., &amp;amp; Burger, J. (1993, January). Nested loops revisited. In Parallel and Distributed Information Systems, 1993., Proceedings of the Second International Conference on (pp. 230-242). IEEE.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf A. Lerner and D. Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf  Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1999&amp;amp;rep=rep1&amp;amp;type=pdf Potamianos, Surajit Chaudhuri Ravi Krishnamurthy Spyros, and Kyuseok Shim. &amp;quot;Optimizing Queries with Materialized Views.&amp;quot; Data Engineering 11 (1995): 190.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6813</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6813"/>
		<updated>2014-02-10T23:50:20Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://gsl.azurewebsites.net/Portals/0/Users/dewitt/Papers/paralleldb/PDIS93.pdf DeWitt, D. J., Naughton, J. F., &amp;amp; Burger, J. (1993, January). Nested loops revisited. In Parallel and Distributed Information Systems, 1993., Proceedings of the Second International Conference on (pp. 230-242). IEEE.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf A. Lerner and D. Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf  Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6812</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6812"/>
		<updated>2014-02-10T23:45:21Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf A. Lerner and D. Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf  Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6811</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6811"/>
		<updated>2014-02-10T23:44:04Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf A. Lerner and D. Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.1232&amp;amp;rep=rep1&amp;amp;type=pdf Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. A Data Transformation System for Biological Data Sources. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)]&lt;br /&gt;
# [http://www.ambuehler.ethz.ch/CDstore/www10/papers/pdf/p220.pdf Freire, Juliana, Bharat Kumar, and Daniel Lieuwen. &amp;quot;WebViews: accessing personalized web content and services.&amp;quot; Proceedings of the 10th international conference on World Wide Web. ACM, 2001.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6809</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6809"/>
		<updated>2014-02-10T23:36:22Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf A. Lerner and D. Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6808</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6808"/>
		<updated>2014-02-10T23:30:32Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf A. Lerner and D. Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
# [http://homepages.inf.ed.ac.uk/libkin/papers/sigmod96b.pdf Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey.&lt;br /&gt;
In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6807</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6807"/>
		<updated>2014-02-10T23:22:49Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
# [http://www.vldb.org/conf/2003/papers/S11P03.pdf A. Lerner and D. Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6806</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6806"/>
		<updated>2014-02-10T23:11:24Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9263&amp;amp;rep=rep1&amp;amp;type=pdf Arocena, Gustavo O., and Alberto O. Mendelzon. &amp;quot;WebOQL: Restructuring documents, databases and webs.&amp;quot; Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6805</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6805"/>
		<updated>2014-02-10T23:07:58Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.6493&amp;amp;rep=rep1&amp;amp;type=pdf G. N. Paulley and Per-Åke Larson. 1994. Exploiting Uniqueness in Query Optimization. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6804</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6804"/>
		<updated>2014-02-10T22:53:41Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
# [http://ilpubs.stanford.edu:8090/262/1/1997-49.pdf Kossmann, Laura M. Haas Donald, and Edward L. Wimmers Jun Yangy. &amp;quot;Optimizing Queries across Diverse Data Sources.&amp;quot; Proceedings of the... International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6802</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6802"/>
		<updated>2014-02-10T22:40:54Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Reading Assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
# [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
# [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6801</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6801"/>
		<updated>2014-02-10T22:40:11Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Reading Assignment ==&lt;br /&gt;
&lt;br /&gt;
Here is the list of selected papers for the reading assignment:&lt;br /&gt;
&lt;br /&gt;
* [http://www.vldb.org/conf/2002/S17P02.pdf Popa, L., Velegrakis, Y., Hernández, M. A., Miller, R. J., &amp;amp; Fagin, R. (2002, August). Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment.]&lt;br /&gt;
* [http://www.researchgate.net/publication/2851695_Accelerating_XPath_Location_Steps/file/d912f50a27052d7d26.pdf Grust, Torsten. &amp;quot;Accelerating XPath location steps.&amp;quot; Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002.]&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6800</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6800"/>
		<updated>2014-02-10T22:31:51Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Additional References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf Graefe, Goetz. &amp;quot;Query evaluation techniques for large databases.&amp;quot; ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.]. A classic database survey, and a must read for anyone serious about data processing.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6796</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6796"/>
		<updated>2014-02-10T15:12:26Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* News */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Wiki is now up-to-date&lt;br /&gt;
* Added research papers for reading assignment&lt;br /&gt;
* Added slides for lecture 1 &amp;amp; 2&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6752</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6752"/>
		<updated>2014-02-05T15:02:09Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 2:   Tuesday February. 11th - Query Compilation */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation 1 ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6751</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6751"/>
		<updated>2014-02-05T15:01:56Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Tuesday February. 11th - Query Compilation ==&lt;br /&gt;
&lt;br /&gt;
* &lt;br /&gt;
* Lecture notes: &lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6750</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6750"/>
		<updated>2014-02-05T15:00:30Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 1: Tuesday Feb 4th - Course Overview */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks ===&lt;br /&gt;
&lt;br /&gt;
* Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke [http://pages.cs.wisc.edu/~dbbook/ Database Management Systems]&lt;br /&gt;
* Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the [http://infolab.stanford.edu/~ullman/dscb.html Database Systems: The Complete Book]&lt;br /&gt;
&lt;br /&gt;
* Guido Moerkotte's free book on query compilation and optimization: [http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf Query Compilers]&lt;br /&gt;
* Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: [http://booksite.elsevier.com/9780124160446/?ISBN=9780124160446 Principles of Data Integration]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Data_integration Data Integration (Wikipedia)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6749</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6749"/>
		<updated>2014-02-05T14:36:48Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Spring 2014 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== NYU School of Engineering. CS6093: Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks Reading ===&lt;br /&gt;
&lt;br /&gt;
* []&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6748</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6748"/>
		<updated>2014-02-05T14:36:19Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Advanced Database Systems (CS6093)&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf Syllabus (pdf)]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks Reading ===&lt;br /&gt;
&lt;br /&gt;
* []&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6747</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6747"/>
		<updated>2014-02-05T14:35:00Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Spring 2014 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Syllabus for this semester: [http://www.vistrails.org/images/Syllabus.pdf]&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks Reading ===&lt;br /&gt;
&lt;br /&gt;
* []&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=File:Syllabus.pdf&amp;diff=6746</id>
		<title>File:Syllabus.pdf</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=File:Syllabus.pdf&amp;diff=6746"/>
		<updated>2014-02-05T14:34:28Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6745</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6745"/>
		<updated>2014-02-05T14:34:15Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Spring 2014 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
Syllabus:&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks Reading ===&lt;br /&gt;
&lt;br /&gt;
* []&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6744</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6744"/>
		<updated>2014-02-05T14:33:47Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 1: Tuesday Feb 4th - Course Overview */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf&lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks Reading ===&lt;br /&gt;
&lt;br /&gt;
* []&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6742</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6742"/>
		<updated>2014-02-04T20:40:08Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 1: Tuesday Jan. 28th - Course Overview */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Feb 4th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~jsimeon/courses/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks Reading ===&lt;br /&gt;
&lt;br /&gt;
* []&lt;br /&gt;
* []&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=File:ADB-Intro-Class1.pdf&amp;diff=6741</id>
		<title>File:ADB-Intro-Class1.pdf</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=File:ADB-Intro-Class1.pdf&amp;diff=6741"/>
		<updated>2014-02-04T20:31:27Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: uploaded a new version of &amp;quot;File:ADB-Intro-Class1.pdf&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;First Class Advanced Databases Spring 2014&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=File:ADB-Intro-Class1.pdf&amp;diff=6740</id>
		<title>File:ADB-Intro-Class1.pdf</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=File:ADB-Intro-Class1.pdf&amp;diff=6740"/>
		<updated>2014-02-04T20:28:33Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: First Class Advanced Databases Spring 2014&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;First Class Advanced Databases Spring 2014&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6689</id>
		<title>Course: Advanced Databases</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Advanced_Databases&amp;diff=6689"/>
		<updated>2014-01-28T18:52:59Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: Created page with '== Spring 2014 ==  '''''This schedule is tentative and subject to change'''''  '''''Make sure to check my.poly.edu for course announcements'''''  == News == *   For frequently as…'&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Spring 2014 ==&lt;br /&gt;
&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Tuesday Jan. 28th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~jsimeon/courses/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/forms/d/1Jb2OTeZ0CmF4Tg0IQ-3ITgYy4WEpVcsm99ZnyIh_Ofk/viewform Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Textbooks Reading ===&lt;br /&gt;
&lt;br /&gt;
* [, Chapter 1]&lt;br /&gt;
* [ Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Optional Quiz ===&lt;br /&gt;
'''Due Dec 9th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - EM and exam review ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf &lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6445</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6445"/>
		<updated>2013-11-29T01:29:53Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Additional Reading */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6444</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6444"/>
		<updated>2013-11-29T01:26:23Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 14: Monday Dec. 9th - - Clustering */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6443</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6443"/>
		<updated>2013-11-29T01:24:51Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 13: Monday Dec. 2nd - Frequent Itemsets */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6442</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6442"/>
		<updated>2013-11-29T01:19:06Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6441</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6441"/>
		<updated>2013-11-29T01:18:10Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* Chapter 5, Data-Intensive Text Processing with MapReduce&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6440</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6440"/>
		<updated>2013-11-29T01:17:57Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 10: Monday Nov. 11th  - MapReduce Algorithm Design */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* Chapter 5, Data-Intensive Text Processing with MapReduce&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6439</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6439"/>
		<updated>2013-11-29T01:01:53Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* Chapter 5, Data-Intensive Text Processing with MapReduce&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6438</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6438"/>
		<updated>2013-11-29T00:24:13Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu&lt;br /&gt;
&lt;br /&gt;
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC&lt;br /&gt;
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.&lt;br /&gt;
&lt;br /&gt;
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS &lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Apache [http://hadoop.apache.org/ Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://code.google.com/p/jaql/ Jaql], [http://mahout.apache.org/ Mahout], BigInsights&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==&lt;br /&gt;
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf &lt;br /&gt;
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==&lt;br /&gt;
&lt;br /&gt;
* Inside MongoDB&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
* Introduction to Provenance&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* We will cover the material planned for &amp;quot;Week 10: Monday Nov. 11th&amp;quot;: Finding Similar Items&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==&lt;br /&gt;
&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Nov 15th, 2013'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing == &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* Chapter 5, Data-Intensive Text Processing with MapReduce&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==&lt;br /&gt;
&lt;br /&gt;
* Invited lectures by:&lt;br /&gt;
** Dr. Lauro Lins (AT&amp;amp;T Research)&lt;br /&gt;
** Dr. Huy Vo (NYU Center for Urban Science and Progress)&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf&lt;br /&gt;
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
The Value of Visualization, Jarke Van Wijk&lt;br /&gt;
http://www.win.tue.nl/~vanwijk/vov.pdf&lt;br /&gt;
&lt;br /&gt;
Tamara Munzner's Book draft 2 available online&lt;br /&gt;
http://www.cs.ubc.ca/~tmm/courses/533/book/&lt;br /&gt;
&lt;br /&gt;
Nanocubes Paper&lt;br /&gt;
http://nanocubes.net&lt;br /&gt;
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
imMens Paper (to contrast with nanocubes)&lt;br /&gt;
http://vis.stanford.edu/papers/immens&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 8th'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Chapter 7 (Clustering), Mining of Massive Data Sets&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 15th'''&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6188</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6188"/>
		<updated>2013-10-02T19:46:51Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: /* Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
&lt;br /&gt;
On September 30th, our class will meet at a different place: 1 Metrotech Center, 19th floor.&lt;br /&gt;
Bring your NYU Poly id -- you will need to show it to the security guard.&lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Introduction to [http://hadoop.apache.org/Hadoop Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://code.google.com/p/jaql/ Jaql], [http://mahout.apache.org/ Mahout], BigInsights&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Torsten Suel ==&lt;br /&gt;
&lt;br /&gt;
* Big Data and Information Retrieval. Invited lecture by Torsten Suel.&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/search-data.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov 5th - EM and Text Processing ==&lt;br /&gt;
&lt;br /&gt;
TODO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
* Data-Intensive Text Processing with MapReduce, Chapter 6&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - Finding Similar Items and Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due November 17th'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- Frequent Itemsets ==&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* Mining of Massive Datasets, Chapter 4&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due November 24th'''&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf&lt;br /&gt;
**Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 1st'''&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Mining of Massive Datasets, Chapter 7&lt;br /&gt;
* See readings for previous class&lt;br /&gt;
* Web Mining, by Bing Liu. http://www.cs.uic.edu/~liub/WebMiningBook.html&lt;br /&gt;
* Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf&lt;br /&gt;
&lt;br /&gt;
== Further Readings ==&lt;br /&gt;
* [http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php Anomaly Detection: A Survey]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini ==&lt;br /&gt;
* Introduction to Visual Analytics&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro-to-visualization.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138&lt;br /&gt;
&lt;br /&gt;
Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Graph Algorithms ==&lt;br /&gt;
&lt;br /&gt;
TODO&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6187</id>
		<title>Course: Big Data Analysis</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_Analysis&amp;diff=6187"/>
		<updated>2013-10-02T19:46:14Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Fall 2013 ==&lt;br /&gt;
'''''This schedule is tentative and subject to change'''''&lt;br /&gt;
&lt;br /&gt;
'''''Make sure to check my.poly.edu for course announcements'''''&lt;br /&gt;
&lt;br /&gt;
== News ==&lt;br /&gt;
&lt;br /&gt;
On September 30th, our class will meet at a different place: 1 Metrotech Center, 19th floor.&lt;br /&gt;
Bring your NYU Poly id -- you will need to show it to the security guard.&lt;br /&gt;
&lt;br /&gt;
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].&lt;br /&gt;
&lt;br /&gt;
== Week 1: Monday Sept. 9th - Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* Course overview and introduction to Big Data Analysis&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf &lt;br /&gt;
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&amp;amp;formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]&lt;br /&gt;
* [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's &amp;quot;How BigData Became so Big&amp;quot;]&lt;br /&gt;
* [http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf World Economic Forum: Big Data, Big Impact]&lt;br /&gt;
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]&lt;br /&gt;
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]&lt;br /&gt;
&lt;br /&gt;
== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==&lt;br /&gt;
&lt;br /&gt;
* Introduction to Map-Reduce and high-level data processing languages&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf &lt;br /&gt;
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].&lt;br /&gt;
* Introduction to [http://hadoop.apache.org/Hadoop Hadoop]&lt;br /&gt;
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://code.google.com/p/jaql/ Jaql], [http://mahout.apache.org/ Mahout], BigInsights&lt;br /&gt;
&lt;br /&gt;
=== Assignment ===&lt;br /&gt;
&lt;br /&gt;
* [[cs9223 Mapreduce Assignment]]&lt;br /&gt;
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.&lt;br /&gt;
For more details see http://cis.poly.edu/policies.&lt;br /&gt;
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''&lt;br /&gt;
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]&lt;br /&gt;
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]&lt;br /&gt;
&lt;br /&gt;
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Databases and Big Data: Persistence, Querying, Indexing, Transactions&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
&lt;br /&gt;
=== Related Topics ===&lt;br /&gt;
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]&lt;br /&gt;
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do&lt;br /&gt;
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.&lt;br /&gt;
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].&lt;br /&gt;
* &amp;quot;NewSQL&amp;quot; stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],&lt;br /&gt;
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]&lt;br /&gt;
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]&lt;br /&gt;
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_&lt;br /&gt;
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]&lt;br /&gt;
* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]&lt;br /&gt;
* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]&lt;br /&gt;
* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]&lt;br /&gt;
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]&lt;br /&gt;
&lt;br /&gt;
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==&lt;br /&gt;
&lt;br /&gt;
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''&lt;br /&gt;
&lt;br /&gt;
* Tutorial: An In-Depth Look at Modern Database Systems&lt;br /&gt;
&lt;br /&gt;
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.&lt;br /&gt;
&lt;br /&gt;
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.&lt;br /&gt;
&lt;br /&gt;
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan&lt;br /&gt;
&lt;br /&gt;
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==&lt;br /&gt;
&lt;br /&gt;
* Pig Latin and Query Processing: &lt;br /&gt;
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]&lt;br /&gt;
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf2-PigOnMapReduce.pdf Queries over MapReduce]&lt;br /&gt;
* In-class assignment&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
&lt;br /&gt;
=== Additional References ===&lt;br /&gt;
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]&lt;br /&gt;
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]&lt;br /&gt;
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]&lt;br /&gt;
&lt;br /&gt;
== Week 6:  Mon Oct. 14th - Fall Break - No class ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 7:  Monday Oct. 21st - Invited Speaker: Torsten Suel ==&lt;br /&gt;
&lt;br /&gt;
* Big Data and Information Retrieval. Invited lecture by Torsten Suel.&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/search-data.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==&lt;br /&gt;
&lt;br /&gt;
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students &lt;br /&gt;
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008&lt;br /&gt;
&lt;br /&gt;
== Week 9: Monday Nov 5th - EM and Text Processing ==&lt;br /&gt;
&lt;br /&gt;
TODO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
&lt;br /&gt;
* Data-Intensive Text Processing with MapReduce, Chapter 6&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 10: Monday Nov. 11th  - Finding Similar Items and Information Integration ==&lt;br /&gt;
* Similarity: Applications, Measures and Efficiency considerations&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf&lt;br /&gt;
* Similarity application: Information integration on the Web: &lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf&lt;br /&gt;
* Homework presentation and demo&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due November 17th'''&lt;br /&gt;
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.&lt;br /&gt;
&lt;br /&gt;
== Week 11: Monday Nov 18th- Frequent Itemsets ==&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf&lt;br /&gt;
&lt;br /&gt;
=== Required Reading ===&lt;br /&gt;
* Mining of Massive Datasets, Chapter 4&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due November 24th'''&lt;br /&gt;
&lt;br /&gt;
=== Additional Reading ===&lt;br /&gt;
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=198467341&amp;amp;CFTOKEN=23537886&amp;amp;__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb&lt;br /&gt;
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf&lt;br /&gt;
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 12: Monday Nov. 25th - Clustering ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: &lt;br /&gt;
** Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf&lt;br /&gt;
**Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf&lt;br /&gt;
&lt;br /&gt;
=== Homework Assignment ===&lt;br /&gt;
'''Due Dec 1st'''&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* Mining of Massive Datasets, Chapter 7&lt;br /&gt;
* See readings for previous class&lt;br /&gt;
* Web Mining, by Bing Liu. http://www.cs.uic.edu/~liub/WebMiningBook.html&lt;br /&gt;
* Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf&lt;br /&gt;
&lt;br /&gt;
== Further Readings ==&lt;br /&gt;
* [http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php Anomaly Detection: A Survey]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini ==&lt;br /&gt;
* Introduction to Visual Analytics&lt;br /&gt;
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro-to-visualization.pdf&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138&lt;br /&gt;
&lt;br /&gt;
Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14: Monday Dec. 9th - - Graph Algorithms ==&lt;br /&gt;
&lt;br /&gt;
TODO&lt;br /&gt;
&lt;br /&gt;
=== Readings ===&lt;br /&gt;
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]&lt;br /&gt;
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]&lt;br /&gt;
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]&lt;br /&gt;
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15  Monday Dec. 16th -  Final Exam ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Other topics ==&lt;br /&gt;
===Provenance ===&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]&lt;br /&gt;
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.&lt;br /&gt;
&lt;br /&gt;
* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]&lt;br /&gt;
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science &amp;amp; Engineering, 2008.&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=File:2-PigOnMapReduce.pdf&amp;diff=6186</id>
		<title>File:2-PigOnMapReduce.pdf</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=File:2-PigOnMapReduce.pdf&amp;diff=6186"/>
		<updated>2013-10-02T19:45:55Z</updated>

		<summary type="html">&lt;p&gt;Jsimeon: Pig/lating queries over MapReduce&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Pig/lating queries over MapReduce&lt;/div&gt;</summary>
		<author><name>Jsimeon</name></author>
	</entry>
</feed>