<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://www.vistrails.org//api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Juliana</id>
	<title>VistrailsWiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://www.vistrails.org//api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Juliana"/>
	<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php/Special:Contributions/Juliana"/>
	<updated>2026-06-09T12:48:26Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.36.2</generator>
	<entry>
		<id>https://www.vistrails.org//index.php?title=TestNewPage&amp;diff=22655</id>
		<title>TestNewPage</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=TestNewPage&amp;diff=22655"/>
		<updated>2026-04-23T22:16:51Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= VisTrails =&lt;br /&gt;
&lt;br /&gt;
'''VisTrails''' is an open-source scientific workflow and provenance management system developed at the [https://vida.engineering.nyu.edu/ VIDA Center] at New York University. It supports computational science by capturing and managing the complete history of the exploratory process: the workflows, their executions, and the results they produce.&lt;br /&gt;
&lt;br /&gt;
VisTrails is actively developed again. The new version, '''[[VisTrailsJL]]''', is a complete reimplementation in [https://julialang.org/ Julia] that brings modern performance, notebook-based workflow authoring, and native compatibility with existing &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; files. See the [https://github.com/VIDA-NYU/VisTrailsJL GitHub repository] to get started.&lt;br /&gt;
&lt;br /&gt;
== What's New ==&lt;br /&gt;
&lt;br /&gt;
After a hiatus since 2018, VisTrails is back. '''VisTrailsJL''' (v2.2) is a ground-up reimplementation in Julia that preserves everything that made the original system valuable — comprehensive provenance, visual workflow management, and support for real scientific use cases — while modernizing the foundation:&lt;br /&gt;
&lt;br /&gt;
* '''Julia reimplementation''' — Julia's JIT compilation brings performance suitable for demanding scientific workflows, and its rich ecosystem (DataFrames.jl, DifferentialEquations.jl, Plots.jl) is a natural fit.&lt;br /&gt;
* '''Notebook-based workflow authoring''' — Workflows can now be defined directly in Jupyter notebooks using simple &amp;lt;code&amp;gt;#|&amp;lt;/code&amp;gt; directives, with no GUI required.&lt;br /&gt;
* '''Full &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; compatibility''' — Existing workflows created with the Python version can be loaded, replayed, and visualized without modification.&lt;br /&gt;
* '''Git-native version control''' — Standard git replaces the custom versioning infrastructure for workflow history.&lt;br /&gt;
* '''Python interoperability''' — Existing Python modules and libraries remain accessible via PyCall.jl.&lt;br /&gt;
&lt;br /&gt;
The original Python codebase (v2.2) is preserved in the repository for reference and compatibility testing.&lt;br /&gt;
&lt;br /&gt;
; Quick links&lt;br /&gt;
: [https://github.com/VIDA-NYU/VisTrailsJL GitHub (VisTrailsJL)] &amp;amp;nbsp;|&amp;amp;nbsp; [[Documentation]] &amp;amp;nbsp;|&amp;amp;nbsp; [[Publications, Tutorials and Presentations]] &amp;amp;nbsp;|&amp;amp;nbsp; [[MailingLists|Mailing Lists]]&lt;br /&gt;
&lt;br /&gt;
== Core Features ==&lt;br /&gt;
&lt;br /&gt;
=== Provenance and Workflow History ===&lt;br /&gt;
&lt;br /&gt;
A defining feature of VisTrails is its '''comprehensive provenance infrastructure'''. Unlike systems that track only the current state of a workflow, VisTrails maintains the full history of every step taken during an exploratory analysis — what was tried, what was changed, and what results each version produced. This enables users to:&lt;br /&gt;
&lt;br /&gt;
* Navigate and compare workflow versions in an intuitive tree interface&lt;br /&gt;
* Undo changes without losing intermediate results&lt;br /&gt;
* Visually diff two workflows and their outputs side by side&lt;br /&gt;
* Reproduce any prior result exactly, long after it was first computed&lt;br /&gt;
&lt;br /&gt;
Provenance information is stored as XML or in a relational database (Python version), or managed via standard git (Julia version).&lt;br /&gt;
&lt;br /&gt;
=== Building and Running Workflows ===&lt;br /&gt;
&lt;br /&gt;
VisTrails supports workflows expressed as '''dataflows''', with support for functional loops and conditional branching. Workflows can be run interactively through the GUI or in batch mode via a server. The system is designed to connect loosely coupled resources — specialized libraries, web services, and grid computing infrastructure.&lt;br /&gt;
&lt;br /&gt;
In VisTrailsJL, workflows can also be defined declaratively in Jupyter notebooks:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#| workflow: my_analysis&lt;br /&gt;
&lt;br /&gt;
#| module-id: input&lt;br /&gt;
#| module-type: basic:Integer&lt;br /&gt;
#| params:&lt;br /&gt;
#|   - value: 42&lt;br /&gt;
&lt;br /&gt;
#| module-id: process&lt;br /&gt;
#| module-type: mypackage:Transform&lt;br /&gt;
#| inputs:&lt;br /&gt;
#|   - value: input.value&lt;br /&gt;
&lt;br /&gt;
#| execute&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Packages and modules are easy to add. The &amp;lt;code&amp;gt;JuliaSource&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;PythonSource&amp;lt;/code&amp;gt; module types allow custom code to be embedded directly in a workflow without creating a full package.&lt;br /&gt;
&lt;br /&gt;
=== Publishing Reproducible Results ===&lt;br /&gt;
&lt;br /&gt;
VisTrails 2.0 introduced support for embedding reproducible results directly in LaTeX/PDF documents via a companion LaTeX package. A figure in a compiled PDF becomes active: clicking it invokes VisTrails and re-executes the workflow that produced it on any machine with the software installed.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
\usepackage{vistrails}&lt;br /&gt;
&lt;br /&gt;
\begin{figure}&lt;br /&gt;
\begin{center}&lt;br /&gt;
\subfigure[a=0.9]{\vistrail[filename=alps.vt, version=2, pdf]{width=8cm}}&lt;br /&gt;
\caption{Clicking this figure retrieves and re-runs the workflow that produced it.}&lt;br /&gt;
\end{center}&lt;br /&gt;
\end{figure}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Querying and Refining Workflows ===&lt;br /&gt;
&lt;br /&gt;
Users can construct expressive queries over a collection of workflows using the same interface used to build them. An '''analogy mechanism''' allows complex modifications to be applied to one workflow by example from another, without manually editing workflow specifications — useful when a family of related analyses needs to evolve together.&lt;br /&gt;
&lt;br /&gt;
=== Visualizing and Comparing Results ===&lt;br /&gt;
&lt;br /&gt;
VisTrails provides a '''spreadsheet view''' for comparing the results of multiple workflows or multiple parameterizations of the same workflow side by side. The visual diff interface highlights structural differences between two workflow versions. Workflows and their version trees can be rendered as SVG (VisTrailsJL) or displayed on large-format display walls.&lt;br /&gt;
&lt;br /&gt;
== Getting Started ==&lt;br /&gt;
&lt;br /&gt;
=== VisTrailsJL (Julia — current) ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Clone the repository&lt;br /&gt;
git clone https://github.com/VIDA-NYU/VisTrailsJL.git&lt;br /&gt;
cd VisTrailsJL/julia&lt;br /&gt;
&lt;br /&gt;
# Install dependencies&lt;br /&gt;
julia --project=. -e 'using Pkg; Pkg.instantiate()'&lt;br /&gt;
&lt;br /&gt;
# Load and render an existing workflow&lt;br /&gt;
julia --project=. -e '&lt;br /&gt;
using VisTrailsJL&lt;br /&gt;
vt = load_vistrail(&amp;quot;../examples/gcd.vt&amp;quot;)&lt;br /&gt;
workflow = get_pipeline(vt)&lt;br /&gt;
render_pipeline_svg(workflow, &amp;quot;workflow.svg&amp;quot;)&lt;br /&gt;
'&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
See the [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/QUICKSTART.md Quickstart Guide] for a full walkthrough.&lt;br /&gt;
&lt;br /&gt;
=== Python VisTrails (legacy reference) ===&lt;br /&gt;
&lt;br /&gt;
The original Python version (v2.2, requires Python 2 / PyQt4) is preserved in the repository for reference and for loading existing &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; files in legacy environments.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# GUI mode&lt;br /&gt;
python vistrails/run.py&lt;br /&gt;
&lt;br /&gt;
# Batch mode&lt;br /&gt;
python vistrails/run.py --batch [options]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Projects Using VisTrails ==&lt;br /&gt;
&lt;br /&gt;
VisTrails has supported real scientific workflows across a wide range of domains. The following projects reflect the breadth of communities that have relied on the system.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! USGS Habitat Modeling&lt;br /&gt;
! NASA Climate Data Analysis&lt;br /&gt;
! DOE CDAT&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:usgs.png|200px|left]]&lt;br /&gt;
| [[Image:nasa.png|200px|left]]&lt;br /&gt;
| [[Image:cdat.png|200px|left]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! ALPS Simulations&lt;br /&gt;
! NSF STC CMOP&lt;br /&gt;
! NSF CDI Wildfire&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:alps-shot.png|200px|left]]&lt;br /&gt;
| [[Image:cmop-ss.png|200px|left]]&lt;br /&gt;
| [[Image:wildfire.png|200px|center]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! NSF DataONE-EVA&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:eva.png|200px|left]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
[https://vistrails.org/index.php/Projects_using_VisTrails See all projects using VisTrails]&lt;br /&gt;
&lt;br /&gt;
== VisTrails in Teaching ==&lt;br /&gt;
&lt;br /&gt;
VisTrails has been used as a teaching tool in courses on Scientific Visualization and Digital Media. Its provenance infrastructure makes it particularly effective in educational settings, where capturing and comparing student workflows provides rich feedback for instructors and learners alike.&lt;br /&gt;
&lt;br /&gt;
Our [http://www.cs.utah.edu/~juliana/pub/vistrails-teaching-eurographics2010.pdf paper] describing a provenance-rich teaching methodology received the '''Best Paper Award''' at Eurographics 2010 Education.&lt;br /&gt;
&lt;br /&gt;
[[Vistrails and Teaching|More on VisTrails and Teaching]]&lt;br /&gt;
&lt;br /&gt;
== System Documentation ==&lt;br /&gt;
&lt;br /&gt;
* [[Documentation|Documentation overview]]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/README.md VisTrailsJL README]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/QUICKSTART.md Quickstart Guide]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/docs/IMPLEMENTATION_STATUS.md Implementation Status]&lt;br /&gt;
* [[FAQ]]&lt;br /&gt;
* [[Users_Guide|Python User's Guide (legacy)]]&lt;br /&gt;
&lt;br /&gt;
To report bugs or request features, please use the [https://github.com/VIDA-NYU/VisTrailsJL/issues issue tracker].&lt;br /&gt;
&lt;br /&gt;
For questions not covered by the documentation, post to the [https://vistrails.org/index.php/MailingLists mailing list].&lt;br /&gt;
&lt;br /&gt;
== Citing VisTrails ==&lt;br /&gt;
&lt;br /&gt;
If you use VisTrails or VisTrailsJL in your research, please cite the relevant work:&lt;br /&gt;
&lt;br /&gt;
'''Original VisTrails system:'''&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
@inproceedings{vistrails2006,&lt;br /&gt;
  title     = {VisTrails: visualization meets data management},&lt;br /&gt;
  author    = {Callahan, Steven P and Freire, Juliana and Scheidegger,&lt;br /&gt;
               Carlos E and Silva, Cl{\'a}udio T and Vo, Huy T},&lt;br /&gt;
  booktitle = {Proceedings of the 2006 ACM SIGMOD International Conference&lt;br /&gt;
               on Management of Data},&lt;br /&gt;
  pages     = {745--747},&lt;br /&gt;
  year      = {2006},&lt;br /&gt;
  doi       = {10.1145/1142473.1142574}&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''VisTrailsJL (Julia reimplementation):'''&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
@software{vistrailsjl2025,&lt;br /&gt;
  title  = {VisTrailsJL: A Julia Implementation of VisTrails},&lt;br /&gt;
  author = {Silva, Claudio T},&lt;br /&gt;
  year   = {2025},&lt;br /&gt;
  url    = {https://github.com/VIDA-NYU/VisTrailsJL}&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Publications, Tutorials and Presentations|Full publication list]]&lt;br /&gt;
&lt;br /&gt;
== People ==&lt;br /&gt;
[[People]]&lt;br /&gt;
&lt;br /&gt;
== Sponsors ==&lt;br /&gt;
&lt;br /&gt;
This work has been supported in part by the National Science Foundation under grants&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0905385 IIS-0905385],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0844572 IIS-0844572],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0746500 IIS CAREER-0746500],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0751152 CNS-0751152],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0513692 IIS-0513692],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0401498 CCF-0401498],&lt;br /&gt;
and others; by the Department of Energy under the SciDAC program (SDM, VACET, and UV-CDAT);&lt;br /&gt;
and by IBM Faculty Awards (2005–2008) and a University of Utah Seed Grant.&lt;br /&gt;
&lt;br /&gt;
== Related ==&lt;br /&gt;
&lt;br /&gt;
[[BirdVis]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[http://www.crowdlabs.org CrowdLabs] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[RepeatabilityCentral]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[ProvenanceAnalytics]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[Provenance: potpourri]]&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Main_Page&amp;diff=22656</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Main_Page&amp;diff=22656"/>
		<updated>2026-04-23T22:15:52Z</updated>

		<summary type="html">&lt;p&gt;Juliana: Update to reflect the new VisTrailsJL&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= VisTrails =&lt;br /&gt;
&lt;br /&gt;
'''VisTrails''' is an open-source scientific workflow and provenance management system developed at the [https://vida.engineering.nyu.edu/ VIDA Center] at New York University. It supports computational science by capturing and managing the complete history of the exploratory process: the workflows, their executions, and the results they produce.&lt;br /&gt;
&lt;br /&gt;
VisTrails is actively developed again. The new version, '''[[VisTrailsJL]]''', is a complete reimplementation in [https://julialang.org/ Julia] that brings modern performance, notebook-based workflow authoring, and native compatibility with existing &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; files. See the [https://github.com/VIDA-NYU/VisTrailsJL GitHub repository] to get started.&lt;br /&gt;
&lt;br /&gt;
== What's New ==&lt;br /&gt;
&lt;br /&gt;
After a hiatus since 2018, VisTrails is back. '''VisTrailsJL''' (v2.2) is a ground-up reimplementation in Julia that preserves everything that made the original system valuable — comprehensive provenance, visual workflow management, and support for real scientific use cases — while modernizing the foundation:&lt;br /&gt;
&lt;br /&gt;
* '''Julia reimplementation''' — Julia's JIT compilation brings performance suitable for demanding scientific workflows, and its rich ecosystem (DataFrames.jl, DifferentialEquations.jl, Plots.jl) is a natural fit.&lt;br /&gt;
* '''Notebook-based workflow authoring''' — Workflows can now be defined directly in Jupyter notebooks using simple &amp;lt;code&amp;gt;#|&amp;lt;/code&amp;gt; directives, with no GUI required.&lt;br /&gt;
* '''Full &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; compatibility''' — Existing workflows created with the Python version can be loaded, replayed, and visualized without modification.&lt;br /&gt;
* '''Git-native version control''' — Standard git replaces the custom versioning infrastructure for workflow history.&lt;br /&gt;
* '''Python interoperability''' — Existing Python modules and libraries remain accessible via PyCall.jl.&lt;br /&gt;
&lt;br /&gt;
The original Python codebase (v2.2) is preserved in the repository for reference and compatibility testing.&lt;br /&gt;
&lt;br /&gt;
; Quick links&lt;br /&gt;
: [https://github.com/VIDA-NYU/VisTrailsJL GitHub (VisTrailsJL)] &amp;amp;nbsp;|&amp;amp;nbsp; [[Documentation]] &amp;amp;nbsp;|&amp;amp;nbsp; [[Publications, Tutorials and Presentations]] &amp;amp;nbsp;|&amp;amp;nbsp; [[MailingLists|Mailing Lists]]&lt;br /&gt;
&lt;br /&gt;
== Core Features ==&lt;br /&gt;
&lt;br /&gt;
=== Provenance and Workflow History ===&lt;br /&gt;
&lt;br /&gt;
A defining feature of VisTrails is its '''comprehensive provenance infrastructure'''. Unlike systems that track only the current state of a workflow, VisTrails maintains the full history of every step taken during an exploratory analysis — what was tried, what was changed, and what results each version produced. This enables users to:&lt;br /&gt;
&lt;br /&gt;
* Navigate and compare workflow versions in an intuitive tree interface&lt;br /&gt;
* Undo changes without losing intermediate results&lt;br /&gt;
* Visually diff two workflows and their outputs side by side&lt;br /&gt;
* Reproduce any prior result exactly, long after it was first computed&lt;br /&gt;
&lt;br /&gt;
Provenance information is stored as XML or in a relational database (Python version), or managed via standard git (Julia version).&lt;br /&gt;
&lt;br /&gt;
=== Building and Running Workflows ===&lt;br /&gt;
&lt;br /&gt;
VisTrails supports workflows expressed as '''dataflows''', with support for functional loops and conditional branching. Workflows can be run interactively through the GUI or in batch mode via a server. The system is designed to connect loosely coupled resources — specialized libraries, web services, and grid computing infrastructure.&lt;br /&gt;
&lt;br /&gt;
In VisTrailsJL, workflows can also be defined declaratively in Jupyter notebooks:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#| workflow: my_analysis&lt;br /&gt;
&lt;br /&gt;
#| module-id: input&lt;br /&gt;
#| module-type: basic:Integer&lt;br /&gt;
#| params:&lt;br /&gt;
#|   - value: 42&lt;br /&gt;
&lt;br /&gt;
#| module-id: process&lt;br /&gt;
#| module-type: mypackage:Transform&lt;br /&gt;
#| inputs:&lt;br /&gt;
#|   - value: input.value&lt;br /&gt;
&lt;br /&gt;
#| execute&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Packages and modules are easy to add. The &amp;lt;code&amp;gt;JuliaSource&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;PythonSource&amp;lt;/code&amp;gt; module types allow custom code to be embedded directly in a workflow without creating a full package.&lt;br /&gt;
&lt;br /&gt;
=== Publishing Reproducible Results ===&lt;br /&gt;
&lt;br /&gt;
VisTrails 2.0 introduced support for embedding reproducible results directly in LaTeX/PDF documents via a companion LaTeX package. A figure in a compiled PDF becomes active: clicking it invokes VisTrails and re-executes the workflow that produced it on any machine with the software installed.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
\usepackage{vistrails}&lt;br /&gt;
&lt;br /&gt;
\begin{figure}&lt;br /&gt;
\begin{center}&lt;br /&gt;
\subfigure[a=0.9]{\vistrail[filename=alps.vt, version=2, pdf]{width=8cm}}&lt;br /&gt;
\caption{Clicking this figure retrieves and re-runs the workflow that produced it.}&lt;br /&gt;
\end{center}&lt;br /&gt;
\end{figure}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Querying and Refining Workflows ===&lt;br /&gt;
&lt;br /&gt;
Users can construct expressive queries over a collection of workflows using the same interface used to build them. An '''analogy mechanism''' allows complex modifications to be applied to one workflow by example from another, without manually editing workflow specifications — useful when a family of related analyses needs to evolve together.&lt;br /&gt;
&lt;br /&gt;
=== Visualizing and Comparing Results ===&lt;br /&gt;
&lt;br /&gt;
VisTrails provides a '''spreadsheet view''' for comparing the results of multiple workflows or multiple parameterizations of the same workflow side by side. The visual diff interface highlights structural differences between two workflow versions. Workflows and their version trees can be rendered as SVG (VisTrailsJL) or displayed on large-format display walls.&lt;br /&gt;
&lt;br /&gt;
== Getting Started ==&lt;br /&gt;
&lt;br /&gt;
=== VisTrailsJL (Julia — current) ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Clone the repository&lt;br /&gt;
git clone https://github.com/VIDA-NYU/VisTrailsJL.git&lt;br /&gt;
cd VisTrailsJL/julia&lt;br /&gt;
&lt;br /&gt;
# Install dependencies&lt;br /&gt;
julia --project=. -e 'using Pkg; Pkg.instantiate()'&lt;br /&gt;
&lt;br /&gt;
# Load and render an existing workflow&lt;br /&gt;
julia --project=. -e '&lt;br /&gt;
using VisTrailsJL&lt;br /&gt;
vt = load_vistrail(&amp;quot;../examples/gcd.vt&amp;quot;)&lt;br /&gt;
workflow = get_pipeline(vt)&lt;br /&gt;
render_pipeline_svg(workflow, &amp;quot;workflow.svg&amp;quot;)&lt;br /&gt;
'&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
See the [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/QUICKSTART.md Quickstart Guide] for a full walkthrough.&lt;br /&gt;
&lt;br /&gt;
=== Python VisTrails (legacy reference) ===&lt;br /&gt;
&lt;br /&gt;
The original Python version (v2.2, requires Python 2 / PyQt4) is preserved in the repository for reference and for loading existing &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; files in legacy environments.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# GUI mode&lt;br /&gt;
python vistrails/run.py&lt;br /&gt;
&lt;br /&gt;
# Batch mode&lt;br /&gt;
python vistrails/run.py --batch [options]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Projects Using VisTrails ==&lt;br /&gt;
&lt;br /&gt;
VisTrails has supported real scientific workflows across a wide range of domains. The following projects reflect the breadth of communities that have relied on the system.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! USGS Habitat Modeling&lt;br /&gt;
! NASA Climate Data Analysis&lt;br /&gt;
! DOE CDAT&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:usgs.png|200px|left]]&lt;br /&gt;
| [[Image:nasa.png|200px|left]]&lt;br /&gt;
| [[Image:cdat.png|200px|left]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! ALPS Simulations&lt;br /&gt;
! NSF STC CMOP&lt;br /&gt;
! NSF CDI Wildfire&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:alps-shot.png|200px|left]]&lt;br /&gt;
| [[Image:cmop-ss.png|200px|left]]&lt;br /&gt;
| [[Image:wildfire.png|200px|center]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! NSF DataONE-EVA&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:eva.png|200px|left]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
[https://vistrails.org/index.php/Projects_using_VisTrails See all projects using VisTrails]&lt;br /&gt;
&lt;br /&gt;
== VisTrails in Teaching ==&lt;br /&gt;
&lt;br /&gt;
VisTrails has been used as a teaching tool in courses on Scientific Visualization and Digital Media. Its provenance infrastructure makes it particularly effective in educational settings, where capturing and comparing student workflows provides rich feedback for instructors and learners alike.&lt;br /&gt;
&lt;br /&gt;
Our [http://www.cs.utah.edu/~juliana/pub/vistrails-teaching-eurographics2010.pdf paper] describing a provenance-rich teaching methodology received the '''Best Paper Award''' at Eurographics 2010 Education.&lt;br /&gt;
&lt;br /&gt;
[[Vistrails and Teaching|More on VisTrails and Teaching]]&lt;br /&gt;
&lt;br /&gt;
== System Documentation ==&lt;br /&gt;
&lt;br /&gt;
* [[Documentation|Documentation overview]]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/README.md VisTrailsJL README]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/QUICKSTART.md Quickstart Guide]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/docs/IMPLEMENTATION_STATUS.md Implementation Status]&lt;br /&gt;
* [[FAQ]]&lt;br /&gt;
* [[Users_Guide|Python User's Guide (legacy)]]&lt;br /&gt;
&lt;br /&gt;
To report bugs or request features, please use the [https://github.com/VIDA-NYU/VisTrailsJL/issues issue tracker].&lt;br /&gt;
&lt;br /&gt;
For questions not covered by the documentation, post to the [https://vistrails.org/index.php/MailingLists mailing list].&lt;br /&gt;
&lt;br /&gt;
== Citing VisTrails ==&lt;br /&gt;
&lt;br /&gt;
If you use VisTrails or VisTrailsJL in your research, please cite the relevant work:&lt;br /&gt;
&lt;br /&gt;
'''Original VisTrails system:'''&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
@inproceedings{vistrails2006,&lt;br /&gt;
  title     = {VisTrails: visualization meets data management},&lt;br /&gt;
  author    = {Callahan, Steven P and Freire, Juliana and Scheidegger,&lt;br /&gt;
               Carlos E and Silva, Cl{\'a}udio T and Vo, Huy T},&lt;br /&gt;
  booktitle = {Proceedings of the 2006 ACM SIGMOD International Conference&lt;br /&gt;
               on Management of Data},&lt;br /&gt;
  pages     = {745--747},&lt;br /&gt;
  year      = {2006},&lt;br /&gt;
  doi       = {10.1145/1142473.1142574}&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''VisTrailsJL (Julia reimplementation):'''&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
@software{vistrailsjl2025,&lt;br /&gt;
  title  = {VisTrailsJL: A Julia Implementation of VisTrails},&lt;br /&gt;
  author = {Silva, Claudio T},&lt;br /&gt;
  year   = {2025},&lt;br /&gt;
  url    = {https://github.com/VIDA-NYU/VisTrailsJL}&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Publications, Tutorials and Presentations|Full publication list]]&lt;br /&gt;
&lt;br /&gt;
== People ==&lt;br /&gt;
[[People]]&lt;br /&gt;
&lt;br /&gt;
== Sponsors ==&lt;br /&gt;
&lt;br /&gt;
This work has been supported in part by the National Science Foundation under grants&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0905385 IIS-0905385],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0844572 IIS-0844572],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0746500 IIS CAREER-0746500],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0751152 CNS-0751152],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0513692 IIS-0513692],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0401498 CCF-0401498],&lt;br /&gt;
and others; by the Department of Energy under the SciDAC program (SDM, VACET, and UV-CDAT);&lt;br /&gt;
and by IBM Faculty Awards (2005–2008) and a University of Utah Seed Grant.&lt;br /&gt;
&lt;br /&gt;
== Related ==&lt;br /&gt;
&lt;br /&gt;
[[BirdVis]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[http://www.crowdlabs.org CrowdLabs] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[RepeatabilityCentral]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[ProvenanceAnalytics]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[Provenance: potpourri]]&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=TestNewPage&amp;diff=22654</id>
		<title>TestNewPage</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=TestNewPage&amp;diff=22654"/>
		<updated>2026-04-23T22:10:58Z</updated>

		<summary type="html">&lt;p&gt;Juliana: Created page with &amp;quot;= VisTrails =  '''VisTrails''' is an open-source scientific workflow and provenance management system developed at the [https://vida.engineering.nyu.edu/ VIDA lab] at New York...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= VisTrails =&lt;br /&gt;
&lt;br /&gt;
'''VisTrails''' is an open-source scientific workflow and provenance management system developed at the [https://vida.engineering.nyu.edu/ VIDA lab] at New York University (formerly at the University of Utah). It supports exploratory computational science by capturing and managing the complete history of data analysis: the workflows, their executions, and the results they produce.&lt;br /&gt;
&lt;br /&gt;
VisTrails is actively developed again. The new version, '''[[VisTrailsJL]]''', is a complete reimplementation in [https://julialang.org/ Julia] that brings modern performance, notebook-based workflow authoring, and native compatibility with existing &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; files. See the [https://github.com/VIDA-NYU/VisTrailsJL GitHub repository] to get started.&lt;br /&gt;
&lt;br /&gt;
== What's New ==&lt;br /&gt;
&lt;br /&gt;
After a hiatus since 2018, VisTrails is back. '''VisTrailsJL''' (v2.2) is a ground-up reimplementation in Julia that preserves everything that made the original system valuable — comprehensive provenance, visual workflow management, and support for real scientific use cases — while modernizing the foundation:&lt;br /&gt;
&lt;br /&gt;
* '''Julia reimplementation''' — Julia's JIT compilation brings performance suitable for demanding scientific workflows, and its rich ecosystem (DataFrames.jl, DifferentialEquations.jl, Plots.jl) is a natural fit.&lt;br /&gt;
* '''Notebook-based workflow authoring''' — Workflows can now be defined directly in Jupyter notebooks using simple &amp;lt;code&amp;gt;#|&amp;lt;/code&amp;gt; directives, with no GUI required.&lt;br /&gt;
* '''Full &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; compatibility''' — Existing workflows created with the Python version can be loaded, replayed, and visualized without modification.&lt;br /&gt;
* '''Git-native version control''' — Standard git replaces the custom versioning infrastructure for workflow history.&lt;br /&gt;
* '''Python interoperability''' — Existing Python modules and libraries remain accessible via PyCall.jl.&lt;br /&gt;
&lt;br /&gt;
The original Python codebase (v2.2) is preserved in the repository for reference and compatibility testing.&lt;br /&gt;
&lt;br /&gt;
; Quick links&lt;br /&gt;
: [https://github.com/VIDA-NYU/VisTrailsJL GitHub (VisTrailsJL)] &amp;amp;nbsp;|&amp;amp;nbsp; [[Documentation]] &amp;amp;nbsp;|&amp;amp;nbsp; [[Publications, Tutorials and Presentations]] &amp;amp;nbsp;|&amp;amp;nbsp; [[MailingLists|Mailing Lists]]&lt;br /&gt;
&lt;br /&gt;
== Core Features ==&lt;br /&gt;
&lt;br /&gt;
=== Provenance and Workflow History ===&lt;br /&gt;
&lt;br /&gt;
A defining feature of VisTrails is its '''comprehensive provenance infrastructure'''. Unlike systems that track only the current state of a workflow, VisTrails maintains the full history of every step taken during an exploratory analysis — what was tried, what was changed, and what results each version produced. This enables users to:&lt;br /&gt;
&lt;br /&gt;
* Navigate and compare workflow versions in an intuitive tree interface&lt;br /&gt;
* Undo changes without losing intermediate results&lt;br /&gt;
* Visually diff two workflows and their outputs side by side&lt;br /&gt;
* Reproduce any prior result exactly, long after it was first computed&lt;br /&gt;
&lt;br /&gt;
Provenance information is stored as XML or in a relational database (Python version), or managed via standard git (Julia version).&lt;br /&gt;
&lt;br /&gt;
=== Building and Running Workflows ===&lt;br /&gt;
&lt;br /&gt;
VisTrails supports workflows expressed as '''dataflows''', with support for functional loops and conditional branching. Workflows can be run interactively through the GUI or in batch mode via a server. The system is designed to connect loosely coupled resources — specialized libraries, web services, and grid computing infrastructure.&lt;br /&gt;
&lt;br /&gt;
In VisTrailsJL, workflows can also be defined declaratively in Jupyter notebooks:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#| workflow: my_analysis&lt;br /&gt;
&lt;br /&gt;
#| module-id: input&lt;br /&gt;
#| module-type: basic:Integer&lt;br /&gt;
#| params:&lt;br /&gt;
#|   - value: 42&lt;br /&gt;
&lt;br /&gt;
#| module-id: process&lt;br /&gt;
#| module-type: mypackage:Transform&lt;br /&gt;
#| inputs:&lt;br /&gt;
#|   - value: input.value&lt;br /&gt;
&lt;br /&gt;
#| execute&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Packages and modules are easy to add. The &amp;lt;code&amp;gt;JuliaSource&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;PythonSource&amp;lt;/code&amp;gt; module types allow custom code to be embedded directly in a workflow without creating a full package.&lt;br /&gt;
&lt;br /&gt;
=== Publishing Reproducible Results ===&lt;br /&gt;
&lt;br /&gt;
VisTrails 2.0 introduced support for embedding reproducible results directly in LaTeX/PDF documents via a companion LaTeX package. A figure in a compiled PDF becomes active: clicking it invokes VisTrails and re-executes the workflow that produced it on any machine with the software installed.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
\usepackage{vistrails}&lt;br /&gt;
&lt;br /&gt;
\begin{figure}&lt;br /&gt;
\begin{center}&lt;br /&gt;
\subfigure[a=0.9]{\vistrail[filename=alps.vt, version=2, pdf]{width=8cm}}&lt;br /&gt;
\caption{Clicking this figure retrieves and re-runs the workflow that produced it.}&lt;br /&gt;
\end{center}&lt;br /&gt;
\end{figure}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Querying and Refining Workflows ===&lt;br /&gt;
&lt;br /&gt;
Users can construct expressive queries over a collection of workflows using the same interface used to build them. An '''analogy mechanism''' allows complex modifications to be applied to one workflow by example from another, without manually editing workflow specifications — useful when a family of related analyses needs to evolve together.&lt;br /&gt;
&lt;br /&gt;
=== Visualizing and Comparing Results ===&lt;br /&gt;
&lt;br /&gt;
VisTrails provides a '''spreadsheet view''' for comparing the results of multiple workflows or multiple parameterizations of the same workflow side by side. The visual diff interface highlights structural differences between two workflow versions. Workflows and their version trees can be rendered as SVG (VisTrailsJL) or displayed on large-format display walls.&lt;br /&gt;
&lt;br /&gt;
== Getting Started ==&lt;br /&gt;
&lt;br /&gt;
=== VisTrailsJL (Julia — current) ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Clone the repository&lt;br /&gt;
git clone https://github.com/VIDA-NYU/VisTrailsJL.git&lt;br /&gt;
cd VisTrailsJL/julia&lt;br /&gt;
&lt;br /&gt;
# Install dependencies&lt;br /&gt;
julia --project=. -e 'using Pkg; Pkg.instantiate()'&lt;br /&gt;
&lt;br /&gt;
# Load and render an existing workflow&lt;br /&gt;
julia --project=. -e '&lt;br /&gt;
using VisTrailsJL&lt;br /&gt;
vt = load_vistrail(&amp;quot;../examples/gcd.vt&amp;quot;)&lt;br /&gt;
workflow = get_pipeline(vt)&lt;br /&gt;
render_pipeline_svg(workflow, &amp;quot;workflow.svg&amp;quot;)&lt;br /&gt;
'&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
See the [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/QUICKSTART.md Quickstart Guide] for a full walkthrough.&lt;br /&gt;
&lt;br /&gt;
=== Python VisTrails (legacy reference) ===&lt;br /&gt;
&lt;br /&gt;
The original Python version (v2.2, requires Python 2 / PyQt4) is preserved in the repository for reference and for loading existing &amp;lt;code&amp;gt;.vt&amp;lt;/code&amp;gt; files in legacy environments.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# GUI mode&lt;br /&gt;
python vistrails/run.py&lt;br /&gt;
&lt;br /&gt;
# Batch mode&lt;br /&gt;
python vistrails/run.py --batch [options]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Projects Using VisTrails ==&lt;br /&gt;
&lt;br /&gt;
VisTrails has supported real scientific workflows across a wide range of domains. The following projects reflect the breadth of communities that have relied on the system.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! USGS Habitat Modeling&lt;br /&gt;
! NASA Climate Data Analysis&lt;br /&gt;
! DOE CDAT&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:usgs.png|200px|left]]&lt;br /&gt;
| [[Image:nasa.png|200px|left]]&lt;br /&gt;
| [[Image:cdat.png|200px|left]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! ALPS Simulations&lt;br /&gt;
! NSF STC CMOP&lt;br /&gt;
! NSF CDI Wildfire&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:alps-shot.png|200px|left]]&lt;br /&gt;
| [[Image:cmop-ss.png|200px|left]]&lt;br /&gt;
| [[Image:wildfire.png|200px|center]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! NSF DataONE-EVA&lt;br /&gt;
|-&lt;br /&gt;
| [[Image:eva.png|200px|left]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
[https://vistrails.org/index.php/Projects_using_VisTrails See all projects using VisTrails]&lt;br /&gt;
&lt;br /&gt;
== VisTrails in Teaching ==&lt;br /&gt;
&lt;br /&gt;
VisTrails has been used as a teaching tool in courses on Scientific Visualization and Digital Media. Its provenance infrastructure makes it particularly effective in educational settings, where capturing and comparing student workflows provides rich feedback for instructors and learners alike.&lt;br /&gt;
&lt;br /&gt;
Our [http://www.cs.utah.edu/~juliana/pub/vistrails-teaching-eurographics2010.pdf paper] describing a provenance-rich teaching methodology received the '''Best Paper Award''' at Eurographics 2010 Education.&lt;br /&gt;
&lt;br /&gt;
[[Vistrails and Teaching|More on VisTrails and Teaching]]&lt;br /&gt;
&lt;br /&gt;
== System Documentation ==&lt;br /&gt;
&lt;br /&gt;
* [[Documentation|Documentation overview]]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/README.md VisTrailsJL README]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/QUICKSTART.md Quickstart Guide]&lt;br /&gt;
* [https://github.com/VIDA-NYU/VisTrailsJL/blob/v2.2/julia/docs/IMPLEMENTATION_STATUS.md Implementation Status]&lt;br /&gt;
* [[FAQ]]&lt;br /&gt;
* [[Users_Guide|Python User's Guide (legacy)]]&lt;br /&gt;
&lt;br /&gt;
To report bugs or request features, please use the [https://github.com/VIDA-NYU/VisTrailsJL/issues issue tracker].&lt;br /&gt;
&lt;br /&gt;
For questions not covered by the documentation, post to the [https://vistrails.org/index.php/MailingLists mailing list].&lt;br /&gt;
&lt;br /&gt;
== Citing VisTrails ==&lt;br /&gt;
&lt;br /&gt;
If you use VisTrails or VisTrailsJL in your research, please cite the relevant work:&lt;br /&gt;
&lt;br /&gt;
'''Original VisTrails system:'''&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
@inproceedings{vistrails2006,&lt;br /&gt;
  title     = {VisTrails: visualization meets data management},&lt;br /&gt;
  author    = {Callahan, Steven P and Freire, Juliana and Scheidegger,&lt;br /&gt;
               Carlos E and Silva, Cl{\'a}udio T and Vo, Huy T},&lt;br /&gt;
  booktitle = {Proceedings of the 2006 ACM SIGMOD International Conference&lt;br /&gt;
               on Management of Data},&lt;br /&gt;
  pages     = {745--747},&lt;br /&gt;
  year      = {2006},&lt;br /&gt;
  doi       = {10.1145/1142473.1142574}&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''VisTrailsJL (Julia reimplementation):'''&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
@software{vistrailsjl2025,&lt;br /&gt;
  title  = {VisTrailsJL: A Julia Implementation of VisTrails},&lt;br /&gt;
  author = {Silva, Claudio T},&lt;br /&gt;
  year   = {2025},&lt;br /&gt;
  url    = {https://github.com/VIDA-NYU/VisTrailsJL}&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Publications, Tutorials and Presentations|Full publication list]]&lt;br /&gt;
&lt;br /&gt;
== People ==&lt;br /&gt;
[[People]]&lt;br /&gt;
&lt;br /&gt;
== Sponsors ==&lt;br /&gt;
&lt;br /&gt;
This work has been supported in part by the National Science Foundation under grants&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0905385 IIS-0905385],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0844572 IIS-0844572],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0746500 IIS CAREER-0746500],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0751152 CNS-0751152],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0513692 IIS-0513692],&lt;br /&gt;
[http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0401498 CCF-0401498],&lt;br /&gt;
and others; by the Department of Energy under the SciDAC program (SDM, VACET, and UV-CDAT);&lt;br /&gt;
and by IBM Faculty Awards (2005–2008) and a University of Utah Seed Grant.&lt;br /&gt;
&lt;br /&gt;
== Related ==&lt;br /&gt;
&lt;br /&gt;
[[BirdVis]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[http://www.crowdlabs.org CrowdLabs] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[RepeatabilityCentral]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[ProvenanceAnalytics]] &amp;amp;nbsp;|&amp;amp;nbsp;&lt;br /&gt;
[[Provenance: potpourri]]&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=User:Juliana&amp;diff=22653</id>
		<title>User:Juliana</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=User:Juliana&amp;diff=22653"/>
		<updated>2026-04-23T22:10:49Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Web Page ==&lt;br /&gt;
&lt;br /&gt;
For more information see my Web page at http://vgc.poly.edu/~juliana/&lt;br /&gt;
&lt;br /&gt;
== [[CS6093 Advanced Databases]] ==&lt;br /&gt;
&lt;br /&gt;
== [[Course: Big Data Analysis]] ==&lt;br /&gt;
[[Course Project: Wikipedia Analysis]]&lt;br /&gt;
&lt;br /&gt;
== [[TestNewPage]] ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=13181</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=13181"/>
		<updated>2017-01-30T04:23:24Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
[[Course: Big Data 2017]]&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes and lab: &lt;br /&gt;
** UPDATED: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3_v2.1.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
* Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner.  With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data.  This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of        (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.&lt;br /&gt;
&lt;br /&gt;
* Bio: Divesh Srivastava is the head of Database Research at AT&amp;amp;T Labs-Research.  He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.  His research interests and publications span a variety of topics in data management.&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/bdq.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP-2016.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: see NYU Classes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11625</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11625"/>
		<updated>2016-04-25T19:41:53Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes and lab: &lt;br /&gt;
** UPDATED: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3_v2.1.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
* Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner.  With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data.  This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of        (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.&lt;br /&gt;
&lt;br /&gt;
* Bio: Divesh Srivastava is the head of Database Research at AT&amp;amp;T Labs-Research.  He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.  His research interests and publications span a variety of topics in data management.&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/bdq.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP-2016.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: see NYU Classes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11624</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11624"/>
		<updated>2016-04-25T19:41:23Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes and lab: &lt;br /&gt;
** UPDATED: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3_v2.1.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
* Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner.  With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data.  This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of        (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.&lt;br /&gt;
&lt;br /&gt;
* Bio: Divesh Srivastava is the head of Database Research at AT&amp;amp;T Labs-Research.  He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.  His research interests and publications span a variety of topics in data management.&lt;br /&gt;
&lt;br /&gt;
* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/bdq.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP-2016.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11623</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11623"/>
		<updated>2016-04-25T19:40:26Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes and lab: &lt;br /&gt;
** UPDATED: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3_v2.1.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
* Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner.  With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data.  This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of        (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.&lt;br /&gt;
&lt;br /&gt;
* Bio: Divesh Srivastava is the head of Database Research at AT&amp;amp;T Labs-Research.  He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.  His research interests and publications span a variety of topics in data management.&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11583</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11583"/>
		<updated>2016-04-18T13:28:48Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes and lab: &lt;br /&gt;
** UPDATED: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3_v2.1.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
* Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner.  With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data.  This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of        (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.&lt;br /&gt;
&lt;br /&gt;
* Bio: Divesh Srivastava is the head of Database Research at AT&amp;amp;T Labs-Research.  He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.  His research interests and publications span a variety of topics in data management.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11582</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11582"/>
		<updated>2016-04-18T00:12:52Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes and lab: &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data quality: the other face of big data - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
* Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth “V” of big data is increasingly being recognized. In this talk, we highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner.  With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data.  This talk presents recent results that are relevant to big data quality management, focusing on the two major dimensions of        (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency.&lt;br /&gt;
&lt;br /&gt;
* Bio: Divesh Srivastava is the head of Database Research at AT&amp;amp;T Labs-Research.  He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.  His research interests and publications span a variety of topics in data management.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11551</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11551"/>
		<updated>2016-04-11T20:36:51Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization  -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization: Using D3 --  Invited lecture by Bowen Yu ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes and lab: &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/vis-d3.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11549</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11549"/>
		<updated>2016-04-11T20:32:08Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 11 - April 4th: Large-Scale Visualization -- -- Invited lecture by Professor Claudio Silva */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization -- -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/intro-to-visualization.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting1.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Plotting2.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/PlottingNotes.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/Tufte.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Videos:&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/biopathways.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/VisTrailsForParaView_Small.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/defog-1150.mov&lt;br /&gt;
**http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/visualization/movies/SevereTstorm.mov&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11548</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11548"/>
		<updated>2016-04-11T20:27:27Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 11 - April 4th: Large-Scale Visualization -- -- Invited lecture by Professor Claudio Silva */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization -- -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11508</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11508"/>
		<updated>2016-04-04T18:19:24Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 14 - April 25th: Association Rules */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization -- -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quiz on [http://www.newgradiance.com Gradiance] -- Association Rules.&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11507</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11507"/>
		<updated>2016-04-04T18:19:02Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Large-Scale Visualization -- -- Invited lecture by Professor Claudio Silva ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Exploring Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11452</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11452"/>
		<updated>2016-03-26T14:19:02Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 10 - March 28th:  Finding similar items &amp;amp; Spark */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**Spark: Cluster Computing with Working Sets by Zaharia et al. https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11451</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11451"/>
		<updated>2016-03-26T13:57:24Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 10 - March 28th:  Finding similar items &amp;amp; Spark */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
** On the resemblance and containment of documents by Andrei Broder. http://www.misserpirat.dk/main/docs/00000004.pdf&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11450</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11450"/>
		<updated>2016-03-26T13:53:48Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 10 - March 28th:  Finding similar items */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items &amp;amp; Spark ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: &lt;br /&gt;
**https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf&lt;br /&gt;
**Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11425</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11425"/>
		<updated>2016-03-21T14:03:33Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609; http://www.vldb.org/pvldb/2/vldb09-938.pdf&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726; http://infolab.stanford.edu/~olston/publications/sigmod08.pdf&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11424</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11424"/>
		<updated>2016-03-21T14:02:30Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 10 - March 28th:  Finding similar items */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11423</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11423"/>
		<updated>2016-03-21T14:02:14Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Pig&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11422</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11422"/>
		<updated>2016-03-21T14:01:49Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11419</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11419"/>
		<updated>2016-03-20T17:47:07Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
**** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* Assignment: Hands-on Map-Reduce (see NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11411</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11411"/>
		<updated>2016-03-19T19:27:25Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
**** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11365</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11365"/>
		<updated>2016-03-07T23:30:47Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 7 - March 7: MapReduce Algorithm Design Patterns */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-recap.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11320</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11320"/>
		<updated>2016-02-29T19:03:32Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 6 - Feb 29:  Introduction to Map Reduce */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)&lt;br /&gt;
** Quiz is due on 2016-03-14 12:00 PM EST&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11277</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11277"/>
		<updated>2016-02-22T18:42:14Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 5 - Feb 22: Data Exploration and Reproducibility */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''   http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11276</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11276"/>
		<updated>2016-02-22T18:41:56Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 5 - Feb 22: Data Exploration and Reproducibility */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  reproducibility-provenance.pdf&lt;br /&gt;
* '''Lab:''' Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11275</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11275"/>
		<updated>2016-02-22T18:40:02Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
* '''Lab:''' Hands-on reproducibility. &lt;br /&gt;
* '''Programming assignment:''' Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 8-- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11273</id>
		<title>NYU HPC Access Instructions</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11273"/>
		<updated>2016-02-22T07:56:37Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Accessing the NYU HPC Cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Accessing the NYU HPC Cluster == &lt;br /&gt;
&lt;br /&gt;
If you don't have an account, request one at https://wikis.nyu.edu/display/NYUHPC/Request+or+Renew&lt;br /&gt;
&lt;br /&gt;
1. Log into the main HPC node:&lt;br /&gt;
       ssh &amp;lt;netid&amp;gt;@hpc.nyu.edu    &lt;br /&gt;
&lt;br /&gt;
2. From the HPC node, log into the Hadoop cluster:&lt;br /&gt;
       ssh dumbo&lt;br /&gt;
&lt;br /&gt;
You will be using a set of commands, and it will save you some time to first create aliases for them. Once on &amp;quot;dumbo&amp;quot;, run the following commands on your terminal:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
bash&lt;br /&gt;
&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
&lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
&lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&lt;br /&gt;
%% Note: you should not have any spaces around &amp;quot;=&amp;quot;!&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you have bash as your default shell, do&lt;br /&gt;
      source .bashrc&lt;br /&gt;
This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Here are some common commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs        %% See available commands.&lt;br /&gt;
&lt;br /&gt;
hfs -help   %% more command details.&lt;br /&gt;
&lt;br /&gt;
hfs -ls [&amp;lt;path&amp;gt;]  %% List files&lt;br /&gt;
&lt;br /&gt;
hfs -cp &amp;lt;src&amp;gt; &amp;lt;dst&amp;gt;  %% Copy stuff&lt;br /&gt;
&lt;br /&gt;
hfs -mkdir &amp;lt;path&amp;gt; %% Create path&lt;br /&gt;
&lt;br /&gt;
hfs -rm &amp;lt;path&amp;gt; %% remove a file&lt;br /&gt;
&lt;br /&gt;
hfs -chmod &amp;lt;path&amp;gt; %% Modify permissions.&lt;br /&gt;
&lt;br /&gt;
hfs -chown &amp;lt;path&amp;gt; %%  Modify owner.&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Some remote access commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs -cat &amp;lt;src&amp;gt;  %% Cat contents to stdout.&lt;br /&gt;
&lt;br /&gt;
hfs -copyFromLocal &amp;lt;localsrc&amp;gt; &amp;lt;dst&amp;gt; %% Copy stuff&lt;br /&gt;
&lt;br /&gt;
hfs -copyToLocal &amp;lt;src&amp;gt; &amp;lt;localdst&amp;gt; %% Copy stuff&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Using Hadoop Streaming ===&lt;br /&gt;
&lt;br /&gt;
* Hadoop streaming allows the use any program written in any language for mapreduce operations.&lt;br /&gt;
* You can use the  &amp;quot;hjs&amp;quot; alias you created to run Hadoop Streaming&lt;br /&gt;
&lt;br /&gt;
To run the example I provided, do the following:&lt;br /&gt;
&lt;br /&gt;
1) Copy the directory containing the Python files and input data to dumbo. You will first need to &amp;quot;scp&amp;quot; from your machine to the hpc node, and them from the hpc node to dumbo.&lt;br /&gt;
Assuming the directory in your machine is called /Users/julianafreire/MRExample&lt;br /&gt;
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: &lt;br /&gt;
Then, from the hpc node:&lt;br /&gt;
       scp -r MRExample  dumbo&lt;br /&gt;
&lt;br /&gt;
** Remember to replace your_netid with your actual netid!&lt;br /&gt;
&lt;br /&gt;
2) From dumbo, you will now copy the data file to HDFS&lt;br /&gt;
       hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt&lt;br /&gt;
&lt;br /&gt;
3) Check if the file is on HDFS&lt;br /&gt;
      hfs -ls&lt;br /&gt;
&lt;br /&gt;
4) Now, to run the job, make sure you are on the right directory&lt;br /&gt;
     cd /home/your_netid/MRExample&lt;br /&gt;
     hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output.  To list the output files:&lt;br /&gt;
     hfs -ls /user/jf1870/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
You can also inspect the content of the files:&lt;br /&gt;
&lt;br /&gt;
    hfs -cat wikipedia.output/*&lt;br /&gt;
&lt;br /&gt;
If you'd like to copy the files over to your local directory:&lt;br /&gt;
    hfs -get /user/jf1870/wikipedia.output  output&lt;br /&gt;
&lt;br /&gt;
This will copy the outputs to the local directory &amp;quot;output&amp;quot; on dumbo&lt;br /&gt;
&lt;br /&gt;
=== Using Spark ===&lt;br /&gt;
&lt;br /&gt;
* Spark allow you to write and run applications quickly in Java, Scala, Python and R&lt;br /&gt;
* You can either use Spark interactive shell or Spark submission tool&lt;br /&gt;
&lt;br /&gt;
To run Spark interactive shell (Scala or Python):&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute one of the following:&lt;br /&gt;
	spark-shell (to run applications in Scala)&lt;br /&gt;
        pyspark (to run applications in Python)&lt;br /&gt;
&lt;br /&gt;
If you want to access your files stored on HDFS, use the following URL as filename in Spark&lt;br /&gt;
	hdfs://babar.es.its.nyu.edu:8020/user/&amp;lt;your_net_id&amp;gt;/&amp;lt;your_files&amp;gt;&lt;br /&gt;
(the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)&lt;br /&gt;
&lt;br /&gt;
To submit job to Spark:&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute&lt;br /&gt;
	spark-submit --num-executors &amp;lt;10-100&amp;gt; &amp;lt;your_python_script&amp;gt; &amp;lt;your_script_arguments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.&lt;br /&gt;
&lt;br /&gt;
You can try some examples: &lt;br /&gt;
* Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py&lt;br /&gt;
* With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py&lt;br /&gt;
&lt;br /&gt;
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark. &lt;br /&gt;
The difference is that Spark Streaming supports processing of live data stream.&lt;br /&gt;
&lt;br /&gt;
Some references:&lt;br /&gt;
&lt;br /&gt;
1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html&lt;br /&gt;
2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11212</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11212"/>
		<updated>2016-02-08T16:23:13Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
* '''Lab:''' Hands-on reproducibility. &lt;br /&gt;
* '''Programming assignment:''' Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11188</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11188"/>
		<updated>2016-02-04T22:34:27Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at Silver 207 &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
* '''Lab:''' Hands-on reproducibility. &lt;br /&gt;
* '''Programming assignment:''' Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11171</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11171"/>
		<updated>2016-02-01T23:48:15Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
* '''Lab:''' Hands-on reproducibility. &lt;br /&gt;
* '''Programming assignment:''' Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11167</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11167"/>
		<updated>2016-02-01T19:05:18Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf&lt;br /&gt;
** ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
* '''Lab:''' Hands-on reproducibility. &lt;br /&gt;
* '''Programming assignment:''' Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11166</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11166"/>
		<updated>2016-02-01T19:02:14Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' getting started with MySQL&lt;br /&gt;
* '''Required Reading:''' &lt;br /&gt;
** Chapter 1 of Mining of Massive Data Analysis&lt;br /&gt;
* '''Suggested Reading:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
** [https://docs.google.com/file/d/0B7lNUaak0bK1NDBWZU5XTmItdGc/edit History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla], by C. Mohan, EDBT 2013&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
* '''Lab:''' Hands-on reproducibility. &lt;br /&gt;
* '''Programming assignment:''' Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11153</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11153"/>
		<updated>2016-01-23T22:05:43Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/datamanagement.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' in-class assignment on relational algebra&lt;br /&gt;
* '''Readings:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
* '''Lab:''' Hands-on reproducibility. &lt;br /&gt;
* '''Programming assignment:''' Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&amp;amp;T Research ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2:  TBD ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11152</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11152"/>
		<updated>2016-01-23T22:01:34Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/datamanagement.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' in-class assignment on relational algebra&lt;br /&gt;
* '''Readings:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf&lt;br /&gt;
* '''Lab:''' NoSQL&lt;br /&gt;
* '''Programming assignment:''' Pig and Spark&lt;br /&gt;
* '''Readings''': &lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* '''Additional Suggested reading:'''&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11151</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11151"/>
		<updated>2016-01-23T21:55:36Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview ==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
*''' Lab:'''  Computing infrastructure for the course &lt;br /&gt;
* '''Reading:''' Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* '''Course survey:''' https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL==&lt;br /&gt;
&lt;br /&gt;
* '''Lecture notes:'''&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/datamanagement.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' in-class assignment on relational algebra&lt;br /&gt;
* '''Readings:''' &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf&lt;br /&gt;
* '''Lab:''' SQL &lt;br /&gt;
* '''Programming assignment:''' Using SQL for data analysis and cleaning (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (local and AWS)&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design.pdf&lt;br /&gt;
* '''Lab:''' Hands-on Hadoop (HPC)&lt;br /&gt;
* '''Programming assignment:''' Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on SPARK (HPC)&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11150</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11150"/>
		<updated>2016-01-23T21:47:14Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* News */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview; Lab: Computing infrastructure for the course ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL==&lt;br /&gt;
&lt;br /&gt;
* In-class assignment: relational algebra&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
* Lab: SQL&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning &lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local and AWS)&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (HPC)&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on SPARK (HPC)&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11149</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11149"/>
		<updated>2016-01-23T21:46:02Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview; Lab: Computing infrastructure for the course ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL==&lt;br /&gt;
&lt;br /&gt;
* In-class assignment: relational algebra&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
* Lab: SQL&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning &lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local and AWS)&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (HPC)&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on SPARK (HPC)&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11148</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11148"/>
		<updated>2016-01-23T18:48:24Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. To obtain your credit, please follow the instructions at http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview; Lab: Computing infrastructure for the course ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL==&lt;br /&gt;
&lt;br /&gt;
* In-class assignment: relational algebra&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.) ==&lt;br /&gt;
&lt;br /&gt;
* Lab: SQL&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning &lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local and AWS)&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns  ==&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (HPC)&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: Parallel Databases vs MapReduce; Introduction to SPARK== &lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on SPARK (HPC)&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11147</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11147"/>
		<updated>2016-01-23T18:21:04Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* DS-GA 1004- Big Data: Tentative Schedule -- subject to change */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructors: &lt;br /&gt;
** Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
** Dr. Erin C Carson &lt;br /&gt;
** Dr. Nicholas Knight &lt;br /&gt;
&lt;br /&gt;
* TAs:&lt;br /&gt;
** Yuan Feng&lt;br /&gt;
** Kevin Ye&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
&lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. To obtain your credit, please follow the instructions at http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL ==&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab:&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
* Lab: (continuation)&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* Required Reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: Algorithm Design for MapReduce: Relational Operations  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local)&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (2nd Edition), Chapter 2.&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce == &lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop on AWS&lt;br /&gt;
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip&lt;br /&gt;
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
&lt;br /&gt;
* Some links to AWS CLI documentation:&lt;br /&gt;
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html&lt;br /&gt;
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool&lt;br /&gt;
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html&lt;br /&gt;
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws&lt;br /&gt;
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11103</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11103"/>
		<updated>2016-01-06T23:02:15Z</updated>

		<summary type="html">&lt;p&gt;Juliana: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. To obtain your credit, please follow the instructions at http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Jan 25:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 1:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL ==&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab:&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 4 - Feb 15: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 5 - Feb 22:  Introduction to Map Reduce ==&lt;br /&gt;
* Lab: (continuation)&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* Required Reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 6 - Feb 29: Algorithm Design for MapReduce: Relational Operations  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local)&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (2nd Edition), Chapter 2.&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 7: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce == &lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop on AWS&lt;br /&gt;
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip&lt;br /&gt;
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
&lt;br /&gt;
* Some links to AWS CLI documentation:&lt;br /&gt;
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html&lt;br /&gt;
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool&lt;br /&gt;
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html&lt;br /&gt;
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws&lt;br /&gt;
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== Week 8 -- March 14th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 9 - March 21: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 10 - March 28th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 4th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 13 - April 18th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 14 - April 25th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 15 - May 2: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 16 - May 9: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 17 - May 16: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11102</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11102"/>
		<updated>2016-01-06T22:44:06Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* DS-GA 1004- Big Data: Tentative Schedule -- subject to change */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
* Some classes will include a lab session, please  always ''bring your laptop''.&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. To obtain your credit, please follow the instructions at http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
= Background (2 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab:&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Feb 16: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 23:  Introduction to Map Reduce ==&lt;br /&gt;
* Lab: (continuation)&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* Required Reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local)&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (2nd Edition), Chapter 2.&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce == &lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop on AWS&lt;br /&gt;
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip&lt;br /&gt;
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
&lt;br /&gt;
* Some links to AWS CLI documentation:&lt;br /&gt;
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html&lt;br /&gt;
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool&lt;br /&gt;
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html&lt;br /&gt;
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws&lt;br /&gt;
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== March 16th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - March 23: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 30th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 8 - April 6th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 10 - April 20th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 27th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 12 - May 4: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 13 - May 11: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 14 - May 18: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11101</id>
		<title>Course: Big Data 2016</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=Course:_Big_Data_2016&amp;diff=11101"/>
		<updated>2016-01-06T22:43:46Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* DS-GA 1004- Big Data: Tentative Schedule -- subject to change */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= DS-GA 1004- Big Data: Tentative Schedule -- ''subject to change'' =&lt;br /&gt;
&lt;br /&gt;
* Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016&lt;br /&gt;
&lt;br /&gt;
* Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)&lt;br /&gt;
&lt;br /&gt;
* Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102. &lt;br /&gt;
* Some classes will include a lab session, please  &amp;quot;&amp;quot;always bring your laptop.&amp;quot;''&lt;br /&gt;
&lt;br /&gt;
= News =&lt;br /&gt;
&lt;br /&gt;
* 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. To obtain your credit, please follow the instructions at http://www.vistrails.org/index.php/AWS_Setup&lt;br /&gt;
* 1/25/2016: Access you NYU HPC account, which you will use for in-class exercises and homework assignments. See  [[NYU HPC Access Instructions]]&lt;br /&gt;
&lt;br /&gt;
= Background (2 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 1 - Feb 2:  Course Overview; The evolution of Data Management and introduction to Big Data ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf&lt;br /&gt;
* Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)&lt;br /&gt;
* Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form&lt;br /&gt;
&lt;br /&gt;
== Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL ==&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/intro-to-db.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/relational-algebra.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-intro.pdf&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/sql-more.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab:&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** [http://philip.greenspun.com/sql/introduction.html Greenspun's SQL for Web Nerds Intro]&lt;br /&gt;
** [http://philip.greenspun.com/sql/data-modeling.html SQL/Nerds Modeling (parts)]&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Feb 16: Holiday ==&lt;br /&gt;
&lt;br /&gt;
= Big Data Foundations and Infrastructure (3 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 3 - Feb 23:  Introduction to Map Reduce ==&lt;br /&gt;
* Lab: (continuation)&lt;br /&gt;
** SQL hands on: [[Big Data 2015 - SQL Lab]]&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-intro.pdf&lt;br /&gt;
* Required Reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce. Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (v 2.1).  Chapter 2 - 2.1, 2.2, and 2.3&lt;br /&gt;
* Other useful reading: &lt;br /&gt;
** Hadoop: The Definitive Guide.  http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520&lt;br /&gt;
&lt;br /&gt;
* Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  &lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop (local)&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Mining of Massive Datasets (2nd Edition), Chapter 2.&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: Map Reduce (check NYU Classes)&lt;br /&gt;
&lt;br /&gt;
== Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce == &lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on Hadoop on AWS&lt;br /&gt;
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip&lt;br /&gt;
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
&lt;br /&gt;
* Some links to AWS CLI documentation:&lt;br /&gt;
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html&lt;br /&gt;
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html&lt;br /&gt;
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool&lt;br /&gt;
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html&lt;br /&gt;
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws&lt;br /&gt;
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Required reading: &lt;br /&gt;
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2&lt;br /&gt;
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)&lt;br /&gt;
&lt;br /&gt;
* Programming assignment: check NYU Classes on March 10th&lt;br /&gt;
&lt;br /&gt;
== March 16th: Spring Break ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Transparency and Reproducibility  (1 week) =&lt;br /&gt;
&lt;br /&gt;
== Week 6 - March 23: Data Exploration and Reproducibility  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:  http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Hands-on reproducibility. Before class, please&lt;br /&gt;
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads&lt;br /&gt;
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt&lt;br /&gt;
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt&lt;br /&gt;
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf&lt;br /&gt;
** Questions? Email Fernando at fchirigati@nyu.edu&lt;br /&gt;
&lt;br /&gt;
* Programming assignment 4: Exploring urban data (see NYU Classes)&lt;br /&gt;
&lt;br /&gt;
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =&lt;br /&gt;
&lt;br /&gt;
== Week 7 - March 30th:  Finding similar items  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] &lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. &lt;br /&gt;
&lt;br /&gt;
== Week 8 - April 6th: Association Rules  ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]&lt;br /&gt;
&lt;br /&gt;
* Suggested additional reading: &lt;br /&gt;
**Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.&lt;br /&gt;
**Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann&lt;br /&gt;
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html&lt;br /&gt;
&lt;br /&gt;
* Homework Assignment&lt;br /&gt;
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.&lt;br /&gt;
&lt;br /&gt;
== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf&lt;br /&gt;
&lt;br /&gt;
* Lab: Using Amazon AWS to analyze and visualize taxi data&lt;br /&gt;
** https://github.com/ViDA-NYU/aws_taxi&lt;br /&gt;
&lt;br /&gt;
== Week 10 - April 20th: Parallel Databases ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf&lt;br /&gt;
&lt;br /&gt;
* Required reading:&lt;br /&gt;
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf&lt;br /&gt;
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext&lt;br /&gt;
&lt;br /&gt;
* Suggested reading:&lt;br /&gt;
** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609&lt;br /&gt;
** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726&lt;br /&gt;
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf&lt;br /&gt;
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&lt;br /&gt;
&lt;br /&gt;
== Week 11 - April 27th: Graph Analysis ==&lt;br /&gt;
&lt;br /&gt;
* Lecture notes:&lt;br /&gt;
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf&lt;br /&gt;
&lt;br /&gt;
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms&lt;br /&gt;
&lt;br /&gt;
== Week 12 - May 4: Final Exam ==&lt;br /&gt;
&lt;br /&gt;
== Week 13 - May 11: Project Presentations ==&lt;br /&gt;
&lt;br /&gt;
== Week 14 - May 18: Project Presentations ==&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11100</id>
		<title>NYU HPC Access Instructions</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11100"/>
		<updated>2016-01-06T22:40:01Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Accessing the NYU HPC Cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Accessing the NYU HPC Cluster == &lt;br /&gt;
&lt;br /&gt;
1. Log into the main HPC node:&lt;br /&gt;
       ssh &amp;lt;netid&amp;gt;@hpc.nyu.edu    &lt;br /&gt;
&lt;br /&gt;
2. From the HPC node, log into the Hadoop cluster:&lt;br /&gt;
       ssh dumbo&lt;br /&gt;
&lt;br /&gt;
You will be using a set of commands, and it will save you some time to first create aliases for them. Once on &amp;quot;dumbo&amp;quot;, run the following commands on your terminal:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
bash&lt;br /&gt;
&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
&lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
&lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&lt;br /&gt;
%% Note: you should not have any spaces around &amp;quot;=&amp;quot;!&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you have bash as your default shell, do&lt;br /&gt;
      source .bashrc&lt;br /&gt;
This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Here are some common commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs        %% See available commands.&lt;br /&gt;
&lt;br /&gt;
hfs -help   %% more command details.&lt;br /&gt;
&lt;br /&gt;
hfs -ls [&amp;lt;path&amp;gt;]  %% List files&lt;br /&gt;
&lt;br /&gt;
hfs -cp &amp;lt;src&amp;gt; &amp;lt;dst&amp;gt;  %% Copy stuff&lt;br /&gt;
&lt;br /&gt;
hfs -mkdir &amp;lt;path&amp;gt; %% Create path&lt;br /&gt;
&lt;br /&gt;
hfs -rm &amp;lt;path&amp;gt; %% remove a file&lt;br /&gt;
&lt;br /&gt;
hfs -chmod &amp;lt;path&amp;gt; %% Modify permissions.&lt;br /&gt;
&lt;br /&gt;
hfs -chown &amp;lt;path&amp;gt; %%  Modify owner.&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Some remote access commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs -cat &amp;lt;src&amp;gt;  %% Cat contents to stdout.&lt;br /&gt;
&lt;br /&gt;
hfs -copyFromLocal &amp;lt;localsrc&amp;gt; &amp;lt;dst&amp;gt; %% Copy stuff&lt;br /&gt;
&lt;br /&gt;
hfs -copyToLocal &amp;lt;src&amp;gt; &amp;lt;localdst&amp;gt; %% Copy stuff&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Using Hadoop Streaming ===&lt;br /&gt;
&lt;br /&gt;
* Hadoop streaming allows the use any program written in any language for mapreduce operations.&lt;br /&gt;
* You can use the  &amp;quot;hjs&amp;quot; alias you created to run Hadoop Streaming&lt;br /&gt;
&lt;br /&gt;
To run the example I provided, do the following:&lt;br /&gt;
&lt;br /&gt;
1) Copy the directory containing the Python files and input data to dumbo. You will first need to &amp;quot;scp&amp;quot; from your machine to the hpc node, and them from the hpc node to dumbo.&lt;br /&gt;
Assuming the directory in your machine is called /Users/julianafreire/MRExample&lt;br /&gt;
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: &lt;br /&gt;
Then, from the hpc node:&lt;br /&gt;
       scp -r MRExample  dumbo&lt;br /&gt;
&lt;br /&gt;
** Remember to replace your_netid with your actual netid!&lt;br /&gt;
&lt;br /&gt;
2) From dumbo, you will now copy the data file to HDFS&lt;br /&gt;
       hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt&lt;br /&gt;
&lt;br /&gt;
3) Check if the file is on HDFS&lt;br /&gt;
      hfs -ls&lt;br /&gt;
&lt;br /&gt;
4) Now, to run the job, make sure you are on the right directory&lt;br /&gt;
     cd /home/your_netid/MRExample&lt;br /&gt;
     hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output.  To list the output files:&lt;br /&gt;
     hfs -ls /user/jf1870/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
You can also inspect the content of the files:&lt;br /&gt;
&lt;br /&gt;
    hfs -cat wikipedia.output/*&lt;br /&gt;
&lt;br /&gt;
If you'd like to copy the files over to your local directory:&lt;br /&gt;
    hfs -get /user/jf1870/wikipedia.output  output&lt;br /&gt;
&lt;br /&gt;
This will copy the outputs to the local directory &amp;quot;output&amp;quot; on dumbo&lt;br /&gt;
&lt;br /&gt;
=== Using Spark ===&lt;br /&gt;
&lt;br /&gt;
* Spark allow you to write and run applications quickly in Java, Scala, Python and R&lt;br /&gt;
* You can either use Spark interactive shell or Spark submission tool&lt;br /&gt;
&lt;br /&gt;
To run Spark interactive shell (Scala or Python):&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute one of the following:&lt;br /&gt;
	spark-shell (to run applications in Scala)&lt;br /&gt;
        pyspark (to run applications in Python)&lt;br /&gt;
&lt;br /&gt;
If you want to access your files stored on HDFS, use the following URL as filename in Spark&lt;br /&gt;
	hdfs://babar.es.its.nyu.edu:8020/user/&amp;lt;your_net_id&amp;gt;/&amp;lt;your_files&amp;gt;&lt;br /&gt;
(the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)&lt;br /&gt;
&lt;br /&gt;
To submit job to Spark:&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute&lt;br /&gt;
	spark-submit --num-executors &amp;lt;10-100&amp;gt; &amp;lt;your_python_script&amp;gt; &amp;lt;your_script_arguments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.&lt;br /&gt;
&lt;br /&gt;
You can try some examples: &lt;br /&gt;
* Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py&lt;br /&gt;
* With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py&lt;br /&gt;
&lt;br /&gt;
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark. &lt;br /&gt;
The difference is that Spark Streaming supports processing of live data stream.&lt;br /&gt;
&lt;br /&gt;
Some references:&lt;br /&gt;
&lt;br /&gt;
1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html&lt;br /&gt;
2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11099</id>
		<title>NYU HPC Access Instructions</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11099"/>
		<updated>2016-01-06T22:35:24Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Accessing the NYU HPC Cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Accessing the NYU HPC Cluster == &lt;br /&gt;
&lt;br /&gt;
1. Log into the main HPC node:&lt;br /&gt;
       ssh &amp;lt;netid&amp;gt;@hpc.nyu.edu    &lt;br /&gt;
&lt;br /&gt;
2. From the HPC node, log into the Hadoop cluster:&lt;br /&gt;
       ssh dumbo&lt;br /&gt;
&lt;br /&gt;
You will be using a set of commands, and it will save you some time to first create aliases for them. Once on &amp;quot;dumbo&amp;quot;, run the following commands on your terminal:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
bash&lt;br /&gt;
&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
&lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&lt;br /&gt;
%% Note: you should not have any spaces around &amp;quot;=&amp;quot;!&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you have bash as your default shell, do&lt;br /&gt;
      source .bashrc&lt;br /&gt;
This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Here are some common commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs        %% See available commands.&lt;br /&gt;
hfs -help   %% more command details.&lt;br /&gt;
hfs -ls [&amp;lt;path&amp;gt;]  %% List files&lt;br /&gt;
hfs -cp &amp;lt;src&amp;gt; &amp;lt;dst&amp;gt;  %% Copy stuff&lt;br /&gt;
hfs -mkdir &amp;lt;path&amp;gt; %% Create path&lt;br /&gt;
hfs -rm &amp;lt;path&amp;gt; %% remove a file&lt;br /&gt;
hfs -chmod &amp;lt;path&amp;gt; %% Modify permissions.&lt;br /&gt;
hfs -chown &amp;lt;path&amp;gt; %%  Modify owner.&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Some remote access commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs -cat &amp;lt;src&amp;gt;  %% Cat contents to stdout.&lt;br /&gt;
hfs -copyFromLocal &amp;lt;localsrc&amp;gt; &amp;lt;dst&amp;gt; %% Copy stuff&lt;br /&gt;
hfs -copyToLocal &amp;lt;src&amp;gt; &amp;lt;localdst&amp;gt; %% Copy stuff&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Using Hadoop Streaming ===&lt;br /&gt;
&lt;br /&gt;
* Hadoop streaming allows the use any program written in any language for mapreduce operations.&lt;br /&gt;
* You can use the  &amp;quot;hjs&amp;quot; alias you created to run Hadoop Streaming&lt;br /&gt;
&lt;br /&gt;
To run the example I provided, do the following:&lt;br /&gt;
&lt;br /&gt;
1) Copy the directory containing the Python files and input data to dumbo. You will first need to &amp;quot;scp&amp;quot; from your machine to the hpc node, and them from the hpc node to dumbo.&lt;br /&gt;
Assuming the directory is called /Users/julianafreire/MRExample&lt;br /&gt;
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: &lt;br /&gt;
Then, from the hpc node:&lt;br /&gt;
       scp -r MRExample  dumbo&lt;br /&gt;
&lt;br /&gt;
** Remember to replace your_netid with your actual netid!&lt;br /&gt;
&lt;br /&gt;
2) From dumbo, you will now copy the data file to HDFS&lt;br /&gt;
       hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt&lt;br /&gt;
&lt;br /&gt;
3) Check if the file is on HDFS&lt;br /&gt;
      hfs -ls&lt;br /&gt;
&lt;br /&gt;
4) Now, to run the job, make sure you are on the right directory&lt;br /&gt;
     cd /home/your_netid/MRExample&lt;br /&gt;
     hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output.  To list the output files:&lt;br /&gt;
     hfs -ls /user/jf1870/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
You can also inspect the content of the files:&lt;br /&gt;
&lt;br /&gt;
    hfs -cat wikipedia.output/*&lt;br /&gt;
&lt;br /&gt;
If you'd like to copy the files over to your local directory:&lt;br /&gt;
    hfs -get /user/jf1870/wikipedia.output  output&lt;br /&gt;
&lt;br /&gt;
This will copy the outputs to the local directory &amp;quot;output&amp;quot; on dumbo&lt;br /&gt;
&lt;br /&gt;
----------------------------------------------------------------------&lt;br /&gt;
Using Spark&lt;br /&gt;
&lt;br /&gt;
* Spark allow you to write and run applications quickly in Java, Scala, Python and R&lt;br /&gt;
* You can either use Spark interactive shell or Spark submission tool&lt;br /&gt;
&lt;br /&gt;
To run Spark interactive shell (Scala or Python):&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute one of the following:&lt;br /&gt;
	spark-shell (to run applications in Scala)&lt;br /&gt;
        pyspark (to run applications in Python)&lt;br /&gt;
&lt;br /&gt;
If you want to access your files stored on HDFS, use the following URL as filename in Spark&lt;br /&gt;
	hdfs://babar.es.its.nyu.edu:8020/user/&amp;lt;your_net_id&amp;gt;/&amp;lt;your_files&amp;gt;&lt;br /&gt;
(the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)&lt;br /&gt;
&lt;br /&gt;
To submit job to Spark:&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute&lt;br /&gt;
	spark-submit --num-executors &amp;lt;10-100&amp;gt; &amp;lt;your_python_script&amp;gt; &amp;lt;your_script_arguments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. &lt;br /&gt;
The bigger the faster. However if many people submit Spark job at the same time, performance will&lt;br /&gt;
be downgraded.&lt;br /&gt;
&lt;br /&gt;
Spark word count example:&lt;br /&gt;
&lt;br /&gt;
Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py&lt;br /&gt;
With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py&lt;br /&gt;
&lt;br /&gt;
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark. &lt;br /&gt;
The difference is that Spark Streaming provide streaming processing of live data stream.&lt;br /&gt;
&lt;br /&gt;
Some references:&lt;br /&gt;
&lt;br /&gt;
1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html&lt;br /&gt;
2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11098</id>
		<title>NYU HPC Access Instructions</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11098"/>
		<updated>2016-01-06T22:34:58Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Accessing the NYU HPC Cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Accessing the NYU HPC Cluster == &lt;br /&gt;
&lt;br /&gt;
1. Log into the main HPC node:&lt;br /&gt;
       ssh &amp;lt;netid&amp;gt;@hpc.nyu.edu    &lt;br /&gt;
&lt;br /&gt;
2. From the HPC node, log into the Hadoop cluster:&lt;br /&gt;
       ssh dumbo&lt;br /&gt;
&lt;br /&gt;
You will be using a set of commands, and it will save you some time to first create aliases for them. Once on &amp;quot;dumbo&amp;quot;, run the following commands on your terminal:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;&lt;br /&gt;
bash&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&lt;br /&gt;
%% Note: you should not have any spaces around &amp;quot;=&amp;quot;!&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you have bash as your default shell, do&lt;br /&gt;
      source .bashrc&lt;br /&gt;
This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Here are some common commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs        %% See available commands.&lt;br /&gt;
hfs -help   %% more command details.&lt;br /&gt;
hfs -ls [&amp;lt;path&amp;gt;]  %% List files&lt;br /&gt;
hfs -cp &amp;lt;src&amp;gt; &amp;lt;dst&amp;gt;  %% Copy stuff&lt;br /&gt;
hfs -mkdir &amp;lt;path&amp;gt; %% Create path&lt;br /&gt;
hfs -rm &amp;lt;path&amp;gt; %% remove a file&lt;br /&gt;
hfs -chmod &amp;lt;path&amp;gt; %% Modify permissions.&lt;br /&gt;
hfs -chown &amp;lt;path&amp;gt; %%  Modify owner.&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Some remote access commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs -cat &amp;lt;src&amp;gt;  %% Cat contents to stdout.&lt;br /&gt;
hfs -copyFromLocal &amp;lt;localsrc&amp;gt; &amp;lt;dst&amp;gt; %% Copy stuff&lt;br /&gt;
hfs -copyToLocal &amp;lt;src&amp;gt; &amp;lt;localdst&amp;gt; %% Copy stuff&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Using Hadoop Streaming ===&lt;br /&gt;
&lt;br /&gt;
* Hadoop streaming allows the use any program written in any language for mapreduce operations.&lt;br /&gt;
* You can use the  &amp;quot;hjs&amp;quot; alias you created to run Hadoop Streaming&lt;br /&gt;
&lt;br /&gt;
To run the example I provided, do the following:&lt;br /&gt;
&lt;br /&gt;
1) Copy the directory containing the Python files and input data to dumbo. You will first need to &amp;quot;scp&amp;quot; from your machine to the hpc node, and them from the hpc node to dumbo.&lt;br /&gt;
Assuming the directory is called /Users/julianafreire/MRExample&lt;br /&gt;
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: &lt;br /&gt;
Then, from the hpc node:&lt;br /&gt;
       scp -r MRExample  dumbo&lt;br /&gt;
&lt;br /&gt;
** Remember to replace your_netid with your actual netid!&lt;br /&gt;
&lt;br /&gt;
2) From dumbo, you will now copy the data file to HDFS&lt;br /&gt;
       hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt&lt;br /&gt;
&lt;br /&gt;
3) Check if the file is on HDFS&lt;br /&gt;
      hfs -ls&lt;br /&gt;
&lt;br /&gt;
4) Now, to run the job, make sure you are on the right directory&lt;br /&gt;
     cd /home/your_netid/MRExample&lt;br /&gt;
     hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output.  To list the output files:&lt;br /&gt;
     hfs -ls /user/jf1870/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
You can also inspect the content of the files:&lt;br /&gt;
&lt;br /&gt;
    hfs -cat wikipedia.output/*&lt;br /&gt;
&lt;br /&gt;
If you'd like to copy the files over to your local directory:&lt;br /&gt;
    hfs -get /user/jf1870/wikipedia.output  output&lt;br /&gt;
&lt;br /&gt;
This will copy the outputs to the local directory &amp;quot;output&amp;quot; on dumbo&lt;br /&gt;
&lt;br /&gt;
----------------------------------------------------------------------&lt;br /&gt;
Using Spark&lt;br /&gt;
&lt;br /&gt;
* Spark allow you to write and run applications quickly in Java, Scala, Python and R&lt;br /&gt;
* You can either use Spark interactive shell or Spark submission tool&lt;br /&gt;
&lt;br /&gt;
To run Spark interactive shell (Scala or Python):&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute one of the following:&lt;br /&gt;
	spark-shell (to run applications in Scala)&lt;br /&gt;
        pyspark (to run applications in Python)&lt;br /&gt;
&lt;br /&gt;
If you want to access your files stored on HDFS, use the following URL as filename in Spark&lt;br /&gt;
	hdfs://babar.es.its.nyu.edu:8020/user/&amp;lt;your_net_id&amp;gt;/&amp;lt;your_files&amp;gt;&lt;br /&gt;
(the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)&lt;br /&gt;
&lt;br /&gt;
To submit job to Spark:&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute&lt;br /&gt;
	spark-submit --num-executors &amp;lt;10-100&amp;gt; &amp;lt;your_python_script&amp;gt; &amp;lt;your_script_arguments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. &lt;br /&gt;
The bigger the faster. However if many people submit Spark job at the same time, performance will&lt;br /&gt;
be downgraded.&lt;br /&gt;
&lt;br /&gt;
Spark word count example:&lt;br /&gt;
&lt;br /&gt;
Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py&lt;br /&gt;
With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py&lt;br /&gt;
&lt;br /&gt;
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark. &lt;br /&gt;
The difference is that Spark Streaming provide streaming processing of live data stream.&lt;br /&gt;
&lt;br /&gt;
Some references:&lt;br /&gt;
&lt;br /&gt;
1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html&lt;br /&gt;
2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11096</id>
		<title>NYU HPC Access Instructions</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11096"/>
		<updated>2016-01-06T22:32:18Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Accessing the NYU HPC Cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Accessing the NYU HPC Cluster == &lt;br /&gt;
&lt;br /&gt;
1. Log into the main HPC node:&lt;br /&gt;
       ssh &amp;lt;netid&amp;gt;@hpc.nyu.edu    &lt;br /&gt;
&lt;br /&gt;
2. From the HPC node, log into the Hadoop cluster:&lt;br /&gt;
       ssh dumbo&lt;br /&gt;
&lt;br /&gt;
You will be using a set of commands, and it will save you some time to first create aliases for them. Once on &amp;quot;dumbo&amp;quot;, run the following commands on your terminal:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
bash&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&lt;br /&gt;
%% Note: you should not have any spaces around &amp;quot;=&amp;quot;!&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you have bash as your default shell, do&lt;br /&gt;
      source .bashrc&lt;br /&gt;
This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Here are some common commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs        %% See available commands.&lt;br /&gt;
hfs -help   %% more command details.&lt;br /&gt;
hfs -ls [&amp;lt;path&amp;gt;]  %% List files&lt;br /&gt;
hfs -cp &amp;lt;src&amp;gt; &amp;lt;dst&amp;gt;  %% Copy stuff&lt;br /&gt;
hfs -mkdir &amp;lt;path&amp;gt; %% Create path&lt;br /&gt;
hfs -rm &amp;lt;path&amp;gt; %% remove a file&lt;br /&gt;
hfs -chmod &amp;lt;path&amp;gt; %% Modify permissions.&lt;br /&gt;
hfs -chown &amp;lt;path&amp;gt; %%  Modify owner.&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Some remote access commands:&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
hfs -cat &amp;lt;src&amp;gt;  %% Cat contents to stdout.&lt;br /&gt;
hfs -copyFromLocal &amp;lt;localsrc&amp;gt; &amp;lt;dst&amp;gt; %% Copy stuff&lt;br /&gt;
hfs -copyToLocal &amp;lt;src&amp;gt; &amp;lt;localdst&amp;gt; %% Copy stuff&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Using Hadoop Streaming ===&lt;br /&gt;
&lt;br /&gt;
* Hadoop streaming allows the use any program written in any language for mapreduce operations.&lt;br /&gt;
* You can use the  &amp;quot;hjs&amp;quot; alias you created to run Hadoop Streaming&lt;br /&gt;
&lt;br /&gt;
To run the example I provided, do the following:&lt;br /&gt;
&lt;br /&gt;
1) Copy the directory containing the Python files and input data to dumbo. You will first need to &amp;quot;scp&amp;quot; from your machine to the hpc node, and them from the hpc node to dumbo.&lt;br /&gt;
Assuming the directory is called /Users/julianafreire/MRExample&lt;br /&gt;
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: &lt;br /&gt;
Then, from the hpc node:&lt;br /&gt;
       scp -r MRExample  dumbo&lt;br /&gt;
&lt;br /&gt;
** Remember to replace your_netid with your actual netid!&lt;br /&gt;
&lt;br /&gt;
2) From dumbo, you will now copy the data file to HDFS&lt;br /&gt;
       hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt&lt;br /&gt;
&lt;br /&gt;
3) Check if the file is on HDFS&lt;br /&gt;
      hfs -ls&lt;br /&gt;
&lt;br /&gt;
4) Now, to run the job, make sure you are on the right directory&lt;br /&gt;
     cd /home/your_netid/MRExample&lt;br /&gt;
     hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output.  To list the output files:&lt;br /&gt;
     hfs -ls /user/jf1870/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
You can also inspect the content of the files:&lt;br /&gt;
&lt;br /&gt;
    hfs -cat wikipedia.output/*&lt;br /&gt;
&lt;br /&gt;
If you'd like to copy the files over to your local directory:&lt;br /&gt;
    hfs -get /user/jf1870/wikipedia.output  output&lt;br /&gt;
&lt;br /&gt;
This will copy the outputs to the local directory &amp;quot;output&amp;quot; on dumbo&lt;br /&gt;
&lt;br /&gt;
----------------------------------------------------------------------&lt;br /&gt;
Using Spark&lt;br /&gt;
&lt;br /&gt;
* Spark allow you to write and run applications quickly in Java, Scala, Python and R&lt;br /&gt;
* You can either use Spark interactive shell or Spark submission tool&lt;br /&gt;
&lt;br /&gt;
To run Spark interactive shell (Scala or Python):&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute one of the following:&lt;br /&gt;
	spark-shell (to run applications in Scala)&lt;br /&gt;
        pyspark (to run applications in Python)&lt;br /&gt;
&lt;br /&gt;
If you want to access your files stored on HDFS, use the following URL as filename in Spark&lt;br /&gt;
	hdfs://babar.es.its.nyu.edu:8020/user/&amp;lt;your_net_id&amp;gt;/&amp;lt;your_files&amp;gt;&lt;br /&gt;
(the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)&lt;br /&gt;
&lt;br /&gt;
To submit job to Spark:&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute&lt;br /&gt;
	spark-submit --num-executors &amp;lt;10-100&amp;gt; &amp;lt;your_python_script&amp;gt; &amp;lt;your_script_arguments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. &lt;br /&gt;
The bigger the faster. However if many people submit Spark job at the same time, performance will&lt;br /&gt;
be downgraded.&lt;br /&gt;
&lt;br /&gt;
Spark word count example:&lt;br /&gt;
&lt;br /&gt;
Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py&lt;br /&gt;
With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py&lt;br /&gt;
&lt;br /&gt;
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark. &lt;br /&gt;
The difference is that Spark Streaming provide streaming processing of live data stream.&lt;br /&gt;
&lt;br /&gt;
Some references:&lt;br /&gt;
&lt;br /&gt;
1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html&lt;br /&gt;
2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
	<entry>
		<id>https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11095</id>
		<title>NYU HPC Access Instructions</title>
		<link rel="alternate" type="text/html" href="https://www.vistrails.org//index.php?title=NYU_HPC_Access_Instructions&amp;diff=11095"/>
		<updated>2016-01-06T22:30:29Z</updated>

		<summary type="html">&lt;p&gt;Juliana: /* Accessing the NYU HPC Cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Accessing the NYU HPC Cluster == &lt;br /&gt;
&lt;br /&gt;
1. Log into the main HPC node:&lt;br /&gt;
       ssh &amp;lt;netid&amp;gt;@hpc.nyu.edu    &lt;br /&gt;
&lt;br /&gt;
2. From the HPC node, log into the Hadoop cluster:&lt;br /&gt;
       ssh dumbo&lt;br /&gt;
&lt;br /&gt;
You will be using a set of commands, and it will save you some time to first create aliases for them. Once on &amp;quot;dumbo&amp;quot;, run the following commands on your terminal:&lt;br /&gt;
&lt;br /&gt;
bash&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:&lt;br /&gt;
alias hfs='/usr/bin/hadoop fs '&lt;br /&gt;
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars&lt;br /&gt;
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar &lt;br /&gt;
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'&lt;br /&gt;
&lt;br /&gt;
%% Note: you should not have any spaces around &amp;quot;=&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you have bash as your default shell, do&lt;br /&gt;
      source .bashrc&lt;br /&gt;
This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Here are some common commands:&lt;br /&gt;
hfs        %% See available commands.&lt;br /&gt;
hfs -help   %% more command details.&lt;br /&gt;
hfs -ls [&amp;lt;path&amp;gt;]  %% List files&lt;br /&gt;
hfs -cp &amp;lt;src&amp;gt; &amp;lt;dst&amp;gt;  %% Copy stuff&lt;br /&gt;
hfs -mkdir &amp;lt;path&amp;gt; %% Create path&lt;br /&gt;
hfs -rm &amp;lt;path&amp;gt; %% remove a file&lt;br /&gt;
hfs -chmod &amp;lt;path&amp;gt; %% Modify permissions.&lt;br /&gt;
hfs -chown &amp;lt;path&amp;gt; %%  Modify owner.&lt;br /&gt;
&lt;br /&gt;
Some remote access commands:&lt;br /&gt;
hfs -cat &amp;lt;src&amp;gt;  %% Cat contents to stdout.&lt;br /&gt;
hfs -copyFromLocal &amp;lt;localsrc&amp;gt; &amp;lt;dst&amp;gt; %% Copy stuff&lt;br /&gt;
hfs -copyToLocal &amp;lt;src&amp;gt; &amp;lt;localdst&amp;gt; %% Copy stuff&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----------------------------------------------------------------------&lt;br /&gt;
Using Hadoop Streaming&lt;br /&gt;
&lt;br /&gt;
* Hadoop streaming allows the use any program written in any language for mapreduce operations.&lt;br /&gt;
* You can use the  &amp;quot;hjs&amp;quot; alias you created to run Hadoop Streaming&lt;br /&gt;
&lt;br /&gt;
To run the example I provided, do the following:&lt;br /&gt;
&lt;br /&gt;
1) Copy the directory containing the Python files and input data to dumbo. You will first need to &amp;quot;scp&amp;quot; from your machine to the hpc node, and them from the hpc node to dumbo.&lt;br /&gt;
Assuming the directory is called /Users/julianafreire/MRExample&lt;br /&gt;
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: &lt;br /&gt;
Then, from the hpc node:&lt;br /&gt;
       scp -r MRExample  dumbo&lt;br /&gt;
&lt;br /&gt;
** Remember to replace your_netid with your actual netid!&lt;br /&gt;
&lt;br /&gt;
2) From dumbo, you will now copy the data file to HDFS&lt;br /&gt;
       hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt&lt;br /&gt;
&lt;br /&gt;
3) Check if the file is on HDFS&lt;br /&gt;
      hfs -ls&lt;br /&gt;
&lt;br /&gt;
4) Now, to run the job, make sure you are on the right directory&lt;br /&gt;
     cd /home/your_netid/MRExample&lt;br /&gt;
     hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output.  To list the output files:&lt;br /&gt;
     hfs -ls /user/jf1870/wikipedia.output&lt;br /&gt;
&lt;br /&gt;
You can also inspect the content of the files:&lt;br /&gt;
&lt;br /&gt;
    hfs -cat wikipedia.output/*&lt;br /&gt;
&lt;br /&gt;
If you'd like to copy the files over to your local directory:&lt;br /&gt;
    hfs -get /user/jf1870/wikipedia.output  output&lt;br /&gt;
&lt;br /&gt;
This will copy the outputs to the local directory &amp;quot;output&amp;quot; on dumbo&lt;br /&gt;
&lt;br /&gt;
----------------------------------------------------------------------&lt;br /&gt;
Using Spark&lt;br /&gt;
&lt;br /&gt;
* Spark allow you to write and run applications quickly in Java, Scala, Python and R&lt;br /&gt;
* You can either use Spark interactive shell or Spark submission tool&lt;br /&gt;
&lt;br /&gt;
To run Spark interactive shell (Scala or Python):&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute one of the following:&lt;br /&gt;
	spark-shell (to run applications in Scala)&lt;br /&gt;
        pyspark (to run applications in Python)&lt;br /&gt;
&lt;br /&gt;
If you want to access your files stored on HDFS, use the following URL as filename in Spark&lt;br /&gt;
	hdfs://babar.es.its.nyu.edu:8020/user/&amp;lt;your_net_id&amp;gt;/&amp;lt;your_files&amp;gt;&lt;br /&gt;
(the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)&lt;br /&gt;
&lt;br /&gt;
To submit job to Spark:&lt;br /&gt;
&lt;br /&gt;
1) Login to dumbo&lt;br /&gt;
&lt;br /&gt;
2) Execute&lt;br /&gt;
	spark-submit --num-executors &amp;lt;10-100&amp;gt; &amp;lt;your_python_script&amp;gt; &amp;lt;your_script_arguments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. &lt;br /&gt;
The bigger the faster. However if many people submit Spark job at the same time, performance will&lt;br /&gt;
be downgraded.&lt;br /&gt;
&lt;br /&gt;
Spark word count example:&lt;br /&gt;
&lt;br /&gt;
Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py&lt;br /&gt;
With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py&lt;br /&gt;
&lt;br /&gt;
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark. &lt;br /&gt;
The difference is that Spark Streaming provide streaming processing of live data stream.&lt;br /&gt;
&lt;br /&gt;
Some references:&lt;br /&gt;
&lt;br /&gt;
1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html&lt;br /&gt;
2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations&lt;/div&gt;</summary>
		<author><name>Juliana</name></author>
	</entry>
</feed>