Big Data 2015: Final Project

Task

You will analyze NYC taxi data. Different groups will use the data to study different aspects urban life that can be detected from the taxi data and other related data sets..

You can select one or more of the areas below to focus your analysis on. While I provide suggestions for what you can look for, I expect you to use your creativity and go beyond those suggestions. Besides exploring the taxi data, you must also use at least one additional data set in your analysis.

Taxi Factbook

Periodically, the Taxi & Limousine Commission releases a Fact Book (see http://www.nyc.gov/html/tlc/downloads/pdf/2014_taxicab_fact_book.pdf) where, for a given year, they list different statistics, e.g., the total number of medallions, the average number of miles traveled per year, number of passengers, etc. In this project, you will create the infrastructure to automatically generate Fact Books for different years. The output could be a Web site where users can explore and compare the statistics for different years.

Detecting Gentrification

Gentrification in NYC is a problem that has received substantial attention both in the media and in academia. See e.g.,

Taxis can serve as sensors for economic activity in NYC, in that higher density of taxis in a region can serve as an indication of increased activity in that region. Can we combine taxi data with other data sets better detect gentrification? For example:

ACRIS (sales data): https://data.cityofnewyork.us/City-Government/ACRIS-Real-Property-Master/bnx9-e6tj
Multi Agency Permits (including all applications for construction activity): https://data.cityofnewyork.us/City-Government/Multi-Agency-Permits/xfyi-uyt5

Understanding taxi usage

Which neighborhoods are better served by taxis? How this correlates with the median household income per neighborhood ?
How does taxi density vary over time? During the day? On weekdays vs. weekends and holidays?
Are there regions where the taxi density is always high (or low)?
What are the most popular destinations? Do these change over time? For example, summer vs the other seasons?
What are the most popular trips (source, destination)?
Is the number of trips and taxis affected by weather? E.g., are there fewer cabs when it is raining/snowing?

Understanding taxi economics

How does revenue vary across neighborhoods and how does it correlate with the median household income in the neighborhood?
How does revenue vary over time? Are the months or seasons when taxi companies make more (or less) money?
How long do cab drives ride without passengers? How does this vary over time?
Are revenues affected during major events? E.g., parades, presidential visits, storms

Understanding driver behavior

How do different drivers work? Do drivers (or group of drivers) have a preferred neighborhood (or set of neighborhoods)? What does the pickup/dropoff distribution looks like? Does this preference change over time?
Are there patterns shared among different drivers?
Do some drivers get higher tips on average than others?
Do some drivers take longer routes than others?
Can you identify patterns for drivers that have higher income?
How many hours do drivers often work each day/week? Are there outliers?

Data sources

Taxi data 2013

Taxi data 2010-2013 =

https://uofi.app.box.com/NYCtaxidata

Census data

Demographics: http://www.nyc.gov/html/dcp/html/census/demo_tables_2010.shtml
Income information: http://www.nyc.gov/html/dcp/html/census/socio_tables.shtml
Shape files for census tracts: http://www.nyc.gov/html/dcp/html/bytes/districts_download_metadata.shtml (search for "tract")

Weather data

http://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets
http://www7.ncdc.noaa.gov/CDO/dataproduct -- select "Surface Data, Hourly Global", and then when it comes to select the region, choose NY and the three main stations (Central Park, JFK and LaGuardia).

Property and Construction data

ACRIS (sales data): https://data.cityofnewyork.us/City-Government/ACRIS-Real-Property-Master/bnx9-e6tj
Multi Agency Permits (including all applications for construction activity): https://data.cityofnewyork.us/City-Government/Multi-Agency-Permits/xfyi-uyt5

Project Mechanics

You should form a group with *at most 3 people*. You have to use the Hadoop environment to carry out your analyses -- you can write mapreduce programs, use Pig, and any other tool that works on Hadoop. Your code and scripts should be made available on GitHub and it should be reproducible -- you should include enough information so that others can reproduce what you did. You will also maintain a GoogleDoc that you will share with me that describes your project, the questions you are investigating, and what you have done so far. Please use this form to register your group and provide the information about your GitHub repo and Google Doc: https://docs.google.com/forms/d/1feAXUfUfgt2NgrHXf3xku3AdxPUWcfaxRv-h-cEfC1E/viewform?usp=send_form

Here are you milestones:

April 3: Submit the form with the information for your group. In the Google Doc, indicate your choice for the project, the data sets you will use, the tasks you will carry out, and a proposed timeline with weekly milestones.
April 10: Submit a status report describing any issues you encountered and updates to your initial plan.
April 17: Submit a status report with preliminary results.
May 11th: Final project report due.
May 11th,18th: Project presentation: each group will present their results to the class. Each student will grade all the presentations (except, of course, their own ;-)

Project Report

You can use your Google Doc as the starting point for your report.

You should describe your experience, issues you encountered (e.g., dirty data) and how you dealt with them.
You should report on and explain the findings of your analysis -- the use of insightful visualizations is highly encouraged.
All results in your report should be reproducible -- the code/scripts you used should be made available together with instructions on how to run them to derive the results you obtained.
You should describe the experimental setup (cluster configuration, number of mappers/reducers, tools you used) as well as report on the performance of your approach (e.g., report the running times of the scripts) and any optimizations you applied to speed up your code.
You should describe the individual contributions of each of the project's members.