Difference between revisions of "Big Data 2015: Final Project"

From VistrailsWiki
Jump to navigation Jump to search
Line 58: Line 58:

===Weather data===
===Weather data===
* http://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets -- choose either "Surface Data, Global Summary of the Day", or "Surface Data, Hourly Global" for a more detailed analysis. You can choose NY state, and then select the "John F Kennedy International Airport" station.
* http://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets  
* http://www7.ncdc.noaa.gov/CDO/dataproduct -- select "Surface Data, Hourly Global", and then when it comes to select the region, choose NY and the three main stations (Central Park, JFK and LaGuardia).
* http://www7.ncdc.noaa.gov/CDO/dataproduct
** Select either "Surface Data, Global Summary of the Day", or "Surface Data, Hourly Global" for a more detailed analysis.  
** You can then choose NY state, and select the "John F Kennedy International Airport" station (or all the stations, Central Park, JFK and LaGuardia).

===Property and Construction data===
===Property and Construction data===

Latest revision as of 18:28, 20 April 2015


By now, you are already an expert on the NYC taxi data. For the final project, different groups will use the data to study various aspects urban life that can be detected from the taxi data and other related data sets.

You can select one or more of the areas below to focus your analysis on. While I provide suggestions for what you can look for, I expect you to use your creativity and go beyond those suggestions. Besides exploring the taxi data, you must also use at least one additional data set in your analysis.

Taxi Factbook

Periodically, the Taxi & Limousine Commission releases a Fact Book (see http://www.nyc.gov/html/tlc/downloads/pdf/2014_taxicab_fact_book.pdf) where, for a given year, they list different statistics, e.g., the total number of medallions, the average number of miles traveled per year, number of passengers, etc. In this project, you will create the infrastructure to automatically generate Fact Books for different years. The output could be a Web site where users can explore and compare the statistics for different years.

Detecting Gentrification

Gentrification in NYC is a problem that has received substantial attention both in the media and in academia. See e.g.,

Taxis can serve as sensors for economic activity in NYC, in that higher density of taxis in a region can serve as an indication of increased activity in that region. Can we combine taxi data with other data sets better detect gentrification? For example:

Understanding taxi usage

  • Which neighborhoods are better served by taxis? How this correlates with the median household income per neighborhood ?
  • How does taxi density vary over time? During the day? On weekdays vs. weekends and holidays?
  • Are there regions where the taxi density is always high (or low)?
  • What are the most popular destinations? Do these change over time? For example, summer vs the other seasons?
  • What are the most popular trips (source, destination)?
  • Is the number of trips and taxis affected by weather? E.g., are there fewer cabs when it is raining/snowing?

Understanding taxi economics

  • How does revenue vary across neighborhoods and how does it correlate with the median household income in the neighborhood?
  • How does revenue vary over time? Are the months or seasons when taxi companies make more (or less) money?
  • How long do cab drives ride without passengers? How does this vary over time?
  • Are revenues affected during major events? E.g., parades, presidential visits, storms

Understanding driver behavior

  • How do different drivers work? Do drivers (or group of drivers) have a preferred neighborhood (or set of neighborhoods)? What does the pickup/dropoff distribution looks like? Does this preference change over time?
  • Are there patterns shared among different drivers?
  • Do some drivers get higher tips on average than others?
  • Do some drivers take longer routes than others?
  • Can you identify patterns for drivers that have higher income?
  • How many hours do drivers often work each day/week? Are there outliers?

Data sources

Taxi data 2013

Taxi data 2010-2013

Census data

Weather data

Property and Construction data

Project Mechanics

You should form a group with *at most 3 people*. You have to use the Hadoop environment to carry out your analyses -- you can write mapreduce programs, use Pig, and any other tool that works on Hadoop. Your code and scripts should be made available on GitHub and it should be reproducible -- you should include enough information so that others can reproduce what you did. You will also maintain a GoogleDoc that you will share with me that describes your project, the questions you are investigating, and what you have done so far. Please use this form to register your group and provide the information about your GitHub repo and Google Doc: https://docs.google.com/forms/d/1feAXUfUfgt2NgrHXf3xku3AdxPUWcfaxRv-h-cEfC1E/viewform?usp=send_form

Here are you milestones:

  • April 3: Submit the form with the information for your group. In the Google Doc, indicate your choice for the project, the data sets you will use, the tasks you will carry out, and a proposed timeline with weekly milestones.
  • April 10: Submit a status report describing any issues you encountered and updates to your initial plan.
  • April 20: Submit a status report with preliminary results.
  • May 11th: Final project report due.
  • May 11th,18th: Project presentation: each group will present their results to the class. Each student will grade all the presentations (except, of course, their own ;-)

Project Report

You can use your Google Doc as the starting point for your report.

  • You should describe your experience, issues you encountered (e.g., dirty data) and how you dealt with them.
  • You should report on and explain the findings of your analysis -- the use of insightful visualizations is highly encouraged.
  • All results in your report should be reproducible -- the code/scripts you used should be made available together with instructions on how to run them to derive the results you obtained.
  • You should describe the experimental setup (cluster configuration, number of mappers/reducers, tools you used) as well as report on the performance of your approach (e.g., report the running times of the scripts) and any optimizations you applied to speed up your code.
  • You should describe the individual contributions of each of the project's members.

Some Notes

  • For your analyses, it may be useful to have the fare and trip files merged. You already know how to do this ;-)
  • If you want to restrict your analyses to only consider trips that start or end in Manhattan, need to obtain the shape files for Manhattan and check whether the Lat/Long for the trip start (or end) are in the polygon defined by the shape files. Here are some links where you can find shape files:
  • To avoid float precision issues, convert money amounts to cents
  • Note that datetime is local datetime (with day-light saving). You can compute the dropoff_datetime by adding pickup_datetime to trip_time_in_secs -- this will help deal with changes in time.
  • Payments in cash often do not have a tip. Go figure...
  • Distances reported are in miles