Provenance and Reproducibility
Data exploration is inherently a trial-and-error process -- as well formulate and test hypothesis, we often need to follow many different lines of reasoning, use different tools, explore multiple parameter value combinations. It is not uncommon to arrive at an interesting result and not remember the exact path that took you there. Therefore, it is important to maintain detailed provenance of the steps followed, data and parameter values used. This is particularly important for Big Data, where complex processes and data are used.
Today, we will use VisTrails, an open source data analysis and visualization system that systematically captures provenance as a user explores data using computational processes. We will discuss the benefits of provenance, in particular, the ability to reproduce results and re-use knowledge.
The Problem: Analyzing MTA Fare Data
You will need to install VisTrails 2.1.1 to run this example. You can download the system from http://www.vistrails.org/index.php/Downloads.
Select the link that matches your operating system.