ProvenanceAnalytics

From VistrailsWiki
Revision as of 16:43, 13 July 2010 by Juliana (talk | contribs) (Created page with '''Site under construction'' Provenance of computational processes and the data they manipulate are of fundamental importance in the scientific process. Provenance (also referre…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Site under construction

Provenance of computational processes and the data they manipulate are of fundamental importance in the scientific process. Provenance (also referred to as audit trail, lineage, and pedigree) captures information about the steps used to generate a given data product. Such information provides documentation that is key to preserving the data and determining the data's quality and authorship as well as interpreting, reproducing, sharing and publishing results.

Because workflows and workflow-based systems capture computational tasks at various levels of detail and systematically record provenance information, they have recently emerged as an alternative to ad-hoc approaches to assembling computational tasks that are widely used in the scientific community. Workflows systems have become an important component of cyberinfrastructure, and NSF has made substantial investments to both improve these systems and foster their adoption. Several large NSF-funded projects (e.g., CMOP, GEON, GriPhyN, LEPP, ROADNet, SCEC, SEEK) rely on workflow systems to automate computational tasks and to maintain detailed provenance of the derived results. As these systems are deployed, large volumes of provenance information are being collected.

But just capturing and storing provenance is not enough. In order to effectively use this information and to deal with a potential information overload, we need novel tools and techniques that help users explore and benefit from provenance. The ability to explore the knowledge available in the provenance of computational tasks has the potential to foster large-scale collaborations, expedite scientific training in disciplinary and inter-disciplinary settings, as well as to reduce the lag between data acquisition and scientific insight.

Shared provenance repositories open up many new opportunities for knowledge sharing and re-use, and enable the creation of tools that support exploratory tasks common in the scientific process. These repositories can expose scientists to computational tasks that provide examples of sophisticated uses of tools they would not have access to otherwise. By analyzing a provenance repository, we can determine common workflow structures, identify common sequences of workflow changes, gain insight into how different people solve problems, and help users debug common errors and bottlenecks. Such analysis can also enable new techniques like auto-completion for workflows.

In this project, we are investigating new techniques for mining, visualizing, and re-using the provenance of computational tasks.


Sites and Tools

  • CrowdLabs: [[1]]
  • VisTrails: [[2]]

Publications