From VistrailsWiki
Jump to navigation Jump to search

Provenance of computational processes and the data they manipulate are of fundamental importance in the scientific process. Provenance (also referred to as audit trail, lineage, and pedigree) captures information about the steps used to generate a given data product. Such information provides documentation that is key to preserving the data and determining the data's quality and authorship as well as interpreting, reproducing, sharing and publishing results.

Because workflows and workflow-based systems capture computational tasks at various levels of detail and systematically record provenance information, they have recently emerged as an alternative to ad-hoc approaches to assembling computational tasks that are widely used in the scientific community. Workflows systems have become an important component of cyberinfrastructure, and NSF has made substantial investments to both improve these systems and foster their adoption. Several large NSF-funded projects (e.g., CMOP, GEON, GriPhyN, LEPP, ROADNet, SCEC, SEEK) rely on workflow systems to automate computational tasks and to maintain detailed provenance of the derived results. As these systems are deployed, large volumes of provenance information are being collected.

But just capturing and storing provenance is not enough. In order to effectively use this information and to deal with a potential information overload, we need novel tools and techniques that help users explore and benefit from provenance. The ability to explore the knowledge available in the provenance of computational tasks has the potential to foster large-scale collaborations, expedite scientific training in disciplinary and inter-disciplinary settings, as well as to reduce the lag between data acquisition and scientific insight.

Shared provenance repositories open up many new opportunities for knowledge sharing and re-use, and enable the creation of tools that support exploratory tasks common in the scientific process. These repositories can expose scientists to computational tasks that provide examples of sophisticated uses of tools they would not have access to otherwise. By analyzing a provenance repository, we can determine common workflow structures, identify common sequences of workflow changes, gain insight into how different people solve problems, and help users debug common errors and bottlenecks. Such analysis can also enable new techniques like auto-completion for workflows.

In this project, we are investigating new techniques for mining, visualizing, and re-using the provenance of computational tasks.

Sites and Tools

Relevant Publications

  • Visual Summaries for Graph Collections. David Koop, Juliana Freire, and Claudio Silva. In IEEE Pacific Vis, 2013.
  • ReproZip: Using Provenance to Support Computational Reproducibility. Fernando Chirigati, Dennis Shasha, and Juliana Freire. In USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2013.
  • Computational reproducibility: state-of-the-art, challenges, and database research opportunities. Juliana Freire, Philippe Bonnet and Dennis Shasha. In Proceedings of SIGMOD, 593-596 (2012).
  • Exploring the Coming Repositories of Reproducible Experiments: Challenges and Opportunities. Juliana Freire, Philippe Bonnet and Dennis Shasha. In PVLDB, vol. 4, no. 12, 2011.
  • DEFOG: A System for Data-Backed Visual Composition. L. Lins, D. Koop, J. Freire, C. Silva. SCI Technical Report, No. UUSCI-2011-003, SCI Institute, University of Utah, 2011.

  • crowdLabs: Social Analysis and Visualization for the Sciences. P. Mates, E. Santos, J. Freire, and C. Silva. In Proceedings of the 23rd International Conference on Scientific and Statistical Database Management (SSDBM), 2011.
  • A Provenance-Based Infrastructure to Support the Life Cycle of Executable Papers. D. Koop, E. .Santos, P. Mates, H. Vo, P. Bonnet, B. Bauer, B. Surer, M. Troyer, D. Williams, J. Tohline, J. Freire and C. Silva. In Proceedings of the International Conference on Computational Science, 2011. To appear.
  • Using VisTrails and Provenance for Teaching Scientific Visualization, C. Silva, E. Anderson, E. Santos and J. Freire. In Proceedings of the Eurographics Education Program, 2010. Best paper award. Revised version appears in Computer Graphics Forum, 30(1), pp. 75-84, 2011
  • The Provenance of Workflow Upgrades. D. Koop, C. E. Scheidegger, J. Freire, and C. Silva. In Proceedings of the International Provenance and Annotation Workshop (IPAW), 2010.
  • Bridging Workflow and Data Provenance using Strong Links. D. Koop, E. Santos, B. Bauer, M. Troyer, J. Freire, and C. Silva. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), 2010.
  • VisMashup: Streamlining the Creation of Custom Visualization Applications, E. Santos, L. Lins, J. Ahrens, J. Freire, and C. Silva. IEEE Transactions on Visualization and Computer Graphics, 15(6), pp. 1539-1546, 2009.
  • Towards Supporting Collaborative Data Analysis and Visualization in a Coastal Margin Observatory, E. Santos, P. Mates, E. Anderson, B. Grimm, J. Freire, C. Silva. In Proceedings of the ACM CSCW Workshop on The Changing Dynamics of Scientific Collaborations, 2010.
  • Enabling Advanced Visualization Tools in a Simulation Monitoring System, E. Santos, J. Tierny, A.

Khan, B. Grimm, L. Lins, J. Freire, V. Pascucci and C. Silva. In Proceedings of the IEEE International Conference on e-Science, pp. 358-365, 2009.

  • Using Mediation to Achieve Provenance Interoperability T. Ellkvist, D. Koop, E. Santos, J. Freire, C.

Silva, and L. Stromback. In Proceedings of the IEEE 2009 Third International Workshop on Scientific Workflows (SWF 2009).

  • A First Study on Clustering Collections of Workflow Graphs. E. Santos, L. Lins, J. P. Ahrens, J. Freire, C. Silva.

In Proceedings of IPAW, pp. 160-173, 2008

  • Examining Statistics of Workflow Evolution Provenance: A First Study. Lauro Lins, David Koop, Erik Anderson, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, Juliana Freire, and Claudio T. Silva. In Proceedings of International Conference on Scientific and Statistical Database Management (SSDBM), 2008.


This project is funded by the National Science foundation under award IIS-0905385.