Difference between revisions of "Provenance challenge"

From VistrailsWiki
Jump to navigation Jump to search
Line 39: Line 39:
== Relations ==
== Relations ==


<table>
The basic relations between the primitives is shown in the diagram above. However few of the current provenance systems contains all of this information. Some may lack information about produced data items or miss the definition of a workflow. The goal must thus be to extract as much information as possible from each source.
<th><td>Relation</td><td>Input</td><td>Output</td><tr>
<th><td>exists</td><td>all</td><td>boolean</td><tr>
<th><td>equals</td><td>all</td><td>boolean</td><tr>
<th><td>annotations</td><td>all</td><td>dict of key/value pairs</td><tr>
</table>


getInputPortForData dataItem inputPort
The most common use of provenance data will be to perform the transitive closure of some connected executions or data items i.e. to track data dependencies back and forward in time. We call this <b>upstream</b> for tracking back in time and <b>downstream</b> for tracking forward in time.
getOutputPortForData dataItem outputPort
getDataFromInputPort inputPort dataItem
getDataFromOutputPort outputPort dataItem
 
hasInputPort moduleInstance inputPort
inputPortOf inputPort moduleInstance
hasOutputPort moduleInstance outputPort
outputPortOf outputPort moduleInstance
 
 
outputOf dataItem moduleExecution
inputOf dataItem moduleExecution
hasOutput moduleExecution dataItem
hasInput moduleExecution dataItem
 
startTime moduleExecution time
endTime moduleExecution time
startTime workflowExecution time
endTime workflowExecution time
 
executionOf    moduleExecution moduleInstance
executionOf workflowExecution workflowInstance
 
hasExecution    moduleInstance moduleExecution
hasExecution workflowInstance workflowExecution
 
executions      workflowExecution moduleExecution
executedIn moduleExecution workflowExecution
 
inWorkflow moduleInstance workflow
hasModule workflow moduleInstance
 
connectedTo inputPort outputPort
connectedTo outputPort inputPort
 
runsModule moduleInstance module
hasInstance module moduleInstance
 
 
derived relations: (might be native)
 
derivedFrom   dataItem dataItem
derivedData   dataItem dataItem
previousModuleExecution   moduleExecution moduleExecution


We have identified 4 primitives to which upstream/downstream tracking is relevant:


<table border="1">
<tr><th>primitive</th><th>description</th></tr>
<tr><td>dataitem</td><td>tracking data dependencies</td></tr>
<tr><td>moduleExecution</td><td>tracking execution dependencies</td></tr>
<tr><td>moduleInstances</td><td>tracking module dependencies within a workflow</td></tr>
<tr><td>workflow</td><td>tracking workflow design history e.g. different workflow versions in the VisTrails action tree</td></tr>
</table>


transitive relations:
transitive relations:
datatype relation
datatype relation
--------------------------------
--------------------------------
Line 113: Line 70:




--[[User:Tommy|Tommy]] 09:05, 12 April 2007 (MDT)
--[[User:Tommy|Tommy]] 03:40, 13 April 2007 (MDT)--[[User:Tommy|Tommy]] 09:05, 12 April 2007 (MDT)

Revision as of 09:40, 13 April 2007

Second provenance challenge design overview

This page describes the implementation of how to answer the queries of the second provenance challenge.

The goal of this project is to create an api capable of querying different kinds of databases containing provenance data. The main focus will be on provenance generated by scientific workflows.

data model overview

This is a description of the data model that i am trying to implement.

Module definition is a description of a processor that takes inputs and generates outputs.

Workflow definition is a description of a workflow that contains modules and connections between them through ports. In the case of VisTrails, it also contains the evolution of the workflow through a parent relation.

Execution log is the information about a workflow execution. It contains information about the processors that were executed and the data items that were created.

Pc model er.gif


primitives

The api will deal with the basic primitives describing workflow executions.


node types:

namedescription
dataitema dataitem that is input/output to a module execution
modulethe module/service that is to be executed
moduleInstancethe module as represented in a workflow
moduleExecutionthe execution of a module
workflowa description of a process containing modules and connections
workflowExecutionthe representation of a workflow execution
inputPortrepresents a specific port thas can be assigned an input to a module execution
outputPortrepresents a specific port thas can contain a product of a module execution
connectionrepresents a connection between module Instances

Relations

The basic relations between the primitives is shown in the diagram above. However few of the current provenance systems contains all of this information. Some may lack information about produced data items or miss the definition of a workflow. The goal must thus be to extract as much information as possible from each source.

The most common use of provenance data will be to perform the transitive closure of some connected executions or data items i.e. to track data dependencies back and forward in time. We call this upstream for tracking back in time and downstream for tracking forward in time.

We have identified 4 primitives to which upstream/downstream tracking is relevant:

primitivedescription
dataitemtracking data dependencies
moduleExecutiontracking execution dependencies
moduleInstancestracking module dependencies within a workflow
workflowtracking workflow design history e.g. different workflow versions in the VisTrails action tree

transitive relations: datatype relation


upstreams:

dataitem derivedFrom - .outputOf()[forall].hasInput() moduleInstance prevModuleInstance - .hasInputPort()[forall].connectedTo().outputPortOf() moduleExecution prevModuleExecution - .hasInput()[forall].OutputOf()

downstreams:

dataitem derivedData - .inputOf()[forall].hasOutput() moduleInstance nextModuleInstance - .hasOutputPort()[forall].connectedTo().inputPortOf() moduleExecution nextModuleExecution - .hasOutput()[forall].inputOf()


--Tommy 03:40, 13 April 2007 (MDT)--Tommy 09:05, 12 April 2007 (MDT)