Second provenance challenge

From VistrailsWiki
Revision as of 16:07, 19 June 2007 by Tommy (talk | contribs)
Jump to navigation Jump to search

Model Integration Results

We have successfully performed the queries using data from VisTrails, MyGrid and Southampton (TBD insert links). We have included our own system because our new query API is general and not native to VisTrails.

The VisTrails and MyGrid models were easy to use because of their simple data format, Southampton was hard to use because of their complex data format using many levels of nesting and abstractions. VisTrails required both the execution log and the workflow definition for the provenance queries whereas MyGrid and Southampton only needed the execution log.

The answers obtained varied depending which information you had access to. E.g. for VisTrails it was not possible to obtain intermediate data items because they are not recorded. In this case the answer was the module executions. The queries required the data to contain at least module executions, connections between them and required annotations. These were all present in the models except Southampton who lacked some annotations.

The concept of data item varies between systems. It can be represented as the data exchanged between modules, the inputs or outputs of a workflow or a file reference passed between modules. The concept of parameters, which are used in VisTrails to modify modules, does not exist in other models. MyGrid uses something similar to edit the parameters of modules (like setting file name to save to). This concept is not clearly defined. Southampton have the concept of assertion where every module/service records its own view of the process. This concept does not exist in the other systems and is not used in our provenance queries. But it might be important for validating results.

Other concepts like modules/connections/executions are the same although most of them have different names.

Our method consists of using wrappers to translates the queries between a common data model and the source data. We first define a high-level general model that captures the basic concepts of workflows and its executions. The model contains basic concepts making it possible to express queries over the different models. Secondly, we defined API functions for the wrappers that use this model. Thirdly, we implemented the wrappers and constructed the queries.

Pros/Cons VisTrails use a normalized data model and needs to use both execution log and workflow definition. MyGrids execution log can be used without using the workflow definition and contain derivation relationships between data items, this makes the data contain redundant information. Southampton is modeling some security features that may be useful but makes the data larger and more complex.

How to connect these systems? In this challenge the connections were done by hand because no common identifiers exist. There is a need for the data to support referencing other models. E.g. If a data item is stored externally and tracked through another provenance store. Common identifiers like LSID:s might be part of the solution. External data items should also be given a namespace to indicate where they came from.


Translation Details

A description of the system describing Model, API and Implementation?

any data which was absent from a downloaded model, and whether this affected the possibility of translation or successful provenance query, and any data which was excluded in translation from a downloaded model because it was extraneous_

Benchmarks

What should we put here? _Describe your proposed benchmark queries, how the comparable quantities are determined, and the results of applying the benchmark to your own system_

Further Comments

_Provide here further comments._

Conclusions

In the general case, tracking provenance through different systems is a data integration problem. But by defining a common model (SWPDM) on a restricted domain (Scientific Workflow) the difficulty is reduced to efficiency and entity resolution problems. We believe that it should be possible for the Scientific workflow community to support a model similar to the SWPDM to enable provenance to be tracked through their systems. We have showed that an API for querying this model can be built and its compatibility with three of the current systems.

Problems for discussion:

How to connect these systems? There is a need for the data to support referencing other models. E.g. If a data item is stored externally and tracked through another provenance store. Common identifiers like LSID:s might be part of the solution. External data items should also be given a namespace to indicate where they came from.

Data items, they are used in many layers and have different meanings.

How will a user be able to express these kind of queries.