Difference between revisions of "UsersGuideVisTrailsPackages"

From VistrailsWiki
Jump to navigation Jump to search
Line 224: Line 224:
== Adding default values and/or labels for parameters ==
== Adding default values and/or labels for parameters ==


In versions 1.4 and greater, package developers can add labels and default values for parameters.  To add this functionality, you need to use the <code>defaults</code> and <code>labels</code> keyward arguments and pass the values as '''strings'''.  For example,
In versions 1.4 and greater, package developers can add labels and default values for parameters.  To add this functionality, you need to use the <code>defaults</code> and <code>labels</code> keyword arguments and pass the values as '''strings'''.  For example,


   class TestDefaults(Module):
   class TestDefaults(Module):

Revision as of 15:24, 31 March 2011

Introduction

VisTrails provides a plugin infrastructure to integrate user-defined functions and libraries. Specifically, users can incorporate their own visualization and simulation codes into pipelines by defining custom modules (or wrappers). These modules are bundled in what we call packages. A VisTrails package is simply a collection of Python classes -- each of these classes will represent a new module -- created by the user that respects a certain convention. Here's a simplified example of a very simple user-defined module:

class Divide(Module):
    def compute(self):
        arg1 = self.getInputFromPort("arg1")
        arg2 = self.getInputFromPort("arg2")
        if arg2 == 0.0:
            raise ModuleError(self, "Division by zero")
        self.setResult("result", arg1 / arg2)

registry.addModule(Divide)
registry.addInputPort(Divide, "arg1", (basic.Float, 'dividend'))
registry.addInputPort(Divide, "arg2", (basic.Float, 'divisor'))
registry.addOutputPort(Divide, "result", (basic.Float, 'quotient'))

New VisTrails modules must subclass from Module, the base class that defines basic functionality. The only required override is the compute() method, which performs the actual module computation. Input and output is specified through ports, which currently have to be explicitly registered with VisTrails. However, this is straightforward, and done through method calls to the module registry. A complete documented example of a (slightly) more complicated module is available here.

Dealing with command line tools and side effects

In an ideal world a module's outputs should be completely determined by its inputs. This is important for provenance purposes - if modules have implicit dependencies, it is not possible to be certain that the same results will be generated when the process is reexecuted.

However, it is clear that certain modules are inherently side-effectful (reading/writing files, network, etc). For the common case of temporary files, VisTrails provides a convenience layer that removes part of the burden of managing temporary files. As an illustrative example, consider one of the packages we make available for image conversion, using the ImageMagick suite:

class Convert(ImageMagick):
    """Convert is the base Module for VisTrails Modules in the ImageMagick
package that deal with operations on images. Convert is a bit of a misnomer since
the 'convert' tool does more than simply file format conversion. Each subclass
has a descriptive name of the operation it implements."""

    def create_output_file(self):
        """Creates a File with the output format given by the
outputFormat port."""
        if self.hasInputFromPort('outputFormat'):
            s = '.' + self.getInputFromPort('outputFormat')
            return self.interpreter.filePool.create_file(suffix=s)

    def geometry_description(self):
        """returns a string with the description of the geometry as
indicated by the appropriate ports (geometry or width and height)"""
        # if complete geometry is available, ignore rest
        if self.hasInputFromPort("geometry"):
            return self.getInputFromPort("geometry")
        elif self.hasInputFromPort("width"):
            w = self.getInputFromPort("width")
            h = self.getInputFromPort("height")
            return "'%sx%s'" % (w, h)
        else:
            raise ModuleError(self, "Needs geometry or width/height")

    def run(self, *args):
        """run(*args), runs ImageMagick's 'convert' on a shell, passing all
arguments to the program."""        
        cmdline = ("convert" + (" %s" * len(args))) % args
        if not self.__quiet:
            print cmdline
        r = os.system(cmdline)
        if r != 0:
            raise ModuleError(self, "system call failed: '%s'" % cmdline)

    def compute(self):
        o = self.create_output_file()
        i = self.input_file_description()
        self.run(i, o.name)
        self.setResult("output", o)

(...)

    reg.addModule(Convert)
    reg.addInputPort(Convert, "geometry", (basic.String, 'ImageMagick geometry'))
    reg.addInputPort(Convert, "width", (basic.String, 'width of the geometry for operation'))
    reg.addInputPort(Convert, "height", (basic.String, 'height of the geometry for operation'))
    reg.addOutputPort(Convert, "output", (basic.File, 'the output file'))

This example introduces several new VisTrails features. The last line of the snippet registers an output port that provides a file. Immediately, a file output presents several problems when a pipeline is to be shared among users in heterogenous environments. For example, where should a file be written to? For temporary files, VisTrails provides a file pool class, that manages temporary files and their lifetimes automatically, so that users don't have to worry about deleting them post-execution. To create a temporary file, a user calls, for example

fileObj = self.interpreter.filePool.create(suffix=".png")

fileObj will then contain a module that represents a file. The file pool simply creates a temporary file with write permissions, whose local name is available, in this case, as fileObj.name. The package developer is then free to use this file for any purpose.

Another feature of this example is the use of command line tools. Notice that Python provides a very convenient way to execute commands through a shell. In this case, we use os.system on a command-line that executes the appropriate program.

Interaction with Caching

VisTrails provides a caching mechanism, in which portions of pipelines that are common across different executions are automatically shared. However, some modules are intrinsically side-effectful (writing a report to stdout, or a file to disk, or creating a user interface widget), and should not be shared. Caching control is therefore up to the package developer. By default, caching is enabled. So a developer that doesn't want caching to apply must make small changes to the module. There's a convenient way to disable caching entirely, by using multiple inheritance, and deriving from a mixin class that's provided by VisTrails. For example, look at the StandardOutput module:

from core.modules.vistrails_module import Module, newModule, \
    NotCacheable, ModuleError
(...)
class StandardOutput(NotCacheable, Module):
    """StandardOutput is a VisTrails Module that simply prints the
    value connected on its port to standard output. It is intended
    mostly as a debugging device."""
    
    def compute(self):
        v = self.getInputFromPort("value")
        print v

By subclassing from NotCacheable as well as from Module (or one of its subclasses), VisTrails automatically will not cache this module, or anything downstream from it.

VisTrails also allows a more sophisticated decision on whether to use caching or not. To do that, a user simply overrides the method is_cacheable to return the correct value. This allows context-dependent decisions. For example, in the teem package, there's a module that generates a scalar field with random numbers. This is non-deterministic, so shouldn't be cached. However, this module only generates non-deterministic values in special occasions, depending on its input port values. To keep efficiency when caching is possible, while still maintaining correctness, that module implements the following override:

class Unu1op(Unu):
(...)
    def is_cacheable(self):
        return not self.getInputFromPort('op') in ['rand', 'nrand']
(...)

Notice that the module explicitly uses inputs to decide whether it should be cached. This allows reasonably fine-grained control over the process.

Interaction with Other Packages

When developing more complicated packages, it becomes natural to split code among different VisTrails packages, and have one logically depend on the other. For example, in one package (say, named ' package_base '), a user might define

class PackageBaseModule(Module):
 ...
def initialize():
 registry.addModule(PackageBaseModule)
 ...

And then, in another package (say, ' package_derived '),

class DerivedModule(PackageBaseModule):
 ...
def initialize():
 registry.addModule(DerivedModule)
 ...

Because of the way packages are loaded, package_derived cannot be initialized before package_base. VisTrails provides a mechanism for specifying interpackage dependencies. Every VisTrails package can provide a list of necessary installed packages. This is done by providing a callable in the package under the name package_dependencies. For example, here's how the VTK VisTrails package declares dependencies:

def package_dependencies():
    import core.packagemanager
    manager = core.packagemanager.get_package_manager()
    if manager.has_package('spreadsheet'):
        return ['spreadsheet']
    else:
        return []

The callable must return a list of strings, representing the name of the packages it depends on. We also use this example to introduce the package manager API, that is useful here for inspecting packages present in the system. Notice that the dependencies are not static. vtk depends on spreadsheet if and only if spreadsheet is present in the system. Otherwise, it has no dependencies.

Note: Circular dependencies are not allowed. They will be detected by VisTrails and an error will be signalled.

Note: Currently, package names are reasonably brittle, in the sense that conflicts in package naming might become an issue. We are in the process of designing an API that will allow more robust naming schemes.

User-defined module shapes and colors

VisTrails allows users to define custom colors and shapes to modules. This must be done at module registration time, by passing special parameters to addModule. For example:

reg.addModule(Afront,
              moduleColor=(1.0,0.0,0.0),
              moduleFringe=[(0.0, 0.0),
                            (0.2, 0.0),
                            (0.2, 0.4),
                            (0.0, 0.4),
                            (0.0, 1.0)])

gives this result:

PackageCustomColorShape1.png

This piece of code

reg = core.modules.module_registry
reg.addModule(Afront,
              moduleColor=(0.4,0.6,0.8),
              moduleFringe=[(0.0, 0.0),
                            (0.2, 0.0),
                            (0.0, 0.2),
                            (0.2, 0.4),
                            (0.0, 0.6),
                            (0.2, 0.8),
                            (0.0, 1.0)])

gives this result:

PackageCustomColorShape2.png

The moduleColor parameter must be a tuple of three floats between 0 and 1 that specify RGB colors for the module background, while moduleFringe is a list of pairs of floats that specify points as they go around a side of the module (the same one is used to go from the top-right corner to bottom-right corner, and from the bottom-left corner to the top-left one. If this is not enough, let the developers know!)

Alternatively, you can use different fringes for the left and right borders:

   reg.addModule(Afront,
                 moduleColor=(1.0,0.8,0.6),
                 moduleLeftFringe=[(0.0, 0.0),
                                   (-0.2, 0.0),
                                   (0.0, 1.0)],
                 moduleRightFringe=[(0.0, 0.0),
                                    (0.2, 1.0),
                                    (0.0, 1.0)])

which gives this:

PackageCustomColorShape3.png

How to make your package reloadable

You need to move almost everything in the __init__.py file to a new file "init.py", but keep the identifier, name, version, configuration, and package_dependencies fields/methods in the __init__.py file. Specifically, make sure that imports (excluding things like core.configuration) and the initialize method are in the init.py file. For example, take a look at the __init__.py file of the pylab package included in VisTrails:

identifier = 'edu.utah.sci.vistrails.matplotlib'
name = 'matplotlib'
version = '0.9.0'
def package_dependencies():
    import core.packagemanager
    manager = core.packagemanager.get_package_manager()
    if manager.has_package('edu.utah.sci.vistrails.spreadsheet'):
        return ['edu.utah.sci.vistrails.spreadsheet']
    else:
        return []

def package_requirements():
    import core.requirements
    if not core.requirements.python_module_exists('matplotlib'):
        raise core.requirements.MissingRequirement('matplotlib')
    if not core.requirements.python_module_exists('pylab'):
        raise core.requirements.MissingRequirement('pylab')

And the init.py contains the other imports, class definitions and the initialize method.

Adding default values and/or labels for parameters

In versions 1.4 and greater, package developers can add labels and default values for parameters. To add this functionality, you need to use the defaults and labels keyword arguments and pass the values as strings. For example,

 class TestDefaults(Module):
     _input_ports = [("f1", "(edu.utah.sci.vistrails.basic:Float)",
                      {"defaults": str([1.23]), "labels": str(["temp"])})]
 _modules = [TestDefaults]

or in the older syntax,

 def initialize():
     reg = core.modules.module_registry.get_module_registry()
     reg.add_module(TestDefaults2)
     reg.add_input_port(TestDefaults2, "f2", [Float, String], 
                        defaults=str([4.56, "abc"]), 
                        labels=str(["temp", "name"]))

Packages that generate modules dynamically

When wrapping existing libraries or trying to generate modules in a more procedural manner, it is useful to dynamically generate modules. In our work, we have created some shortcuts to make this easier. In addition, the list of modules can also be based based on the package configuration. Here is some example code:

__init__.py

 from core.configuration import ConfigurationObject
 
 identifier = "edu.utah.sci.dakoop.auto_example"
 version = "0.0.1"
 name = "AutoExample"
 
 configuration = ConfigurationObject(use_b=True)

init.py

The expand_ports and build_modules methods are functions to help the construction of the modules easier. The key parts are the new_module call and setting the _modules variable.

 from core.modules.vistrails_module import new_module, Module
 
 identifier = "edu.utah.sci.dakoop.auto_example"
 
 def expand_ports(port_list):
     new_port_list = []
     for port in port_list:
         port_spec = port[1]
         if type(port_spec) == str: # or unicode...
             if port_spec.startswith('('):
                 port_spec = port_spec[1:]
             if port_spec.endswith(')'):
                 port_spec = port_spec[:-1]
             new_spec_list = []
             for spec in port_spec.split(','):
                 spec = spec.strip()
                 parts = spec.split(':', 1)
                 print 'parts:', parts
                 namespace = None
                 if len(parts) > 1:
                     mod_parts = parts[1].rsplit('|', 1)
                     if len(mod_parts) > 1:
                         namespace, module_name = mod_parts
                     else:
                         module_name = parts[1]
                     if len(parts[0].split('.')) == 1:
                         id_str = 'edu.utah.sci.vistrails.' + parts[0]
                     else:
                         id_str = parts[0]
                 else:
                     mod_parts = spec.rsplit('|', 1)
                     if len(mod_parts) > 1:
                         namespace, module_name = mod_parts
                     else:
                         module_name = spec
                     id_str = identifier
                 if namespace:
                     new_spec_list.append(id_str + ':' + module_name + ':' + \
                                              namespace)
                 else:
                     new_spec_list.append(id_str + ':' + module_name)
             port_spec = '(' + ','.join(new_spec_list) + ')'
         new_port_list.append((port[0], port_spec) + port[2:])
     print new_port_list
     return new_port_list
 
 def build_modules(module_descs):
     new_classes = {}
     for m_name, m_dict in module_descs:
         m_doc = m_dict.get("_doc", None)
         m_inputs = m_dict.get("_inputs", [])
         m_outputs = m_dict.get("_outputs", [])
         if "_inputs" in m_dict:
             del m_dict["_inputs"]
         if "_outputs" in m_dict:
             del m_dict["_outputs"]
         if "_doc" in m_dict:
             del m_dict["_doc"]
         klass_dict = {}
         if "_compute" in m_dict:
             klass_dict["compute"] = m_dict["_compute"]
             del m_dict["_compute"]
         m_class = new_module(Module, m_name, klass_dict, m_doc)
         m_class._input_ports = expand_ports(m_inputs)
         m_class._output_ports = expand_ports(m_outputs)
         new_classes[m_name] = (m_class, m_dict)
     return new_classes.values()
 
 def initialize():
     global _modules
     def a_compute(self):
         a = self.getInputFromPort("a")
         i = 0
         if self.hasInputFromPort("i"):
             i = self.getInputFromPort("i")
         if a == "abc":
             i += 100
         self.setResult("b", i)
 
     module_descs = [("ModuleA", {"_inputs": [("a", "basic:String")],
                                  "_outputs": [("b", "basic:Integer")],
                                  "_doc": "ModuleA documentation",
                                  "_compute": a_compute,
                                  "namespace": "Test"}),
                     ("ModuleB", {"_inputs": [("a", "Test|ModuleA")],
                                  "_outputs": [("b", "Test|ModuleA")],
                                  "_doc": "ModuleB documentation"})
                     ]
 
     if configuration.use_b:
         _modules = build_modules(module_descs)
     else:
         _modules = build_modules(module_descs[:1])
 
 _modules = []

Help! This documentation wasn't good enough!

Sorry, it's our fault! If you need help, join the vistrails-users list and post your question there.