When dealing with large datasets, the ability to leverage multiple cores or machines allows to speed up workflow execution time. VisTrails provides support for parallelization of tasks using IPython.
This package uses the standard IPython.parallel machinery to execute jobs. You will need to create and configure the IPython profile that you want VisTrails to use.
VisTrails is capable of running the ipcontroller and ipengine commands for you to start a controller or a set of engines locally, but for more complex setups, you can run ipcluster yourself from a terminal with the necessary options.
In the Packages/Parallel Flow menu, you will find the following options:
Note that when VisTrails is exited, it will shutdown the engines that it started. If it started the controller, it will also be shutdown, along with every engine that might have connected to it from other machines. To prevent that, use the ‘cleanup’ button and choose not to stop them; they will detach from VisTrails and won’t be killed automatically. You will still be able to use the ‘cluster shutdown’ button explicitely.
Map allows you to execute a single module or a Group in parallel using input values taken from a list. It works in exactly the same way as the regular Map module from the Control Flow package.
Contrary to the standard Map module, the elements of the list will be submitted to the IPython controller which will execute them in a load-balanced manner on the engines currently connected.