JobSubmission

From VistrailsWiki
Revision as of 14:49, 24 October 2014 by Remi (talk | contribs)
Jump to navigation Jump to search

Introduction

This page describes and discusses the "job submission" effort in VisTrails, i.e. running job from VisTrails on remote servers, and getting a job's result asynchronously in a later session.

Long-running jobs

VisTrails's supports long-running jobs through the ModuleSuspended mechanism. A Module can suspend itself once a job is running (after submitting it on the first run, or after checking that it's not done yet on subsequent runs) by raising a ModuleSuspended exception, which contains information allowing the JobMonitor to automatically check for the status of the job in the background.

The JobMonitor then notifies the user if a known running job is now completed, so that the user can re-run the workflow, which should now be able to pass the suspending module.

In addition, the JobMonitor serializes this information so that VisTrails knows to check for these jobs if you restart it later on (or on a different machine, if this is written to the vistrail).

This is a very abstract and high-level interface for background jobs that any job-submission mechanism is built on.

Remote job packages

Running jobs on a remote machine can be done with ad-hoc packages that use the ModuleSuspended mechanism.

RemoteQ

This is the only such package right now.

The major issue is certainly that the underlying library used by the VisTrails package, RemoteQ, has been put together quickly by Tommy from BatchQ. It is unsupported, untested, it is completely specific to us (there has to be a more widely-used tool for this...), and it is very broken (can't even be installed). The rest of this section is about the design of the package, not the code issues; they still need addressing if we are to keep this.

It allows a user to run commands on a server through SSH, via such modules as Machine, CopyFile, RunCommand, and RunPBSScript.

  • [RR] The problem here is that filenames need to be explicit in the workflow. There is no job isolation. This means that there are side-effects, which should NOT be permitted in data flows, and ARE going to break things (especially if the vistrail gets shared). Files and jobs should be associated with the job "signature" so that running a version doesn't corrupt the results of another (NECESSARY for sharing job-submitting vistrails!) and running a pipeline should get the output of the correct job.

Running jobs are associated to a subpipeline signature (in JSON previously, now in annotation; FIXME: still in JSON in XML!)

  • [RR] JSON in XML is gross
  • [RR] Should probably have one annotation per job
  • [RR] What are these 'workflow' and 'id' UUIDs?

Jobs have to contain the subpipeline signature to be matched with subsequent invocations, but also workflow/version so that job can be checked or resumed from JobMonitor. It also has to serialize whatever information it needs to resume, e.g. the output filename if it gets that from a parameter. Objective is to resume without running the upstream (since re-running the upstream might yield different results and thus a require a different job).