Difference between revisions of "Hadoop Package"

From VistrailsWiki
Jump to navigation Jump to search
m (github.org -> github.com)
 
(14 intermediate revisions by one other user not shown)
Line 2: Line 2:


== Installation ==
== Installation ==
=== Install vistrails ===
=== Mac ===
Get vistrails using git and check out a version supporting the hadoop package:
This binary version of vistrails has the hadoop package preinstalled:
 
http://vgc.poly.edu/files/tommy/vistrails-mac-10.6-master-2014-02-25.dmg
 
=== Linux ===
Install vistrails from source:
  git clone http://vistrails.org/git/vistrails.git
  git clone http://vistrails.org/git/vistrails.git
git clone https://github.com/rexissimus/BatchQ-PBS -b remoteq
cp -r BatchQ-PBS/remoteq vistrails/
  cd vistrails
  cd vistrails
  git checkout 976255974f2b206f030b2436a5f10286844645b0
  python vistrails/run.py
On mac you can link this to the binary using vistrails/scripts/mac_update_bin.sh.
The first time vistrails is started it will download and install all the dependencies.
 
=== Install BatchQ-PBS and the RemotePBS package ===
This python package is used for communication over ssh. Get it with:
git clone https://github.com/rexissimus/BatchQ-PBS
Copy BatchQ-PBS/batchq to your vistrails python installations site-packages folder.


Copy BatchQ-PBS/batchq/contrib/vistrails/RemotePBS to ~/.vistrails/userpackages/
=== Windows ===
The BatchQ library used by the RemoteQ package does not support Windows. But you should be able to run Linux in a [http://www.virtualbox.org virtual machine] and install vistrails there.


=== Install the hadoop package ===
== Using the hadoop package ==
git clone git://vgc.poly.edu:src/vistrails-hadoop.git
ln -s `pwd`/vistrails-hadoop/hadoop  ~/.vistrails/userpackages/hadoop


== Modules used with the hadoop package ==
==== Machine ====
 
Represents a remote machine running SSH.  
==== Dialogs.PasswordDialog ====
Used to specify a password to the remote machine
 
==== Remote PBS.Machine ====
Represents a remote machine running SSH.
* server - the server url
* server - the server url
* username - the remote server username, default is your local username
* username - the remote server username, default is your local username
* password - your password, connect the PasswordDialog to here
* password - your password, connect a PasswordDialog to here
* port - the remote ssh port, set to 0 if using an ssh tunnel
* port - the remote ssh port, set to 0 if using an ssh tunnel


===== Connecting to the Poly cluster through vgchead =====
If you specify the machine information in the RemoteQ package configuration you do not need to use a Machine in your workflow:
The hadoop job submitter runs on gray02.poly.edu. If you are outside the poly network you need to use a ssh tunnel to get through the firewall.


Add this to ~/.ssh/config:
{| border="1"
  Host vgctunnel
| server || server url or alias
HostName vgchead.poly.edu
|-
LocalForward 8101 gray02.poly.edu:22
| username || your username on the server, leave blank to use your local username.
|-
Host gray02
| port || SSH port on the server, leave blank if using default, use 0 if using an alias.
HostName localhost
|-
Port 8101
| password || Do you need to specify a password to log in.
ForwardX11 yes
|-
Set up a tunnel to gray02 by running:
| defaultFS || The default filesystem path to use. Example: "s3n://<yourusername>/"
ssh vgctunnel
|-
In vistrails, create a Machine module with host=gray02 and port=0. Now you have a connection that can be used by the hadoop package
| uris || A list of path#alias tuples separated by ";". Using <alias> in the workflow will replace it with <path> before being executed by hadoop.
|}


==== HadoopStreaming ====
==== HadoopStreaming ====
Line 84: Line 80:


== Deleting a job ==
== Deleting a job ==
A job is monitored on many levels. To make sure it can be executed from the beginning make sure to:
To make sure a job can be executed from the beginning:


* Clear the vistrails cache
* Clear the vistrails cache
* Delete the job in the job monitor by selecting it an pressing "Del"
* Delete the job in the job monitor by selecting it an pressing "Del"
* Log in to the job server "gray02" and remove all files starting with ~/.vistrails-hadoop/.batchq.%JobIdentifier%.*


== Example ==
== Using the cluster at gray02.poly.edu ==
Lets try using gray02.poly.edu to run basic example with a mapper that returns info about the machine it was executed on.
If you are outside the poly network you need to use a ssh tunnel to get through the firewall.


 
Add this to ~/.ssh/config:
You will need an account on vgchead.
Host vgctunnel
In a terminal run:
HostName vgchead.poly.edu
LocalForward 8101 gray02.poly.edu:22
Host gray02
HostName localhost
Port 8101
Set up a tunnel to gray02 by running:
  ssh vgctunnel
  ssh vgctunnel
Enter your password and keep the window open.
Open vistrails-hadoop/example_nodeinfo.vt. It contains a working hadoop workflow.


Enter the machine info by going to Preferences->Module Packages, select RemoteQ and click "configure...". Enter this in the configuration:
In vistrails, enter the machine info by going to Preferences->Module Packages, select RemoteQ and click "configure...". Enter this in the configuration:
{| border="1"
{| border="1"
| server ||  gray02
| server ||  gray02
Line 109: Line 108:
|-
|-
| password || True
| password || True
|-
| defaultFS || hdfs://gray02.poly.edu:8020/user/<yourusername>/
|-
|-
| uris || hdfs:///user/tommy/wikitext-big-notitle.csv#wikitext-big-notitle.csv
| uris || hdfs:///user/tommy/wikitext-big-notitle.csv#wikitext-big-notitle.csv
|}
|}
[[image:hadoop_workflow.png]]
Execute the workflow. The workflow will halt while waiting for the job to finish. Pressing cancel will detach the running job an add it to the Job Monitor. The status of the execution can be checked by right-clicking HadoopStreaming and (for hadoop) selecting "View Standard error".
The job can be resumed by re-executing the workflow in vistrails. Once it completes the spreadsheet will list info about the 20 lines processed by the mapper. (usually the same)


== Using Amazon AWS ==
== Using Amazon AWS ==


First do the [[AWS_Setup]].
First do [[AWS_Setup]].


AWS uses "*.pem" key files for access. To use it, automatically edit ~/.ssh/config and add
AWS uses "*.pem" key files for access. Make sure you have one, then edit ~/.ssh/config and add


  Host aws
  Host aws
  HostName ec2-54-201-233-14.us-west-2.compute.amazonaws.com
  HostName ec2-54-201-233-14.us-west-2.compute.amazonaws.com
  IdentityFile ~/.ssh/tommy.pem  
  IdentityFile ~/.ssh/<yourusername>.pem  
 
after replacing the host name and path to your key file.
Enter the machine info by going to Preferences->Module Packages, select RemoteQ, click "configure...", and enter this in the configuration:
{| border="1"
| server ||  aws
|-
| username || hadoop
|-
| port || 0
|-
| password || False
|-
| defaultFS || s3n://<yourusername>/
|-
| uris || s3://cs9223/wikitext-big-notitle.csv#wikitext-big-notitle.csv
|}
Change defaultFS to your s3 bucket
 
== Example ==
Lets run a basic example with a mapper that returns info about the machine it was executed on.


after replacing the host name and path to your key file. You can now use "aws" as the server name in the Machine module in vistrails. Username should always be "hadoop".


An example file is available at vistrails-hadoop/aws.vt. It contains a working hadoop workflow. Change the S3 bucket instances to point to your bucket and execute. When it finishes you should see the same result as in the [[#Example|Example above]].
Open hadoop-nodeinfo.vt. Execute the workflow. The workflow will halt while waiting for the job to finish. Pressing cancel will detach the running job an add it to the Job Monitor. The status of the execution can be checked by right-clicking HadoopStreaming and (for hadoop) selecting "View Standard error". Click "Update" to refresh the view.
The job can be resumed by re-executing the workflow in vistrails. Once it completes the spreadsheet will list info about the 20 lines processed by the mapper. (usually the same)

Latest revision as of 15:01, 26 February 2014

This page describes how to use the hadoop package in VisTrails. This package works on Mac and Linux.

Installation

Mac

This binary version of vistrails has the hadoop package preinstalled:

http://vgc.poly.edu/files/tommy/vistrails-mac-10.6-master-2014-02-25.dmg

Linux

Install vistrails from source:

git clone http://vistrails.org/git/vistrails.git
git clone https://github.com/rexissimus/BatchQ-PBS -b remoteq
cp -r BatchQ-PBS/remoteq vistrails/
cd vistrails
python vistrails/run.py

The first time vistrails is started it will download and install all the dependencies.

Windows

The BatchQ library used by the RemoteQ package does not support Windows. But you should be able to run Linux in a virtual machine and install vistrails there.

Using the hadoop package

Machine

Represents a remote machine running SSH.

  • server - the server url
  • username - the remote server username, default is your local username
  • password - your password, connect a PasswordDialog to here
  • port - the remote ssh port, set to 0 if using an ssh tunnel

If you specify the machine information in the RemoteQ package configuration you do not need to use a Machine in your workflow:

server server url or alias
username your username on the server, leave blank to use your local username.
port SSH port on the server, leave blank if using default, use 0 if using an alias.
password Do you need to specify a password to log in.
defaultFS The default filesystem path to use. Example: "s3n://<yourusername>/"
uris A list of path#alias tuples separated by ";". Using <alias> in the workflow will replace it with <path> before being executed by hadoop.

HadoopStreaming

Runs a hadoop job on a remote cluster.

  • CacheArchive - Jar files to upload
  • CacheFiles - Other files to upload
  • Combiner - combiner file to use after mapper. Can be same as reducer.
  • Environment - Environment variables
  • Identifier - A unique string identifying each new job. The job files on the server will be called ~/.vistrails-hadoop/.batchq.%Identifier%.*
  • Input - The input file/directory to process
  • Mapper - The mapper program (required)
  • Output - The output directory name
  • Reducer - The reducer program (optional)
  • Workdir - The server workdir (Default is ~/.vistrails-hadoop)

HDFSEnsureNew

Deletes file/directory from remote HDFS storage

HDFSGet

Retrieve file/directory from remote HDFS storage. Used to get the results.

  • Local File - Destination file/directory
  • Remote Location - Source file/directory in HDFS storage

HDFSPut

Upload file/directory to remote HDFS storage. Used to upload mappers, reducers and data files.

  • Local File - Source file/directory
  • Remote Location - Destination file/directory in HDFS storage

PythonSourceToFile

PythonSource that is written to a file. Used to create mapper/reducer files.

URICreator

Creates links to locations in HDFS storage for input data and other files

Deleting a job

To make sure a job can be executed from the beginning:

  • Clear the vistrails cache
  • Delete the job in the job monitor by selecting it an pressing "Del"

Using the cluster at gray02.poly.edu

If you are outside the poly network you need to use a ssh tunnel to get through the firewall.

Add this to ~/.ssh/config:

Host vgctunnel
HostName vgchead.poly.edu
LocalForward 8101 gray02.poly.edu:22

Host gray02
HostName localhost
Port 8101

Set up a tunnel to gray02 by running:

ssh vgctunnel

In vistrails, enter the machine info by going to Preferences->Module Packages, select RemoteQ and click "configure...". Enter this in the configuration:

server gray02
username <yourusername>
port 0
password True
defaultFS hdfs://gray02.poly.edu:8020/user/<yourusername>/
uris hdfs:///user/tommy/wikitext-big-notitle.csv#wikitext-big-notitle.csv

Using Amazon AWS

First do AWS_Setup.

AWS uses "*.pem" key files for access. Make sure you have one, then edit ~/.ssh/config and add

Host aws
HostName ec2-54-201-233-14.us-west-2.compute.amazonaws.com
IdentityFile ~/.ssh/<yourusername>.pem 

after replacing the host name and path to your key file. Enter the machine info by going to Preferences->Module Packages, select RemoteQ, click "configure...", and enter this in the configuration:

server aws
username hadoop
port 0
password False
defaultFS s3n://<yourusername>/
uris s3://cs9223/wikitext-big-notitle.csv#wikitext-big-notitle.csv

Change defaultFS to your s3 bucket

Example

Lets run a basic example with a mapper that returns info about the machine it was executed on.


Open hadoop-nodeinfo.vt. Execute the workflow. The workflow will halt while waiting for the job to finish. Pressing cancel will detach the running job an add it to the Job Monitor. The status of the execution can be checked by right-clicking HadoopStreaming and (for hadoop) selecting "View Standard error". Click "Update" to refresh the view. The job can be resumed by re-executing the workflow in vistrails. Once it completes the spreadsheet will list info about the 20 lines processed by the mapper. (usually the same)