VisTrails Home

NYU HPC Access Instructions

From VisTrailsWiki

(Difference between revisions)
Jump to: navigation, search
(Accessing the NYU HPC Cluster)
(Accessing the NYU HPC Cluster)
Line 25: Line 25:
<code>
<code>
alias hfs='/usr/bin/hadoop fs '
alias hfs='/usr/bin/hadoop fs '
 +
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
 +
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar  
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar  
 +
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
Line 41: Line 44:
<code>
<code>
hfs        %% See available commands.
hfs        %% See available commands.
 +
hfs -help  %% more command details.
hfs -help  %% more command details.
 +
hfs -ls [<path>]  %% List files
hfs -ls [<path>]  %% List files
 +
hfs -cp <src> <dst>  %% Copy stuff
hfs -cp <src> <dst>  %% Copy stuff
 +
hfs -mkdir <path> %% Create path
hfs -mkdir <path> %% Create path
 +
hfs -rm <path> %% remove a file
hfs -rm <path> %% remove a file
 +
hfs -chmod <path> %% Modify permissions.
hfs -chmod <path> %% Modify permissions.
 +
hfs -chown <path> %%  Modify owner.
hfs -chown <path> %%  Modify owner.
</code>
</code>
Line 53: Line 63:
<code>
<code>
hfs -cat <src>  %% Cat contents to stdout.
hfs -cat <src>  %% Cat contents to stdout.
 +
hfs -copyFromLocal <localsrc> <dst> %% Copy stuff
hfs -copyFromLocal <localsrc> <dst> %% Copy stuff
 +
hfs -copyToLocal <src> <localdst> %% Copy stuff
hfs -copyToLocal <src> <localdst> %% Copy stuff
</code>
</code>
Line 66: Line 78:
1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo.
1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo.
-
Assuming the directory is called /Users/julianafreire/MRExample
+
Assuming the directory in your machine is called /Users/julianafreire/MRExample
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu:  
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu:  
Then, from the hpc node:
Then, from the hpc node:
Line 95: Line 107:
This will copy the outputs to the local directory "output" on dumbo
This will copy the outputs to the local directory "output" on dumbo
-
----------------------------------------------------------------------
+
=== Using Spark ===
-
Using Spark
+
* Spark allow you to write and run applications quickly in Java, Scala, Python and R
* Spark allow you to write and run applications quickly in Java, Scala, Python and R
Line 120: Line 131:
spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>
spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>
-
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission.  
+
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.
-
The bigger the faster. However if many people submit Spark job at the same time, performance will
+
-
be downgraded.
+
-
 
+
-
Spark word count example:
+
-
Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
+
You can try some examples:
-
With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py
+
* Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
 +
* With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py
-
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark.  
+
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark.  
-
The difference is that Spark Streaming provide streaming processing of live data stream.
+
The difference is that Spark Streaming supports processing of live data stream.
Some references:
Some references:

Revision as of 16:40, 6 January 2016

Accessing the NYU HPC Cluster

1. Log into the main HPC node:

      ssh <netid>@hpc.nyu.edu    

2. From the HPC node, log into the Hadoop cluster:

      ssh dumbo

You will be using a set of commands, and it will save you some time to first create aliases for them. Once on "dumbo", run the following commands on your terminal:

bash

alias hfs='/usr/bin/hadoop fs '

export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars

export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar

alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'


To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file: alias hfs='/usr/bin/hadoop fs '

export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars

export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar

alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

%% Note: you should not have any spaces around "="!


If you have bash as your default shell, do

     source .bashrc

This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.


Here are some common commands: hfs  %% See available commands.

hfs -help  %% more command details.

hfs -ls [<path>]  %% List files

hfs -cp <src> <dst>  %% Copy stuff

hfs -mkdir <path> %% Create path

hfs -rm <path> %% remove a file

hfs -chmod <path> %% Modify permissions.

hfs -chown <path> %% Modify owner.

Some remote access commands: hfs -cat <src>  %% Cat contents to stdout.

hfs -copyFromLocal <localsrc> <dst> %% Copy stuff

hfs -copyToLocal <src> <localdst> %% Copy stuff


Using Hadoop Streaming

  • Hadoop streaming allows the use any program written in any language for mapreduce operations.
  • You can use the "hjs" alias you created to run Hadoop Streaming

To run the example I provided, do the following:

1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo. Assuming the directory in your machine is called /Users/julianafreire/MRExample

      scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: 

Then, from the hpc node:

      scp -r MRExample  dumbo
    • Remember to replace your_netid with your actual netid!

2) From dumbo, you will now copy the data file to HDFS

      hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt

3) Check if the file is on HDFS

     hfs -ls

4) Now, to run the job, make sure you are on the right directory

    cd /home/your_netid/MRExample
    hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output

5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output. To list the output files:

    hfs -ls /user/jf1870/wikipedia.output

You can also inspect the content of the files:

   hfs -cat wikipedia.output/*

If you'd like to copy the files over to your local directory:

   hfs -get /user/jf1870/wikipedia.output  output

This will copy the outputs to the local directory "output" on dumbo

Using Spark

  • Spark allow you to write and run applications quickly in Java, Scala, Python and R
  • You can either use Spark interactive shell or Spark submission tool

To run Spark interactive shell (Scala or Python):

1) Login to dumbo

2) Execute one of the following: spark-shell (to run applications in Scala)

       pyspark (to run applications in Python)

If you want to access your files stored on HDFS, use the following URL as filename in Spark hdfs://babar.es.its.nyu.edu:8020/user/<your_net_id>/<your_files> (the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)

To submit job to Spark:

1) Login to dumbo

2) Execute spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>

DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.

You can try some examples:

Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark. The difference is that Spark Streaming supports processing of live data stream.

Some references:

1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html 2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations

Personal tools