Difference between revisions of "NYU HPC Access Instructions"

Revision as of 22:40, 6 January 2016

Accessing the NYU HPC Cluster

1. Log into the main HPC node:

      ssh <netid>@hpc.nyu.edu

2. From the HPC node, log into the Hadoop cluster:

      ssh dumbo

You will be using a set of commands, and it will save you some time to first create aliases for them. Once on "dumbo", run the following commands on your terminal:

bash

alias hfs='/usr/bin/hadoop fs '

export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars

export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar

alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file: alias hfs='/usr/bin/hadoop fs '

export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars

export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar

alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

%% Note: you should not have any spaces around "="!

If you have bash as your default shell, do

     source .bashrc

This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.

Here are some common commands: hfs %% See available commands.

hfs -help %% more command details.

hfs -ls [<path>] %% List files

hfs -cp <src> <dst> %% Copy stuff

hfs -mkdir <path> %% Create path

hfs -rm <path> %% remove a file

hfs -chmod <path> %% Modify permissions.

hfs -chown <path> %% Modify owner.

Some remote access commands: hfs -cat <src> %% Cat contents to stdout.

hfs -copyFromLocal <localsrc> <dst> %% Copy stuff

hfs -copyToLocal <src> <localdst> %% Copy stuff

Using Hadoop Streaming

Hadoop streaming allows the use any program written in any language for mapreduce operations.
You can use the "hjs" alias you created to run Hadoop Streaming

To run the example I provided, do the following:

1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo. Assuming the directory in your machine is called /Users/julianafreire/MRExample

      scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu:

Then, from the hpc node:

      scp -r MRExample  dumbo

- Remember to replace your_netid with your actual netid!

2) From dumbo, you will now copy the data file to HDFS

      hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt

3) Check if the file is on HDFS

     hfs -ls

4) Now, to run the job, make sure you are on the right directory

    cd /home/your_netid/MRExample
    hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output

5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output. To list the output files:

    hfs -ls /user/jf1870/wikipedia.output

You can also inspect the content of the files:

   hfs -cat wikipedia.output/*

If you'd like to copy the files over to your local directory:

   hfs -get /user/jf1870/wikipedia.output  output

This will copy the outputs to the local directory "output" on dumbo

Using Spark

Spark allow you to write and run applications quickly in Java, Scala, Python and R
You can either use Spark interactive shell or Spark submission tool

To run Spark interactive shell (Scala or Python):

1) Login to dumbo

2) Execute one of the following: spark-shell (to run applications in Scala)

       pyspark (to run applications in Python)

If you want to access your files stored on HDFS, use the following URL as filename in Spark hdfs://babar.es.its.nyu.edu:8020/user/<your_net_id>/<your_files> (the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)

To submit job to Spark:

1) Login to dumbo

2) Execute spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>

DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.

You can try some examples:

Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py

Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark. The difference is that Spark Streaming supports processing of live data stream.

Some references:

1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html 2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations

Difference between revisions of "NYU HPC Access Instructions"

Revision as of 22:40, 6 January 2016

Accessing the NYU HPC Cluster

Using Hadoop Streaming

Using Spark

Navigation menu

Search

@@ Line 25: / Line 25: @@
 <code>
 alias hfs='/usr/bin/hadoop fs '
 export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
 export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar
 alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
@@ Line 41: / Line 44: @@
 <code>
 hfs        %% See available commands.
 hfs -help   %% more command details.
 hfs -ls [<path>]  %% List files
 hfs -cp <src> <dst>  %% Copy stuff
 hfs -mkdir <path> %% Create path
 hfs -rm <path> %% remove a file
 hfs -chmod <path> %% Modify permissions.
 hfs -chown <path> %%  Modify owner.
 </code>
@@ Line 53: / Line 63: @@
 <code>
 hfs -cat <src>  %% Cat contents to stdout.
 hfs -copyFromLocal <localsrc> <dst> %% Copy stuff
 hfs -copyToLocal <src> <localdst> %% Copy stuff
 </code>
@@ Line 66: / Line 78: @@
 ) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo.
-Assuming the directory is called /Users/julianafreire/MRExample
+Assuming the directory in your machine is called /Users/julianafreire/MRExample
         scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu:
 Then, from the hpc node:
@@ Line 95: / Line 107: @@
 This will copy the outputs to the local directory "output" on dumbo
-----------------------------------------------------------------------
+=== Using Spark ===
-Using Spark
 * Spark allow you to write and run applications quickly in Java, Scala, Python and R
@@ Line 120: / Line 131: @@
 	spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>
-DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission.
+DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.
-The bigger the faster. However if many people submit Spark job at the same time, performance will
-be downgraded.
-Spark word count example:
-Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
+You can try some examples:
-With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py
+* Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
+* With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py
-Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark.
+Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark.
-The difference is that Spark Streaming provide streaming processing of live data stream.
+The difference is that Spark Streaming supports processing of live data stream.
 Some references: