NYU HPC Access Instructions

From VistrailsWiki
Jump to navigation Jump to search

Accessing the NYU HPC Cluster

1. Log into the main HPC node:

      ssh <netid>@hpc.nyu.edu    

2. From the HPC node, log into the Hadoop cluster:

      ssh dumbo

You will be using a set of commands, and it will save you some time to first create aliases for them. Once on "dumbo", run the following commands on your terminal:

bash alias hfs='/usr/bin/hadoop fs ' export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file: alias hfs='/usr/bin/hadoop fs ' export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

%% Note: you should not have any spaces around "="!

If you have bash as your default shell, do

     source .bashrc

This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.

Here are some common commands: hfs  %% See available commands. hfs -help  %% more command details. hfs -ls [<path>]  %% List files hfs -cp <src> <dst>  %% Copy stuff hfs -mkdir <path> %% Create path hfs -rm <path> %% remove a file hfs -chmod <path> %% Modify permissions. hfs -chown <path> %% Modify owner.

Some remote access commands: hfs -cat <src>  %% Cat contents to stdout. hfs -copyFromLocal <localsrc> <dst> %% Copy stuff hfs -copyToLocal <src> <localdst> %% Copy stuff

Using Hadoop Streaming

  • Hadoop streaming allows the use any program written in any language for mapreduce operations.
  • You can use the "hjs" alias you created to run Hadoop Streaming

To run the example I provided, do the following:

1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo. Assuming the directory is called /Users/julianafreire/MRExample

      scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: 

Then, from the hpc node:

      scp -r MRExample  dumbo
    • Remember to replace your_netid with your actual netid!

2) From dumbo, you will now copy the data file to HDFS

      hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt

3) Check if the file is on HDFS

     hfs -ls

4) Now, to run the job, make sure you are on the right directory

    cd /home/your_netid/MRExample
    hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output

5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output. To list the output files:

    hfs -ls /user/jf1870/wikipedia.output

You can also inspect the content of the files:

   hfs -cat wikipedia.output/*

If you'd like to copy the files over to your local directory:

   hfs -get /user/jf1870/wikipedia.output  output

This will copy the outputs to the local directory "output" on dumbo

Using Spark

  • Spark allow you to write and run applications quickly in Java, Scala, Python and R
  • You can either use Spark interactive shell or Spark submission tool

To run Spark interactive shell (Scala or Python):

1) Login to dumbo

2) Execute one of the following: spark-shell (to run applications in Scala)

       pyspark (to run applications in Python)

If you want to access your files stored on HDFS, use the following URL as filename in Spark hdfs://babar.es.its.nyu.edu:8020/user/<your_net_id>/<your_files> (the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)

To submit job to Spark:

1) Login to dumbo

2) Execute spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>

DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark job at the same time, performance will be downgraded.

Spark word count example:

Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py

Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark. The difference is that Spark Streaming provide streaming processing of live data stream.

Some references:

1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html 2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations