AWS Setup

Setting up your AWS account

Go to http://aws.amazon.com/ and sign up: You may sign in using your existing Amazon account or you can create a new account by selecting "I am a new user."

Enter your contact information and confirm your acceptance of the AWS Customer Agreement. Once you have created an Amazon Web Services Account, check your email for your confirmation step. You need Access Identifiers to make valid web service requests.

Go to http://aws.amazon.com/ and sign in. At the top of the page, click on Sign in to the AWS Management Console. You need to sign up for three of their services: Simple Storage Service (S3), Elastic Compute Cloud (EC2), and Amazon Elastic MapReduce.

Get your AWS credit code from here code assignment, the go to http://aws.amazon.com/awscredits/ This gives you $100 credit towards AWS. Be aware that if you exceed it, Amazon will charge your credit card without warning. This credit should be enough for this assignment (if you are interested in their changes, see AWS charges: currently, AWS charges about 8 cents/node/hour for the default "small" node size.). However, you must remember to terminate manually the AWS cluster (called Job Flows) when you are done: if you just close the browser, the job flows continue to run, and amazon will continue to charge you for days and weeks, exhausting your credit and charging you huge amount on your credit card. Remember to terminate the AWS cluster.

Setting up an EC2 key pair

To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following:

After setting up your account, follow Amazon's instructions to create a key pair. Follow the instructions in section "Having AWS create the key pair for you," subsection "AWS Management Console." (Don't do this in Internet Explorer, or you might not be able to download the .pem private key file.)

Download and save the .pem private key file to disk. We will reference the .pem file as </path/to/saved/keypair/file.pem> in the following instructions.

Make sure only you can access the .pem file, just to be safe:

   $ chmod 600 </path/to/saved/keypair/file.pem>

Terminating an AWS cluster

After you are done, shut down the AWS cluster:

   Go to the Management Console.
   Select the job in the list.
   Click the Terminate button (it should be right below "Your Elastic MapReduce Job Flows").
   Wait for a while (may take minutes) and recheck until the job state becomes TERMINATED.

Pay attention to this step. If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge you: for hours, days, weeks, and when your credit is exhausted, it chages your creditcard. Make sure you don't leave the console until you have confirmation that the job is terminated.

Killing a Hadoop Job

From the job tracker interface find the hadoop job_id, then type:

   % hadoop job -kill job_id

Managing the results of your tasks

Copying files to or from the AWS master node

To copy one file from the master node back to your computer, run this command on the local computer:

   $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<file_path> .

where <file_path> can be absolute or relative to the AWS master node's home folder. The file should be copied onto your current directory ('.') on your local computer.

Better: copy an entire directory, recursively. Suppose your files are in the directory example-results. They type the following on your loal computer:

   $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -r hadoop@<master.public-dns-name.amazonaws.com>:example-results .

As an alternative, you may run the scp command on the AWS master node, and connect to your local machine. For that, you need to know your local machine's domain name, or IP address, and your local machine needs to accept ssh connections.

Storing Files in S3

This seems much easier to use. Go to your AWS Management Console, click on Create Bucket, and create a new bucket (=directory). Give it a name that may be a public name. Let's say you call it superman-hw6. Click on the Properties button, then Permissions tab. Make sure you have all the permissions.

In your program, you can write the results to 's3n://superman-hw6/example-results'. When the program terminates, then in your S3 console you should see the new directory example-results. Click on individual files to download. The number of files depends on the number of reduce tasks, and may vary from one to a few dozens. The only disadvantage of using S3 is that you have to click on each file separately to download.

Note that S3 is permanent storage, and you are charged for it. You can safely store all your query answers for several weeks without exceeding your credit; at some point in the future remember to delete them.

Modified from http://www.cs.washington.edu/education/courses/csep544/11au/hw/hw6/hw6-awsusage.html