Difference between revisions of "Assignment 3 - FAQ"

From VistrailsWiki
Jump to navigation Jump to search
(Created page with '== Frequently Asked Questions == === How do I specify which subset of a key to be used by the partitioner? === * Hadoop Streaming provides an option for you to modify the parti…')
 
Line 23: Line 23:


* You can do this when you configure a step:
* You can do this when you configure a step:
[[File:Example.jpg]]
[[File:emr-partitioner.png]]

Revision as of 03:29, 11 April 2014

Frequently Asked Questions

How do I specify which subset of a key to be used by the partitioner?

  • Hadoop Streaming provides an option for you to modify the partitioning strategy
  • Here's an example:

hadoop jar /usr/bin/hadoop/contrib/streaming/hadoop-streaming-1.0.3.16.jar -D mapred.reduce.tasks=2 -D stream.num.map.output.key.fields=2 -D num.key.fields.for.partition=2 -file wordMatrix_mapperPairs.py -mapper wordMatrix_mapperPairs.py -file wordMatrix_reducerPairs.py -reducer wordMatrix_reducerPairs.py -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -input /user/juliana/input -output /user/juliana/output2

    • stream.num.map.output.key.fields=2 informs Hadoop that the first 2 fields of the mapper output form the key -- in this case (word1,word2), and the third field corresponds to the value.
    • num.key.fields.for.partition=2 specifies that both fields are to be used by the partitioner.
    • Note that we also need to specify the partitioner: -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
  • Here's another example, now using only the first field as the key:

hadoop jar /usr/bin/hadoop/contrib/streaming/hadoop-streaming-1.0.3.16.jar -D mapred.reduce.tasks=2 -D stream.num.map.output.key.fields=2 -D num.key.fields.for.partition=1 -file wordMatrix_mapperPairs.py -mapper wordMatrix_mapperPairs.py -file wordMatrix_reducerPairs.py -reducer wordMatrix_reducerPairs.py -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -input /user/juliana/input -output /user/juliana/output2

    • num.key.fields.for.partition=1 specifies that both fields are to be used by the partitioner.

How do I specify which subset of a key to be used by the partitioner on AWS?

  • You can do this when you configure a step:

Emr-partitioner.png