Sunday, May 13, 2012


Hadoop Project 1: Join Algorithm (Feb - May 2012)

Team member: David Zheng
Programming language: Java
Hadoop version: 1.0.0
Description: Implement two join algorithms presented in the paper, A Comparison of Join Algorithms for Log Processing
 in MapReduce. The two join algorithms are Semi-Join and Per-Split-Semi-Join. We use a cluster of three nodes, one 
namenode and two datanodes. 

What you will learn from this project

  • For reducers, the type of the input key-value pair must be the same as the type of the output key-value pair.
  • If you're running Map-only job, you MUST EXPLICITLY set the number of the reducers to be 0.
  • The difference among mappers, datanodes and map function.
  • How to write close function after map function.
  • How mappers partition input file/s when no partitioning method is specified.
  • How to specify the number of lines read by each mapper.

[1] Vuk Ercegovac, Jun Rao, Spyros Blanas, Jignesh M. Patel. A Comparison of Join Algorithms for Log 
Processing in MapReduce. SIGMOD '10 Proceedings of the 2010 international conference on Management of
 data.

No comments:

Post a Comment