Dave's Hadoop Blog

Hadoop Project 1: Join Algorithm (Feb - May 2012)

Team member: David Zheng

Programming language: Java

Hadoop version: 1.0.0

Description: Implement two join algorithms presented in the paper, A Comparison of Join Algorithms for Log Processing
in MapReduce. The two join algorithms are Semi-Join and Per-Split-Semi-Join. We use a cluster of three nodes, one
namenode and two datanodes.

What you will learn from this project

For reducers, the type of the input key-value pair must be the same as the type of the output key-value pair.

If you're running Map-only job, you MUST EXPLICITLY set the number of the reducers to be 0.

The difference among mappers, datanodes and map function.

How to write close function after map function.

How mappers partition input file/s when no partitioning method is specified.

How to specify the number of lines read by each mapper.

[1] Vuk Ercegovac, Jun Rao, Spyros Blanas, Jignesh M. Patel. A Comparison of Join Algorithms for Log
Processing in MapReduce. SIGMOD '10 Proceedings of the 2010 international conference on Management of
data.

Dave's Hadoop Blog

Sunday, May 13, 2012

Hadoop Project 1: Join Algorithm (Feb - May 2012)

No comments:

Post a Comment

About Me