Dave's Hadoop Blog

Semi-Join

Problem

Join a table R (customer table) and a table L (log table) on joining key CustID using MapReduce
architecture.

Motivation

Often, when R is large, many records in R may not be actually referenced by any records in table
L. Consider Facebook as an example. Its user table has hundreds of millions of records. However, an
hour worth of log data likely contains the activities of only a few million unique users and the majority
of the users are not present in this log at all. For broadcast join, this means that a large portion of the
records in R that are shipped across the network (via the DFS) and loaded in the hash table are not used
by the join. We exploit semi-join to avoid sending the records in R over the network that will not join
with L [1].

Dataset

For the input dataset, I used the code GenerateDataset.java to generate the transaction data,
Transaction.txt, and customer data, Customer.txt. In order to simulate the circumstances that benefit from
Semi-Join, we need to generate proper size of customer data and log data. Let's generate a Customer.txt
file with 20,000 records and a Transaction.txt file with the same number of records, with the
Transaction.txt only referencing the customers whose ID is from 1 to 200. Thus only the first 200 (one
hundredth) customers will have their transactions records in Transaction.txt file. This is to simulate the
situation that an hour worth of log data likely contains the activities of only a few thousand unique users
and the majority of the users are not present in this log at all [1]. The Customer.txt file represent table R
mentioned in the paper, and the Transaction.txt file represent table L mentioned in the paper.

Workflow of MapRedue Jobs

Bugs Encountered and Solutions

1) For reducers, the type of the input key-value pair must be the same as the type of the output key-value
pair. !!!

2) Do not create a output file directory for a MapReduce job if you will provide that as a parameter to
run the jobs, otherwise a "directory already exist" error will be raised.

3) The character between the key and value of a key-value pair in Hadoop output files is "\t", not
speces. Thus, when you try to split a line, do not use delimiter " ". Instead, use "\t". !!!

4) No matter what the type of output key-value pairs are, you can't assign null to either of them, e.g.
output.collect(custID, null), otherwise a NullPonterException would be raised.!!!

5) You can't map a key to a null value in hashtables in MapReduce code, e.g. ht.put(custID, null),
otherwise a error java.util.Hashtable.put(Unknown Source) would be raised. !!!

Tips & Experience

1) The difference among mappers, datanodes and map function.

Datanodes are the independent machines in which HDFS are stored and mappers are running. Mappers
are the processes running in datanodes. A datanode can have one or multiple mappers, but a mapper
can only reside in one datanode. Map functions are just snippets of code executed by each mapper.

2) If we don't specify the number of reducers, it would be the same as the number of the datanodes. In
the first phase of semi join, we only need one output file, i.e. L.uk. Because the number of output
files of a MapReduce Job is the same as the number of Reducers, we need to specify the number
of reducers to be 1.

3) If you are going to run a MapReduce that contains multiple jobs, you'd better get these jobs working
one by one for easier debugging.

How to run the program
(1) Delete previously generated output folders
(2) Compile
(3) Make jar
(4) Run

Useful Resources

How to read files in a directory in HDFS using Hadoop filesystem API

https://sites.google.com/site/hadoopandhive/home/how-to-read-all-files-in-a-directory-in-hdfs-using-had
oop-filesystem-api !!!

Hadoop File System Java Tutorial

http://myjavanotebook.blogspot.com/2008/05/hadoop-file-system-tutorial.html !!!

Classes Used

JobConf !!!

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html

FileSystem !!!

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#
append(org.apache.hadoop.fs.Path, int)

Files Attached

GenerateDataset.java - Source code to generate input files
Semi_Join.java
Input files
Output files at each phase

Dave's Hadoop Blog

Sunday, May 13, 2012

Semi-Join

No comments:

Post a Comment

About Me