Semi-Join
Problem
Join a table R (customer table) and a table L (log table) on joining key CustID using MapReduce
architecture.
Motivation
Often, when R is large, many records in R may not be actually referenced by any records in table
L. Consider Facebook as an example. Its user table has hundreds of millions of records. However, an hour worth of log data likely contains the activities of only a few million unique users and the majority of the users are not present in this log at all. For broadcast join, this means that a large portion of the records in R that are shipped across the network (via the DFS) and loaded in the hash table are not used by the join. We exploit semi-join to avoid sending the records in R over the network that will not join with L [1].
Dataset
For the input dataset, I used the code GenerateDataset.java to generate the transaction data,Transaction.txt, and customer data, Customer.txt. In order to simulate the circumstances that benefit from Semi-Join, we need to generate proper size of customer data and log data. Let's generate a Customer.txt file with 20,000 records and a Transaction.txt file with the same number of records, with the Transaction.txt only referencing the customers whose ID is from 1 to 200. Thus only the first 200 (one hundredth) customers will have their transactions records in Transaction.txt file. This is to simulate the situation that an hour worth of log data likely contains the activities of only a few thousand unique users and the majority of the users are not present in this log at all [1]. The Customer.txt file represent table R mentioned in the paper, and the Transaction.txt file represent table L mentioned in the paper.
Workflow of MapRedue Jobs
Bugs Encountered and Solutions
1) For reducers, the type of the input key-value pair must be the same as the type of the output key-value
pair. !!!
2) Do not create a output file directory for a MapReduce job if you will provide that as a parameter to
run the jobs, otherwise a "directory already exist" error will be raised.
3) The character between the key and value of a key-value pair in Hadoop output files is "\t", not
speces. Thus, when you try to split a line, do not use delimiter " ". Instead, use "\t". !!!
4) No matter what the type of output key-value pairs are, you can't assign null to either of them, e.g.
output.collect(custID, null), otherwise a NullPonterException would be raised.!!!
5) You can't map a key to a null value in hashtables in MapReduce code, e.g. ht.put(custID, null),
otherwise a error java.util.Hashtable.put(Unknown Source) would be raised. !!!
Tips & Experience
1) The difference among mappers, datanodes and map function.
Datanodes are the independent machines in which HDFS are stored and mappers are running. Mappers
are the processes running in datanodes. A datanode can have one or multiple mappers, but a mapper can only reside in one datanode. Map functions are just snippets of code executed by each mapper.
2) If we don't specify the number of reducers, it would be the same as the number of the datanodes. In
the first phase of semi join, we only need one output file, i.e. L.uk. Because the number of output files of a MapReduce Job is the same as the number of Reducers, we need to specify the number of reducers to be 1.
3) If you are going to run a MapReduce that contains multiple jobs, you'd better get these jobs working
one by one for easier debugging.
How to run the program
(1) Delete previously generated output folders (2) Compile (3) Make jar (4) Run
Useful Resources
How to read files in a directory in HDFS using Hadoop filesystem API
https://sites.google.com/site/hadoopandhive/home/how-to-read-all-files-in-a-directory-in-hdfs-using-had
oop-filesystem-api !!!
Hadoop File System Java Tutorial
Classes Used
JobConf !!!
FileSystem !!!
Files Attached
|
No comments:
Post a Comment