Hive、MapReduce、Spark分佈式生成唯一數值型ID

Redis 技術中國大數據 2017-04-13

在實際業務場景下，經常會遇到在Hive、MapReduce、Spark中需要生成唯一的數值型ID。

一般常用的做法有：

MapReduce中使用1個Reduce來生成;

Hive中使用row_number分析函數來生成，其實也是1個Reduce;

藉助HBase或Redis或Zookeeper等其它框架的計數器來生成;

數據量不大的情況下，可以直接使用1和2方法來生成，但如果數據量巨大，1個Reduce處理起來就非常慢。

在數據量非常大的情況下，如果你僅僅需要唯一的數值型ID，注意：不是需要”連續的唯一的數值型ID”，那麼可以考慮採用本文中介紹的方法，否則，請使用第3種方法來完成。

Spark中生成這樣的非連續唯一數值型ID，非常簡單，直接使用zipWithUniqueId即可。

參考zipWithUniqueId的方法，在MapReduce和Hive中，實現如下：

在Spark中，zipWithUniqueId是通過使用分區Index作為每個分區ID的開始值，在每個分區內，ID增長的步長為該RDD的分區數，那麼在MapReduce和Hive中，也可以照此思路實現，Spark中的分區數，即為MapReduce中的Map數，Spark分區的Index，即為Map Task的ID。Map數，可以通過JobConf的getNumMapTasks，而Map Task ID，可以通過參數mapred.task.id獲取，格式如：attempt_1478926768563_0537_m_000004_0，截取m_000004_0中的4，再加1，作為該Map Task的ID起始值。注意：這兩個只均需要在Job運行時才能獲取。另外，從圖中也可以看出，每個分區/Map Task中的數據量不是絕對一致的，因此，生成的ID不是連續的。

下面的UDF可以在Hive中直接使用：

package com.lxw1234.hive.udf;
import org.apache.hadoop.hive.ql.exec.MapredContext;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.io.LongWritable;
@UDFType(deterministic = false, stateful = true)
public class RowSeq2 extends GenericUDF {
private static LongWritable result = new LongWritable;
private static final char SEPARATOR = '_';
private static final String ATTEMPT = "attempt";
private long initID = 0l;
private int increment = 0;
@Override
public void configure(MapredContext context) {
increment = context.getJobConf.getNumMapTasks;
if(increment == 0) {
throw new IllegalArgumentException("mapred.map.tasks is zero");
}
initID = getInitId(context.getJobConf.get("mapred.task.id"),increment);
if(initID == 0l) {
throw new IllegalArgumentException("mapred.task.id");
}
System.out.println("initID : " + initID + " increment : " + increment);
}
@Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
result.set(getValue);
increment(increment);
return result;
}
@Override
public String getDisplayString(String[] children) {
return "RowSeq-func";
}
private synchronized void increment(int incr) {
initID += incr;
}
private synchronized long getValue {
return initID;
}
//attempt_1478926768563_0537_m_000004_0 // return 0+1
private long getInitId (String taskAttemptIDstr,int numTasks)
throws IllegalArgumentException {
try {
String parts = taskAttemptIDstr.split(Character.toString(SEPARATOR));
if(parts.length == 6) {
if(parts[0].equals(ATTEMPT)) {
if(!parts[3].equals("m") && !parts[3].equals("r")) {
throw new Exception;
}
long result = Long.parseLong(parts[4]);
if(result >= numTasks) { //if taskid >= numtasks
throw new Exception("TaskAttemptId string : " + taskAttemptIDstr
+ " parse ID [" + result + "] >= numTasks[" + numTasks + "] ..");
}
return result + 1;
}
}
} catch (Exception e) {}
throw new IllegalArgumentException("TaskAttemptId string : " + taskAttemptIDstr
+ " is not properly formed");
}
}

有一張去重後的用戶id(字符串類型)表，需要位每個用戶id生成一個唯一的數值型seq:

ADD jar file:///tmp/udf.jar;
CREATE temporary function seq2 as 'com.lxw1234.hive.udf.RowSeq2';
hive>> desc lxw_all_ids;
OK
id string
Time taken: 0.074 seconds, Fetched: 1 row(s)
hive> select * from lxw_all_ids limit 5;
OK
01779E7A06ABF5565A4982_cookie
031E2D2408C29556420255_cookie
03371ADA0B6E405806FFCD_cookie
0517C4B701BC1256BFF6EC_cookie
05F12ADE0E880455931C1A_cookie
Time taken: 0.215 seconds, Fetched: 5 row(s)
hive> select count(1) from lxw_all_ids;
253402337
hive> create table lxw_all_ids2 as select id,seq2 as seq from lxw_all_ids;
…
Hadoop job information for Stage-1: number of mappers: 27; number of reducers: 0
…

該Job使用了27個Map Task，沒有使用Reduce，那麼將會產生27個結果文件。

再看結果表中的數據：

hive> select * from lxw_all_ids2 limit 10;
OK
766CA2770527B257D332AA_cookie 1
5A0492DB0000C557A81383_cookie 28
8C06A5770F176E58301EEF_cookie 55
6498F47B0BCAFE5842B83A_cookie 82
6DA33CB709A23758428A44_cookie 109
B766347B0D27925842AC2D_cookie 136
5794357B050C99584251AC_cookie 163
81D67A7B011BEA5842776C_cookie 190
9D2F8EB40AEA525792347D_cookie 217
BD21077B09F9E25844D2C1_cookie 244
hive> select count(1),count(distinct seq) from lxw_all_ids2;
253402337 253402337

limit 10只從第一個結果文件，即MapTaskId為0的結果文件中拿了10條，這個Map中，start=1，increment=27，因此生成的ID如上所示。

count(1),count(distinct seq)的值相同，說明seq沒有重複值，你可以試試max(seq)，結果必然大於253402337，說明seq是”非連續唯一數值型ID“.

Hive、MapReduce、Spark分佈式生成唯一數值型ID

相關推薦