Bulk Data Generator Tool

A random data generator tool for HBase tables leveraging HBase bulk load with uniformly distributed data.

This is a random data generator tool for HBase tables leveraging Hbase bulk load. It can create pre-splited HBase table and the generated data is uniformly distributed to all the regions of the table.

Usage

usage: hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool <OPTIONS> [-D<property=value>]*
 -d,--delete-if-exist         If it's set, the table will be deleted if already exist.
 -h,--help                    Show help message for the tool
 -mc,--mapper-count <arg>     The number of mapper containers to be launched.
 -o,--table-options <arg>     Table options to be set while creating the table.
 -r,--rows-per-mapper <arg>   The number of rows to be generated PER mapper.
 -sc,--split-count <arg>      The number of regions/pre-splits to be created for the table.
 -t,--table <arg>             The table name for which data need to be generated.

Examples:

hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool -t TEST_TABLE -mc 10 -r 100 -sc 10

hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool -t TEST_TABLE -mc 10 -r 100 -sc 10 -d -o "BACKUP=false,NORMALIZATION_ENABLED=false"

hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool -t TEST_TABLE -mc 10 -r 100 -sc 10 -Dmapreduce.map.memory.mb=8192

Overview

Table Schema

Tool generates a HBase table with single column family, i.e. cf and 9 columns i.e.

ORG_ID, TOOL_EVENT_ID, EVENT_ID, VEHICLE_ID, SPEED, LATITUDE, LONGITUDE, LOCATION, TIMESTAMP

with row key as

<TOOL_EVENT_ID>:<ORGANIZATION_ID>

Table Creation

Tool creates a pre-splited HBase Table having "split-count" splits (i.e. split-count + 1 regions) with sequential 6 digit region boundary prefix. Example: If a table is generated with "split-count" as 10, it will have (10+1) regions with following start-end keys.

(-000001, 000001-000002, 000002-000003, ...., 000009-000010, 0000010-)

Data Generation

Tool creates and run a MR job to generate the HFiles, which are bulk loaded to table regions via org.apache.hadoop.hbase.tool.BulkLoadHFilesTool. The number of mappers is defined in input as "mapper-count". Each mapper generates "records-per-mapper" rows.

org.apache.hadoop.hbase.util.bulkdatageneratorBulkDataGeneratorRecordReader ensures that each record generated by mapper is associated with index (added to the key) ranging from 1 to "records-per-mapper".

The TOOL_EVENT_ID column for each row has a 6 digit prefix as

(index) mod ("split-count" + 1)

Example, if 10 records are to be generated by each mapper and "split-count" is 4, the TOOL_EVENT_IDs for each record will have a prefix as

Record Index	TOOL_EVENT_ID's first six characters
1	000001
2	000002
3	000003
4	000004
5	000000
6	000001
7	000002
8	000003
9	000004
10	000005

Since TOOL_EVENT_ID is first attribute of row key and table region boundaries are also having start-end keys as 6 digit sequential prefixes, this ensures that each mapper generates (nearly) same number of rows for each region, making the data uniformly distributed. TOOL_EVENT_ID suffix and other columns of the row are populated with random data.

Number of rows generated is

rows-per-mapper * mapper-count

Size of each rows is (approximately)

850 B

Experiments

These results are from a 11 node cluster having HBase and Hadoop service running within self-managed test environment

Data Size	Time to Generate Data (mins)
100 GB	6 minutes
340 GB	13 minutes
3.5 TB	3 hours 10 minutes