Bulk Data Generator Tool
A random data generator tool for HBase tables leveraging HBase bulk load with uniformly distributed data.
This is a random data generator tool for HBase tables leveraging Hbase bulk load. It can create pre-splited HBase table and the generated data is uniformly distributed to all the regions of the table.
Usage
usage: hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool <OPTIONS> [-D<property=value>]*
-d,--delete-if-exist If it's set, the table will be deleted if already exist.
-h,--help Show help message for the tool
-mc,--mapper-count <arg> The number of mapper containers to be launched.
-o,--table-options <arg> Table options to be set while creating the table.
-r,--rows-per-mapper <arg> The number of rows to be generated PER mapper.
-sc,--split-count <arg> The number of regions/pre-splits to be created for the table.
-t,--table <arg> The table name for which data need to be generated.Examples:
hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool -t TEST_TABLE -mc 10 -r 100 -sc 10
hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool -t TEST_TABLE -mc 10 -r 100 -sc 10 -d -o "BACKUP=false,NORMALIZATION_ENABLED=false"
hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool -t TEST_TABLE -mc 10 -r 100 -sc 10 -Dmapreduce.map.memory.mb=8192Overview
Table Schema
Tool generates a HBase table with single column family, i.e. cf and 9 columns i.e.
ORG_ID, TOOL_EVENT_ID, EVENT_ID, VEHICLE_ID, SPEED, LATITUDE, LONGITUDE, LOCATION, TIMESTAMPwith row key as
<TOOL_EVENT_ID>:<ORGANIZATION_ID>Table Creation
Tool creates a pre-splited HBase Table having "split-count" splits (i.e. split-count + 1 regions) with sequential 6 digit region boundary prefix. Example: If a table is generated with "split-count" as 10, it will have (10+1) regions with following start-end keys.
(-000001, 000001-000002, 000002-000003, ...., 000009-000010, 0000010-)Data Generation
Tool creates and run a MR job to generate the HFiles, which are bulk loaded to table regions via org.apache.hadoop.hbase.tool.BulkLoadHFilesTool.
The number of mappers is defined in input as "mapper-count". Each mapper generates "records-per-mapper" rows.
org.apache.hadoop.hbase.util.bulkdatageneratorBulkDataGeneratorRecordReader ensures that each record generated by mapper is associated with index (added to the key) ranging from 1 to "records-per-mapper".
The TOOL_EVENT_ID column for each row has a 6 digit prefix as
(index) mod ("split-count" + 1)Example, if 10 records are to be generated by each mapper and "split-count" is 4, the TOOL_EVENT_IDs for each record will have a prefix as
| Record Index | TOOL_EVENT_ID's first six characters |
|---|---|
| 1 | 000001 |
| 2 | 000002 |
| 3 | 000003 |
| 4 | 000004 |
| 5 | 000000 |
| 6 | 000001 |
| 7 | 000002 |
| 8 | 000003 |
| 9 | 000004 |
| 10 | 000005 |
Since TOOL_EVENT_ID is first attribute of row key and table region boundaries are also having start-end keys as 6 digit sequential prefixes, this ensures that each mapper generates (nearly) same number of rows for each region, making the data uniformly distributed. TOOL_EVENT_ID suffix and other columns of the row are populated with random data.
Number of rows generated is
rows-per-mapper * mapper-countSize of each rows is (approximately)
850 BExperiments
These results are from a 11 node cluster having HBase and Hadoop service running within self-managed test environment
| Data Size | Time to Generate Data (mins) |
|---|---|
| 100 GB | 6 minutes |
| 340 GB | 13 minutes |
| 3.5 TB | 3 hours 10 minutes |