Bulk Loading

Overview

HBase includes several methods of loading data into tables. The most straightforward method is to either use the TableOutputFormat class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.

The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly load the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than loading via the HBase API.

Bulk Load Architecture

The HBase bulk load process consists of two main steps.

Preparing data via a MapReduce job

The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job using HFileOutputFormat2. This output format writes out data in HBase's internal storage format so that they can be later loaded efficiently into the cluster.

In order to function efficiently, HFileOutputFormat2 must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.

HFileOutputFormat2 includes a convenience function, configureIncrementalLoad(), which automatically sets up a TotalOrderPartitioner based on the current region boundaries of a table.

Completing the data load

After a data import has been prepared, either by using the importtsv tool with the “importtsv.bulk.output” option or by some other MapReduce job using the HFileOutputFormat, the completebulkload tool is used to import the data into the running cluster. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate RegionServer which adopts the HFile, moving it into its storage directory and making the data available to clients.

If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, the completebulkload utility will automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.

$ hadoop jar hbase-mapreduce-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable

The -c config-file option can be used to specify a file containing the appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on the CLASSPATH (In addition, the CLASSPATH must contain the directory that has the zookeeper configuration file if zookeeper is NOT managed by HBase).

Advanced Usage

Although the importtsv tool is useful in many cases, advanced users may want to generate data programmatically, or import data from other formats. To get started doing so, dig into ImportTsv.java and check the JavaDoc for HFileOutputFormat.

The import step of the bulk load can also be done programmatically. See the LoadIncrementalHFiles class for more information.

'Adopting' Stray Data

Should an HBase cluster lose account of regions or files during an outage or error, you can use the completebulkload tool to add back the dropped data. HBase operator tooling such as HBCK2 or the reporting added to the Master's UI under the HBCK Report (Since HBase 2.0.6/2.1.6/2.2.1) can identify such 'orphan' directories.

Before you begin the 'adoption', ensure the hbase:meta table is in a healthy state. Run the CatalogJanitor by executing the catalogjanitor_run command on the HBase shell. When finished, check the HBCK Report page on the Master UI. Work on fixing any inconsistencies, holes, or overlaps found before proceeding. The hbase:meta table is the authority on where all data is to be found and must be consistent for the completebulkload tool to work properly.

The completebulkload tool takes a directory and a tablename. The directory has subdirectories named for column families of the targeted tablename. In these subdirectories are hfiles to load. Given this structure, you can pass errant region directories (and the table name to which the region directory belongs) and the tool will bring the data files back into the fold by moving them under the approprate serving directory. If stray files, then you will need to mock up this structure before invoking the completebulkload tool; you may have to look at the file content using the HFile Tool to see what the column family to use is. When the tool completes its run, you will notice that the source errant directory has had its storefiles moved/removed. It is now desiccated since its data has been drained, and the pointed-to directory can be safely removed. It may still have .regioninfo files and other subdirectories but they are of no relevance now (There may be content still under the recovered_edits directory; a TODO is tooling to replay the content of recovered_edits if needed; see Add RecoveredEditsPlayer). If you pass completebulkload a directory without store files, it will run and note the directory is storefile-free. Just remove such 'empty' directories.

For example, presuming a directory at the top level in HDFS named eb3352fb5c9c9a05feeb2caba101e1cc has data we need to re-add to the HBase TestTable:

$ ${HBASE_HOME}/bin/hbase --config ~/hbase-conf completebulkload hdfs://server.example.org:9000/eb3352fb5c9c9a05feeb2caba101e1cc TestTable

After it successfully completes, any files that were in eb3352fb5c9c9a05feeb2caba101e1cc have been moved under hbase and the eb3352fb5c9c9a05feeb2caba101e1cc directory can be deleted (Check content before and after by running ls -r on the HDFS directory).

Bulk Loading Replication

HBASE-13153 adds replication support for bulk loaded HFiles, available since HBase 1.3/2.0. This feature is enabled by setting hbase.replication.bulkload.enabled to true (default is false). You also need to copy the source cluster configuration files to the destination cluster.

Additional configurations are required too:

hbase.replication.source.fs.conf.provider
This defines the class which loads the source cluster file system client configuration in the destination cluster. This should be configured for all the RS in the destination cluster. Default is org.apache.hadoop.hbase.replication.regionserver.DefaultSourceFSConfigurationProvider.
hbase.replication.conf.dir
This represents the base directory where the file system client configurations of the source cluster are copied to the destination cluster. This should be configured for all the RS in the destination cluster. Default is $HBASE_CONF_DIR.
hbase.replication.cluster.id
This configuration is required in the cluster where replication for bulk loaded data is enabled. A source cluster is uniquely identified by the destination cluster using this id. This should be configured for all the RS in the source cluster configuration file for all the RS.

For example: If source cluster FS client configurations are copied to the destination cluster under directory /home/user/dc1/, then hbase.replication.cluster.id should be configured as dc1 and hbase.replication.conf.dir as /home/user.

DefaultSourceFSConfigurationProvider supports only xml type files. It loads source cluster FS client configuration only once, so if source cluster FS client configuration files are updated, every peer(s) cluster RS must be restarted to reload the configuration.