Protobuf in HBase
Detailed guide on how HBase uses Protocol Buffers for serialization, RPC interfaces, and coprocessor endpoints, including shading and versioning considerations.
Protobuf
HBase uses Google's protobufs wherever
it persists metadata — in the tail of hfiles or Cells written by
HBase into the system hbase:meta table or when HBase writes znodes
to zookeeper, etc. — and when it passes objects over the wire making
RPCs. HBase uses protobufs to describe the RPC
Interfaces (Services) we expose to clients, for example the Admin and Client
Interfaces that the RegionServer fields,
or specifying the arbitrary extensions added by developers via our
Coprocessor Endpoint mechanism.
With protobuf, you describe serializations and services in a .protos file.
You then feed these descriptors to a protobuf tool, the protoc binary,
to generate classes that can marshall and unmarshall the described serializations
and field the specified Services.
See the README.txt in the HBase sub-modules for details on how
to run the class generation on a per-module basis;
e.g. see hbase-protocol/README.txt for how to generate protobuf classes
in the hbase-protocol module.
In HBase, .proto files are either in the hbase-protocol module; a module
dedicated to hosting the common proto files and the protoc generated classes
that HBase uses internally serializing metadata. For extensions to hbase
such as REST or Coprocessor Endpoints that need their own descriptors; their
protos are located inside the function's hosting module: e.g. hbase-rest
is home to the REST proto files and the hbase-rsgroup table grouping
Coprocessor Endpoint has all protos that have to do with table grouping.
Protos are hosted by the module that makes use of them. While this makes it so generation of protobuf classes is distributed, done per module, we do it this way so modules encapsulate all to do with the functionality they bring to hbase.
Extensions whether REST or Coprocessor Endpoints will make use of core HBase protos found back in the hbase-protocol module. They'll use these core protos when they want to serialize a Cell or a Put or refer to a particular node via ServerName, etc., as part of providing the CPEP Service. Going forward, after the release of hbase-2.0.0, this practice needs to whither. We'll explain why in the later hbase-2.0.0 section.
hbase-2.0.0 and the shading of protobufs (HBASE-15638)
As of hbase-2.0.0, our protobuf usage gets a little more involved. HBase core protobuf references are offset so as to refer to a private, bundled protobuf. Core stops referring to protobuf classes at com.google.protobuf._ and instead references protobuf at the HBase-specific offset org.apache.hadoop.hbase.shaded.com.google.protobuf._. We do this indirection so hbase core can evolve its protobuf version independent of whatever our dependencies rely on. For instance, HDFS serializes using protobuf. HDFS is on our CLASSPATH. Without the above described indirection, our protobuf versions would have to align. HBase would be stuck on the HDFS protobuf version until HDFS decided to upgrade. HBase and HDFS versions would be tied.
We had to move on from protobuf-2.5.0 because we need facilities added in protobuf-3.1.0; in particular being able to save on copies and avoiding bringing protobufs onheap for serialization/deserialization.
In hbase-2.0.0, we introduced a new module, hbase-protocol-shaded
inside which we contained all to do with protobuf and its subsequent
relocation/shading. This module is in essence a copy of much of the old
hbase-protocol but with an extra shading/relocation step.
Core was moved to depend on this new module.
That said, a complication arises around Coprocessor Endpoints (CPEPs).
CPEPs depend on public HBase APIs that reference protobuf classes at
com.google.protobuf.* explicitly. For example, in our Table Interface
we have the below as the means by which you obtain a CPEP Service
to make invocations against:
...
<T extends com.google.protobuf.Service,R> Map<byte[],R> coprocessorService(
Class<T> service, byte[] startKey, byte[] endKey,
org.apache.hadoop.hbase.client.coprocessor.Batch.Call<T,R> callable)
throws com.google.protobuf.ServiceException, ThrowableExisting CPEPs will have made reference to core HBase protobufs
specifying ServerNames or carrying Mutations.
So as to continue being able to service CPEPs and their references
to com.google.protobuf.* across the upgrade to hbase-2.0.0 and beyond,
HBase needs to be able to deal with both
com.google.protobuf.* references and its internal offset
org.apache.hadoop.hbase.shaded.com.google.protobuf.* protobufs.
The hbase-protocol-shaded module hosts all
protobufs used by HBase core.
But for the vestigial CPEP references to the (non-shaded) content of
hbase-protocol, we keep around most of this module going forward
just so it is available to CPEPs. Retaining the most of hbase-protocol
makes for overlapping, 'duplicated' proto instances where some exist as
non-shaded/non-relocated here in their old module
location but also in the new location, shaded under
hbase-protocol-shaded. In other words, there is an instance
of the generated protobuf class
org.apache.hadoop.hbase.protobuf.generated.ServerName
in hbase-protocol and another generated instance that is the same in all
regards except its protobuf references are to the internal shaded
version at org.apache.hadoop.hbase.shaded.protobuf.generated.ServerName
(note the 'shaded' addition in the middle of the package name).
If you extend a proto in hbase-protocol-shaded for internal use,
consider extending it also in
hbase-protocol (and regenerating).
Going forward, we will provide a new module of common types for use by CPEPs that will have the same guarantees against change as does our public API. TODO.
protobuf changes for hbase-3.0.0 (HBASE-23797)
Since hadoop(start from 3.3.x) also shades protobuf and bumps the version to 3.x, there is no reason for us to stay on protobuf 2.5.0 any more.
In HBase 3.0.0, the hbase-protocol module has been purged, the CPEP implementation should use the protos in hbase-protocol-shaded module, and also make use of the shaded protobuf in hbase-thirdparty. In general, we will keep the protobuf version compatible for a whole major release, unless there are critical problems, for example, a critical CVE on protobuf.
Add this dependency to your pom:
<dependency>
<groupId>org.apache.hbase.thirdparty</groupId>
<artifactId>hbase-shaded-protobuf</artifactId>
<!-- use the version that your target hbase cluster uses -->
<version>${hbase-thirdparty.version}</version>
<scope>provided</scope>
</dependency>And typically you also need to add this plugin to your pom to make your generated protobuf code also use the shaded and relocated protobuf version in hbase-thirdparty.
<plugin>
<groupId>com.google.code.maven-replacer-plugin</groupId>
<artifactId>replacer</artifactId>
<version>1.5.3</version>
<executions>
<execution>
<phase>process-sources</phase>
<goals>
<goal>replace</goal>
</goals>
</execution>
</executions>
<configuration>
<basedir>${basedir}/target/generated-sources/</basedir>
<includes>
<include>**/*.java</include>
</includes>
<!-- Ignore errors when missing files, because it means this build
was run with -Dprotoc.skip and there is no -Dreplacer.skip -->
<ignoreErrors>true</ignoreErrors>
<replacements>
<replacement>
<token>([^\.])com.google.protobuf</token>
<value>$1org.apache.hbase.thirdparty.com.google.protobuf</value>
</replacement>
<replacement>
<token>(public)(\W+static)?(\W+final)?(\W+class)</token>
<value>@javax.annotation.Generated("proto") $1$2$3$4</value>
</replacement>
<!-- replacer doesn't support anchoring or negative lookbehind -->
<replacement>
<token>(@javax.annotation.Generated\("proto"\) ){2}</token>
<value>$1</value>
</replacement>
</replacements>
</configuration>
</plugin>In hbase-examples module, we have some examples under the
org.apache.hadoop.hbase.coprocessor.example package. You can see
BulkDeleteEndpoint and BulkDelete.proto for more details, and you can also
check the pom.xml of hbase-examples module to see how to make use of the above
plugin.
Unit Testing HBase Applications
This chapter discusses unit testing your HBase application using JUnit, Mockito, MRUnit, and HBaseTestingUtility.
Procedure Framework (PV2)
Procedure v2 ...aims to provide a unified way to build...multi-step procedures with a rollback/roll-forward ability in case of failure (e.g. create/delete table) — Matteo Bertozzi, the author of Pv2.