Protocol Buffers are amazingly useful, even if they aren’t the fastest serialization format on the block. Surprisingly though, they lack one important feature: deterministic serialization. Fortunately, there’s a simple hack to get around this.
Why would you want deterministic serialialization for Protocol Buffers? Well, one obvious reason would be that you might want to attach a digital signature to a Protocol Buffers message. In my case though, I wanted to rsync Protocol Buffers messages as part of my Mutagen project. These messages (representing potentially very large filesystem hierarchies) are just a little too big to send quickly (<1s) over low-bandwidth connections. Moreover, they grow linearly with the filesystem hierarchy size and generally only change in one or two places at most, so rsync was a logical choice of transfer mechanism. Unfortunately, deterministic serialization is essential in order for rsync to have any hope of speeding up the transfer.
Now, there are at least a few existing ways to get deterministic serialization…
First, the particular Protocol Buffers implementation might just support it as an implementation detail, though this is less reliable. This is definitely not the case in Go (which I’m using in Mutagen), where maps (used pervasively in the Protocol Buffers library and message types) intentionally randomize iteration order because reasons.
Second, the C++ Protocol Buffers implementation supports a deterministic serialization API, but this isn’t supported in other implementations at the moment.
Third, for Go users, there’s the
“gogo” Protocol Buffers fork, which offers a
stable_marshaler
extension, and it
works great, but it means you have to pull in two Protocol Buffers
implementations (adding to binary size), keep them in sync (which is harder than
it sounds due to Go’s horrible dependency management), hope the “gogo”
maintainers have their fork up-to-date (best of luck if you’re also trying to
use gRPC), and make sure the code generator you’re using is compatible with both
libraries (which means you have to keep what’s in your $GOPATH/bin
directory
synchronized with the library code in your project’s vendor
directory, which
is crazy).
Fortunately, there is an easier (albeit hackish) way…
The main source of non-deterministic serialization in Protocol Buffers is maps.
Fields (both non-repeated and repeated) are generally serialized in order. To
some extent this assumption is relying on implementation details, but it’s a
sensible one for now. Maps, however, vary vastly between languages. They might
be hash tables, B-trees, or they could even be linear searches on a key-value
array — Protocol Buffers makes no guarantees. In fact, Protocol Buffers might
not even use a language’s canonical map type, and for some languages it might
not be clear which canonical map type to use (e.g. std::map
vs
std::unordered_map
).
So how do we get around this? Well, the Protocol Buffers documentation has a
small but important note
that details how maps are serialized: they are simply repeated key-value fields.
Thus, given the following Protocol Buffers message definitions, Entry
and
StableEntry
are binary-compatible:
enum EntryKind {
Directory = 0;
File = 1;
}
message Entry {
EntryKind kind = 1;
bool executable = 2;
bytes digest = 3;
map<string, Entry> contents = 4;
}
message StableEntry {
EntryKind kind = 1;
bool executable = 2;
bytes digest = 3;
repeated StableEntryContent contents = 4;
}
message StableEntryContent {
string name = 1;
StableEntry entry = 2;
}
These definitions are exactly what Mutagen uses. It simply converts Entry
to
StableEntry
, sorts the contents
field on the name
field, marshals, rsyncs,
and then unmarshals directly into Entry
. This gives deterministic
serialization without an additional library, without a significant performance
penalty (the conversion time is minimal), and without relying too much on
implementation details. Of course, the Go Protocol Buffers implementation could
change to randomize field ordering as well, and maybe at that point a switch to
“gogo” Protocol Buffers would be warranted, but for now I’m quite happy with
this simple hack.