Deterministic Protocol Buffers Serialization

8 January 2017 in software

Protocol Buffers are amazingly useful, even if they aren’t the fastest serialization format on the block. Surprisingly though, they lack one important feature: deterministic serialization. Fortunately, there’s a simple hack to get around this.

Why would you want deterministic serialialization for Protocol Buffers? Well, one obvious reason would be that you might want to attach a digital signature to a Protocol Buffers message. In my case though, I wanted to rsync Protocol Buffers messages as part of my Mutagen project. These messages (representing potentially very large filesystem hierarchies) are just a little too big to send quickly (<1s) over low-bandwidth connections. Moreover, they grow linearly with the filesystem hierarchy size and generally only change in one or two places at most, so rsync was a logical choice of transfer mechanism. Unfortunately, deterministic serialization is essential in order for rsync to have any hope of speeding up the transfer.

Now, there are at least a few existing ways to get deterministic serialization…

First, the particular Protocol Buffers implementation might just support it as an implementation detail, though this is less reliable. This is definitely not the case in Go (which I’m using in Mutagen), where maps (used pervasively in the Protocol Buffers library and message types) intentionally randomize iteration order because reasons.

Second, the C++ Protocol Buffers implementation supports a deterministic serialization API, but this isn’t supported in other implementations at the moment.

Third, for Go users, there’s the “gogo” Protocol Buffers fork, which offers a stable_marshaler extension, and it works great, but it means you have to pull in two Protocol Buffers implementations (adding to binary size), keep them in sync (which is harder than it sounds due to Go’s horrible dependency management), hope the “gogo” maintainers have their fork up-to-date (best of luck if you’re also trying to use gRPC), and make sure the code generator you’re using is compatible with both libraries (which means you have to keep what’s in your $GOPATH/bin directory synchronized with the library code in your project’s vendor directory, which is crazy).

Fortunately, there is an easier (albeit hackish) way…

The main source of non-deterministic serialization in Protocol Buffers is maps. Fields (both non-repeated and repeated) are generally serialized in order. To some extent this assumption is relying on implementation details, but it’s a sensible one for now. Maps, however, vary vastly between languages. They might be hash tables, B-trees, or they could even be linear searches on a key-value array — Protocol Buffers makes no guarantees. In fact, Protocol Buffers might not even use a language’s canonical map type, and for some languages it might not be clear which canonical map type to use (e.g. std::map vs std::unordered_map).

So how do we get around this? Well, the Protocol Buffers documentation has a small but important note that details how maps are serialized: they are simply repeated key-value fields. Thus, given the following Protocol Buffers message definitions, Entry and StableEntry are binary-compatible:

enum EntryKind {
    Directory = 0;
    File = 1;
}

message Entry {
    EntryKind kind = 1;
    bool executable = 2;
    bytes digest = 3;
    map<string, Entry> contents = 4;
}

message StableEntry {
    EntryKind kind = 1;
    bool executable = 2;
    bytes digest = 3;
    repeated StableEntryContent contents = 4;
}

message StableEntryContent {
    string name = 1;
    StableEntry entry = 2;
}

These definitions are exactly what Mutagen uses. It simply converts Entry to StableEntry, sorts the contents field on the name field, marshals, rsyncs, and then unmarshals directly into Entry. This gives deterministic serialization without an additional library, without a significant performance penalty (the conversion time is minimal), and without relying too much on implementation details. Of course, the Go Protocol Buffers implementation could change to randomize field ordering as well, and maybe at that point a switch to “gogo” Protocol Buffers would be warranted, but for now I’m quite happy with this simple hack.