Inside Airgeek: Designing a Custom Binary Encoding and Compression
In the previous post of the Inside Airgeek series, we took a look at how we make sure that our indoor air quality monitor (which is coming soon) derives accurate timing for offline measurements in concert with the mobile app. Because the main focus was on the intricacies of accurate timekeeping, we glossed over many interesting details, such as how all that data gets encoded.
When you fetch data from your device via NFC, a number of objects are exchanged between your device and the smartphone app:
-
A request to fetch data (
FetchDataReq
) describing the span of data that is being requested and a timing cue. -
A response (
FetchDataResp
) containing a timing cue and a range of data points overlapping the requested range.
Naturally, these objects have an in-memory representation. For example, there is
a struct ag_fetch_data_resp
:
struct ag_fetch_data_resp {
seq_t start; // sequence number of the first datapoint
seq_t end; // sequence number of the last datapoint (exclusive)
uint32_t unix; // Unix timestamp at seconds
uint32_t seconds; // seconds at Unix timestamp
struct datapoint data[]; // datapoints follow in memory
};
But in an NFC transfer, or any transfer for that matter, bytes are transferred, not the C data structures directly.1 Thus the sender (Airgeek) needs to encode this structure as a sequence of bytes, and the recipient (the smartphone app) needs to decode it into its own representation that is easy to work with, such as a Kotlin data class on Android.
For Airgeek, we went with a custom binary encoding. To see why, let’s start with a simple but inefficient encoding and slowly work our way towards the custom encoding we designed.
Step #0: JavaScript Object Notation (JSON)
Let’s start simply with JSON, a ubiquitous text encoding for structured data. I often like to quickly whip up a solution to a problem, even if it’s obviously suboptimal, and work from there. Not only is that a good way to get the problem solving juices flowing, it also helps me avoid optimizing what doesn’t need optimizing in the first place.
In JSON, a FetchDataResp
might be2 encoded as:
{
"startSeq": 212442,
"endSeq": 212496,
"unix": 1750454932,
"seconds": 204230,
"datapoints": [
{
"seconds": 1750447814,
"t": 20.7,
"rh": 45.3,
"p": 998.1,
"co2": 768,
"no2": 15,
"co": 0.1,
"pm1": 1,
"pm2p5": 3,
"pm10": 7
]
/* ... */
}
This data representation has several advantages:
-
It’s infinitely flexible. You can add, remove and reorder many of the members without having to modify the encoder or the decoder. This is probably why JSON became the representation of choice for web APIs—it makes it very easy to achieve forward and backward compatibility.
-
It comes with standards, tools and libraries. JSON is described in RFC 8259. It comes with powerful tools such as jq and has support in the standard libraries of many popular languages such as Python or Go.
Unfortunately, it also has some very serious drawbacks:
-
It isn’t space efficient. The example above uses almost 300 bytes to encode what is at most 56 bytes of “real” data: the 14×32-bit numbers. As we’ll see, this isn’t a problem with JSON itself, but rather with the chosen data structure.
This is a dealbreaker, because the nominal NFC transfers speed of most smartphones is 53 kb/s ≈ 6.6 kB/s, or only about 20 of these objects per second. In reality, the transfers speeds are even lower.
-
It’s quite slow to encode and decode. As most of the data we tend to encode is just numbers, a lot of time is spent converting between the native binary representation of numbers and their text representation. This extra processing translates to power wasted, and in a battery-operated device, power comes at a premium.
-
It is far from trivial to encode and decode JSON. A fast, reliable JSON decoder isn’t something you typically want to implement yourself. And while good JSON processing libraries certainly exist, we wanted to avoid any external dependencies for this project.3
At around 300 bytes, this is clearly way too much. So let’s get to work.
Step #1: Flattening the JSON
It’s easy to see from the example above that a lot of space is wasted on the “structural” parts, such as the dictionaries and their keys. A much more concise JSON encoding of that same data would be:
[212442,212496,1750454932,204230,[[1750447814,20.7,45.3,998.1,768,15,0.1,1,3,7], /* ... */]]
This gets us down to 81 bytes, which is significantly better, but still at the very least 46 % bigger than it needs to be, and we have lost a lot of the flexibility that made the original encoding appealing.
But did we really need all that flexibility in the first place? Not really. While we do need to account for future modifications to the protocol—and we’ll add a few well-defined extension points later on—the protocol is unlikely to undergo significant development.
In fact, we can get rid of even more of the remaining structure:
[212442,212496,1750454932,204230,1750447814,20.7,45.3,998.1,768,15,0.1,1,3,7, /* ... */]
That’s 78 bytes, still far from optimal.
Step #2: Integers only
Before we venture into the realm of binary encodings, let’s simplify things a bit. Real-world data such as temperature or humidity aren’t whole numbers, they are decimal numbers. A natural machine representation for decimal numbers on Arm Cortex M4F would be IEEE 754 single-precision (32-bit) floats.
But compared to the representation of integers, the IEEE 754 representation of floating-point numbers is much more complicated to work with. And since we expect most of the savings to come from how we represent the numeric values, we can make our lives a lot easier by keeping their representation as easy to work with as possible.
So let’s work with fixed-point values instead: to encode “20.7” with one decimal place, we can encode “207/10”. The divide-by-ten is pre-agreed upon by the protocol and implicit, only “207” needs to be transferred. Similarly, we can encode a relative humidity of “48 %” with 1/2 % precision as “96/2”, with only “96” appearing on the wire.
Two’s complement arithmetic: a small detour
To represent negative integers, we’ll be using two’s complement, a simple but clever representation based on modular arithmetic. In two’s complement, to represent a k-bit integer −a, we take the representation of a (that’s just a in binary), perform a bitwise negation, and add one. For example, to represent −42 as a 32-bit signed two’s complement value:
- (42)10 = (00000000 00000000 00000000 00101010)2
- Bitwise negation gives (11111111 11111111 11111111 11010101)2
- Adding one gives (11111111 11111111 11111111 11010110)2
A natural thing to ask is why should this particular representation of negative numbers be of any use? Consider what the number b looks like when written as a k-bit binary number:
Consequently, the bitwise negation of b, ~b, is:
That’s because subtracting the value of a single bit from 1 effectively flips that bit4. Then, we can distribute and write:
Rearranging gives:
But since unsigned k-bit addition wraps around every 2k, we get that:
In other words, to subtract b, the value ~b + 1 is precisely what you need to add. This means that CPUs don’t need dedicated machinery for subtraction, they can just add the two’s complement instead. Neat!5
Along the way, we have also derived why ones’ complement works.
Step #3: Switching to binary
Representing numbers in binary is pretty much straightforward. We just need to know:
- the number of bytes used to represent a single number,
- the byte order (little- or big-endian), and
- the representation of negative numbers, if desired.
For example: instead of representing 1750454932 with the 10-byte string
“1750454932”, we could encode it as the 32-bit value itself. In little endian,
that’s the four bytes 0x6855D294
.
For example, to encode a 32-bit little-endian integer in two’s complement on an Arm Cortex-M4F, we can just copy the machine representation, because that’s what the machine is using:
// enc_i32 encodes the 32-bit signed integer i32 into the dst buf
// in little-endian two's complement.
void enc_i32(byte_t *dst, int32_t i32)
{
memcpy(dst, &i32, sizeof(i32));
}
That’s 14×4B = 56 bytes of data. So we’re done, right? Not quite!
Step #4: Variable-width encoding of integers
One thing which might not be immediately obvious is that when we switched from the text representation of numbers to the binary representation, we have lost a very useful property of the text encoding: while the binary representation always requires 4 bytes to represent a value, the width of the text representation is proportional to the base-10 logarithm of the value.
For example, to encode “42” as text, two bytes, ‘4’ and ‘2’ (0x3432)6, are
required. But the (little-endian) binary representation is the four bytes
0x2a000000
. On the other hand, the text representation requires a delimiter
in between adjacent numbers—in our flat JSON example, the delimiter was
usually the comma—to mark when one number ends and another one begins:
...,42,43,...
Therefore a text encoding of 42 requires closer to three bytes. Still, this is one byte fewer than the fixed-width binary representation. Perhaps there’s a binary representation which is proportional to the magnitude of the encoded value?
Instead of always using 4 bytes to encode a 32-bit integers, we could use fewer bytes to represent small numbers. There are multiple ways to achieve this, but to keep things simple, let’s take a look at the what is probably the simplest such approach.7
For each byte of the encoding, the most significant bit (MSb) indicates whether this is the last byte (0) or whether another byte follows (1). The remaining bits then make up the representation of the number. For example:
-
All 0 ≤ n ≤ 127 are encoded in a single byte
0xxxxxxx
, since they only require 7 bits of magnitude. -
For 0 ≤ n ≤ 214−1, two bytes are needed:
1xxxxxxxx
0xxxxxxx
, since they require between 8 and 14 bits of magnitude. -
The maximum 32-bit unsigned integer is five bytes:
10001111
11111111
11111111
11111111
11111111
.
This brings the size of the message down to 51 bytes.
Step #5: Using narrower integer types
But there’s a simpler way to achieve similar results: using narrower integer types to begin with. On a 32-bit processor such as ours, using 32-bit integers to represent numbers is a natural choice, but not necessarily an ideal choice for a compact encoding. For many values, a narrower integer type would suffice.
For example, the temperature sensor has a range of −40.0 to 125.0 °C with a resolution of 0.01 °C and an accuracy of 0.1 °C. For our air quality monitor, a resolution of 0.1 °C is more than sufficient8, so we only need to encode the values −40.0, -39.9, …, 124.9, 125.0. Using the fixed point encoding, we can encode the values as -400, …, 1250 instead, with the unit being 1/10 °C. We can easily fit this into the 16-bit 2’s complement signed integer range of −32 768 through +32 767 inclusive.
In so doing, we are exploiting a property inherent in our data, stemming from the real-world limitations of our sensors. A general-purpose encoding cannot make these choices, as it would be impractical to use. But since our encoding is custom, we can afford to fine-tune it to the problem.
By choosing the smallest integer type that fits each value, the message is 54 bytes. This is a little bit more than the 51 bytes needed for the variable-width encoding, but it’s simpler to implement and faster to execute. And we’re not done yet.
Step #6: Using k-bit integers
Let’s take this approach one step further: there’s no reason why the widths of the encoded fields must be a multiple of a byte, or be the width of one of the native integer types—it’s easier if they are, certainly, but it’s not required.
To represent temperature, we need the values −400…1250. A 12-bit 2’s complement signed integer has a range of -2048…2047 and is thus sufficient to represent temperature. Thankfully, 2’s complement works the same for all k-bit integers for all k ≥ 1.
By choosing the smallest k-bit integer to represent each member, we can bring the size of the message down to 397 bits, or roughly 50 B. This may seem like a bad return on investment at first, but will soon make a lot more sense.
Step #7: Conditional encoding
Airgeek measures all parameters every five minutes, except particulate matter, which is only measured 6 times per day.9 But the PM1, PM2.5 and PM10 values still take up 30 bits—even when the datapoint did not include a PM measurement. With 288 measurements per day, but only 6 PMX measurements per day, we’re wasting almost 4 bytes almost 98 % of the time. Let’s fix that.
The Wirth Syntax Notation is a way to capture the syntax of a particular language. Compared to alternatives, I find it super easy to read. Consider the following productions:
Seq = /* 32b unsigned */ .
Len = /* 16b unsigned */ .
Seconds = /* 32b unsigned */ .
Unix = /* 32b unsigned */ .
Data = DataHeader { Datapoint } .
DataHeader = StartSeq Len Unix Seconds .
StartSeq = Seq .
EndSeq = Seq .
This describes a part of the FetchMeasurementResp
encoding, as I’m sure you
have figured out.
The Wirth Syntax Notation is used extensively in the Go specification, which is why it’s so much easier to understand compared to, say, the C11 specification. That’s why we’re using it to describe our binary encoding scheme. The part concerning the PM cluster is described with the following productions:
DataEntry = "0" FullData Trailer .
FullData = Seconds FullT FullRH FullP FullCO FullCO2 FullNO2 PM .
FullT = /* 11 b signed, value = data [/10 degC] */ .
FullRH = /* 8 b unsigned, value = data [/2 % RH] */ .
FullP = /* 14 b unsigned, value = data [/10 hPa] */ .
FullCO = /* 13 b unsigned, value = data [ppm] */ .
FullCO2 = /* 13 b unsigned, value = data [ppm] */ .
FullNO2 = /* 11 b unsigned, value = data [ppb] */ .
PM = "0" | "1" FullPM2p5 FullPM1 FullPM10 .
FullPM2p5 = /* 10 b unsigned, value = data [ug/m^3] */ .
FullPM1 = /* 10 b unsigned, value = data [ug/m^3] */ .
FullPM10 = /* 10 b unsigned, value = data [ug/m^3] */ .
In other words, the cluster of PM values always contains at least one bit.
When that bit is 0
, the datapoint does not contain PM values. When it’s a 1
,
the 30 following bits contain the three values for PM1, PM2.5
and PM10.
This means the size of one datapoint is 103 bits ≈ 13 bytes without the PM cluster, and 133 bits ≈ 17 bytes with the PM cluster. Since the PM cluster is missing in about 98 % of all measurements, the average size of a datapoint is 103.62 bits, or just a smidge under 13 bytes.
Step #8: Delta compression
OK, so this was a lot of fun, but at 13 bytes, those 288 datapoints per day add up to some 3.7 kB, and that’s still just way too much. And while I think there is little we can do to encode an individual datapoint much more efficiently, there’s a lot we can do about encoding a sequence of successive datapoints.
The important thing to realize is that indoor air doesn’t really change all that much in 5 minutes—numbers go up and down, but the curves for most of the values are usually pretty smooth. And while there may be occasional “jumps” in the data, such as during intense cross-ventilation, these are the exception rather than the rule.
Let’s apply our conditional encoding once again, but this time to switch between two possible representations: the “full” encoding that was discussed up until this point, and the new “delta” encoding which describes a change from the previous measurement in sequence.
Data = DataHeader { DataEntry } .
DataHeader = Unix Seconds StartSeq Len .
StartSeq = Seq .
EndSeq = Seq .
DataEntry = "0" FullData Trailer | "1" DeltaData .
FullData = Seconds FullT FullRH FullP FullCO FullCO2 FullNO2 PM .
FullT = /* 11 b signed, value = data [/10 degC] */ .
FullRH = /* 8 b unsigned, value = data [/2 % RH] */ .
FullP = /* 14 b unsigned, value = data [/10 hPa] */ .
FullCO = /* 13 b unsigned, value = data [ppm] */ .
FullCO2 = /* 13 b unsigned, value = data [ppm] */ .
FullNO2 = /* 11 b unsigned, value = data [ppb] */ .
PM = "0" | "1" FullPM2p5 FullPM1 FullPM10 .
FullPM2p5 = /* 10 b unsigned, value = data [ug/m^3] */ .
FullPM1 = /* 10 b unsigned, value = data [ug/m^3] */ .
FullPM10 = /* 10 b unsigned, value = data [ug/m^3] */ .
DeltaData = DeltaFlags DeltaSeconds DeltaT DeltaRH DeltaP DeltaCO DeltaCO2 DeltaNO2.
DeltaFlags = /* 1 b unsigned, reserved for future use * / .
DeltaSeconds = /* 3 b unsigned, value = prev + 60 * (1 + data) [seconds since the Epoch] */ .
DeltaT = /* 5 b signed, value = prev + data [/10 degC] */ .
DeltaRH = /* 5 b signed, value = prev + data [/2 % RH] */ .
DeltaP = /* 5 b signed, value = prev + data [/10 hPa] */ .
DeltaCO = /* 3 b signed, value = prev + data [ppm] */ .
DeltaCO2 = /* 6 b signed, value = prev + data [ppm] */ .
DeltaNO2 = /* 3 b signed, value = prev + data [ppb] */ .
The first measurement in a sequence is always full, as there is no prior datapoint to base it upon. The following datapoints are encoded as deltas whenever possible: the full encoding is only used when there’s at least one component which doesn’t fit the delta range or when the PM cluster is present.
Amazingly, delta compression is chosen 90 % of the time. In a sequence of 288 measurements, the average size of a datapoint is thus less than 40 bits.
Conclusion
40 bits! When I saw this play out in real time, I was absolutely delighted. I love it when I can get a lot of mileage from a simple idea, and I believe this qualifies.
There are, of course, established high-performance general-purpose binary encoding schemes out there, such as Google protocol buffers or Cap’n Proto. There’s also CBOR, an ingenious binary encoding—if you haven’t yet, check it out! (I wrote an encoder and a decoder for it many years ago.)
And while they are great, we really wanted to keep things as simple as possible, and to avoid any external dependencies, and even the C stdlib. And while we certainly could’ve used those existing libraries and bolt some standard compression on top, the result would be almost certainly unnecessarily complicated, slow and hard to maintain. And much less fun.
In the next post, I’d like to take a look at the encoder and decoder itself—or more specifically, how we managed to keep them simple with a clever approach to error handling. Stay tuned!
-
One requirement we have for the encoding is to be machine independent, so simply streaming the memory representation of the
struct
won’t cut it. ↩ -
The example isn’t syntactically correct JSON, as JSON does not support the C-style
/*...*/
comments. Or any comments for that matter. ↩ -
For a number of very good reasons: code size, complexity, power consumption and security. ↩
-
It’s one of those little hacks I love. For example, in a binary search tree, where s is the index of a son (0 for the left son, 1 for the right son), 1 − s is the other son. ↩
-
Although, as I learned recently, there is a downside: while 2’s complement makes the addition simpler to implement in hardware, it makes multiplication more complex. Thus most CPUs will first convert 2’s complement operands to sign-magnitude, multiply that, and then covert the result back to 2’s complement. But I love the math behind it anyway. ↩
-
ASCII is clever: the numbers 0 through 9 are encoded as
0x30
through0x39
, so0x3
n is the digit n. This is useful when reading hex dumps, as any encoded numbers stand out immediately. ↩ -
A modified version of this approach is used in UTF-8 to efficiently encode Unicode code-points. As most letters in Latin text fall within the 7-bit ASCII range, a single byte is enough to represent most numbers, but up to four bytes7 are needed to encode high-numbered code-points. Four bytes are enough as UTF-8 is limited to the range U+0000–U+10FFFF. ↩ ↩2
-
It’s very important to distinguish between resolution and accuracy. You can have a very high resolution, very low accuracy temperature sensor which spits out somewhat random data, and a very high resolution, very high accuracy temperature sensor. The SHT45 we’re using is the latter kind. By limiting it’s resolution to 0.1&things;°C, we’re not losing any accuracy. ↩
-
In case you’re wondering why we only make six PMX measurements per day: it’s because we care about the long-term average. We could measure PMX more often, but the long-term average would come out about the same at the expense of a significant reduction in battery life. The long-term background PMX level seems to pose greatest danger to health. ↩