Module skippable

Expand description

Encoding of a skippable frame.

A skippable frame carries a fixed-size numeric identifier in its magic number as well as variable-sized arbitrary bytes. It does not decompose into any internal block format like data frames. It also has no decoder behavior specified by RFC 8788, which instead explicitly clarifies its intent to support “user-defined metadata”:

From a compliant decoder perspective, skippable frames simply need to be skipped, and their content ignored, resuming decoding after the skippable frame.

This frame type’s encoding is specified in section 3.1.2 of IETF RFC 8878¹:

                 +==============+============+===========+
                 | Magic_Number | Frame_Size | User_Data |
                 +==============+============+===========+
                 |   4 bytes    |  4 bytes   |  n bytes  |
                 +--------------+------------+-----------+

§Privacy Risk: Watermarking

The IETF RFC repeatedly notes the potential for watermarking and other forms of tracking possible via skippable frames in this standard:

It should be noted that a skippable frame can be used to watermark a stream of concatenated frames embedding any kind of tracking information (even just a Universally Unique Identifier (UUID)). Users wary of such possibility should scan the stream of concatenated frames in an attempt to detect such frames for analysis or removal.

Because the specification does not specify the behavior of skippable frames, this risk can go undetected unless the decoder explicitly handles such frames. Removing such frames will modify the resulting stream (which itself may impose its own risk of watermarking), but should make it possible for two independent implementations (or two independent users of this library) to avoid being individually watermarked by skippable frames alone if they were to reproduce a zstd stream from an untrusted source.

§Data Frames Contain Hidden States

However, the Zstandard stream format contains many further opportunities for individually watermarking a stream beyond skippable frames which are not mentioned in the spec, and which generally revolve around the immense flexibility of standard data frames.

These opportunities are almost too numerous to name, but take on a few broad categories:

degenerate states: when decoded output is empty
- Like skippable frames, these also have no effect upon the decoded output, but can store arbitrary user data.
- examples include:
  - when Frame_Content_Size is 0 (this also limits all subsequent Window_Size and Block_Size).
  - when Block_Size is 0 (for Raw_Block or RLE_Block).
  - when Number_of_Sequences is 0.
  - TODO: probably when a literal or offset is zero-length in sequence execution?
synonymous/fungible states: when the same output data is representable with distinct byte strings
- This comes in three basic forms:
  1. using a more complex data structure than necessary, e.g.:
    - a Raw_Block for a single repeating byte.
    - a Compressed_Block for uncompressible data.
  2. using a sequence of too simple data structures, e.g.:
    - two consecutive RLE_Blocks with Block_Size == 1 vs Raw_Block with 2 bytes.
    - a Raw_Block for highly compressible data.
  3. using Block_Type vs Literals_Block_Type:
    - Block_Type provides simpler forms of RLE_Block and Raw_Literals_Block, whereas the Sequences_Section² from Compressed_Block can describe a program to execute a sequence of run-length literals or directly-copied bytes.
- Note that “compressibility” is highly domain-specific, and decisions may be performed arbitrarily by the encoder.
  - This therefore exposes the encoder to watermarking.
dict/literal encoding: when decisions are made regarding prefix data or symbol distributions
- TODO: it is still unclear how this works and the directions seem to contradict each other.
- This technique can be supremely difficult to detect heuristically.
  - It may be possible through re-encoding to compare against a symbol distribution table built up by hand.
  - In general, the space of possible compression encodings is vast, and as compression is compared by both speed and size ratio, the decisions a compressor makes are hard to judge.
    - However, this individuality streak makes encoders susceptible to watermarking too.
block index selection: when the encoder decides how to chunk up the stream
- As with dict encoding, this is generally considered an arbitrary decision by the encoder.
  - As a result, encoding is also watermarkable.

§“Unused Bit” is a Skippable Frame

Also worth calling out in particular is the “Unused Bit” from section 3.1.1.1.1.3³:

A decoder compliant with this specification version shall not interpret this bit.

This is actually even stronger than a skippable frame, as it claims compliance requires not looking at the value of the bit, whereas skippable frames do not impose any interpretation (forbidding an interpretation is also an interpretation!). Luckily, as it states at the top:

This document is not an Internet Standards Track specification.

So for now we can do what it suggests:

An encoder compliant with this specification must set this bit to zero.

§Timing Attacks on Decoding

Yet decoders can be deanonymized in yet another way, even just by downloading a Zstandard data stream: in particular, by their choice of internal buffering.

The spec makes it clear that decoders are free to choose their own buffer limits, saying this two separate times! In Single_Segment_Flag⁴:

For broader compatibility, decoders are recommended to support memory sizes of at least 8 MB. This is only a recommendation; each decoder is free to support higher or lower limits, depending on local limitations.

And then in Window_Descriptor⁵:

For improved interoperability, it’s recommended for decoders to support values of Window_Size up to 8 MB and for encoders not to generate frames requiring a Window_Size larger than 8 MB. It’s merely a recommendation though, and decoders are free to support higher or lower limits, depending on local limitations.

curl allows specifying a buffer size to receive output (including decompressed Zstandard stream data) into⁶, and this can be used to validate the effect of buffer selection (using curl’s internal buffer reallocation heuristics) upon remote latencies.

§Window Size is Fingerprintable Entropy

Unfortunately, this freedom of choice in buffer size defines a fingerprintable time series, visible to the remote end through variable latency and packet size over the course of the download (the proof of this is left as an exercise to the reader).

To quote a tor browser developer⁷:

Window dimensions are a big source of fingerprintable entropy on the web.

Analogously, the variable latency between reads from the network socket introduced by a particular buffer size can likely be used to fingerprint a decoder. Tor hidden services have been fingerprinted through a time series analysis of packet sizes in this way⁸.

§How to Achieve Anonymity

Given all of this uncertainty, how can a decoder expect to avoid fingerprinting?

In general, this is simply not possible by merely scanning and discarding frames from the decoder alone (as the spec recommends). The author of this library can identify three main strategies to mitigate the issues described above:

Fully read out Zstandard network streams to disk.
- Instead of imparting backpressure by stream processing, decouple the Zstandard decompression from network operations.
  - Note that there are other forms of fingerprinting unrelated to Zstandard that may remain despite this mitigation.
- It may be possible to select buffer sizes according to some degree of randomness to thwart fingerprinting, but that would require a much more thorough analysis to formalize and prove.
Fully decode each stream, then re-encode it.
- Note that this inverts the threat model: instead of fingerprinting by correlating a stream sent to a particular individual to de-anonymize them, this now risks fingerprinting an individual by their choice of encoder settings.
  - However, this effectively breaks the link from received Zstandard data stream to recipient, so Zstandard data streams can be received from untrusted sources.
  - Also note that if the resulting Zstandard data stream is never going to be used anywhere else (if it completely stays on the local node or internal network), this mitigation is unnecessary.
    - Note that encryption is not a sufficient protection here if the Zstandard data stream is attacker-controlled! See the CRIME exploit⁹ linked in the spec.
Re-encode using deterministic settings to avoid leaking machine-specific info.
- TODO: this needs to be fleshed out when the encoder is built!
- Especially consider how the translation of symbol frequency tables may incur rounding errors from machine precision boundaries and how this may induce deterministic differences.