Expand description
Encoding of a skippable frame.
A skippable frame carries a fixed-size numeric identifier in its magic number as well as
variable-sized arbitrary bytes. It does not decompose into any internal block format like
data frames. It also has no decoder behavior specified by RFC 8788, which instead explicitly
clarifies its intent to support “user-defined metadata”:
From a compliant decoder perspective, skippable frames simply need to be skipped, and their content ignored, resuming decoding after the skippable frame.
This frame type’s encoding is specified in section 3.1.2 of IETF RFC 88781:
+==============+============+===========+
| Magic_Number | Frame_Size | User_Data |
+==============+============+===========+
| 4 bytes | 4 bytes | n bytes |
+--------------+------------+-----------+§Privacy Risk: Watermarking
The IETF RFC repeatedly notes the potential for watermarking and other forms of tracking possible via skippable frames in this standard:
It should be noted that a skippable frame can be used to watermark a stream of concatenated frames embedding any kind of tracking information (even just a Universally Unique Identifier (UUID)). Users wary of such possibility should scan the stream of concatenated frames in an attempt to detect such frames for analysis or removal.
Because the specification does not specify the behavior of skippable frames, this risk can go undetected unless the decoder explicitly handles such frames. Removing such frames will modify the resulting stream (which itself may impose its own risk of watermarking), but should make it possible for two independent implementations (or two independent users of this library) to avoid being individually watermarked by skippable frames alone if they were to reproduce a zstd stream from an untrusted source.
§Data Frames Contain Hidden States
However, the Zstandard stream format contains many further opportunities for individually
watermarking a stream beyond skippable frames which are not mentioned in the spec, and which
generally revolve around the immense flexibility of standard data frames.
These opportunities are almost too numerous to name, but take on a few broad categories:
- degenerate states: when decoded output is empty
- Like skippable frames, these also have no effect upon the decoded output, but can store arbitrary user data.
- examples include:
- when
Frame_Content_Sizeis 0 (this also limits all subsequentWindow_SizeandBlock_Size). - when
Block_Sizeis 0 (forRaw_BlockorRLE_Block). - when
Number_of_Sequencesis 0. - TODO: probably when a literal or offset is zero-length in sequence execution?
- when
- synonymous/fungible states: when the same output data is representable with distinct byte
strings
- This comes in three basic forms:
- using a more complex data structure than necessary, e.g.:
- a
Raw_Blockfor a single repeating byte. - a
Compressed_Blockfor uncompressible data.
- a
- using a sequence of too simple data structures, e.g.:
- two consecutive
RLE_Blocks withBlock_Size == 1vsRaw_Blockwith 2 bytes. - a
Raw_Blockfor highly compressible data.
- two consecutive
- using
Block_TypevsLiterals_Block_Type:Block_Typeprovides simpler forms ofRLE_BlockandRaw_Literals_Block, whereas theSequences_Section2 fromCompressed_Blockcan describe a program to execute a sequence of run-length literals or directly-copied bytes.
- using a more complex data structure than necessary, e.g.:
- Note that “compressibility” is highly domain-specific, and decisions may be performed
arbitrarily by the encoder.
- This therefore exposes the encoder to watermarking.
- This comes in three basic forms:
- dict/literal encoding: when decisions are made regarding prefix data or symbol
distributions
- TODO: it is still unclear how this works and the directions seem to contradict each other.
- This technique can be supremely difficult to detect heuristically.
- It may be possible through re-encoding to compare against a symbol distribution table built up by hand.
- In general, the space of possible compression encodings is vast, and as compression is
compared by both speed and size ratio, the decisions a compressor makes are hard to judge.
- However, this individuality streak makes encoders susceptible to watermarking too.
- block index selection: when the encoder decides how to chunk up the stream
- As with dict encoding, this is generally considered an arbitrary decision by the encoder.
- As a result, encoding is also watermarkable.
- As with dict encoding, this is generally considered an arbitrary decision by the encoder.
§“Unused Bit” is a Skippable Frame
Also worth calling out in particular is the “Unused Bit” from section 3.1.1.1.1.33:
A decoder compliant with this specification version shall not interpret this bit.
This is actually even stronger than a skippable frame, as it claims compliance requires not looking at the value of the bit, whereas skippable frames do not impose any interpretation (forbidding an interpretation is also an interpretation!). Luckily, as it states at the top:
This document is not an Internet Standards Track specification.
So for now we can do what it suggests:
An encoder compliant with this specification must set this bit to zero.
§Timing Attacks on Decoding
Yet decoders can be deanonymized in yet another way, even just by downloading a Zstandard data stream: in particular, by their choice of internal buffering.
The spec makes it clear that decoders are free to choose their own buffer limits, saying this
two separate times! In Single_Segment_Flag4:
For broader compatibility, decoders are recommended to support memory sizes of at least 8 MB. This is only a recommendation; each decoder is free to support higher or lower limits, depending on local limitations.
And then in Window_Descriptor5:
For improved interoperability, it’s recommended for decoders to support values of
Window_Sizeup to 8 MB and for encoders not to generate frames requiring aWindow_Sizelarger than 8 MB. It’s merely a recommendation though, and decoders are free to support higher or lower limits, depending on local limitations.
curl allows specifying a buffer size to receive output (including decompressed Zstandard
stream data) into6, and this can be used to validate the effect of buffer
selection (using curl’s internal buffer reallocation heuristics) upon remote latencies.
§Window Size is Fingerprintable Entropy
Unfortunately, this freedom of choice in buffer size defines a fingerprintable time series, visible to the remote end through variable latency and packet size over the course of the download (the proof of this is left as an exercise to the reader).
To quote a tor browser developer7:
Window dimensions are a big source of fingerprintable entropy on the web.
Analogously, the variable latency between reads from the network socket introduced by a particular buffer size can likely be used to fingerprint a decoder. Tor hidden services have been fingerprinted through a time series analysis of packet sizes in this way8.
§How to Achieve Anonymity
Given all of this uncertainty, how can a decoder expect to avoid fingerprinting?
In general, this is simply not possible by merely scanning and discarding frames from the decoder alone (as the spec recommends). The author of this library can identify three main strategies to mitigate the issues described above:
- Fully read out Zstandard network streams to disk.
- Instead of imparting backpressure by stream processing, decouple the Zstandard
decompression from network operations.
- Note that there are other forms of fingerprinting unrelated to Zstandard that may remain despite this mitigation.
- It may be possible to select buffer sizes according to some degree of randomness to thwart fingerprinting, but that would require a much more thorough analysis to formalize and prove.
- Instead of imparting backpressure by stream processing, decouple the Zstandard
decompression from network operations.
- Fully decode each stream, then re-encode it.
- Note that this inverts the threat model: instead of fingerprinting by correlating a stream
sent to a particular individual to de-anonymize them, this now risks fingerprinting an
individual by their choice of encoder settings.
- However, this effectively breaks the link from received Zstandard data stream to recipient, so Zstandard data streams can be received from untrusted sources.
- Also note that if the resulting Zstandard data stream is never going to be used
anywhere else (if it completely stays on the local node or internal network), this
mitigation is unnecessary.
- Note that encryption is not a sufficient protection here if the Zstandard data stream is attacker-controlled! See the CRIME exploit9 linked in the spec.
- Note that this inverts the threat model: instead of fingerprinting by correlating a stream
sent to a particular individual to de-anonymize them, this now risks fingerprinting an
individual by their choice of encoder settings.
- Re-encode using deterministic settings to avoid leaking machine-specific info.
- TODO: this needs to be fleshed out when the encoder is built!
- Especially consider how the translation of symbol frequency tables may incur rounding errors from machine precision boundaries and how this may induce deterministic differences.
https://datatracker.ietf.org/doc/html/rfc8878#section-3.1.2 ↩
https://datatracker.ietf.org/doc/html/rfc8878#section-3.1.1.3.2 ↩
https://datatracker.ietf.org/doc/html/rfc8878#section-3.1.1.1.1.3 ↩
https://datatracker.ietf.org/doc/html/rfc8878#section-3.1.1.1.1.2 ↩
https://datatracker.ietf.org/doc/html/rfc8878#name-window-descriptor ↩
https://docs.rs/curl/latest/curl/easy/struct.Easy2.html#method.buffer_size ↩
https://www.informatik.tu-cottbus.de/~andriy/papers/acmccs-wpes17-hidden-services-fp.pdf ↩
https://en.wikipedia.org/w/index.php?title=CRIME&oldid=844538656 ↩