From 1734207266752b9464f42c31f7728a7e3c692c50 Mon Sep 17 00:00:00 2001 From: Runxi Yu Date: Wed, 11 Mar 2026 16:13:40 +0800 Subject: research: Add packfile bloom filter RFC --- research/packfile_bloom.txt | 130 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) create mode 100644 research/packfile_bloom.txt (limited to 'research') diff --git a/research/packfile_bloom.txt b/research/packfile_bloom.txt new file mode 100644 index 00000000..7a3aae02 --- /dev/null +++ b/research/packfile_bloom.txt @@ -0,0 +1,130 @@ +Packfile bloom filter RFC +========================= + +Problem +------- + +Especially for server-side usages, repacking is extremely expensive, and +creating multi-pack-indexes is still rather expensive. Incremental MIDX +partially solves this, but would defeat the purpose of MIDX when there are too +many of them, as Git would still have to walk the MIDXes in order while +performing expensive indexing queries. + +Idea +---- + +Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and +ordinary .idx files are still traversed in their usual order, but the first +step when traversing them, is to check whether that index could possibly have +the desired object, through a bloom filter. + +We will want the filters to be mmaped, and we want the lookup cost to be +dominated by one cache-line read rather than using many scattered reads. +Therefore, a blocked bloom filter is likely the right direction here. The steps +are as follows: + +1. Split the filter into 64-octet buckets, since 64 octets is the most common + cache-line size. +2. Use some bits of the object ID to choose the bucket. +3. Use the rest of the key to choose several bit positions inside that bucket. +4. A lookup thus reads one 64-octet bucket and checks whether all required bits + are set. + +Note on Object IDs +------------------ + +Git object IDs are cryptographic hashes (e.g., currently either SHA-256 +or SHA-1), and are thus uniformly distributed in non-pathological scenarios. +See also the "Security considerations" section. + +Definitions +----------- + +Let: + + B := number of buckets + K := number of bits set and tested per object ID + +* All integers here are big endian. +* The OID is to be interpreted as a big-endian bitstring, where bit offset 0 + is the most significant bit of octet 0. +* log2(B) + 9K <= hash length in bits. + +File layout +----------- + +* 4-octet signature: {'I', 'D', 'B', 'L'} +* 4-octet version identifier (= 1) +* 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256) +* 4-octet B (number of buckets) +* 2-octet K (number of bits set and tested per object ID) +* 6-octet padding (must be all zeros) +* B buckets of 64 octets each. + +Validation +---------- + +* Matching signature +* Supported version (the rest of the rules are for this version) +* Hash function identifier must be recognized +* B must be nonzero and a power of two +* K must be nonzero +* log2(B) + 9K <= hash length in bits +* Padding must be all zero +* File size must be 24 + 64 * B octets + +Lookup procedure +---------------- + +1. Let b be the unsigned integer encoded by the most significant log2(B) bits + of OID. B is a power of two, and 0 <= b < B. +2. Select and read bucket b. +3. For each 0 <= i < K: + 1. Start immediately after the most significant log2(B) bits of OID, let the + i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the + next 9 * K bits of the OID. + 2. Let pi be the unsigned integer encoded by that 9-bit field. + Then, 0 <= pi < 512. + 3. Compute wi := pi >> 6, and bi := pi & 63. + Thus, wi identifies one of the 8 64-bit words in bucket b, and bi + identifies one bit within that word. + 4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit + word, bit index 0 denotes the most significant bit, and bit index 63 + denotes the least significant bit.) + +If any test fails, the OID is definitely not in the relevant idx. +If all tests succeed, the OID may be in the relevant idx. + +Note that two of the K 9-bit fields can decode to the same pi, which means an +insertion may set fewer than K distinct bits. + +Worked example +-------------- + +Let: + + B = 1 << 15 = 32768 + K = 8 + +Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket +and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of +87 bits taken from the object ID. + +1. Read the first 15 bits of OID and interpret them as b, where + 0 <= b < 32768. +2. Read bucket b. +3. For each 0 <= i < 8: + 1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it + as pi, where 0 <= pi < 512. + 2. Compute: wi = pi >> 6, bi = pi & 63. + 3. Test whether bit bi is set in the word wi of bucket b. + +Security considerations +------------------------ + +An adversarial packfile where objects are (computationally intensive, even for +SHA-1 as vulnerable as it is) constructed to have the same prefix for the +relevant object format hash algorithm could be used to fill up the bloom +filters, rendering some buckets useless. In the worst case, if they somehow +fill all filters, this proposal's optimizations become useless, but would not +be a significant DoS vector. -- cgit v1.3.1-10-gc9f91