From 1734207266752b9464f42c31f7728a7e3c692c50 Mon Sep 17 00:00:00 2001
From: Runxi Yu
Date: Wed, 11 Mar 2026 16:13:40 +0800
Subject: research: Add packfile bloom filter RFC

---
 research/packfile_bloom.txt | 130 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 130 insertions(+)
 create mode 100644 research/packfile_bloom.txt

(limited to 'research')

diff --git a/research/packfile_bloom.txt b/research/packfile_bloom.txt
new file mode 100644
index 00000000..7a3aae02
--- /dev/null
+++ b/research/packfile_bloom.txt
@@ -0,0 +1,130 @@
+Packfile bloom filter RFC
+=========================
+
+Problem
+-------
+
+Especially for server-side usages, repacking is extremely expensive, and
+creating multi-pack-indexes is still rather expensive. Incremental MIDX
+partially solves this, but would defeat the purpose of MIDX when there are too
+many of them, as Git would still have to walk the MIDXes in order while
+performing expensive indexing queries.
+
+Idea
+----
+
+Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and
+ordinary .idx files are still traversed in their usual order, but the first
+step when traversing them, is to check whether that index could possibly have
+the desired object, through a bloom filter.
+
+We will want the filters to be mmaped, and we want the lookup cost to be
+dominated by one cache-line read rather than using many scattered reads.
+Therefore, a blocked bloom filter is likely the right direction here. The steps
+are as follows:
+
+1. Split the filter into 64-octet buckets, since 64 octets is the most common
+   cache-line size.
+2. Use some bits of the object ID to choose the bucket.
+3. Use the rest of the key to choose several bit positions inside that bucket.
+4. A lookup thus reads one 64-octet bucket and checks whether all required bits
+   are set.
+
+Note on Object IDs
+------------------
+
+Git object IDs are cryptographic hashes (e.g., currently either SHA-256
+or SHA-1), and are thus uniformly distributed in non-pathological scenarios.
+See also the "Security considerations" section.
+
+Definitions
+-----------
+
+Let:
+
+     B := number of buckets
+     K := number of bits set and tested per object ID
+
+* All integers here are big endian.
+* The OID is to be interpreted as a big-endian bitstring, where bit offset 0
+  is the most significant bit of octet 0.
+* log2(B) + 9K <= hash length in bits.
+
+File layout
+-----------
+
+* 4-octet signature: {'I', 'D', 'B', 'L'}
+* 4-octet version identifier (= 1)
+* 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256)
+* 4-octet B (number of buckets)
+* 2-octet K (number of bits set and tested per object ID)
+* 6-octet padding (must be all zeros)
+* B buckets of 64 octets each.
+
+Validation
+----------
+
+* Matching signature
+* Supported version (the rest of the rules are for this version)
+* Hash function identifier must be recognized
+* B must be nonzero and a power of two
+* K must be nonzero
+* log2(B) + 9K <= hash length in bits
+* Padding must be all zero
+* File size must be 24 + 64 * B octets
+
+Lookup procedure
+----------------
+
+1. Let b be the unsigned integer encoded by the most significant log2(B) bits
+   of OID. B is a power of two, and 0 <= b < B.
+2. Select and read bucket b.
+3. For each 0 <= i < K:
+   1. Start immediately after the most significant log2(B) bits of OID, let the
+      i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the
+      next 9 * K bits of the OID.
+   2. Let pi be the unsigned integer encoded by that 9-bit field.
+      Then, 0 <= pi < 512.
+   3. Compute wi := pi >> 6, and bi := pi & 63.
+      Thus, wi identifies one of the 8 64-bit words in bucket b, and bi
+      identifies one bit within that word.
+   4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit
+      word, bit index 0 denotes the most significant bit, and bit index 63
+      denotes the least significant bit.)
+
+If any test fails, the OID is definitely not in the relevant idx.
+If all tests succeed, the OID may be in the relevant idx.
+
+Note that two of the K 9-bit fields can decode to the same pi, which means an
+insertion may set fewer than K distinct bits.
+
+Worked example
+--------------
+
+Let:
+
+    B = 1 << 15 = 32768
+    K = 8
+
+Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket
+and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of
+87 bits taken from the object ID.
+
+1. Read the first 15 bits of OID and interpret them as b, where
+   0 <= b < 32768.
+2. Read bucket b.
+3. For each 0 <= i < 8:
+   1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it
+      as pi, where 0 <= pi < 512.
+   2. Compute: wi = pi >> 6, bi = pi & 63.
+   3. Test whether bit bi is set in the word wi of bucket b.
+
+Security considerations
+------------------------
+
+An adversarial packfile where objects are (computationally intensive, even for
+SHA-1 as vulnerable as it is) constructed to have the same prefix for the
+relevant object format hash algorithm could be used to fill up the bloom
+filters, rendering some buckets useless. In the worst case, if they somehow
+fill all filters, this proposal's optimizations become useless, but would not
+be a significant DoS vector.
-- 
cgit v1.3.1-10-gc9f91