diff options
| author | 2026-06-14 13:21:43 +0000 | |
|---|---|---|
| committer | 2026-06-14 13:21:43 +0000 | |
| commit | eb643bd6cc46db8cf00f68b2ddf4a5d6afd4d252 (patch) | |
| tree | 037dbd23281da795932bac0547f0d67347db9fd0 /internal/format/packidx/bloom/doc.go | |
| parent | research: Add back here (diff) | |
internal/format/packidx/bloom: Add
Diffstat (limited to 'internal/format/packidx/bloom/doc.go')
| -rw-r--r-- | internal/format/packidx/bloom/doc.go | 122 |
1 files changed, 122 insertions, 0 deletions
diff --git a/internal/format/packidx/bloom/doc.go b/internal/format/packidx/bloom/doc.go new file mode 100644 index 00000000..df7ded6c --- /dev/null +++ b/internal/format/packidx/bloom/doc.go @@ -0,0 +1,122 @@ +// Package bloom provides a blocked Bloom filter +// for pack indexes. +// +// A filter answers, from a single cache-line-sized read, +// whether an object ID is definitely absent from the index it covers. +// A lookup that must consult many packs +// can then skip the full binary search +// in every pack whose filter rejects the object, +// decreasing the cost of misses. +// +// # Rationale +// +// Especially for server-side usage, +// repacking is expensive, +// and creating multi-pack-indexes is still rather expensive. +// Incremental multi-pack-indexes partially solve this, +// but having too many of them defeats the purpose, +// since the indexes must still be walked in order +// while performing expensive lookups. +// +// Instead, each multi-pack-index layer, +// and each ordinary pack index, +// may carry its own filter. +// The indexes are still traversed in their usual order, +// but the first step when traversing one +// is to check whether it could possibly hold the wanted object. +// +// The filter is split into 64-octet buckets, +// matching the most common cache-line size. +// Some bits of the object ID choose the bucket, +// and the rest choose several bit positions inside it, +// so a lookup reads one 64-octet bucket +// and checks whether all required bits are set. +// +// # Parameters +// +// A filter is parameterized by +// the number of buckets B +// and the number of bits set and tested per object ID, K. +// All integers in the format are big endian. +// The object ID is interpreted as a big-endian bitstring, +// where bit offset 0 is the most significant bit of octet 0. +// B must be a nonzero power of two, +// K must be nonzero, +// and log2(B) + 9*K must not exceed the hash length in bits. +// +// # File format +// +// A filter file is a 64-octet header +// followed by B buckets of 64 octets each: +// +// - 4-octet signature: {'I', 'D', 'B', 'L'}. +// - 4-octet version identifier (= 1). +// - 4-octet object hash algorithm identifier +// (= 1 for SHA-1, 2 for SHA-256). +// - 4-octet B, the number of buckets. +// - 2-octet K, the number of bits set and tested per object ID. +// - 46-octet padding, which must be all zero. +// - B buckets of 64 octets each. +// +// A reader must validate that +// the signature matches, +// the version is supported, +// the hash function identifier is recognized, +// B is nonzero and a power of two, +// K is nonzero, +// log2(B) + 9*K does not exceed the hash length in bits, +// the padding is all zero, +// and the file size is exactly 64 + 64*B octets. +// +// # Lookup +// +// A lookup against one filter proceeds as follows: +// +// 1. Let b be the unsigned integer encoded +// by the most significant log2(B) bits of the object ID. +// B is a power of two, so 0 <= b < B. +// 2. Select and read bucket b. +// 3. For each 0 <= i < K, +// take the i-th 9-bit field +// from the 9*K bits that follow the bucket-selecting bits, +// and let pi be the unsigned integer it encodes, +// so 0 <= pi < 512. +// Compute wi = pi >> 6 and bi = pi & 63, +// so wi identifies one of the eight 64-bit words in bucket b +// and bi identifies one bit within that word. +// Within each 64-bit word, +// bit index 0 is the most significant bit +// and bit index 63 is the least significant bit. +// Test whether bit bi is set in word wi of bucket b. +// +// If any test fails, +// the object ID is definitely not in the covered index. +// If all tests succeed, +// the object ID may be in it. +// Two of the K 9-bit fields can decode to the same pi, +// so an insertion may set fewer than K distinct bits; +// this only raises the false positive rate +// and never causes a false negative. +// +// # Worked example +// +// Let B = 1 << 15 = 32768 and K = 8. +// Then log2(B) = 15, +// so each lookup uses 15 bits to choose the bucket +// and 8*9 = 72 bits to choose the in-bucket positions, +// for a total of 87 bits taken from the object ID. +// A SHA-1 has 160 bits and a SHA-256 has 256 bits, +// so both leave ample headroom. +// +// # Security considerations +// +// Object IDs are public unkeyed hashes, +// so an adversary can mine packs +// whose object IDs share a chosen prefix +// to crowd objects into one bucket +// and fill its bits. +// In the worst case this renders some buckets useless, +// making the filter degrade to "may contain" for those buckets, +// but it never produces a false negative +// and is not a significant denial-of-service vector. +package bloom |
