aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Runxi Yu2026-06-14 12:26:19 +0000
committerGravatar Runxi Yu2026-06-14 12:26:19 +0000
commitcc89a9979bd470cd874e12561918f718af45c8c3 (patch)
tree71d0cdfbd83f58ede2683f1333df0138cf801d66
parentinternal/format/packidx: Lookup interpolation heuristic (diff)
research: Add back here
-rw-r--r--research/dynamic_packfiles.txt179
-rw-r--r--research/packfile_bloom.txt133
2 files changed, 312 insertions, 0 deletions
diff --git a/research/dynamic_packfiles.txt b/research/dynamic_packfiles.txt
new file mode 100644
index 00000000..e4fe7e54
--- /dev/null
+++ b/research/dynamic_packfiles.txt
@@ -0,0 +1,179 @@
+dynamic packfiles to append objects
+
+gc/refcount process punches page-sized holes in them for pages fully
+within the space of unwanted objects, after setting a tombstone mark
+
+holes are recorded in an index and re-used
+
+then, if desired, the repack process removes all the punched holes
+and anything surrounding from unwanted objects that are slightly out
+of the page boundary
+
+repack is not really git's repack algorithm, it's bascially just
+defragmentation.
+
+genreational bloom filters
+
+idx design
+==========
+
+so, let's first get our invariants and patterns clear.
+
+* fixed-length cryptographic object IDs
+* essentially uniform key distribution
+* exact lookup only, no range scans, no ordered iteration requirements
+* reads are extremely important
+* writes are mostly append-like
+* deletes/tombstones may happen later but are secondary
+
+1st design
+----------
+
+* mutable front index
+* immutable base index
+* period merge/compaction into a new base generation
+
+
+
+upload-pack/send-pack/defrag
+============================
+
+take current pack, remove dead objects/holes, filter objects out, record
+offsets and adjust ofs_deltas since they always go backwards, write the pack
+back; then stream written pack to client. two-step necessary because pack
+header includes object count; could have a custom new protocol that doesn't do
+so.
+
+random chat log dump
+====================
+<~runxiyu> ori: actually. i think my hashtable-ish .idx scheme doesn't work really well with e.g. "user provided us a small part of the hash"
+<~runxiyu> and when using the Git CLI, abbreviated hashes are extremely common....
+<~runxiyu> not lik ei'd need them in a *forge*
+<~runxiyu> but ugh
+<~runxiyu> i guess i'm going with some sort of b-tree :((
+<~runxiyu> ~~maybe i should just port gefs to git~~
+<&ori> runxiyu: why not? you should be able to pick the pages based on the prefix and then scan, no?
+<~rx> ori: i need to somehow munge the has to prevent page directory explosions
+<~rx> the hash*
+<~rx> e.g. siphash(objectid, secret)
+<~rx> otherwise an attacker could give you 10M objects that start with 00000 and whatnot
+<&ori> what's the worst case that would happen there, and is it exponentially worse than giving you 10M objects that start with anything?
+<&ori> I'm thinking that you can't generate a case worse than 256/nobject extra table lookups, assuming one bit per fanout..
+<~runxiyu> ori: for extendible hashing, yes, definitely worse
+<~runxiyu> the directory will expand a lot for no good reason
+<&ori> yes, but you have 256 bits of hash
+<&ori> how much is a lot worse?
+<&ori> what's the worst an attacker can do, and how is the impact worse than uploading 10M giant objects?
+<&ori> also, spotted a bag of kuai kuai keeping the cash register working today at a tea shop
+<~runxiyu> waitt
+<~runxiyu> hmmm
+ * runxiyu looks agagin if it's O(N) or O(2^N)
+<~runxiyu> well
+<~runxiyu> i think it should be a O(2^n) directory size when the attacker can control n bits prefix
+<&ori> what's the 'n' here?
+<~runxiyu> > can control n bits prefix
+<&ori> yeah, you run out of prefix pretty quickly, though
+<&ori> I'm not seeing how you could get an exponential blowup if you share pages
+<&ori> may be missing something, though
+<~runxiyu> hm
+<&ori> oh, wait, I see
+<&ori> no, wait
+<~runxiyu> i think im confusing myself too to some extent but something doesn't feel right
+<~runxiyu> urgh
+<~runxiyu> okay, rethinking this
+<~runxiyu> d is the global depth
+<~runxiyu> diretory size is 2^d
+<~runxiyu> B records per bucket
+<~runxiyu> whatever happens inside the bucket idc, let's say it's a linked list
+<~runxiyu> whatever happens inside the bucket idc, let's say it's an array* (linked lists suck)
+<~runxiyu> l <= d
+<~runxiyu> (l being the local depth of a bucket)
+<~runxiyu> normal: d = log^2(N/B)
+<&ori> ahh, I see.
+<~runxiyu> N is the object count
+<&ori> yes, so what if you binary searched the page directory, or made it multi-level
+<~runxiyu> an attacker could grab a giant repo and find commonly-prefixed objects, they don't need to brute force their own
+<~runxiyu> ori: remember we're trying to do something easy to add new objects into
+<~runxiyu> how'd you do that with a binary search?
+<~runxiyu> not sure what you mean by multi-level yet here
+<~runxiyu> well, it could just turn into a b+tree...
+<~runxiyu> hm
+<&ori> multilevel -- you have pd[0] using bits 0..n
+<~runxiyu> maybe an lmdb object store isn't too bad after all
+<&ori> pd[0][1] using bits n...m
+<&ori> etc
+<&ori> and the reason I was a bit confused was that I had thought the directory was a trie
+<&ori> rather than just an expanding top level directory
+<~runxiyu> ah
+<&ori> so, yeah, I was thinking you could make the page directory an actual trie
+<~runxiyu> sigh
+<~runxiyu> i guess abbreviated object IDs is something i can't really skip.
+<~runxiyu> ori: ill look into radix trees and LSM trees too
+<~runxiyu> well, you're basically suggesting a radix tree i guess
+<~runxiyu> well actually
+<~runxiyu> radix might not necessarily be the best trie here
+<~runxiyu> idk
+<~runxiyu> hm
+<~runxiyu> firstly im really heavy on reads
+<~runxiyu> and random keys with no sequential access
+<~runxiyu> ok LSM makes no sense
+<&hax[xor]> > O(2^N)
+<~runxiyu> ori: thoughts on how to make tries reasonable to use on disks?
+<&hax[xor]> that sounds like something is already very broken
+<~runxiyu> hax[xor]: wdym
+<&hax[xor]> directory size should absolutely not scale like that
+<~runxiyu> hax[xor]: maybe read up on how extendible hsahing works again?
+<&hax[xor]> probably but if that's how it scales it still sounds verybroken
+<~runxiyu> n is not the amount of objects
+<~runxiyu> it's a pathlogic condition caused by chosne-prefix keys
+<~runxiyu> (your keys are usually supposed to be hashed into something the attacker can't predict)
+<&hax[xor]> if you mean the directory size scales linearly with the number of objects the attacker puts in it... that sounds perfectly normal?
+<&ori> runxiyu: same as extendible hashing, just after you extend to, say, 8 bits, you stop splitting the page directory, and have subdirectories
+<~runxiyu> ori: that could make senes
+<~runxiyu> haven't thought it through
+<~runxiyu> directory size is 2^d, d being the global depth
+<~runxiyu> urgh i need to review for exams
+<~runxiyu> okay
+<~runxiyu> write amplification issue
+<~runxiyu> im not sure how significant this is for realistic git workloads
+<~runxiyu> i haven't counted, but there should be many, many, many more reads than writes
+<~runxiyu> if write amplification is really an issue
+<&ori> I may go wander around a bit.
+<~runxiyu> then ill just port gefs
+<~runxiyu> ori: do you mean IRL, or over dynamic pack data structures-
+<&ori> irl.
+<~runxiyu> alright that makes more sense :P
+<&ori> tomorrow I think I check out Jiufen
+<~runxiyu> frick i want to be able to type epsilon with compose
+<&ori> is that not possible?
+<~runxiyu> i don't seem to be able to
+<~runxiyu> but idk the compose tables on my system
+<~runxiyu> ε
+<~runxiyu> well
+<~runxiyu> unicode hex input always works :/
+<~runxiyu> OKAY FUCK
+<~runxiyu> I keep getting distracted by interesting things
+<~runxiyu> I need to review for my fucking exams
+-- Mode #chat [-q runxiyu] by runxiyu
+-- Mode #chat [-a runxiyu] by runxiyu
+-- #chat: You must be a channel halfop or higher to set channel mode b (ban).
+-- Mode #chat [+b mute:account:runxiyu] by runxiyu
+-- #chat: You cannot send messages to this channel whilst a m: (mute) extban is set matching you.
+-- #chat: You cannot send messages to this channel whilst a m: (mute) extban is set matching you.
+<&f_> does that even work?
+<&ori> for 9front, <alt>*e gives ε
+<&ori> but, don't remember the compose map
+<&ori> thought that there was a similar thing for all greek letters
+
+
+See also:
+https://github.com/inkandswitch/darn
+https://www.youtube.com/watch?v=nk4nefmguZk
+https://crates.io/crates/iroh-blobs
+https://crates.io/crates/bao-tree
+
+
+Actually, who cares about abbreviated hashes?
+Clients. Clients only.
+It will be a separate interface satisfied by "normal repos"
+but not satisfied by our store. Good enough.
diff --git a/research/packfile_bloom.txt b/research/packfile_bloom.txt
new file mode 100644
index 00000000..63acafbe
--- /dev/null
+++ b/research/packfile_bloom.txt
@@ -0,0 +1,133 @@
+Packfile bloom filter RFC
+=========================
+
+Problem
+-------
+
+Especially for server-side usages, repacking is extremely expensive, and
+creating multi-pack-indexes is still rather expensive. Incremental MIDX
+partially solves this, but would defeat the purpose of MIDX when there are too
+many of them, as Git would still have to walk the MIDXes in order while
+performing expensive indexing queries.
+
+Idea
+----
+
+Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and
+ordinary .idx files are still traversed in their usual order, but the first
+step when traversing them, is to check whether that index could possibly have
+the desired object, through a bloom filter.
+
+We will want the filters to be mmaped, and we want the lookup cost to be
+dominated by one cache-line read rather than using many scattered reads.
+Therefore, a blocked bloom filter is likely the right direction here. The steps
+are as follows:
+
+1. Split the filter into 64-octet buckets, since 64 octets is the most common
+ cache-line size.
+2. Use some bits of the object ID to choose the bucket.
+3. Use the rest of the key to choose several bit positions inside that bucket.
+4. A lookup thus reads one 64-octet bucket and checks whether all required bits
+ are set.
+
+Definitions
+-----------
+
+Let:
+
+ B := number of buckets
+ K := number of bits set and tested per object ID
+
+* All integers here are big endian.
+* The OID is to be interpreted as a big-endian bitstring, where bit offset 0
+ is the most significant bit of octet 0.
+* log2(B) + 9K <= hash length in bits.
+
+File layout
+-----------
+
+* 4-octet signature: {'I', 'D', 'B', 'L'}
+* 4-octet version identifier (= 1)
+* 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256)
+* 4-octet B (number of buckets)
+* 2-octet K (number of bits set and tested per object ID)
+* 46-octet padding (must be all zeros)
+* B buckets of 64 octets each.
+
+Validation
+----------
+
+* Matching signature
+* Supported version (the rest of the rules are for this version)
+* Hash function identifier must be recognized
+* B must be nonzero and a power of two
+* K must be nonzero
+* log2(B) + 9K <= hash length in bits
+* Padding must be all zero
+* File size must be 64 + 64 * B octets
+
+Lookup procedure
+----------------
+
+1. Let b be the unsigned integer encoded by the most significant log2(B) bits
+ of OID. B is a power of two, and 0 <= b < B.
+2. Select and read bucket b.
+3. For each 0 <= i < K:
+ 1. Start immediately after the most significant log2(B) bits of OID, let the
+ i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the
+ next 9 * K bits of the OID.
+ 2. Let pi be the unsigned integer encoded by that 9-bit field.
+ Then, 0 <= pi < 512.
+ 3. Compute wi := pi >> 6, and bi := pi & 63.
+ Thus, wi identifies one of the 8 64-bit words in bucket b, and bi
+ identifies one bit within that word.
+ 4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit
+ word, bit index 0 denotes the most significant bit, and bit index 63
+ denotes the least significant bit.)
+
+If any test fails, the OID is definitely not in the relevant idx.
+If all tests succeed, the OID may be in the relevant idx.
+
+Note that two of the K 9-bit fields can decode to the same pi, which means an
+insertion may set fewer than K distinct bits.
+
+Worked example
+--------------
+
+Let:
+
+ B = 1 << 15 = 32768
+ K = 8
+
+Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket
+and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of
+87 bits taken from the object ID.
+
+1. Read the first 15 bits of OID and interpret them as b, where
+ 0 <= b < 32768.
+2. Read bucket b.
+3. For each 0 <= i < 8:
+ 1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it
+ as pi, where 0 <= pi < 512.
+ 2. Compute: wi = pi >> 6, bi = pi & 63.
+ 3. Test whether bit bi is set in the word wi of bucket b.
+
+Security considerations
+------------------------
+
+An adversarial packfile where objects are (computationally intensive, even for
+SHA-1 as vulnerable as it is) constructed to have the same prefix for the
+relevant object format hash algorithm could be used to fill up the bloom
+filters, rendering some buckets useless. In the worst case, if they somehow
+fill all filters, this proposal's optimizations become useless, but would not
+be a significant DoS vector.
+
+TODOs
+-----
+
+* Consider dropping mmap (page read vs cachline read)
+* How should B and K be chosen?
+* How does creation/insert work? Note that packfiles and `.idx`es are immutable.
+* What are the sizes?
+* What are the false positive rates?
+* How are benchmarks?