From cc89a9979bd470cd874e12561918f718af45c8c3 Mon Sep 17 00:00:00 2001 From: Runxi Yu Date: Sun, 14 Jun 2026 12:26:19 +0000 Subject: research: Add back here --- research/dynamic_packfiles.txt | 179 +++++++++++++++++++++++++++++++++++++++++ research/packfile_bloom.txt | 133 ++++++++++++++++++++++++++++++ 2 files changed, 312 insertions(+) create mode 100644 research/dynamic_packfiles.txt create mode 100644 research/packfile_bloom.txt diff --git a/research/dynamic_packfiles.txt b/research/dynamic_packfiles.txt new file mode 100644 index 00000000..e4fe7e54 --- /dev/null +++ b/research/dynamic_packfiles.txt @@ -0,0 +1,179 @@ +dynamic packfiles to append objects + +gc/refcount process punches page-sized holes in them for pages fully +within the space of unwanted objects, after setting a tombstone mark + +holes are recorded in an index and re-used + +then, if desired, the repack process removes all the punched holes +and anything surrounding from unwanted objects that are slightly out +of the page boundary + +repack is not really git's repack algorithm, it's bascially just +defragmentation. + +genreational bloom filters + +idx design +========== + +so, let's first get our invariants and patterns clear. + +* fixed-length cryptographic object IDs +* essentially uniform key distribution +* exact lookup only, no range scans, no ordered iteration requirements +* reads are extremely important +* writes are mostly append-like +* deletes/tombstones may happen later but are secondary + +1st design +---------- + +* mutable front index +* immutable base index +* period merge/compaction into a new base generation + + + +upload-pack/send-pack/defrag +============================ + +take current pack, remove dead objects/holes, filter objects out, record +offsets and adjust ofs_deltas since they always go backwards, write the pack +back; then stream written pack to client. two-step necessary because pack +header includes object count; could have a custom new protocol that doesn't do +so. + +random chat log dump +==================== +<~runxiyu> ori: actually. i think my hashtable-ish .idx scheme doesn't work really well with e.g. "user provided us a small part of the hash" +<~runxiyu> and when using the Git CLI, abbreviated hashes are extremely common.... +<~runxiyu> not lik ei'd need them in a *forge* +<~runxiyu> but ugh +<~runxiyu> i guess i'm going with some sort of b-tree :(( +<~runxiyu> ~~maybe i should just port gefs to git~~ +<&ori> runxiyu: why not? you should be able to pick the pages based on the prefix and then scan, no? +<~rx> ori: i need to somehow munge the has to prevent page directory explosions +<~rx> the hash* +<~rx> e.g. siphash(objectid, secret) +<~rx> otherwise an attacker could give you 10M objects that start with 00000 and whatnot +<&ori> what's the worst case that would happen there, and is it exponentially worse than giving you 10M objects that start with anything? +<&ori> I'm thinking that you can't generate a case worse than 256/nobject extra table lookups, assuming one bit per fanout.. +<~runxiyu> ori: for extendible hashing, yes, definitely worse +<~runxiyu> the directory will expand a lot for no good reason +<&ori> yes, but you have 256 bits of hash +<&ori> how much is a lot worse? +<&ori> what's the worst an attacker can do, and how is the impact worse than uploading 10M giant objects? +<&ori> also, spotted a bag of kuai kuai keeping the cash register working today at a tea shop +<~runxiyu> waitt +<~runxiyu> hmmm + * runxiyu looks agagin if it's O(N) or O(2^N) +<~runxiyu> well +<~runxiyu> i think it should be a O(2^n) directory size when the attacker can control n bits prefix +<&ori> what's the 'n' here? +<~runxiyu> > can control n bits prefix +<&ori> yeah, you run out of prefix pretty quickly, though +<&ori> I'm not seeing how you could get an exponential blowup if you share pages +<&ori> may be missing something, though +<~runxiyu> hm +<&ori> oh, wait, I see +<&ori> no, wait +<~runxiyu> i think im confusing myself too to some extent but something doesn't feel right +<~runxiyu> urgh +<~runxiyu> okay, rethinking this +<~runxiyu> d is the global depth +<~runxiyu> diretory size is 2^d +<~runxiyu> B records per bucket +<~runxiyu> whatever happens inside the bucket idc, let's say it's a linked list +<~runxiyu> whatever happens inside the bucket idc, let's say it's an array* (linked lists suck) +<~runxiyu> l <= d +<~runxiyu> (l being the local depth of a bucket) +<~runxiyu> normal: d = log^2(N/B) +<&ori> ahh, I see. +<~runxiyu> N is the object count +<&ori> yes, so what if you binary searched the page directory, or made it multi-level +<~runxiyu> an attacker could grab a giant repo and find commonly-prefixed objects, they don't need to brute force their own +<~runxiyu> ori: remember we're trying to do something easy to add new objects into +<~runxiyu> how'd you do that with a binary search? +<~runxiyu> not sure what you mean by multi-level yet here +<~runxiyu> well, it could just turn into a b+tree... +<~runxiyu> hm +<&ori> multilevel -- you have pd[0] using bits 0..n +<~runxiyu> maybe an lmdb object store isn't too bad after all +<&ori> pd[0][1] using bits n...m +<&ori> etc +<&ori> and the reason I was a bit confused was that I had thought the directory was a trie +<&ori> rather than just an expanding top level directory +<~runxiyu> ah +<&ori> so, yeah, I was thinking you could make the page directory an actual trie +<~runxiyu> sigh +<~runxiyu> i guess abbreviated object IDs is something i can't really skip. +<~runxiyu> ori: ill look into radix trees and LSM trees too +<~runxiyu> well, you're basically suggesting a radix tree i guess +<~runxiyu> well actually +<~runxiyu> radix might not necessarily be the best trie here +<~runxiyu> idk +<~runxiyu> hm +<~runxiyu> firstly im really heavy on reads +<~runxiyu> and random keys with no sequential access +<~runxiyu> ok LSM makes no sense +<&hax[xor]> > O(2^N) +<~runxiyu> ori: thoughts on how to make tries reasonable to use on disks? +<&hax[xor]> that sounds like something is already very broken +<~runxiyu> hax[xor]: wdym +<&hax[xor]> directory size should absolutely not scale like that +<~runxiyu> hax[xor]: maybe read up on how extendible hsahing works again? +<&hax[xor]> probably but if that's how it scales it still sounds verybroken +<~runxiyu> n is not the amount of objects +<~runxiyu> it's a pathlogic condition caused by chosne-prefix keys +<~runxiyu> (your keys are usually supposed to be hashed into something the attacker can't predict) +<&hax[xor]> if you mean the directory size scales linearly with the number of objects the attacker puts in it... that sounds perfectly normal? +<&ori> runxiyu: same as extendible hashing, just after you extend to, say, 8 bits, you stop splitting the page directory, and have subdirectories +<~runxiyu> ori: that could make senes +<~runxiyu> haven't thought it through +<~runxiyu> directory size is 2^d, d being the global depth +<~runxiyu> urgh i need to review for exams +<~runxiyu> okay +<~runxiyu> write amplification issue +<~runxiyu> im not sure how significant this is for realistic git workloads +<~runxiyu> i haven't counted, but there should be many, many, many more reads than writes +<~runxiyu> if write amplification is really an issue +<&ori> I may go wander around a bit. +<~runxiyu> then ill just port gefs +<~runxiyu> ori: do you mean IRL, or over dynamic pack data structures- +<&ori> irl. +<~runxiyu> alright that makes more sense :P +<&ori> tomorrow I think I check out Jiufen +<~runxiyu> frick i want to be able to type epsilon with compose +<&ori> is that not possible? +<~runxiyu> i don't seem to be able to +<~runxiyu> but idk the compose tables on my system +<~runxiyu> ε +<~runxiyu> well +<~runxiyu> unicode hex input always works :/ +<~runxiyu> OKAY FUCK +<~runxiyu> I keep getting distracted by interesting things +<~runxiyu> I need to review for my fucking exams +-- Mode #chat [-q runxiyu] by runxiyu +-- Mode #chat [-a runxiyu] by runxiyu +-- #chat: You must be a channel halfop or higher to set channel mode b (ban). +-- Mode #chat [+b mute:account:runxiyu] by runxiyu +-- #chat: You cannot send messages to this channel whilst a m: (mute) extban is set matching you. +-- #chat: You cannot send messages to this channel whilst a m: (mute) extban is set matching you. +<&f_> does that even work? +<&ori> for 9front, *e gives ε +<&ori> but, don't remember the compose map +<&ori> thought that there was a similar thing for all greek letters + + +See also: +https://github.com/inkandswitch/darn +https://www.youtube.com/watch?v=nk4nefmguZk +https://crates.io/crates/iroh-blobs +https://crates.io/crates/bao-tree + + +Actually, who cares about abbreviated hashes? +Clients. Clients only. +It will be a separate interface satisfied by "normal repos" +but not satisfied by our store. Good enough. diff --git a/research/packfile_bloom.txt b/research/packfile_bloom.txt new file mode 100644 index 00000000..63acafbe --- /dev/null +++ b/research/packfile_bloom.txt @@ -0,0 +1,133 @@ +Packfile bloom filter RFC +========================= + +Problem +------- + +Especially for server-side usages, repacking is extremely expensive, and +creating multi-pack-indexes is still rather expensive. Incremental MIDX +partially solves this, but would defeat the purpose of MIDX when there are too +many of them, as Git would still have to walk the MIDXes in order while +performing expensive indexing queries. + +Idea +---- + +Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and +ordinary .idx files are still traversed in their usual order, but the first +step when traversing them, is to check whether that index could possibly have +the desired object, through a bloom filter. + +We will want the filters to be mmaped, and we want the lookup cost to be +dominated by one cache-line read rather than using many scattered reads. +Therefore, a blocked bloom filter is likely the right direction here. The steps +are as follows: + +1. Split the filter into 64-octet buckets, since 64 octets is the most common + cache-line size. +2. Use some bits of the object ID to choose the bucket. +3. Use the rest of the key to choose several bit positions inside that bucket. +4. A lookup thus reads one 64-octet bucket and checks whether all required bits + are set. + +Definitions +----------- + +Let: + + B := number of buckets + K := number of bits set and tested per object ID + +* All integers here are big endian. +* The OID is to be interpreted as a big-endian bitstring, where bit offset 0 + is the most significant bit of octet 0. +* log2(B) + 9K <= hash length in bits. + +File layout +----------- + +* 4-octet signature: {'I', 'D', 'B', 'L'} +* 4-octet version identifier (= 1) +* 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256) +* 4-octet B (number of buckets) +* 2-octet K (number of bits set and tested per object ID) +* 46-octet padding (must be all zeros) +* B buckets of 64 octets each. + +Validation +---------- + +* Matching signature +* Supported version (the rest of the rules are for this version) +* Hash function identifier must be recognized +* B must be nonzero and a power of two +* K must be nonzero +* log2(B) + 9K <= hash length in bits +* Padding must be all zero +* File size must be 64 + 64 * B octets + +Lookup procedure +---------------- + +1. Let b be the unsigned integer encoded by the most significant log2(B) bits + of OID. B is a power of two, and 0 <= b < B. +2. Select and read bucket b. +3. For each 0 <= i < K: + 1. Start immediately after the most significant log2(B) bits of OID, let the + i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the + next 9 * K bits of the OID. + 2. Let pi be the unsigned integer encoded by that 9-bit field. + Then, 0 <= pi < 512. + 3. Compute wi := pi >> 6, and bi := pi & 63. + Thus, wi identifies one of the 8 64-bit words in bucket b, and bi + identifies one bit within that word. + 4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit + word, bit index 0 denotes the most significant bit, and bit index 63 + denotes the least significant bit.) + +If any test fails, the OID is definitely not in the relevant idx. +If all tests succeed, the OID may be in the relevant idx. + +Note that two of the K 9-bit fields can decode to the same pi, which means an +insertion may set fewer than K distinct bits. + +Worked example +-------------- + +Let: + + B = 1 << 15 = 32768 + K = 8 + +Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket +and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of +87 bits taken from the object ID. + +1. Read the first 15 bits of OID and interpret them as b, where + 0 <= b < 32768. +2. Read bucket b. +3. For each 0 <= i < 8: + 1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it + as pi, where 0 <= pi < 512. + 2. Compute: wi = pi >> 6, bi = pi & 63. + 3. Test whether bit bi is set in the word wi of bucket b. + +Security considerations +------------------------ + +An adversarial packfile where objects are (computationally intensive, even for +SHA-1 as vulnerable as it is) constructed to have the same prefix for the +relevant object format hash algorithm could be used to fill up the bloom +filters, rendering some buckets useless. In the worst case, if they somehow +fill all filters, this proposal's optimizations become useless, but would not +be a significant DoS vector. + +TODOs +----- + +* Consider dropping mmap (page read vs cachline read) +* How should B and K be chosen? +* How does creation/insert work? Note that packfiles and `.idx`es are immutable. +* What are the sizes? +* What are the false positive rates? +* How are benchmarks? -- cgit v1.3.1-10-gc9f91