Clone
QP‑trie in NSD
Tony Finch edited this page 2022-04-06 16:43:16 +01:00

In July 2020 and June 2021, before I joined ISC, I (Tony Finch) integrated a DNS-specific version of my qp-trie data structure into NSD. NSD is an attractive platform for this kind of work:

  • NSD is relatively simple for a real DNS server

  • it already has a choice between a red-black tree and a radix tree, so adding a third option is not too hard.

previous writing on this topic

summary of my fork of NSD

There are a few branches in my fork of NSD, corresponding to various phases of my experiments.

  • fanf-qp: the basic DNS-trie implementation from July 2020, which uses 16 byte nodes, does not support concurrent updates, and uses the system allocator

  • fanf-cow: first iteration of the new copy-on-write strategy, using a custom bump allocator with a pool of pages and 12 byte nodes.

    Relatively aggressive compaction, in that it traverses deeply in the tree to find pages that are fragmented and should be compacted. This had the awkward effect of causing COW mutations for nodes in low-fragmentation pages in the middle tiers of the tree, pushing them past the fragmentation threshold, leading to a second compaction run almost immediately, so it ended up being a very complicated way of compacting the whole tree.

  • fanf-cjc: a simple two-space copying / compacting collector. (cjc = Chris Cheney, inventor of the stackless graph copy.)

    In reaction to the bad behaviour of the fanf-cow branch, how much simpler would it be to get rid of the page pool and instead use a single contiguous arena? Quite a lot simpler, but the GC is correspondingly time-consuming because it always compacts the whole tree.

  • fanf-gen: a refinement of fanf-cow that tries to apply the generational hypothesis more intelligently. (generational hypothesis = recently allocated nodes are short-lived, so GC is faster if it arranges objects into age-based tiers, and collects the young tiers more frequently.)

    In this version the tree is compacted from the top down (top = frequently mutated, therefore probably more fragmented); when it finds a node in an unfragmented page, it skips the node and all its children, to avoid causing more fragmentation.

I think fanf-gen is the best version so far, so it is the HEAD branch of my NSD fork.

other notes

There's a basic randomized tester in tpkg/cutest/cutest_qp.c. It can be built in a mode that uses mprotect() to enforce the copy-on-write read-only / read-write split.

parallel loading in NSD

I mentioned to Willem Toorop that I think it would not be too hard to merge two qp-trie structures, to support parallel zone loading. (NSD has a single unified tree containing all its zones.) And the result should be laid out very nicely in memory for good locality. I think it would make sense to merge pairwise, like a mergesort, parallelizable O(n log n) instead of sequential O(n*n). The merge can be stackless, by adding a third kind of node that is a placeholder pointing at the two subtrees that still need to be merged. Merging multiple trees in one shot could be O(n) but more complicated.

Or, if we know in advance that the parallel loader has split the zone into chunks that have disjoint names, it might be simpler to treat each chunk like an update transaction - but that might not be able to make the most of the available parallelism.

extensions for BIND

There are a couple of improvements that BIND will need in the qp-trie implementation:

  • Unlike NSD, which has a unified lookup tree, BIND has a tree per zone, so in setups with many small zones, each zone's tree needs to be as small as possible. A likely solution is that if the tree occupies only one page, realloc() it to minimum size.

  • BIND supports multiple read-only past versions of a zone database (to allow zone transfers to continue during updates), so the post-update compaction needs a bit more sophistication.

discussion

Jeroen

Like Benno stated, I did a bit of work on in-memory databases too. Specifically, I did a tweaked version of the Adaptive Radix Tree (ART) and I tuned it for use within NSD. Essentially, they both try to solve the same problem but use different approaches to compressing inner nodes. I'll have to study the qp trie a bit more to grasp everything, but it's really interesting!

You may want to have a look at https://blog.nlnetlabs.nl/adapting-radix-trees/. I was planning to use read-copy-update instead of any sort of locking. I cut the previous- next pointers by keeping a path instead to be able to do so. Not sure if your implementation can use it, but I wanted to mention it because it isn't directly tied to the inner structure and maybe useful. Instead of having a leaf, a tagged pointer is used, and since every object stored is of the same type, that allows me to cut additional memory.

My version uses fixed size nodes of 4, 16, 32, 38, 48 and 256 bytes (ART uses 4, 16, 48 and 256). A new angle I have is to investigate If I can figure out a way to replace two levels of sparse node types 4. That should save some memory and an extra indirection. Maybe I can apply some of your ideas to ART or vice-versa.

Tony

I read your ART blog post with interest, and your treeperf benchmark was really helpful when I was getting my qp-trie working in NSD. (I couldn't get the nametree itself working.) Back when I started thinking about this many years ago, I started off trying to implement an ART-like structure, but I did not like having to implement 5 variants of similar code (but different in important and tricky ways) for the different node sizes. So I was pleased when I worked out how the qp-trie could use only one node layout :-)

I have done a couple of rounds of work on copy-on-write. I wasn't very satisfied with my first attempt because it involved mutating nodes that are being read by multiple threads. https://dotat.at/@/2018-05-22-beer-festival-week-hacking-notes-for-monday.html The second round in NSD is cleaner, and smaller and faster :-) And like you, I was aiming to use RCU for replacing the active tree, but (at least in BIND) I should first check if a simple read-write lock is good enough - flipping a pointer does not take long.

One of the disadvantages of the qp-trie is its nodes have no spare space, so it does a lot of realloc(). ART's quantization helps to reduce that; I think an interesting question is whether the qp-trie custom allocator and compaction is efficient enough at updates in comparison to ART. I also need to spend some time with some jemalloc memory usage statistics to see how its fragmentation metrics compare to mine. The older qp-trie code uses a lot of different allocation size classes, which is not great - again, ART is better in this respect - but the custom allocator eliminates it completely.

Jeroen

Happy to hear the post was of help. You're right, the nametree isn't fully integrated yet. NSD will have to be modified for that. Specifically, every server would need to keep a path when doing a lookup. That's still a WiP.

I forgot to mention that one reason it's beneficial, at least for NSD, removing the next+previous pointers etc, is that when we do a reload, we don't touch as much other zones+domains. Which, (I figured, not benchmarked) should mean less memory is required for that short amount of time. We had some users that experienced problems on reloads (oom- killer?).

I'd be delighted to keep in touch and swap ideas!

I think I have a good enough understanding of how the qp trie works and I read through the papers mentioned in your articles. That is, without studying the allocator, because it doesn't affect the algorithm itself much(?)

One thing that I noticed about the qp trie, is that you shift the nibble. So it sort-of works like path compression in ART. However, the node itself doesn't store part of the key. I think negative lookups might be impacted by that. The foo+bar+baz example you use in various places (e.g. https://fanf.livejournal.com/137283.html) shows a branch for "f and b", and then the branch for b shifts the nibble by 5. If I'd lookup "boz", the tree would always need to take the optimistic path and can only verify it's not a match once it reaches the leaf right(?)

A second remark/question I have is that only 4 bits at a time can be compared. I'm thinking that worst-case, not sure how often a server would run into that in practice, paths can be twice as long as with ART.

With regards to the nametree, I like the popcount trick! A thought I had yesterday is that I can potentially use it to cut the node38, node48 and node256 (maybe node32 too) types and decrease the amount of memory required in those scenarios. At the time ART was conceived (2012'ish), AVX2 wasn't a thing. Hence only the node48 and node256(?) However, AVX2 covers 256 bits. Which fits DNS brilliantly! What if I'd convert a key into one of 256 bits and apply SIMD operations to do the popcount. That should at least shrink the header for every node type greater than 32. Based on how much one cares about the speed of insert operations I could allocate the vector holding the pointers in smaller chunks to cut memory cost.

I think it'd be interesting to investigate a reimplementation of the nametree with node4, node16, maybe a node32 and a then popcount node. Then, based on how much one cares about insert speed, implement a node4by(2|3|4) to clear more levels in one go. That could shave of some loads, thus resulting in better performance(?)

Tony

Specifically, every server would need to keep a path when doing a lookup. That's still a WiP.

Ah, that makes sense. I think I might need to re-use that idea in BIND for the same reasons that it is useful in NSD :-) I think Knot-DNS only uses a path for iterating over elements in the tree; I can't remember right now how it finds previous and next leaf nodes in a normal lookup.

I think I have a good enough understanding of how the qp trie works and I read through the papers mentioned in your articles. That is, without studying the allocator, because it doesn't affect the algorithm itself much(?)

Yes, the allocator does not affect the algorithms; it just allows a smaller node layout, and supports copy-on-write and garbage collection. COW does affect the update algorithms a little bit: I was surprised when it turned out to be easier than I expected.

One thing that I noticed about the qp trie, is that you shift the nibble. So it sort-of works like path compression in ART. However, the node itself doesn't store part of the key. I think negative lookups might be impacted by that.

Correct, negative lookups require a second pass to find the predecessor of the qname for the covering NSEC. The second pass is usually cheap because it's mostly in cache, so it isn't as bad as it sounds. The benchmark numbers from the README show that "typo" names (similar to existing names) are slower than queries for known names (but still faster than NSD's radix tree) and "nxdomain" queries (for completely random names) are a little bit slower than the radix tree,

A second remark/question I have is that only 4 bits at a time can be compared.

Yes, in the first qp-trie version, and the version used in Knot-DNS. My fork of NSD uses a newer variant that I sometimes call a DNS-trie because it is specific to domain names. It's based on the observation that it is best to rearrange DNS names into lexicographic order so they can be used as lookup keys; we can change how domain names are prepared for lookup, and how tree nodes are laid out, so that they work really nicely together. In a DNS-trie domain names traverse one node or less per wire format byte. And the bitmap size (and node size) is nice and small: 26 case-insensitive letters. 10 digits, a few punctuation characters, and a few escape characters. Less than 50, no need to support 256-wide nodes :-)