See also some more recent DB notes
Back to agenda: https://gitlab.isc.org/isc-projects/bind9/-/wikis/BIND-9.19-Plan
Attendees: Ondrej, Michał K., Michal N., Petr, Aram, Artem, Evan, Matthijs, Mark, Vicky.
Goals
- Modernize design with better data structures (not those from the 90s).
- Reducing contention between threads, hopefully we can we remove a lot of locking (possibly dependent on the dispatch uplift).
- Better management of aging cache data (cache cleaning)
- Better G maintenance: Make it easier for people to make changes in RBTDB
- Reduce memory usage (we are wasting a lot of memory now)
- Reduce the impact of updating XFR/AXFRs, maybe provide more control over them?
Refactor RBTDB or replace? We can write a new database backend for the lib/dns/db API.
How can we manage/plan the db replacement for 2022 better than we did the network refactoring? (see https://gitlab.isc.org/isc-projects/bind9/-/wikis/BIND-9.19-Planning:-Netmgr-evaluation)
Do we have any reason to think our existing db is a performance bottleneck?
- Michal K: ISTR some tests were performed which indicated that RBTDB is not a performance bottleneck, but I cannot find that write-up :(
- Evan: I remember it too - both jinmei and mukund benchmarked it and found that rbtdb could handle 10x or more the transactions per second that BIND could, so other things were higher priority. but I don't have the data.
Notes
ADB
Ondrej - Look at other vendors (IP-address, qname, time as a key, instead of "just server name").
Evan - Code triggers resolver fetch that creates ADB lookup, etc. We can do better.
RBTDB Thoughts
- Improve test coverage on db (ref: "W/w" bug)
- Separate db for zone data vs cache data would limit risk (so consider introducing a new db for just cache data first and retain current db for zone data)
- Artem - if we are to adapt an existing solution, we should consider tooling around it
- goal: reduce the impact of updating XFR/AXFRs, maybe provide more control over them?
- RBTDB is complicated INSIDE, but it has a well-defined external interface, so damage should be easier to contain.
- Assorted ideas: Maybe we should store RRSIGs together with their "source" rdatasets. These for cycles going through all RRsets on a node, searching for an RRSIG seem as unnecessary complication to me (pspacek).
The scope of this is to refactor the implementations lib/dns/db.c. Not sure if we can keep the lib/dns/db.c API.
The RBTDB is actually not a real RBT
Ondrej - We are designing a database that is key value storage. It is likely a tree. It needs to support transactions. And needs to support the DNS quirks.
Mark - the reason we are using an RBTDB (rather than a hash table) is that we need denial of existence lookups.
Matthijs - Tree makes sense for DNS. We want to have a traversable structure.
Mark - the bind RBT is not actually a red black tree, it is design of the locking was based on old task manager
Evan - every node is identified by a hash table and associated with a bucket..
Mark - an array of locks, locks shared by multiple nodes originally written when obtaining a lock was hugely expensive, do we now want to put a lock on each node?
Evan - suspect there are some things that are not adequately locked right now, because at the time we assumed all tasks were serialized
Mark - also inside the rbtdb there are also heaps, which are used to work out what the next node needs, used for LRU, to quickly find out what the next RRset needs signing.
Mark - cache can be a rbtree of hashes with auxiliary database for NSEC3.
Cache cleaning doesn't work
Petr - Main complaint: cache cleaning doesn't work very well.
Ondrej - Current implementation cannot clean empty rbt non-leaf nodes.
Ondrej - There is cleaning of rbtree and cleaning of nodes rbt nodes.
Matthijs - The new structure should also be optimized for updates (additions, deletions).
Mark - The RBTDB is designed to deal with that.
Versions
Ondrej - Versions. They don't work exactly as transactions.
Mark - One version for writing, multiple versions open for reading.
Mark - If a commit is reversed, every data entry is marked as invalid.
Mark - There is a linked list of every rdata slab in a commit, and if that is reversed, every slab is set to invalid.
Ondrej - Performance issues due to versions.
Mark - Revisions to db, you write the data into it, but it is not visible until you commit it. You can have multiple versions open for reading, but only one version can be open for writing at a time. If a commit is reversed, every data entry in that commit is marked as invalid. There is a linked list of every rdataslab in a commit. If the commit is reversed every one of those slabs has a header on it, but the header is changed so that the slab is no longer valid.
Ondrej - CIRA reported some performance issue related to RBTDB versioning
What do we optimize for? searching or edits?
Matthijs - Lookups should be cheap, but especially for cache we should also have fast inserts and removals.
Matthijs - Trees usually don't do a good job of insertions and deletions.
DB API
Keep lib/dns/db.c API if possible?
Petr - The existing API is long and has many calls that make no sense for cache and zone db both. Perhaps split API
Evan - The API is flexible, allowing for different features.
Evan - Don't see how splitting API is useful, both are generic databases that use the same lookup mechanisms.
Matthijs - Splitting API makes it more intuitive what this backend can do. There is little overlap between the functionality except for find/add/del RRset (and some more).
Serve Stale
Mark - Can we just remove it?
Passing thought: what if stale data was moved into a separate tree entirely, rather than kept around but marked "stale"? this could work, what people want I think is a semi-permanent list of A records for the 'top' names. not even necessarily a tree, could be a hash table.
Lessons learned from the netmgr refactor
We should work on this project with 2+ people.
Brainstorm ideas to reduce risk, including test strategies, milestones/checkpoints, maintain a log of 'what has been tried and rejected'.
Writing more tests. Test the existing API and make sure the new database does the same thing.
- Testing correctness.
- Testing concurrency (is going to be a beast).
- But we can live with a performance problem (although the expectation is that this new database will improve performance).
- What are the requirements? This drives what needs to be tested.
Simplify existing code:
- Already been partly done (RPZ and stuff has been ripped out).
- Split rbtdb into zone db and cache db (Evan was already hacking this during the meeting).
- Other than that there is not a lot of evil.
Write down the quirks first.
Main problems
Design problems:
- Big differences (thousands of changed records) is performing badly.
- Locking is suboptimal, granularity and serialization needs to change after the dispatch work.
- RRsets and RRSIGS are separate, causing suboptimal RRSIG lookups.
- If we had 2 separate implementations we could optimize each differently, the cache db is constantly updated while being read. Zone db is updated with larger updates, more episodically.
Known implementation problems:
- RBT Rebalancing code on data removal is missing.
- The code is a hairball
- LRU cleaning does not work very well
- overmem behavior is part of memory allocator, makes things slower even outside of RBTDB
Implementations pros - it is apparently not a performance bottleneck (except for big zone transfers ...)
Requirements
- ECS - maybe in more generic terms? arbitrary binary keys intead of IP addresses?
- DNSSEC
- online signing - is this a job for dns db? No: can we do the same thing CF does, generate minimum enclosers and sign on the fly
- Find next name, find previous name.
- Find parent name.
- Find closest encloser, source of synthesis
- Wildcards
- NSEC, NSEC3, ? NSEC5
- Bottom of zone cuts
- DNAME
- Glue
- glue cache made a big improvement in root zone performance when it was added
- hidden records (occluded data)
- zone metadata: storing original case as it is loaded from the zone and try to preserve it.
- views
- aggressive NSEC caching (do we even know if this is useful in practice or is it just something that sounds useful?)
- NSEC3PARAM auxiliary DB
- versioning - required for auth, rollback for failed zone transfers etc.
- is one new version adequate?
- cache cleaning - something which cleans cruft but at the same time does not cause excessive
locking
- TODO literature research?
- Vicky - I wonder if we could learn anything from the quad9 resolver, they might let us do some experiments (except I just realized they maintain 0 caching)
- rdataset methods to be investigated
- trust levels
- prefetch
- owner case
- Performance:
- Optimized for lookups
- Scalable for insertions and removals
A possible project outline - for discussion
- refactor current db interface to simplify it
- goal: simplifying the current implementation
- minimal auth, without dnssec, read-only
- goal: scales well for parallel access - can in practice serve lots of queries on many cores
- includes: bottom of zone cuts, DNAME, glue, occluded data, case preservation, wildcards
- auth read/write, with dynamic updates - still scales well for parallel reads & writes
- goal: scales well even after adding transactions, when queries and updates are done in parallel
- includes: transactions
- static dnssec - zones generated using dnssec-signzone
- goal: maintenance of auxiliary structures does not drag down performance
- includes: find next name, find previous name, find closest encloser, source of synthesis, NSEC, NSEC3, aggressive NSEC caching, NSEC3PARAM auxiliary DB
- does not yet require some sort of LRU for RRSIG maintenance
- dnssec with incremental resigning
- requires some sort of efficient LRU to order RRSIGs, which copes well with dynamic updates
- cache
- includes: find parent name, trust levels, prefetch, LRU with serve-stale support
- cache with ECS
- consider if we can make "ECS" scopes more generic than IP address - it would enable new use-cases like GeoIP on per-RR basis etc.