bind9

mirror/bind9

Fork 0

mirror of https://gitlab.isc.org/isc-projects/bind9.git synced 2025-12-20 00:55:29 +08:00

Table of Contents

Goals
Notes

ADB
RBTDB Thoughts
The RBTDB is actually not a real RBT
Cache cleaning doesn't work
Versions
DB API
Serve Stale
Lessons learned from the netmgr refactor

Main problems
Requirements
A possible project outline - for discussion

See also some more recent DB notes

Back to agenda: https://gitlab.isc.org/isc-projects/bind9/-/wikis/BIND-9.19-Plan

Attendees: Ondrej, Michał K., Michal N., Petr, Aram, Artem, Evan, Matthijs, Mark, Vicky.

Goals

Modernize design with better data structures (not those from the 90s).
Reducing contention between threads, hopefully we can we remove a lot of locking (possibly dependent on the dispatch uplift).
Better management of aging cache data (cache cleaning)
Better G maintenance: Make it easier for people to make changes in RBTDB
Reduce memory usage (we are wasting a lot of memory now)
Reduce the impact of updating XFR/AXFRs, maybe provide more control over them?

Refactor RBTDB or replace? We can write a new database backend for the lib/dns/db API.

How can we manage/plan the db replacement for 2022 better than we did the network refactoring? (see https://gitlab.isc.org/isc-projects/bind9/-/wikis/BIND-9.19-Planning:-Netmgr-evaluation)

Do we have any reason to think our existing db is a performance bottleneck?

Michal K: ISTR some tests were performed which indicated that RBTDB is not a performance bottleneck, but I cannot find that write-up :(
Evan: I remember it too - both jinmei and mukund benchmarked it and found that rbtdb could handle 10x or more the transactions per second that BIND could, so other things were higher priority. but I don't have the data.

Notes

ADB

Ondrej - Look at other vendors (IP-address, qname, time as a key, instead of "just server name").

Evan - Code triggers resolver fetch that creates ADB lookup, etc. We can do better.

RBTDB Thoughts

Improve test coverage on db (ref: "W/w" bug)
Separate db for zone data vs cache data would limit risk (so consider introducing a new db for just cache data first and retain current db for zone data)
Artem - if we are to adapt an existing solution, we should consider tooling around it
goal: reduce the impact of updating XFR/AXFRs, maybe provide more control over them?
RBTDB is complicated INSIDE, but it has a well-defined external interface, so damage should be easier to contain.
Assorted ideas: Maybe we should store RRSIGs together with their "source" rdatasets. These for cycles going through all RRsets on a node, searching for an RRSIG seem as unnecessary complication to me (pspacek).

The scope of this is to refactor the implementations lib/dns/db.c. Not sure if we can keep the lib/dns/db.c API.

The RBTDB is actually not a real RBT

Ondrej - We are designing a database that is key value storage. It is likely a tree. It needs to support transactions. And needs to support the DNS quirks.

Mark - the reason we are using an RBTDB (rather than a hash table) is that we need denial of existence lookups.

Matthijs - Tree makes sense for DNS. We want to have a traversable structure.

Mark - the bind RBT is not actually a red black tree, it is design of the locking was based on old task manager

Evan - every node is identified by a hash table and associated with a bucket..

Mark - an array of locks, locks shared by multiple nodes originally written when obtaining a lock was hugely expensive, do we now want to put a lock on each node?

Evan - suspect there are some things that are not adequately locked right now, because at the time we assumed all tasks were serialized

Mark - also inside the rbtdb there are also heaps, which are used to work out what the next node needs, used for LRU, to quickly find out what the next RRset needs signing.

Mark - cache can be a rbtree of hashes with auxiliary database for NSEC3.

Cache cleaning doesn't work

Petr - Main complaint: cache cleaning doesn't work very well.

Ondrej - Current implementation cannot clean empty rbt non-leaf nodes.

Ondrej - There is cleaning of rbtree and cleaning of nodes rbt nodes.

Matthijs - The new structure should also be optimized for updates (additions, deletions).

Mark - The RBTDB is designed to deal with that.

Versions

Ondrej - Versions. They don't work exactly as transactions.

Mark - One version for writing, multiple versions open for reading.

Mark - If a commit is reversed, every data entry is marked as invalid.

Mark - There is a linked list of every rdata slab in a commit, and if that is reversed, every slab is set to invalid.

Ondrej - Performance issues due to versions.

Mark - Revisions to db, you write the data into it, but it is not visible until you commit it. You can have multiple versions open for reading, but only one version can be open for writing at a time. If a commit is reversed, every data entry in that commit is marked as invalid. There is a linked list of every rdataslab in a commit. If the commit is reversed every one of those slabs has a header on it, but the header is changed so that the slab is no longer valid.

Ondrej - CIRA reported some performance issue related to RBTDB versioning

What do we optimize for? searching or edits?

Matthijs - Lookups should be cheap, but especially for cache we should also have fast inserts and removals.

Matthijs - Trees usually don't do a good job of insertions and deletions.

DB API

Keep lib/dns/db.c API if possible?

Petr - The existing API is long and has many calls that make no sense for cache and zone db both. Perhaps split API

Evan - The API is flexible, allowing for different features.

Evan - Don't see how splitting API is useful, both are generic databases that use the same lookup mechanisms.

Matthijs - Splitting API makes it more intuitive what this backend can do. There is little overlap between the functionality except for find/add/del RRset (and some more).

Serve Stale

Mark - Can we just remove it?

Passing thought: what if stale data was moved into a separate tree entirely, rather than kept around but marked "stale"? this could work, what people want I think is a semi-permanent list of A records for the 'top' names. not even necessarily a tree, could be a hash table.

Lessons learned from the netmgr refactor

We should work on this project with 2+ people.

Brainstorm ideas to reduce risk, including test strategies, milestones/checkpoints, maintain a log of 'what has been tried and rejected'.

Writing more tests. Test the existing API and make sure the new database does the same thing.

Testing correctness.
Testing concurrency (is going to be a beast).
But we can live with a performance problem (although the expectation is that this new database will improve performance).
What are the requirements? This drives what needs to be tested.

Simplify existing code:

Already been partly done (RPZ and stuff has been ripped out).
Split rbtdb into zone db and cache db (Evan was already hacking this during the meeting).
Other than that there is not a lot of evil.

Write down the quirks first.

Main problems

Design problems:

Big differences (thousands of changed records) is performing badly.
Locking is suboptimal, granularity and serialization needs to change after the dispatch work.
RRsets and RRSIGS are separate, causing suboptimal RRSIG lookups.
If we had 2 separate implementations we could optimize each differently, the cache db is constantly updated while being read. Zone db is updated with larger updates, more episodically.

Known implementation problems:

RBT Rebalancing code on data removal is missing.
The code is a hairball
LRU cleaning does not work very well
overmem behavior is part of memory allocator, makes things slower even outside of RBTDB

Implementations pros - it is apparently not a performance bottleneck (except for big zone transfers ...)

Requirements

ECS - maybe in more generic terms? arbitrary binary keys intead of IP addresses?
DNSSEC
online signing - is this a job for dns db? No: can we do the same thing CF does, generate minimum enclosers and sign on the fly
Find next name, find previous name.
Find parent name.
Find closest encloser, source of synthesis
Wildcards
NSEC, NSEC3, ? NSEC5
Bottom of zone cuts
DNAME
Glue
- glue cache made a big improvement in root zone performance when it was added
hidden records (occluded data)
zone metadata: storing original case as it is loaded from the zone and try to preserve it.
views
aggressive NSEC caching (do we even know if this is useful in practice or is it just something that sounds useful?)
NSEC3PARAM auxiliary DB
versioning - required for auth, rollback for failed zone transfers etc.
- is one new version adequate?
cache cleaning - something which cleans cruft but at the same time does not cause excessive locking
- TODO literature research?
- Vicky - I wonder if we could learn anything from the quad9 resolver, they might let us do some experiments (except I just realized they maintain 0 caching)
rdataset methods to be investigated
trust levels
prefetch
owner case
Performance:
- Optimized for lookups
- Scalable for insertions and removals

A possible project outline - for discussion

refactor current db interface to simplify it

goal: simplifying the current implementation

minimal auth, without dnssec, read-only

goal: scales well for parallel access - can in practice serve lots of queries on many cores
includes: bottom of zone cuts, DNAME, glue, occluded data, case preservation, wildcards

auth read/write, with dynamic updates - still scales well for parallel reads & writes

goal: scales well even after adding transactions, when queries and updates are done in parallel
includes: transactions

static dnssec - zones generated using dnssec-signzone

goal: maintenance of auxiliary structures does not drag down performance
includes: find next name, find previous name, find closest encloser, source of synthesis, NSEC, NSEC3, aggressive NSEC caching, NSEC3PARAM auxiliary DB
does not yet require some sort of LRU for RRSIG maintenance

dnssec with incremental resigning

requires some sort of efficient LRU to order RRSIGs, which copes well with dynamic updates

cache

includes: find parent name, trust levels, prefetch, LRU with serve-stale support

cache with ECS

consider if we can make "ECS" scopes more generic than IP address - it would enable new use-cases like GeoIP on per-RR basis etc.