Clone
5
BIND9 Databases
Ondřej Surý edited this page 2025-10-21 08:51:52 +02:00
This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Database Structures in BIND 9

This is high level overview of the database data structures used in BIND 9.

Most of the lookups below require support for "partial match" which is basically returning the closest ancestor in the tree.

Currently, the database of choice is QPtrie which is based on RCU concepts and has multiple wait-free readers and single writer (+ locked snapshot create, e.g. writer blocks snapshots, but snapshots don't block writer).

Authoritative DNS

Zone Table (dns_zt)

Zone Table is the tree-of-trees - it maintains the tree of the Zone origins (the domain name at the top of the Zone tree).

  • Key: domain name
  • Data: Zone object
  • Thread-Safety: safe (uses RCU)
  • Functions:
    • add
    • remove
    • find
    • foreach
  • Requirements:
    • Partial matching

Zone Database (dns_qpzonedb_t)

  • Key: namespace + domain name
  • Data: qpznode
  • Thread-Safety: safe - uses RCU for the domain tree and global bucketed rwlocks to lock data
  • Functions:
    • add
    • remove
    • find
    • foreach
  • Requirements:
    • Write transactions
    • Snapshots - read transactions not depending on RCU read critical section
    • Read transactions that are fast
    • Partial matching
    • Previous and next (sorted by DNSSEC-defined sorting) - this is required for NSEC/NSEC3 - we need to be able to find "enclosers"

Description:

Zone Database is the structure that holds the tree of domain names inside a single DNS Zone. For every configured "zone" there would be 1:1 mapping with the Zone object.

The current implementation has a two-layer implementation for Owner-Name + RRType lookups. The Zone Databases is indexed using the Namespace+DomainName and it stores object that holds the list of RRTypes at the node (slabheader). Our long term goal is to replace the slabheaders with single-layer indexed with Namespace+Name+Type, but we are not there yet and it is not that simple. It will be explained below in a section about slabheaders.

Zone Database has a concept of "versions" currently built on top of slabheader structures, and the intent is to move the "versions" into the database concept (henceforth the requirement for transactional support).

The tree actually holds three trees in the single underlying QPtrie, namespace (NORMAL, NSEC and NSEC3) is prepended to the key.

Reads need to be really fast. This is basically the DNS Lookup for a domain name. Nowadays, there's a mechanism to keep the results consistent for a single remote client query. I am not sure if this is actually required as DNS is globally only eventually consistent. But if the snapshots (see below) are very lightweight, they can be used for this too.

Snapshots has two uses:

  1. Outgoing Zone Transfers (e.g. outgoing AXFR or IXFR) - as the zones can grow quite large, the zone transfer might take several TCP sends
  2. Zone Dumps - same - when dumping the zone into the zone file, it runs asynchronously on a separate (slow) thread

Writes are relatively rare (compared to reads) as they happen only when there's a new version of the zone.

Updates can happen in four basic cases:

  1. zone transfer (AXFR) - the whole underlying tree is replaced with new one. Zone transfers are serialized.
  2. incremental zone transfer - the updates are chunked - tree nodes are created and deleted. Zone transfers are serialized.
  3. updates - client initiated updates - can be batched or applied one-by-one. Should not require COW for the whole database.
  4. signing - DNSSEC signing happens as timed-tasks that trigger when the expiration is near. Results in data to be deleted and inserted.

QPZNode (qpznode_t):

The qpznode is reference counted object stored in Zone database. It has no thread-safety on its own, and is locked from Zone database using global bucketed rwlocks (1024 of them currently). These locks are shared among all the Zones.

It contains some flags and data. The data is slabheaders that hold the actual Resource Data (RData) indexed by Resource Type (RType). Slabheaders are shared among Zones and Cache, so more on them below.

Extra stuff:

There's a globally locked heap that is indexed with the DNSSEC resigning time. When the DNSSEC signer fires up, it picks a slabheader (<name,type>) from the list, signs it, inserts the signature into the zone and the database inserts the slabheader into the heap again with new resign time. (This is not really a source of any contention.)

Cache Database (dns_qpcache_t)

  • Key: namespace + domain name
  • Data: QPCNode
  • Thread-Safety: safe - uses read-write lock
  • Functions:
    • add
    • remove
    • find
    • foreach
  • Requirements:
    • Updates
    • Reads
    • Partial matching
    • Previous and next (sorted by DNSSEC-defined sorting) - this is required for NSEC/NSEC3 - we need to be able to find "enclosers"

Cache Database is the basic structure that powers the DNS Resolver. Henceforth, its requirements are little bit different from the Zone Database. It requires only eventual consistency, but while writes are less common than reads, they are more common than in the Zone Databases. The writes might be even 1:1 in the case of cold cache where the most of operations are cache misses.

The current implementation has a two-layer implementation for Owner-Name + RRType lookups. The Cache Databases is indexed using the Namespace+DomainName and it stores object that holds the list of RRTypes at the node (slabheader). Our long term goal is to replace the slabheaders with single-layer indexed with Namespace+Name+Type, but we are not there yet and it is not that simple. It will be explained below in a section about slabheaders.

The tree actually holds two trees in the single underlying QPtrie, namespace (NORMAL and NSEC) is prepended to the key.

Reads need to be fast, but there is no requirement for the reads to be consistent (only RData + RRSIG(RData) should be).

Writes need to be reasonably fast and the changes should propagate to other threads as the cache is also used as shared state for composing the answers to the clients. Remote DNS queries are coalesced and when the originating threads writes the data to the cache, it uses callbacks to notify the other threads that the data in the cache is now available.

When designing DNS, both the cold and hot cache performance is important because the DNS is basically defenseless against remote attacker asking for random stuff (called pseudo-random sub-domain attack) making the cache to cache more crap than useful stuff.

QPCNode (qpcnode_t):

The QPCNode is reference counted object stored in Cache database. It has no thread-safety on its own, and is locked from Cache database using per-cache rwlocks (per-thread rwlock) called NodeLock.

It contains some flags and data. The data is slabheaders that hold the actual Resource Data (RData) indexed by Resource Type (RType). Slabheaders are shared among Zones and Cache, so more on them below.

Cleaning:

There are two cleaning mechanisms - opportunistic per-thread heap-based TTL cleaning. When Cache acquires write lock on the NodeLock, it looks whether there are TTL-expired slabheaders to be cleaned and marks them as "ancient" and when reference count on the Node goes to 0 and we have or can acquire write lock, it gets cleaned. This is required to not pull the rug from the existing clients reading data from the QPCNode. This could probably be replaced with RCU-based mechanism, but it is going to require changes in other places of the code.

The second cleaning mechanism is overmem cleaning. When Cache detects the overmem condition (>= 7/8 over configured max cache), it triggers more aggressive cleaning that is based on SIEVE LRU. As a note: SIEVE requires write lock on NodeLock only on cache-miss, and just uses atomic-set on cache-hit.

Some Notes:

We've tried to convert the Cache on LRU-based QP (and not the simple one), but the usage patterns makes this actually slower than using our read-write lock (it is not the pthread_rwlock but simplified C-RW-WP).

SlabHeaders (dns_slabheder_t)

  • Key: RType - but it is really a linked list now
  • Data: flags + links + RData (but really opaque)
  • Thread-Safety: unsafe - requires external locking
  • Functions:
    • add
    • remove
    • find
    • foreach
  • Requirements:
    • Add
    • Remove
    • Find
    • Multiple-Find (I'll explain)

If the ZoneDB and CacheDB are the bones, SlabHeaders would be the muscles and meat. It is indexed with RType, so the search for <name,type> tuple first looks up the node in the QPtrie, and then looks for matching Type in the SlabHeader linked list (more below). This is where it can get really complicated as we also store proofs of non-existence of the Type and proofs of non-existence of all types in the same slabheader. The link between RType and RRSIG(RType) is also weak as it requires two slabheaders that has links between them. This is required for Zone Database when loading the zone (one-by-one) and also the zone can (theoretically) contain RRSIGs for types that don't exist. This is subject to future refactoring, but it gets complicated very quickly (it is not impossible, just complex).

The Multiple-Find  when looking for <name,type>, we also need to look for <name,CNAME> (for CNAMEs), <name,NS> for delegations and <name,NSEC> for proofs of non-existence. Again this is a limitation of the current implementation, but the refactoring needs to be incremental, so this is to stay for a little bit longer.

The linked-list is actually linked-list of linked-lists. This is currently required for ZoneDB, as the second layer of the linked-list holds the different versions of the RData (imagine there's outgoing transfer for version 1, but meanwhile incoming transfer updated the database to version 2). There's an ongoing effort to remove the versions from the slabheaders and move the versions to use the database transactions. Cache does not use version, so the inside linked-list is only used to keep the version if someone is reading from it (most probably this could be replace with RCU-based mechanism). The outside list is indexed by Resource Type, the inside list is indexed by "version" for Zones. For cache, only the top slabheader is read, but it can become outdated while still in-use.