--- /dev/null
+Tree format
+===========
+
+The Maple Tree squeezes various bits in at various points which aren't
+necessarily obvious. Usually, this is done by observing that pointers are
+N-byte aligned and thus the bottom log_2(N) bits are available for use.
+We don't use the high bits of pointers to store additional information
+because we don't know what bits are unused on any given architecture.
+
+Nodes
+-----
+
+Most nodes are 128 bytes in size and are also aligned to 128 bytes, giving
+us 7 low bits for our own purposes. Nodes are of five kinds:
+
+1. Non-leaf range nodes.
+2. Leaf range nodes.
+3. Leaf sparse nodes.
+4. Leaf dense nodes.
+5. Leaf page nodes.
+
+The root node may be of any type. The minimum value in the node pointed
+to be root is always 0. The maximum value is implied by the maximum value
+representable by the node type. eg a leaf range_21 node would have a
+maximum of 2^21-1.
+
+Non-page nodes store the parent pointer in the first word of the node;
+see 'Node Parent' below. Page nodes store the parent pointer in the
+'struct page' corresponding to the address of the node.
+
+Tree Root
+---------
+
+If the tree contains a single entry at index 0, it is usually stored in
+tree->ma_root. To optimise for the page cache, an entry which ends in
+'00', '01' or '11' is stored in the root, but an entry which ends in '10'
+will be stored in a node. Bit 2 is set if there are any NULL entries
+in the tree. Bits 3-6 are used to store enum maple_type.
+
+The flags are used both to store some immutable information about this tree
+(set at tree creation time) and dynamic information set under the spinlock.
+Needed flags: irq/bh safe locking, whether to support 'reserved', whether to
+track max-range-free, whether to reap nodes which only contain value entries,
+whether slot 0 is busy, how many mark bits to support (5 bits), 18 mark bits
+(total 29 bits).
+
+Node Slots
+----------
+
+Leaf nodes (dense, sparse_N and leaf_N) do not store pointers to nodes,
+they store user data. Users may store almost any bit pattern. As noted
+above, the optimisation of storing an entry at 0 in the root pointer
+cannot be done for data which have the bottom two bits set to '10'.
+We also reserve values with the bottom two bits set to '10' which are
+below 4096 (ie 2, 6, 10 .. 4094) for internal use. Some APIs return
+errnos as a negative errno shifted right by two bits and the bottom two
+bits set to '10', and while choosing to store these values in the array
+is not an error, it may lead to confusion if you're testing for an error
+with xa_is_err().
+
+Non-leaf nodes store the type of the node pointed to (enum maple_type
+in bits 3-6), and whether there are any NULL entries anywhere below this
+node in bit 2. That leaves bits 0-1 unused for now.
+
+Node Parent
+-----------
+
+The node->parent of the root node has bit 0 set and the rest of the
+pointer is a pointer to the tree itself. No more bits are available in
+this pointer (on m68k, the data structure may only be 2-byte aligned).
+
+Non-root nodes can only have maple_range_* nodes as parents, so we only
+need to distinguish between range_16, range_32 and range_64. With 32-bit
+pointers, we can store up to 11 pointers in a range_64, 16 in a range_32
+and 21 in a range_16, so encode a range_16 as 00 (Bits 2-6 encode the
+slot number), encode range_32 as 010 (Bits 3-6 encode the slot number)
+and range_64 as 110 (Bits 3-6 encode the slot number).
+
+Auxiliary Data
+--------------
+
+At tree creation time, the user can specify that they're willing to
+trade off storing fewer entries in a tree in return for storing more
+information in each node.
+
+The maple tree supports marking entries and searching for entries which
+are marked. In order to do this efficiently, it has to store one bit per
+mark per slot in each node. In the absence of auxiliary data, there are
+no bits available in the range64 node, so we have to sacrifice a slot, and
+so we may as well sacrifice a pivot. That gives us 128 bits for 7 slots,
+which lets us support up to 18 mark bits. For range32 nodes, we normally
+have 10 slots and 4 bytes free. That lets us support 3 mark bits without
+giving up any slots, and once we've given up one slot plus one pivot, we
+can support up to 14 mark bits. For range16 nodes, we normally have 12
+slots and 2 bytes free which lets us store 1 mark bit without giving up
+any slots. Giving up one slot frees 10 bytes, which lets us store 8 mark
+bits, and giving up two slots lets us store 17 mark bits. Dense nodes
+can store 4 bits by giving up one slot, 9 bits by giving up two slots,
+16 bits by giving up three slots and 23 bits by giving up four slots.
+
+Another use-case for auxiliary data is to record the largest range of NULL
+entries available in this node, also called gaps. This optimises the tree
+for allocating a range.
+
+We do not currently know of any users who want to use both marks and
+max-free-range auxiliary data; for ease of implementation, only one of the
+two is supported, but should a user show up, it would be straightforward
+to add support in the future.
+
+Maple State
+-----------
+
+If state->node has bit 0 set then it references a tree location which
+is not a node (eg the root). If bit 1 is set, the rest of the bits
+are a negative errno. Bit 2 (the 'unallocated slots' bit) is clear.
+Bits 3-6 indicate the node type. This encoding is chosen so that in the
+optimised iteration loop the inline function can simply test (unsigned
+long)state->node & 127 and bail to the out-of-line functions if it's
+not 0. The maple_dense enum must remain at 0 for this to work.
+
+state->alloc uses the low 7 bits for storing a number. If state->node
+indicates -ENOMEM, the number is the number of nodes which need to be
+allocated. If state->node is a pointer to a node, this space is reused to keep
+track of a slot number. If more than one node is allocated, then the nodes are
+placed in state->node slots. Once those slots are full, the slots in the next
+allocated node is filled. This pattern is continued in sequence such that the
+location of a node could be defined as follows:
+
+state->node->slot[X / MAPLE_NODE_SLOTS]->slot[X % MAPLE_NODE_SLOTS]
+
+Where X is the allocation count -1 (for the initial state->node).
+
+The high bits of state->alloc are either NULL or a pointer to the first node
+allocated.
+
+
+Tree Operations
+===============
+
+Inserting
+---------
+
+Inserting a new range inserts either 0, 1, or 2 pivots within the tree. If the
+insert fits exactly into an existing gap with a value of NULL, then the slot
+only needs to be written with the new value. If the range being inserted is
+adjacent to another range, then only a single pivot needs to be inserted (as
+well as writing the entry). If the new range is within a gap but does not
+touch any other ranges, then two pivots need to be inserted: the start - 1, and
+the end. As usual, the entry must be written. Most operations require a new
+node to be allocated and replace an existing node to ensure RCU Safety. The
+exception to requiring a newly allocated node is when inserting at the end of a
+node (appending).
+
+Storing
+-------
+
+Storing is the same operation as insert with the added caveat that it can
+overwrite entries. Although this seems simple enough, one may want to examine
+what happens if a single store operation was to overwrite multiple entries
+within a B-Tree.
+
+Erasing
+-------
+
+Erasing stores a special entry stating that this slot was occupied but is now
+considered NULL (XA_DELETED_ENTRY). Storing this special value allows for a
+working tree in the event that an allocation cannot occur for any reason. It
+also allows for fine-tuning of the performance of the trees coalescing by
+avoiding allocating new nodes unless a threshold of entries with the special
+value exists. The special entry is automatically removed if a new node is
+needed for any other operation or to avoid splitting.
+
+Splitting & coalescing
+----------------------
+
+To keep allocations to a minimum, splitting should only occur when absolutely
+necessary. This means that it may be necessary for the node to the left to
+acquire entries from the node which would otherwise need to be split. This
+would allow for two allocations (parent, and a single child) and an append
+instead of three allocations (parent, and two new children). The parent would
+need to be a new node to avoid implying the incorrect range for slot 0.
+
+Coalescing should also keep allocating new nodes to a minimum. To achieve the
+low allocation count goal, coalescing will attempt to use a single node to
+combine both left and right nodes into a single node. If this is possible,
+then the result will be a parent which contains two duplicate pivots of the
+right-most entry value and the right child will be replaced by an
+XA_SKIP_ENTRY. Upon completion, the parent will need to be checked for
+coalescing as well. If the left and the right nodes cannot fit into a single
+node, then the data is split evenly between both nodes, adjust the parent
+pivot, and fill the moved data with XA_RETRY_ENTRY values. This will avoid
+needing to rebalance the right child and the parent would not need to be
+rebalanced.
+
+To avoid jitter between coalescing and splitting during a remove and insert
+operation, there is a need to keep the low water mark of coalescing and the
+high watermark of data resulting from a split operation. It is worth noting
+that deleting a single entry may result in a NULL, XA_DELETED_ENTRY, NULL which
+means a node size can change by 2 with a single delete.
+
+Rebalance & Replacing
+---------------------
+
+Rebalancing occurs if a non-leaf node does not have the minimum number of
+occupied slots. Rebalancing occurs by consuming data in the node to the right
+into this node. If there is no data left, the right node is freed. If there
+is no right node, then rebalancing is done for the left node. If there is a
+single node, then the trees height is reduced. When a node is freed, the
+parent is also checked for rebalancing.
+
+Replacing a tree is rarely necessary, however, in the case of a store operation causing an hugely unbalanced tree to be produced, then a rebuild is currently used to restore a compliant B-tree. ***Note that this will be revisited and replaced.***
+
+Worked examples
+===============
+
+Inserting multiples of 5
+------------------------
+
+Insert p0 at 0: Tree contains p0 at root.
+
+Insert p1 at 5: We allocate n0:
+n0: (p0, 0, NULL, 4, p1, 5, NULL, 0)
+
+Insert p2 at 10: We append to n0:
+n0: (p0, 0, NULL, 4, p1, 5, NULL, 9, p2, 10, NULL, 0)
+
+Insert p3 at 15: We append to n0:
+n0: (p0, 0, NULL, 4, p1, 5, NULL, 9, p2, 10, NULL, 14, p3, 15, NULL)
+
+Insert p4 at 20: We allocate a replacement n0 as well as n1 and n2:
+n0: (n1, 10, n2, 0xff..ff)
+n1: (p0, 0, NULL, 4, p1, 5, NULL, 9, p2, 10)
+n2: (NULL, 14, p3, 15, NULL, 19, p4, 20, NULL, 0)
+
+Insert p5 at 25: We append to n2:
+n2: (NULL, 14, p3, 15, NULL, 19, p4, 20, NULL, 24, p5, 25, NULL, 0)
+
+Insert p6 at 30: We allocate a replacement n2 and n3 and add to n0:
+n0: (n1, 10, n2, 20, n3, 0xff..ff)
+n1: (p0, 0, NULL, 4, p1, 5, NULL, 9, p2, 10)
+n2: (NULL, 14, p3, 15, NULL, 19, p4, 20)
+n3: (NULL, 24, p5, 25, NULL, 29, p6, 30, NULL, 0)
+
+Insert p7 at 35: We add to n3:
+n3: (NULL, 24, p5, 25, NULL, 29, p6, 30, NULL, 34, p7, 35, NULL, 0)
+
+Insert p8 at 3. We allocate a replacement n1:
+n1: (p0, 0, NULL, 2, p8, 3, NULL, 4, p1, 5, NULL, 9, p2, 10)
+
+Insert p9 at 4. It already has a slot open for it:
+n1: (p0, 0, NULL, 2, p8, 3, p9, 4, p1, 5, NULL, 9, p2, 10)
+
+Insert p10 at 1. We allocate a replacement n1:
+n1: (p0, 0, p10, 1, NULL, 2, p8, 3, p9, 4, p1, 5, NULL, 9, p2)
+
+Insert p11 at 6. We allocate a replacement n1, a new n4 and a replacement n0:
+n0: (n1, 6, n4, 10, n2, 20, n3, 0xff..ff)
+n1: (p0, 0, p10, 1, NULL, 2, p8, 3, p9, 4, p1, 5, p11, 6, NULL)
+n4: (NULL, 9, p2, 10)
+(yes, n4 violates the minimum occupancy requirement of a B-tree, but that's
+no worse than violating the minimum span requirement of a Maple Tree in
+terms of number of nodes allocated, height of tree or number of cachelines
+accessed. We might choose to merge n4 with n2 in this specific instance).
+
+We considered this alternative:
+
+# Insert p6 at 30: We append to n2 again and change n0:
+# n0: (n1, 10, n2, 30, NULL, 0)
+# n2: (NULL, 14, p3, 15, NULL, 19, p4, 20, NULL, 24, p5, 25, NULL, 29, p6)
+
+# Insert p7 at 35: We allocate a replacement n2 and n3 and change n0 right-to-left
+# n0: (n1, 10, n2, 25, n3, 0xff..ff)
+# n1: (p0, 0, NULL, 4, p1, 5, NULL, 9, p2, 10)
+# n2: (NULL, 14, p3, 15, NULL, 19, p4, 20, NULL, 24, p5, 25)
+# n3: (NULL, 29, p6, 30, NULL, 34, p7, 35, NULL, 0)
+
+but decided against it because it makes range32s harder.
+
+Inserting in reverse order
+--------------------------
+
+for (n = 1000; n > 0; n--)
+ store(n, p_n);
+
+start with a leaf sparse node:
+
+n0: (1000, p1000)
+n0: (1000, p1000, 999, p999)
+n0: (1000, p1000, 999, p999, 998, p998)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996, 995, p995)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996, 995, p995, 994, p994)
+
+Now n0 is full. When we try to insert p993, it's time to allocate
+another node. The range of the indices is small enough to allocate a
+dense node, but the node doesn't start at 0, so we need a parent node.
+
+n0: (NULL, 992, n1, 1000, NULL, 0)
+n1: (p993..p1000, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+
+Note that we have NULL stored in a non-leaf node. This is allowed, but
+makes inserting into those slots tricky as we need to know what level we
+need to descend to in order to create leaf nodes.
+
+When we try to insert 992, we need to decide which type of node to create.
+Lets assume sparse. So we insert it into the root (no need to reallocate
+yet):
+
+n0: (n2, 992, n1, 1000, NULL, 0)
+n2: (992, p992)
+
+And we continue, as above:
+n2: (992, p992, 991, p991, 990, p990, 989, p989, 988, p988, 987, p987, 986, p986)
+
+When we come to insert 985, there's nowhere to store it. Again, the range
+of the indices in this node are small, so we want to create a dense node.
+If we were tracking the maximum-free-range, we would make a different
+splitting decision from this. Create a new, dense n2, and create a
+replacement n0 for the "split" of n2:
+
+n0: (NULL, 977, n2, 992, n1, 1000, NULL, 0)
+n2: (NULL, NULL, NULL, NULL, NULL, NULL, NULL, p985, p986, ..., p992)
+
+We can now fill in n2 all the way to index 978.
+
+File descriptors
+----------------
+
+The kernel opens fds 0,1,2, then bash opens fd 255. File descriptors
+may well want one tag (for close-on-exec), so we have to sacrifice
+some space per node to store the tag information. Because this is
+an allocating XArray, we'll assume a dense node for the first three
+allocations. Then we have to decide what to do when allocating 255.
+The best approach is to replace it with a sparse_9 node which allows
+us to store 12 pointers (usually 13, but the tag space). We'll then
+continue to use it for fds 3-10. Upon allocating fd 11, we'd want three
+nodes; a range_16 with two children; dense from 0-13 and then sparse_9
+from 255-255. Continuing to allocate fds from the bottom up will result
+in allocating dense nodes from 14-27, 28-41, 42-55, 56-69, 70-83, 84-97,
+... until the range_16 is full at 12 * 14 = 168 pointers. From there,
+we'd allocate another range_16
+
+The 4kB page node can accommodate 504 pointers with 504 bits used for
+the tag. From there. we'd want to go to a 64kB page (order 4) and
+then to a 1MB page (order 8). Ideally we'd go to a 16MB page next,
+but MAX_ORDER is capped at 11, so we'd start allocating 8MB pages.
+Each can accommodate around a million pointers.
+
+Deep Thoughts
+=============
+
+There are data structures in the kernel which provide support for
+overlapping ranges. The XArray / Maple Tree have no support for this
+idiom, but maybe this data structure could be enhanced to support those.
+If we want to be rid of the rbtree, we'll need to do something like this.
+(Examples: DRBD, x86 PAT. file locks might also benefit from being
+converted from a linked list)