N-byte aligned and thus the bottom log_2(N) bits are available for use.
We don't use the high bits of pointers to store additional information
because we don't know what bits are unused on any given architecture.
-Nodes are 128 bytes in size and are also aligned to 128 bytes, giving
-us 7 bits for our own purposes.
+
+Nodes
+-----
+
+Most nodes are 128 bytes in size and are also aligned to 128 bytes, giving
+us 7 low bits for our own purposes. Nodes are of five kinds:
+
+1. Non-leaf range nodes.
+2. Leaf range nodes.
+3. Leaf sparse nodes.
+4. Leaf dense nodes.
+5. Leaf page nodes.
+
+The root node may be of any type. The minimum value in the node pointed
+to be root is always 0. The maximum value is implied by the maximum value
+representable by the node type. eg a leaf range_21 node would have a
+maximum of 2^21-1.
+
+Non-page nodes store the parent pointer in the first word of the node;
+see 'Node Parent' below. Page nodes store the parent pointer in the
+'struct page' corresponding to the address of the node.
Tree Root
---------
If the tree contains a single entry at index 0, it is usually stored in
-tree->ma_root. To optimise for the page cache, an entry which ends in '00',
-'01' or '11' is stored in the root, but an entry which ends in '10' will be
-stored in a node. Bit 2 is set if there are any NULL entries in the tree.
-Bits 3-6 are used to store enum maple_type.
+tree->ma_root. To optimise for the page cache, an entry which ends in
+'00', '01' or '11' is stored in the root, but an entry which ends in '10'
+will be stored in a node. Bit 2 is set if there are any NULL entries
+in the tree. Bits 3-6 are used to store enum maple_type.
The flags are used both to store some immutable information about this tree
(set at tree creation time) and dynamic information set under the spinlock.
slot number), encode range_32 as 010 (Bits 3-6 encode the slot number)
and range_64 as 110 (Bits 3-6 encode the slot number).
-Parent bits:
-bit 0: 1 = root, 0 otherwise
-bit 1: 0 = range 16, 1 otherwise
-bit 2: 0 = range 32, 1 = range 64
-
-
Auxiliary Data
--------------
but decided against it because it makes range32s harder.
+Inserting in reverse order
+--------------------------
+
+for (n = 1000; n > 0; n--)
+ store(n, p_n);
+
+start with a leaf sparse node:
+
+n0: (1000, p1000)
+n0: (1000, p1000, 999, p999)
+n0: (1000, p1000, 999, p999, 998, p998)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996, 995, p995)
+n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996, 995, p995, 994, p994)
+
+Now n0 is full. When we try to insert p993, it's time to allocate
+another node. The range of the indices is small enough to allocate a
+dense node, but the node doesn't start at 0, so we need a parent node.
+
+n0: (NULL, 992, n1, 1000, NULL, 0)
+n1: (p993..p1000, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+
+Note that we have NULL stored in a non-leaf node. This is allowed, but
+makes inserting into those slots tricky as we need to know what level we
+need to descend to in order to create leaf nodes.
+
+When we try to insert 992, we need to decide which type of node to create.
+Lets assume sparse. So we insert it into the root (no need to reallocate
+yet):
+
+n0: (n2, 992, n1, 1000, NULL, 0)
+n2: (992, p992)
+
+And we continue, as above:
+n2: (992, p992, 991, p991, 990, p990, 989, p989, 988, p988, 987, p987, 986, p986)
+
+When we come to insert 985, there's nowhere to store it. Again, the range
+of the indices in this node are small, so we want to create a dense node.
+If we were tracking the maximum-free-range, we would make a different
+splitting decision from this. Create a new, dense n2, and create a
+replacement n0 for the "split" of n2:
+
+n0: (NULL, 977, n2, 992, n1, 1000, NULL, 0)
+n2: (NULL, NULL, NULL, NULL, NULL, NULL, NULL, p985, p986, ..., p992)
+
+We can now fill in n2 all the way to index 978.
+
+File descriptors
+----------------
+
+The kernel opens fds 0,1,2, then bash opens fd 255. File descriptors
+may well want one tag (for close-on-exec), so we have to sacrifice
+some space per node to store the tag information. Because this is
+an allocating XArray, we'll assume a dense node for the first three
+allocations. Then we have to decide what to do when allocating 255.
+The best approach is to replace it with a sparse_9 node which allows
+us to store 12 pointers (usually 13, but the tag space). We'll then
+continue to use it for fds 3-10. Upon allocating fd 11, we'd want three
+nodes; a range_16 with two children; dense from 0-13 and then sparse_9
+from 255-255. Continuing to allocate fds from the bottom up will result
+in allocating dense nodes from 14-27, 28-41, 42-55, 56-69, 70-83, 84-97,
+... until the range_16 is full at 12 * 14 = 168 pointers. From there,
+we'd allocate another range_16
+
+The 4kB page node can accommodate 504 pointers with 504 bits used for
+the tag. From there. we'd want to go to a 64kB page (order 4) and
+then to a 1MB page (order 8). Ideally we'd go to a 16MB page next,
+but MAX_ORDER is capped at 11, so we'd start allocating 8MB pages.
+Each can accommodate around a million pointers.
Deep Thoughts
=============