From 373e4f3b340683050e359bde50a9ba19f9cb3d4c Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 15 Aug 2019 15:42:56 -0400 Subject: [PATCH] more thoughts --- Thoughts | 107 ++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 95 insertions(+), 12 deletions(-) diff --git a/Thoughts b/Thoughts index 3dae35297d1d..aa0f534677af 100644 --- a/Thoughts +++ b/Thoughts @@ -6,17 +6,36 @@ necessarily obvious. Usually, this is done by observing that pointers are N-byte aligned and thus the bottom log_2(N) bits are available for use. We don't use the high bits of pointers to store additional information because we don't know what bits are unused on any given architecture. -Nodes are 128 bytes in size and are also aligned to 128 bytes, giving -us 7 bits for our own purposes. + +Nodes +----- + +Most nodes are 128 bytes in size and are also aligned to 128 bytes, giving +us 7 low bits for our own purposes. Nodes are of five kinds: + +1. Non-leaf range nodes. +2. Leaf range nodes. +3. Leaf sparse nodes. +4. Leaf dense nodes. +5. Leaf page nodes. + +The root node may be of any type. The minimum value in the node pointed +to be root is always 0. The maximum value is implied by the maximum value +representable by the node type. eg a leaf range_21 node would have a +maximum of 2^21-1. + +Non-page nodes store the parent pointer in the first word of the node; +see 'Node Parent' below. Page nodes store the parent pointer in the +'struct page' corresponding to the address of the node. Tree Root --------- If the tree contains a single entry at index 0, it is usually stored in -tree->ma_root. To optimise for the page cache, an entry which ends in '00', -'01' or '11' is stored in the root, but an entry which ends in '10' will be -stored in a node. Bit 2 is set if there are any NULL entries in the tree. -Bits 3-6 are used to store enum maple_type. +tree->ma_root. To optimise for the page cache, an entry which ends in +'00', '01' or '11' is stored in the root, but an entry which ends in '10' +will be stored in a node. Bit 2 is set if there are any NULL entries +in the tree. Bits 3-6 are used to store enum maple_type. The flags are used both to store some immutable information about this tree (set at tree creation time) and dynamic information set under the spinlock. @@ -57,12 +76,6 @@ and 21 in a range_16, so encode a range_16 as 00 (Bits 2-6 encode the slot number), encode range_32 as 010 (Bits 3-6 encode the slot number) and range_64 as 110 (Bits 3-6 encode the slot number). -Parent bits: -bit 0: 1 = root, 0 otherwise -bit 1: 0 = range 16, 1 otherwise -bit 2: 0 = range 32, 1 = range 64 - - Auxiliary Data -------------- @@ -255,6 +268,76 @@ We considered this alternative: but decided against it because it makes range32s harder. +Inserting in reverse order +-------------------------- + +for (n = 1000; n > 0; n--) + store(n, p_n); + +start with a leaf sparse node: + +n0: (1000, p1000) +n0: (1000, p1000, 999, p999) +n0: (1000, p1000, 999, p999, 998, p998) +n0: (1000, p1000, 999, p999, 998, p998, 997, p997) +n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996) +n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996, 995, p995) +n0: (1000, p1000, 999, p999, 998, p998, 997, p997, 996, p996, 995, p995, 994, p994) + +Now n0 is full. When we try to insert p993, it's time to allocate +another node. The range of the indices is small enough to allocate a +dense node, but the node doesn't start at 0, so we need a parent node. + +n0: (NULL, 992, n1, 1000, NULL, 0) +n1: (p993..p1000, NULL, NULL, NULL, NULL, NULL, NULL, NULL) + +Note that we have NULL stored in a non-leaf node. This is allowed, but +makes inserting into those slots tricky as we need to know what level we +need to descend to in order to create leaf nodes. + +When we try to insert 992, we need to decide which type of node to create. +Lets assume sparse. So we insert it into the root (no need to reallocate +yet): + +n0: (n2, 992, n1, 1000, NULL, 0) +n2: (992, p992) + +And we continue, as above: +n2: (992, p992, 991, p991, 990, p990, 989, p989, 988, p988, 987, p987, 986, p986) + +When we come to insert 985, there's nowhere to store it. Again, the range +of the indices in this node are small, so we want to create a dense node. +If we were tracking the maximum-free-range, we would make a different +splitting decision from this. Create a new, dense n2, and create a +replacement n0 for the "split" of n2: + +n0: (NULL, 977, n2, 992, n1, 1000, NULL, 0) +n2: (NULL, NULL, NULL, NULL, NULL, NULL, NULL, p985, p986, ..., p992) + +We can now fill in n2 all the way to index 978. + +File descriptors +---------------- + +The kernel opens fds 0,1,2, then bash opens fd 255. File descriptors +may well want one tag (for close-on-exec), so we have to sacrifice +some space per node to store the tag information. Because this is +an allocating XArray, we'll assume a dense node for the first three +allocations. Then we have to decide what to do when allocating 255. +The best approach is to replace it with a sparse_9 node which allows +us to store 12 pointers (usually 13, but the tag space). We'll then +continue to use it for fds 3-10. Upon allocating fd 11, we'd want three +nodes; a range_16 with two children; dense from 0-13 and then sparse_9 +from 255-255. Continuing to allocate fds from the bottom up will result +in allocating dense nodes from 14-27, 28-41, 42-55, 56-69, 70-83, 84-97, +... until the range_16 is full at 12 * 14 = 168 pointers. From there, +we'd allocate another range_16 + +The 4kB page node can accommodate 504 pointers with 504 bits used for +the tag. From there. we'd want to go to a 64kB page (order 4) and +then to a 1MB page (order 8). Ideally we'd go to a 16MB page next, +but MAX_ORDER is capped at 11, so we'd start allocating 8MB pages. +Each can accommodate around a million pointers. Deep Thoughts ============= -- 2.50.1