Kent Overstreet [Tue, 2 Jun 2020 23:41:47 +0000 (19:41 -0400)]
bcachefs: Fix a deadlock in bch2_btree_node_get_sibling()
There was a bad interaction with bch2_btree_iter_set_pos_same_leaf(),
which can leave a btree node locked that is just outside iter->pos,
breaking the lock ordering checks in __bch2_btree_node_lock(). Ideally
we should get rid of this corner case, but for now fix it locally with
verbose comments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 28 May 2020 20:06:13 +0000 (16:06 -0400)]
bcachefs: Fixes for going RO
Now that interior btree updates are fully transactional, we don't need
to write out alloc info in a loop. However, interior btree updates do
put more things in the journal, so we still need a loop in the RO
sequence.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 25 May 2020 18:57:06 +0000 (14:57 -0400)]
bcachefs: Interior btree updates are now fully transactional
We now update the alloc info (bucket sector counts) atomically with
journalling the update to the interior btree nodes, and we also set new
btree roots atomically with the journalled part of the btree update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 24 May 2020 18:06:10 +0000 (14:06 -0400)]
bcachefs: Fix reading of alloc info after unclean shutdown
When updates to interior nodes started being journalled, that meant that
after an unclean shutdown, until journal replay is done we can't walk
the btree without overlaying the updates from the journal.
The initial btree gc was changed to walk the btree overlaying keys from
the journal - but bch2_alloc_read() and bch2_stripes_read() were missed.
Major whoops...
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Yuxuan Shui [Fri, 22 May 2020 14:50:05 +0000 (15:50 +0100)]
bcachefs: fix stack corruption
When a bkey_on_stack is passed to bch_read_indirect_extent, there is no
guarantee that it will be big enough to hold the bkey. And
bch_read_indirect_extent is not aware of bkey_on_stack to call realloc
on it. This cause a stack corruption.
This commit makes bch_read_indirect_extent aware of bkey_on_stack so it
can call realloc when appropriate.
Kent Overstreet [Tue, 12 May 2020 00:01:07 +0000 (20:01 -0400)]
bcachefs: Fixes for startup on very full filesystems
- Always pass BTREE_INSERT_USE_RESERVE when writing alloc btree keys
- Don't strand buckest on the copygc freelist until after recovery is
done and we're starting copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 6 May 2020 19:37:04 +0000 (15:37 -0400)]
bcachefs: Some compression improvements
In __bio_map_or_bounce(), the check for if the bio is physically
contiguous is improved; it's now more readable and handles multi page
but contiguous bios.
Also when decompressing, we were doing a redundant memcpy in the case
where we were able to use vmap to map a bio contigiously.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 2 May 2020 20:21:35 +0000 (16:21 -0400)]
bcachefs: Fix two more deadlocks
Deadlock on shutdown:
btree_update_nodes_written() unblocks btree nodes from being written;
after doing so, it has to check if they were marked as needing to be
written and if so kick off those writes - if that doesn't happen, we'll
never release journal pins and shutdown will get stuck when flushing the
journal.
There was an error path where this didn't happen, because in the error
path we don't actually want those btree nodes write to happen; however,
we still have to kick off the write path so the journal pins get
released. The btree write path checks if we're in a journal error state
and doesn't do the actual write if we are.
Also - there was another deadlock because btree_update_nodes_written()
was taking the btree update off of the unwritten_list too soon - before
getting a journal reservation, which could fail and have to be retried.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
peek_slot() shouldn't return -EINTR when there's only a single live
iterator, but that's tricky to guarantee - we seem to be returning
-EINTR when we shouldn't, but it's easy enough to handle in the caller.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 4 Apr 2020 17:54:19 +0000 (13:54 -0400)]
bcachefs: Fix a null ptr deref during journal replay
We were calling bch2_extent_can_insert() incorrectly; it should only be
called when the extents-to-keys pass is running because that's when we
could be splitting a compressed extent. Calling bch2_extent_can_insert()
without passing in a disk reservation was causing a null ptr deref.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 7 Jan 2020 18:29:32 +0000 (13:29 -0500)]
bcachefs: Kill bkey_type_successor
Previously, BTREE_ID_INODES was special - inodes were indexed by the
inode field, which meant the offset field of struct bpos wasn't used,
which led to special cases in e.g. the btree iterator code.
Now, inodes in the inodes btree are indexed by the offset field.
Also: prevously min_key was special for extents btrees, min_key for
extents would equal max_key for the previous node. Now, min_key =
bkey_successor() of the previous node, same as non extent btrees.
This means we can completely get rid of
btree_type_sucessor/predecessor.
Also make some improvements to the metadata IO validate/compat code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Feb 2020 00:06:31 +0000 (19:06 -0500)]
bcachefs: Journal updates to interior nodes
Previously, the btree has always been self contained and internally
consistent on disk without anything from the journal - the journal just
contained pointers to the btree roots.
However, this meant that btree node split or compact operations - i.e.
anything that changes btree node topology and involves updates to
interior nodes - would require that interior btree node to be written
immediately, which means emitting a btree node write that's mostly empty
(using 4k of space on disk if the filesystemm blocksize is 4k to only
write perhaps ~100 bytes of new keys).
More importantly, this meant most btree node writes had to be FUA, and
consumer drives have a history of slow and/or buggy FUA support - other
filesystes have been bit by this.
This patch changes the interior btree update path to journal updates to
interior nodes, after the writes for the new btree nodes have completed.
Best of all, it turns out to simplify the interior node update path
somewhat.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 24 Mar 2020 21:00:48 +0000 (17:00 -0400)]
bcachefs: Disable extent merging
Extent merging is currently broken, and will be reimplemented
differently soon - right now it only happens when btree nodes are being
compacted, which makes it difficult to test.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 21 Mar 2020 18:47:00 +0000 (14:47 -0400)]
bcachefs: Fix a locking bug in fsck
This works around a btree locking issue - we can't be holding read locks
while taking write locks, which currently means we can't have live
iterators holding read locks at commit time.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 18 Mar 2020 15:40:07 +0000 (11:40 -0400)]
bcachefs: BCH_FEATURE_new_extent_overwrite is now required
The patch "bcachefs: Move extent overwrite handling out of core btree
code" should have been flipping on this feature bit; extent btree nodes
in the old format have to be rewritten before we can insert into them
with the new extent update path. Not turning on this feature bit was
causing us to go into an infinite loop where we keep rewriting btree
nodes over and over.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 16 Mar 2020 21:23:37 +0000 (17:23 -0400)]
bcachefs: Clear BCH_FEATURE_extents_above_btree_updates on clean shutdown
This is needed so that users can roll back to before "d9bb516b2d
bcachefs: Move extent overwrite handling out of core btree code", which
it appears may still be buggy.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 30 Dec 2019 19:37:25 +0000 (14:37 -0500)]
bcachefs: Move extent overwrite handling out of core btree code
Ever since the btree code was first written, handling of overwriting
existing extents - including partially overwriting and splittin existing
extents - was handled as part of the core btree insert path. The modern
transaction and iterator infrastructure didn't exist then, so that was
the only way for it to be done.
This patch moves that outside of the core btree code to a pass that runs
at transaction commit time.
This is a significant simplification to the btree code and overall
reduction in code size, but more importantly it gets us much closer to
the core btree code being completely independent of extents and is
important prep work for snapshots.
This introduces a new feature bit; the old and new extent update models
are incompatible when the filesystem needs journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 18 Feb 2020 21:17:55 +0000 (16:17 -0500)]
bcachefs: More btree iter invariants
Ensure that iter->pos always lies between the start and end of iter->k
(the last key returned). Also, bch2_btree_iter_set_pos() now invalidates
the key that peek() or next() returned.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 18 Feb 2020 21:17:55 +0000 (16:17 -0500)]
bcachefs: Iterator debug code improvements
More aggressively checking iterator invariants, and fixing the resulting
bugs. Also greatly simplifying iter_next() and iter_next_slot() - they
were hyper optimized before, but the optimizations were getting too
brittle.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 5 Mar 2020 23:43:31 +0000 (18:43 -0500)]
bcachefs: Skip 0 size deleted extents in journal replay
These are created by the new extent update path, but not used yet by the
recovery code and they break the existing recovery code, so we can just
skip them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 9 Mar 2020 20:15:54 +0000 (16:15 -0400)]
bcachefs: Traverse iterator in journal replay
This fixes a bug where we end up spinning in journal replay - in theory
this shouldn't be necessary though, transaction reset should be
re-traversing all iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 2 Mar 2020 22:08:19 +0000 (17:08 -0500)]
bcachefs: Fix extent_sort_fix_overlapping()
Recently the extent update path started emmiting 0 size whiteouts on
extent overwrite, as part of transitioning to moving extent handling
out of the core btree code.
Unfortunately, this broke the old code path that handles overlapping
extents when reading in btree nodes - it relies on sorting incomming
extents by start position, but the 0 size whiteouts broke that ordering.
Skipping over them before the main algorithm sees them fixes this.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>