Kent Overstreet [Sun, 23 May 2021 01:43:20 +0000 (21:43 -0400)]
bcachefs: Fix an issue with inconsistent btree writes after unclean shutdown
After unclean shutdown, btree writes may have completed on one device
and not others - and this inconsistency could lead us to writing new
bsets with a gap in our btree node in one of our replicas.
Fortunately, this is only an issue with bsets that are newer than the
most recent journal flush, and we already have a mechanism for detecting
and blacklisting those. We just need to make sure to start new btree
writes after the most recent _non_ blacklisted bset.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Kent Overstreet [Sat, 22 May 2021 21:37:25 +0000 (17:37 -0400)]
bcachefs: Add a workqueue for btree io completions
Also, clean up workqueue usage - we shouldn't be using system
workqueues, pretty much everything we do needs to be on our own
WQ_MEM_RECLAIM workqueues.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Kent Overstreet [Sat, 22 May 2021 03:57:37 +0000 (23:57 -0400)]
bcachefs: Add a debug mode that always reads from every btree replica
There's a new module parameter, verify_all_btree_replicas, that enables
reading from every btree replica when reading in btree nodes and
comparing them against each other. We've been seeing some strange btree
corruption - this will hopefully aid in tracking it down and catching it
more often.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Kent Overstreet [Thu, 20 May 2021 19:49:23 +0000 (15:49 -0400)]
bcachefs: Fix for buffered writes getting -ENOSPC
Buffered writes may have to increase their disk reservation at btree
update time, due to compression and erasure coding being unpredictable:
O_DIRECT writes should be checking for -ENOSPC, but buffered writes have
already been accepted and should not.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Kent Overstreet [Wed, 19 May 2021 03:17:03 +0000 (23:17 -0400)]
bcachefs: Split extents if necessary in bch2_trans_update()
Currently, we handle multiple overlapping extents in the same
transaction commit by doing fixups in bch2_trans_update() - this patch
extents that to split updates when necessary. The next patch that
changes the reflink code to not fragment extents when making them
indirect will require this.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Kent Overstreet [Wed, 19 May 2021 03:53:43 +0000 (23:53 -0400)]
bcachefs: Ratelimiting for writeback IOs
Writeback throttling is a kernel config option and not always enabled.
When it's not enabled we need a fallback, to avoid unbounded memory
pinning and work item backlogs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dan Robertson [Wed, 19 May 2021 00:36:20 +0000 (20:36 -0400)]
bcachefs: statfs resports incorrect avail blocks
The current implementation of bch_statfs does not scale the number of
available blocks provided in f_bavail by the reserve factor. This causes
an allocation of a file of this size to fail.
Signed-off-by: Dan Robertson <dan@dlrobertson.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brett Holman [Mon, 17 May 2021 03:53:55 +0000 (21:53 -0600)]
bcachefs: made changes to support clang, fixed a couple bugs
fs/bcachefs/bset.c edited prefetch macro to add clang support
fs/bcachefs/btree_iter.c bugfix: initialize iter->real_pos in bch2_btree_iter_init for later use
fs/bcachefs/io.c bugfix: eliminated undefined behavior (negative bitshift)
fs/bcachefs/buckets.c bugfix: invert sign to handle 64bit abs()
Signed-off-by: Brett Holman <bpholman5@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dan Robertson [Sat, 15 May 2021 00:02:44 +0000 (20:02 -0400)]
bcachefs: properly initialize used values
- Ensure the second key value in bch_hash_info is initialized to zero
if the info type is of type BCH_STR_HASH_SIPHASH.
- Initialize the possibly returned value in bch2_inode_create. Assuming
bch2_btree_iter_peek returns bkey_s_c_null, the uninitialized value
of ret could be returned to the user as an error pointer.
- Fix compiler warning in initialization of bkey_s_c_stripe
fs/bcachefs/buckets.c:1646:35: warning: suggest braces around initialization
of subobject [-Wmissing-braces]
struct bkey_s_c_stripe new_s = { NULL };
^~~~
Signed-off-by: Dan Robertson <dan@dlrobertson.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dan Robertson [Wed, 5 May 2021 11:09:43 +0000 (07:09 -0400)]
bcachefs: Fix out of bounds read in fs usage ioctl
Fix a possible read out of bounds if bch2_ioctl_fs_usage is called when
replica_entries_bytes is set to a value that is smaller than the size
of bch_replicas_usage.
Signed-off-by: Dan Robertson <dan@dlrobertson.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dan Robertson [Thu, 13 May 2021 00:54:37 +0000 (20:54 -0400)]
bcachefs: Fix null deref in bch2_ioctl_read_super
Do not attempt to cleanup the returned value of bch2_device_lookup if
the returned value was an error pointer. We currently check to see if
the returned value is null and run the cleanup otherwise. As a result,
we attempt to run the cleanup on a error pointer.
Signed-off-by: Dan Robertson <dan@dlrobertson.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dan Robertson [Wed, 12 May 2021 18:07:57 +0000 (14:07 -0400)]
bcachefs: Fix possible null deref on mount
Ensure that the block device pointer in a superblock handle is not
null before dereferencing it in bch2_dev_to_fs. The block device pointer
may be null when mounting a new bcachefs filesystem given another mounted
bcachefs filesystem exists that has at least one device that is offline.
Signed-off-by: Dan Robertson <dan@dlrobertson.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dan Robertson [Sun, 9 May 2021 22:52:23 +0000 (18:52 -0400)]
bcachefs: Fix error in parsing of mount options
When parsing the mount options duplicate the given options. This is
required as the options are parsed twice and strsep is used in parsing.
The options will be modified into a possibly invalid options set for the
second round of parsing if the options are not duplicated before
parsing.
Signed-off-by: Dan Robertson <dan@dlrobertson.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dan Robertson [Sat, 8 May 2021 02:29:02 +0000 (22:29 -0400)]
bcachefs: Fix oob write in __bch2_btree_node_write
Fix a possible out of bounds write in __bch2_btree_node_write when
the data buffer padding is cleared up to the block size. The out of
bounds write is possible if the data buffers size is not a multiple
of the block size.
Signed-off-by: Dan Robertson <dan@dlrobertson.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 29 Apr 2021 02:51:42 +0000 (22:51 -0400)]
bcachefs: Fix time handling
There were some overflows in the time conversion functions - fix this by
converting tv_sec and tv_nsec separately. Also, set sb->time_min and
sb->time_max.
Fixes xfstest generic/258.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Apr 2021 18:03:13 +0000 (14:03 -0400)]
bcachefs: Change copygc wait amount to be min of per device waits
We're seeing a filesystem get stuck when all devices but one have no
more reclaimable buckets - because the copygc wait amount is curretly
filesystem wide.
This patch should fix that, possibly at the expensive of running too
much when only one or a few devices is full and the rebalance thread
needs to move data around.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 27 Apr 2021 18:02:00 +0000 (14:02 -0400)]
bcachefs: Change bch2_btree_key_cache_count() to exclude dirty keys
We're seeing livelocks that appear to be due to
bch2_btree_key_cache_scan repeatedly scanning and blocking other tasks
from using the key cache lock - we probably shouldn't be reporting
objects that can't actually be freed yet.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Apr 2021 20:24:03 +0000 (16:24 -0400)]
bcachefs: Evict btree nodes we're deleting
There was a bug that led to duplicate btree node pointers being inserted
at the wrong level. The new topology repair code can fix that, except
that the btree cache code gets confused when we read in a btree node
from the pointer that was at the wrong level. This patch evicts nodes
that we're deleting to, which nicely solves the problem.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 Apr 2021 01:08:49 +0000 (21:08 -0400)]
bcachefs: New check_nlinks algorithm for snapshots
With snapshots, using a radix tree for the table of link counts won't
work anymore because we also need to distinguish between inodes with
different snapshot IDs. Instead, this patch builds up a sorted array of
inodes that have hardlinks that we can binary search on - taking
advantage of the fact that with inode backpointers, the check_nlinks()
pass _only_ needs to concern itself with inodes that have hardlinks now.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 Apr 2021 20:32:35 +0000 (16:32 -0400)]
bcachefs: New and improved topology repair code
This splits out btree topology repair into a separate pass, and makes
some improvements:
- When we have to pick which of two overlapping nodes to drop keys
from, we use the btree node header sequence number to preserve the
newer node
- the gc code has been changed so that it doesn't bail out if we're
continuing/ignoring on fsck error - this way the dump tool can skip
running the repair pass but still walk all reachable metadata
- add a new superblock flag indicating when a filesystem is known to
have btree topology issues, and the topology repair pass should be
run
- changing the start/end of a node might mean keys in that node have to
be deleted: this patch handles that better by splitting it out into a
separate function and running it explicitly in the topology repair
code, previously those keys were only being dropped when the btree
node was read in.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 Apr 2021 20:18:43 +0000 (16:18 -0400)]
bcachefs: Fix repair leading to replicas not marked
bch2_check_fix_ptrs() was being called after checking if the replicas
set was marked - but repair could change which replicas set needed to be
marked. Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 21 Apr 2021 22:08:39 +0000 (18:08 -0400)]
bcachefs: Don't BUG() in update_replicas
Apparently, we have a bug where in mark and sweep while accounting for a
key, a replicas entry isn't found. Change the code to print out the key
we couldn't mark and halt instead of a BUG_ON().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 20 Apr 2021 21:09:25 +0000 (17:09 -0400)]
bcachefs: Fix a deadlock on journal reclaim
Flushing the btree key cache needs to use allocation reserves - journal
reclaim depends on flushing the btree key cache for making forward
progress, and the allocator and copygc depend on journal reclaim making
forward progress.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 21 Apr 2021 00:21:12 +0000 (20:21 -0400)]
bcachefs: Update bch2_btree_verify()
bch2_btree_verify() verifies that the btree node on disk matches what we
have in memory. This patch changes it to verify every replica, and also
fixes it for interior btree nodes - there's a mem_ptr field which is
used as a scratch space and needs to be zeroed out for comparing with
what's on disk.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 19 Apr 2021 21:17:34 +0000 (17:17 -0400)]
bcachefs: Fix a use after free
Turns out, we weren't waiting on in flight btree writes when freeing
existing btree nodes. This lead to stray btree writes overwriting newly
allocated buckets, but only started showing itself with some of the
recent allocator work and another patch to move submitting of btree
writes to worqueues.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 19 Apr 2021 21:07:20 +0000 (17:07 -0400)]
bcachefs: Fix for btree_gc repairing interior btree ptrs
Using the normal transaction commit path to insert and journal updates
to interior nodes hadn't been done before this repair code was written,
not surprising that there was a bug.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 18 Apr 2021 21:44:35 +0000 (17:44 -0400)]
bcachefs: Always check for invalid bkeys in trans commit path
We check for this prior to metadata being written, but we're seeing some
strange bugs lately, and this will help catch those closer to where they
occur.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 18 Apr 2021 03:18:17 +0000 (23:18 -0400)]
bcachefs: Check that keys are in the correct btrees
We've started seeing bug reports of pointers to btree nodes being
detected in leaf nodes. This should catch that before it's happened, and
it's something we should've been checking anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 Apr 2021 01:53:23 +0000 (21:53 -0400)]
bcachefs: Allocator thread doesn't need gc_lock anymore
Even with runtime gc (which currently isn't supported), runtime gc no
longer clears/recalculates the main set of bucket marks - it allocates
and calculates another set, updating the primary at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 17 Apr 2021 01:34:00 +0000 (21:34 -0400)]
bcachefs: gc shouldn't care about owned_by_allocator
The owned_by_allocator field is a purely in memory thing, even if/when
we bring back GC at runtime there's no need for it to be recalculating
this field. This is prep work for pulling it out of struct bucket, and
eventually getting rid of the bucket array.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Apr 2021 18:29:26 +0000 (14:29 -0400)]
bcachefs: Fix transaction restarts due to upgrading of cloned iterators
This fixes a regression from 52d86202fd bcachefs: Improve bch2_btree_iter_traverse_all()
We want to avoid mucking with other iterators in the btree transaction
in operations that are only supposed to be touching individual iterators
- that patch was a cleanup to move lock ordering handling to
bch2_btree_iter_traverse_all(). But it broke upgrading of cloned
iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 16 Apr 2021 16:38:14 +0000 (12:38 -0400)]
bcachefs: Fix journal reclaim loop
When dirty key cache keys were separated from other journal pins, we
broke the loop conditional in __bch2_journal_reclaim() - it's supposed
to keep looping as long as there's work to do.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 14 Apr 2021 17:26:15 +0000 (13:26 -0400)]
bcachefs: Improve bch2_btree_iter_traverse_all()
By changing it to upgrade iterators to intent locks to avoid lock
restarts we can simplify __bch2_btree_node_lock() quite a bit - this
fixes a probable bug where it could potentially drop a lock on an
unrelated error but still succeed instead of causing a transaction
restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Apr 2021 07:11:07 +0000 (03:11 -0400)]
bcachefs: Improved check_directory_structure()
Now that we have inode backpointers, we can simplify checking directory
structure: instead of doing a DFS from the filesystem root and then
checking if we found everything, we can iterate over every inode and see
if we can go up until we get to the root.
This patch also has a number of fixes and simplifications for the inode
backpointer checks. Also, it turns out we don't actually need the
BCH_INODE_BACKPTR_UNTRUSTED flag.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Apr 2021 07:25:37 +0000 (03:25 -0400)]
bcachefs: Fix fsck to not use bch2_link_trans()
bch2_link_trans() uses the btree key cache for inode updates, and fsck
isn't supposed to - also, it's not really what we want for reattaching
unreachable inodes anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Apr 2021 18:00:07 +0000 (14:00 -0400)]
bcachefs: Fix bch2_trans_relock()
The patch that changed bch2_trans_relock() to not look at iter->uptodate
also tried to add an optimization by only having it relock
btree_iter_key() iterators (iterators that are live or have been marked
as keep). But, this wasn't thought through - this pops internal iterator
assertions because on transaction restart, when we're traversing
iterators we traverse all iterators marked as linked, and having
bch2_trans_relock() skip some of those mean that it can skil the
iterator that bch2_btree_iter_traverse_one() is currently traversing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Apr 2021 05:55:57 +0000 (01:55 -0400)]
bcachefs: Simplify hash table checks
Very early on there was a period where we were accidentally generating
dirents with trailing garbage; we've since dropped support for
filesystems that old and the fsck code can be dropped.
Also, this patch switches to a simpler algorithm for checking hash
tables. It's less efficient on hash collision - but with 64 bit keys,
those are very rare.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 Apr 2021 01:41:48 +0000 (21:41 -0400)]
bcachefs: Check inodes at start of fsck
This splits out checking inode nlinks from the rest of the inode checks
and moves most of the inode checks to the start of fsck, so that other
fsck passes can depend on it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>