Detect buckets with missing backpointers, and run repair on demand.
__bch2_move_data_phys() now calls
bch2_check_bucket_backpointer_mismatch() as it walks buckets, which
checks for missing backpointers by comparing backpointers against bucket
sector counts.
When missing backpointers are detected, we kick off
bch2_check_extents_to_backpointers() asynchronously - right away if
we're trying to evacuate, or with a threshold if we're just running
copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 10 May 2025 03:28:01 +0000 (23:28 -0400)]
bcachefs: Run recovery passes asynchronously
When we request a recovery pass to be run online, i.e. not during
recovery, if it's an online pass it'll now be run in the background,
instead of waiting for the next mount.
To avoid situations where recovery passes are running continuously, this
also includes ratelimiting: if the RUN_RECOVERY_PASS_ratelimit flag is
passed, the pass may be deferred until later - depending on the runtime
and last run stats in the recovery_passes superblock section.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 13 May 2025 21:36:55 +0000 (17:36 -0400)]
bcachefs: Reduce usage of recovery.curr_pass
We want recovery.curr_pass to be private to the recovery passes code,
for better showing recovery pass status; also, it may rewind and is
generally not the correct member to use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 10 May 2025 21:45:45 +0000 (17:45 -0400)]
bcachefs: __bch2_run_recovery_passes()
Consolidate bch2_run_recovery_passes() and
bch2_run_online_recovery_passes(), prep work for automatically
scheduling and running recovery passes in the background.
- Now takes a mask of which passes to run, automatic background repair
will pass in sb.recovery_passes_required.
- Skips passes that are failing: a pass that failed may be reattempted
after another pass succeeds (some passes depend on repair done by
other passes for successful completion).
- bch2_recovery_passes_match() helper to skip alloc passes on a
filesystem without alloc info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 14 May 2025 21:58:00 +0000 (17:58 -0400)]
bcachefs: Fix opt hooks in sysfs for non sb option
We weren't checking if the option changed for non-superblock options -
this led to rebalance not waking up when enabling the
"rebalance_enabled" option.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 13 May 2025 17:49:51 +0000 (13:49 -0400)]
bcachefs: Add tracepoint, counter for io_move_created_rebalance
Internal moves shouldn't add new rebalance_work, but it's been reported
that this seems to be happening. Add a tracepoint and counter so we can
see what's going on.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 8 May 2025 21:01:49 +0000 (17:01 -0400)]
bcachefs: kill move_bucket_in_flight
Small cleanup/simplification, and prep work for the next patch, which
will add checking if buckets don't get evacuated because they're missing
backpointers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 13 May 2025 14:53:23 +0000 (10:53 -0400)]
bcachefs: bch2_fs_emergency_read_only2()
More error message cleanup: instead of multiple printk()s per error, we
want to be building up a single error message in a printbuf, so that it
can be printed with indenting that shows grouping and avoid errors
getting interspersed or lost in the log.
This gets rid of most calls to bch2_fs_emergency_read_only(). We still
have calls to
- bch2_fatal_error()
- bch2_fs_fatal_error()
- bch2_fs_fatal_err_on()
that need work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 7 May 2025 18:26:18 +0000 (14:26 -0400)]
bcachefs: Knob for manual snapshot deletion
Add 'opts.snapshot_deletion_enabled', enabled by default.
This may be turned off so that the new sysfs knob,
'internal/trigger_delete_dead_snapshots', may be used instead - this
will allow snapshot deletion to be profiled more easily.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fast device removal, that uses backpointers to find pointers to the
device being removed instead of a full metadata scan.
This requires BCH_SB_MEMBER_DELETED_UUID, which is an incompatible
change - hence the version number bump. We don't fully trust
backpointers, so we don't want to reuse device indexes until after a
fsck has verified that there aren't any pointers to removed devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 2 May 2025 17:23:22 +0000 (13:23 -0400)]
bcachefs: delete_dead_snapshot_keys_v2()
Since extents, dirents and xattrs require an inode with the
corresponding snapshot ID to exists, we can avoid a lot of scanning by
only scanning those trees for keys to process if the correspending inode
exists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're going to be speeding up snapshot deletion, by only having it
process the extents/dirents/xattrs btrees if an inode of a given
snapshot ID was present.
This raises the possibility of 'bkey_in_missing_snapshot' errors popping
up, if we ever accidentally don't do the corresponding inode update, or
if the new algorithm has bugs.
So instead of deleting snapshot IDs, add a new deleted flag, so that
'key in missing snapshot' errors can more definitively tell what
happened and automatically repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're going to be speeding up snapshot deletion, by only having it
process the extents/dirents/xattrs btrees if an inode of a given
snapshot ID was present.
This raises the possibility of 'bkey_in_missing_snapshot' errors popping
up, if we ever accidentally don't do the corresponding inode update, or
if the new algorithm has bugs.
So we'll want to be able to differentiate more definitively between
'snapshot went missing' (and perhaps needs to be reconstructed), and
'key in snapshot that was deleted'.
So instead of deleting snapshot IDs, we'll be adding a new deleted flag
and leaving them permanently.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're going to be doing some snapshot deletion performance improvements,
and those will strictly require that if an extent/dirent/xattr is
present, an inode is present in that snapshot ID.
We already check for this, but we don't repair it on disk: this patch
adds that repair and turns it into a real fsck_err().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 3 May 2025 20:48:00 +0000 (16:48 -0400)]
bcachefs: get_inodes_all_snapshots() now includes whiteouts
The next patch is going to change lookup_inode_for_snapshot to
rigorously require that a extent/dirent/xattr keys have a corresponding
inode key present - whiteouts included, so this simplifies the checks
lookup_inode_for_snapshot() will have to do.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 1 May 2025 02:05:49 +0000 (22:05 -0400)]
bcachefs: Fix setting ca->name in device add
Device add doesn't get the devide index and attach to the filesystem
until after attaching the block device, and setting the device name from
the block device name - these needs some minor tweaks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Refactor a couple of structs that contain flexible arrays in the
middle by replacing them with unions.
So, with these changes, fix the following warnings:
fs/bcachefs/disk_accounting.c:429:51: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/bcachefs/ec_types.h:8:41: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 26 Apr 2025 16:39:17 +0000 (12:39 -0400)]
bcachefs: Run most explicit recovery passes persistent
If we detect an error that requires running a recovery pass, and we're
not in recovery, we won't be able to fix it until the next mount - make
sure we're noting in the superblock that it needs to run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 22 Apr 2025 13:14:19 +0000 (09:14 -0400)]
bcachefs: Single err message for btree node reads
Like we just did with the data read path, emit a single error message
per btree node reads, nicely formatted, with all the actions we took
grouped together.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 24 Apr 2025 13:27:10 +0000 (09:27 -0400)]
bcachefs: Plumb printbuf through bch2_btree_lost_data()
Part of the ongoing project to improve error messages by building them
up in printbufs and emitting them all at once, so that we can easily see
what events are related in the log.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 22 Apr 2025 09:49:20 +0000 (05:49 -0400)]
bcachefs: Emit a single log message on data read error
Instead of emitting a message immediately when we get an error in the
read path, and then another at the end if we successfully retry - emit
one single log message before returning from bch2_rbio_retry().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 21 Apr 2025 17:02:51 +0000 (13:02 -0400)]
bcachefs: Make various async objs visible in debugfs
Add async objs list for
- promote_op
- bch_read_bio
- btree_read_bio
- btree_write_bio
This gets us introspection on in-flight async ops, and because under the
hood it uses fast_lists (percpu slot buffer on top of a radix tree),
it'll be fast enough to enable in production.
This will be very helpful for debugging "something got stuck" issues,
which have been cropping up from time to time (in the CI, especially
with folio writeback).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 28 Sep 2024 20:22:38 +0000 (16:22 -0400)]
bcachefs: fast_list
A fast "list" data structure, which is actually a radix tree, with an
IDA for slot allocation and a percpu buffer on top of that.
Items cannot be added or moved to the head or tail, only added at some
(arbitrary) position and removed. The advantage is that adding, removing
and iteration is generally lockless, only hitting the lock in ida when
the percpu buffer is full or empty.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>