]> www.infradead.org Git - users/willy/pagecache.git/log
users/willy/pagecache.git
2 months agobcachefs: Improve trace_move_extent_finish
Kent Overstreet [Sun, 26 Jan 2025 01:08:26 +0000 (20:08 -0500)]
bcachefs: Improve trace_move_extent_finish

We're currently debugging issues with rebalance, where it's not making
progress as quickly as it should be (or sometimes not at all).

Add the full data_update to the move_extent_finish tracepoint, so we can
check that the replicas we wrote match what we were supposed to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 months agobcachefs: Fix trace_copygc
Kent Overstreet [Fri, 17 Jan 2025 17:51:51 +0000 (12:51 -0500)]
bcachefs: Fix trace_copygc

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 months agobcachefs: Journal writes are now IOPRIO_CLASS_RT
Kent Overstreet [Sat, 25 Jan 2025 02:29:24 +0000 (21:29 -0500)]
bcachefs: Journal writes are now IOPRIO_CLASS_RT

System performance is particularly sensitive to journal write latency,
the number of outstanding journal writes is bounded and we can't issue
journal flushes until other journal writes have completed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 months agobcachefs: Improve journal pin flushing
Kent Overstreet [Sun, 26 Jan 2025 00:22:50 +0000 (19:22 -0500)]
bcachefs: Improve journal pin flushing

Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.

The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.

So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
  we can wait for key cache/write buffer pin flushing to complete
  before flushing dirty btree nodes

- A new closure_waitlist is added for bch2_journal_flush_pins; this one
  is only used under or when we're taking the journal lock, so it's
  pretty cheap to add rigorously correct wakeups to journal_pin_set()
  and journal_pin_drop().

Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.

Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 months agobcachefs: fix bch2_btree_node_flags
Kent Overstreet [Sat, 25 Jan 2025 23:09:32 +0000 (18:09 -0500)]
bcachefs: fix bch2_btree_node_flags

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 months agobcachefs: rebalance, copygc enabled are runtime opts
Kent Overstreet [Sat, 25 Jan 2025 22:19:38 +0000 (17:19 -0500)]
bcachefs: rebalance, copygc enabled are runtime opts

Fix a regression from when these were switched to normal opts.h options.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 months agobcachefs: Improve decompression error messages
Kent Overstreet [Fri, 24 Jan 2025 14:23:02 +0000 (09:23 -0500)]
bcachefs: Improve decompression error messages

Ratelimit them, and use the new bch2_write_op_error() helper that prints
path and file offset.

Reported-by: https://github.com/koverstreet/bcachefs/issues/819
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bset_blacklisted_journal_seq is now AUTOFIX
Kent Overstreet [Wed, 22 Jan 2025 04:03:08 +0000 (23:03 -0500)]
bcachefs: bset_blacklisted_journal_seq is now AUTOFIX

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: "Journal stuck" timeout now takes into account device latency
Kent Overstreet [Tue, 21 Jan 2025 22:42:25 +0000 (17:42 -0500)]
bcachefs: "Journal stuck" timeout now takes into account device latency

If a block device (e.g. your typical consumer SSD) is taking multiple
seconds for IOs (typically flushes), we don't want to emit the "journal
stuck" message prematurely.

Also, make sure to drop the btree_trans srcu lock if we're blocking for
more than a second.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Reduce stack frame size of __bch2_str_hash_check_key()
Kent Overstreet [Tue, 21 Jan 2025 17:56:00 +0000 (12:56 -0500)]
bcachefs: Reduce stack frame size of __bch2_str_hash_check_key()

We don't need all the helpers inlined here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Fix btree_trans_peek_key_cache()
Kent Overstreet [Tue, 21 Jan 2025 07:26:13 +0000 (02:26 -0500)]
bcachefs: Fix btree_trans_peek_key_cache()

BTREE_ITER_cached_nofill has some tricky corner cases; it's used
internally for iterators that aren't walking the key cache, but need to
be coherent with the key cache.

It tells traverse to look up and lock the key cache entry if present,
but don't create one if it doesn't exist.

That means we have to have a BTREE_ITER_UPTODATE path (because after
traverse the path has to be UPTODATE, or we pop assertions) that doesn't
point to anything (which is the less bad option, taken by the previous
fix).

The previous fix for this path missed an issue that can happen in
bch2_trans_peek_key_cache(): we can't set should_be_locked on a path
that doesn't point to anything and doesn't hold locks.

Fixes: bd5b09727f3d ("bcachefs: Don't set btree_path to updtodate if we don't fill")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Fix check_inode_hash_info_matches_root()
Kent Overstreet [Wed, 15 Jan 2025 17:17:28 +0000 (12:17 -0500)]
bcachefs: Fix check_inode_hash_info_matches_root()

Can't use memcmp() when the struct contains padding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Document issue with bch_stripe layout
Kent Overstreet [Mon, 13 Jan 2025 20:41:50 +0000 (15:41 -0500)]
bcachefs: Document issue with bch_stripe layout

We've got a problem with bch_stripe that is going to take an on disk
format rev to fix - we can't access the block sector counts if the
checksum type is unknown.

Document it for now, there are a few other things to fix as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Fix self healing on read error
Kent Overstreet [Tue, 31 Dec 2024 23:42:48 +0000 (18:42 -0500)]
bcachefs: Fix self healing on read error

We were incorrectly checking if there'd been an io error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Pop all the transactions from the abort one
Alan Huang [Wed, 2 Oct 2024 19:06:33 +0000 (03:06 +0800)]
bcachefs: Pop all the transactions from the abort one

The transaction is going to abort, so there will be no cycle involving
this transaction anymore.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Only abort the transactions in the cycle
Alan Huang [Wed, 2 Oct 2024 19:06:32 +0000 (03:06 +0800)]
bcachefs: Only abort the transactions in the cycle

When the cycle doesn't involve the initiator of the cycle detection,
we might choose a transaction that is not involved in the cycle to abort.
It shouldn't be that since it won't break the cycle, this patch
therefore chooses the transaction in the cycle to abort.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Introduce lock_graph_pop_from
Alan Huang [Wed, 2 Oct 2024 19:06:31 +0000 (03:06 +0800)]
bcachefs: Introduce lock_graph_pop_from

This patch introduces a helper function called lock_graph_pop_from,
it pops the graph from i.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Convert open-coded lock_graph_pop_all to helper
Alan Huang [Wed, 2 Oct 2024 19:06:30 +0000 (03:06 +0800)]
bcachefs: Convert open-coded lock_graph_pop_all to helper

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Do not allow no fail lock request to fail
Alan Huang [Thu, 10 Oct 2024 13:21:50 +0000 (21:21 +0800)]
bcachefs: Do not allow no fail lock request to fail

If the transaction chose itself as a victim before and restarted, it
might request a no fail lock request this time. But it might be added to
others' lock graph and be chose as the victim again, it's no longer safe
without additional check. We can also convert the cycle detector to be
fully RCU-based to solve that unsoundness, but the latency added to trans_put
and additional memory required may not worth it.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Merge the condition to avoid additional invocation
Alan Huang [Wed, 25 Sep 2024 16:45:00 +0000 (00:45 +0800)]
bcachefs: Merge the condition to avoid additional invocation

If the lock has been acquired and unlocked, we don't have to do clear
and wakeup again, though harmless since we hold the intent lock. Merge
the condition might be clearer.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agoRevert "bcachefs: Fix bch2_btree_node_upgrade()"
Alan Huang [Wed, 25 Sep 2024 16:46:02 +0000 (00:46 +0800)]
Revert "bcachefs: Fix bch2_btree_node_upgrade()"

This reverts commit 62448afee714354a26db8a0f3c644f58628f0792.

six_lock_tryupgrade fails only if there is an intent lock held,
it won't fail no matter how many read locks are held.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bcachefs_metadata_version_directory_size
Hongbo Li [Tue, 7 Jan 2025 13:18:41 +0000 (13:18 +0000)]
bcachefs: bcachefs_metadata_version_directory_size

This adds another metadata version for accounting directory size.
For the new version of the filesystem, when new subdirectory items
are created or deleted, the parent directory's size will change
accordingly. For the old version of the existed file system, running
fsck will automatically upgrade the metadata version, and it will
do the check and recalculationg of the directory size.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: make directory i_size meaningful
Hongbo Li [Tue, 7 Jan 2025 13:18:40 +0000 (13:18 +0000)]
bcachefs: make directory i_size meaningful

The isize of directory is 0 in bcachefs if the directory is empty.
With more child dirents created, its size ought to change. Many
other filesystems changed as that (ie. xfs and btrfs). And many of
them changed as the size of child dirent name. Although the directory
size may not seem to convey much, we can still give it some meaning.

The formula of dentry size as follow:
    occupied_size = 40 + ALIGN(9 + namelen, 8)

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet
Kent Overstreet [Sat, 4 Jan 2025 17:10:25 +0000 (12:10 -0500)]
bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet

check_unreachable_inodes does work in online mode, with the one caveat
that it assumes check_dirents has also run - and check_dirents is not
PASS_ONLINE yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck
Kent Overstreet [Sat, 4 Jan 2025 17:09:52 +0000 (12:09 -0500)]
bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck

No need to pull the whole alloc btree into the btree key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Check for dirents to overwritten inodes
Kent Overstreet [Tue, 31 Dec 2024 20:59:02 +0000 (15:59 -0500)]
bcachefs: Check for dirents to overwritten inodes

This fixes various "dirent to missing inode" errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth
Kent Overstreet [Sun, 29 Dec 2024 14:37:15 +0000 (09:37 -0500)]
bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Don't set btree_path to updtodate if we don't fill
Kent Overstreet [Tue, 31 Dec 2024 17:58:23 +0000 (12:58 -0500)]
bcachefs: Don't set btree_path to updtodate if we don't fill

This fixes various locking asserts, and a null ptr deref in
bch2_btree_iter_peek_path().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: __bch2_btree_pos_to_text()
Kent Overstreet [Mon, 30 Dec 2024 21:22:59 +0000 (16:22 -0500)]
bcachefs: __bch2_btree_pos_to_text()

Factor out a version of bch2_btree_pos_to_text() that doesn't take a
pointer to a in-memory btree node, to be used for btree node scrub.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: printbuf_reset() handles tabstops
Kent Overstreet [Mon, 30 Dec 2024 20:31:14 +0000 (15:31 -0500)]
bcachefs: printbuf_reset() handles tabstops

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Silence read-only errors when deleting snapshots
Kent Overstreet [Tue, 31 Dec 2024 14:55:09 +0000 (09:55 -0500)]
bcachefs: Silence read-only errors when deleting snapshots

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Dropped superblock write is no longer a fatal error
Kent Overstreet [Sun, 29 Dec 2024 00:57:04 +0000 (19:57 -0500)]
bcachefs: Dropped superblock write is no longer a fatal error

Just emit a warning if errors=continue or fix_safe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_trans_node_drop()
Kent Overstreet [Wed, 25 Dec 2024 17:19:08 +0000 (12:19 -0500)]
bcachefs: bch2_trans_node_drop()

Factor out a small common helper.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_trans_unlock_write()
Kent Overstreet [Tue, 24 Dec 2024 10:40:17 +0000 (05:40 -0500)]
bcachefs: bch2_trans_unlock_write()

New helper for dropping all write locks; which is distinct from the
helper the transaction commit path uses, which is faster and only
touches updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: btree_node_unlock() can now drop write locks
Kent Overstreet [Tue, 24 Dec 2024 10:57:30 +0000 (05:57 -0500)]
bcachefs: btree_node_unlock() can now drop write locks

Prep work for reworking btree node locking during interior btree
updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: six locks: write locks can now be held recursively
Kent Overstreet [Sat, 21 Dec 2024 07:33:53 +0000 (02:33 -0500)]
bcachefs: six locks: write locks can now be held recursively

This is needed for the interior update locking rework, where we'll be
holding node write locks for the duration of the update - which is
needed for synchronizing with online check_allocations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_fs_btree_gc_init()
Kent Overstreet [Wed, 25 Dec 2024 11:32:41 +0000 (06:32 -0500)]
bcachefs: bch2_fs_btree_gc_init()

Now returns errors, prep work for check_allocations_done_lock

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Assert that btree write buffer only touches the right btrees
Kent Overstreet [Tue, 24 Dec 2024 21:57:24 +0000 (16:57 -0500)]
bcachefs: Assert that btree write buffer only touches the right btrees

More asserts, more better.

Also, clean up the per-btree flags a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_inum_path() now crosses subvolumes correctly
Kent Overstreet [Tue, 24 Dec 2024 10:16:56 +0000 (05:16 -0500)]
bcachefs: bch2_inum_path() now crosses subvolumes correctly

The dirent that points to a subvolume root is in the parent subvolume.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_inum_path() no longer returns an error for disconnected inums
Kent Overstreet [Tue, 24 Dec 2024 10:11:46 +0000 (05:11 -0500)]
bcachefs: bch2_inum_path() no longer returns an error for disconnected inums

bch2_inum_path() should work even if the filesystem is corrupted - we
don't want it to cause fsck to fail.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: btree_path_very_locks(): verify lock seq
Kent Overstreet [Sat, 21 Dec 2024 07:55:03 +0000 (02:55 -0500)]
bcachefs: btree_path_very_locks(): verify lock seq

If the btree_path's lock seq is wrong, the next bch2_trans_relock()
operation is guaranteed to fail and we take an unnecessary transaction
restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: fix bch2_btree_key_cache_drop()
Kent Overstreet [Sat, 21 Dec 2024 09:14:28 +0000 (04:14 -0500)]
bcachefs: fix bch2_btree_key_cache_drop()

When evicting, we shouldn't leave a pointer to the key cache entry lying
around - that screws up btree path asserts we're adding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_btree_node_write_trans()
Kent Overstreet [Sat, 21 Dec 2024 08:31:00 +0000 (03:31 -0500)]
bcachefs: bch2_btree_node_write_trans()

Avoiding screwing up path->lock_seq.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Fixes for snapshot_tree.master_subvol
Kent Overstreet [Fri, 20 Dec 2024 09:46:00 +0000 (04:46 -0500)]
bcachefs: Fixes for snapshot_tree.master_subvol

Ensure that snapshot_tree.master_subvol is cleared when we delete the
master subvolume in a tree of snapshots, and allow for snapshot trees
that don't have a master subvolume in fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Don't rely on snapshot_tree.master_subvol for reattaching
Kent Overstreet [Sat, 21 Dec 2024 04:56:42 +0000 (23:56 -0500)]
bcachefs: Don't rely on snapshot_tree.master_subvol for reattaching

Previously, fsck used the snapshot tree's master subvol for finding the
root inode number - but the master subvol might have been deleting, and
setting a new one should be a user operation; meaning we can't rely on
it existing.

Fortunately, for finding the root inode number in a tree of snapshots,
finding any associated subvolume works.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_kvmalloc()
Kent Overstreet [Fri, 20 Dec 2024 10:20:01 +0000 (05:20 -0500)]
bcachefs: bch2_kvmalloc()

Add a version of kvmalloc() that doesn't have the INT_MAX limit; large
filesystems do hit this.

We'll want to get rid of the in-memory bucket gens array, but we're not
there quite yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Fix assert for online fsck
Kent Overstreet [Mon, 16 Dec 2024 21:41:25 +0000 (16:41 -0500)]
bcachefs: Fix assert for online fsck

We can't check if we're racing with fsck ending until mark_lock is held.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Handle -BCH_ERR_need_mark_replicas in gc
Kent Overstreet [Mon, 16 Dec 2024 18:58:02 +0000 (13:58 -0500)]
bcachefs: Handle -BCH_ERR_need_mark_replicas in gc

Locking considerations (possibly no longer relevant?) mean that when an
accounting update needs a new superblock replicas entry to be created,
it's deferred to the transaction commit error path.

But accounting updates for gc/fcsk aren't done from the transaction
commit path - so we need to handle
-BCH_ERR_btree_insert_need_mark_replicas locally.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Write lock btree node in key cache fills
Kent Overstreet [Sat, 8 Jun 2024 21:01:31 +0000 (17:01 -0400)]
bcachefs: Write lock btree node in key cache fills

this addresses a key cache coherency bug

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: kill __bch2_btree_iter_flags()
Kent Overstreet [Sun, 15 Dec 2024 07:24:30 +0000 (02:24 -0500)]
bcachefs: kill __bch2_btree_iter_flags()

bch2_btree_iter_flags() now takes a level parameter; this fixes a bug
where using a node iterator on a leaf wouldn't set
BTREE_ITER_with_key_cache, leading to fun cache coherency bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Drop redundant "read error" call from btree_gc
Kent Overstreet [Sun, 15 Dec 2024 07:03:11 +0000 (02:03 -0500)]
bcachefs: Drop redundant "read error" call from btree_gc

The btree node read error path already calls topology error, so this is
entirely redundant, and we're not specific enough about our error codes
- this was triggering for bucket_ref_update() errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Drop racy warning
Kent Overstreet [Sun, 15 Dec 2024 06:52:54 +0000 (01:52 -0500)]
bcachefs: Drop racy warning

Checking for writing past i_size after unlocking the folio and clearing
the dirty bit is racy, and we already check it at the start.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: better check_bp_exists() error message
Kent Overstreet [Thu, 12 Dec 2024 05:55:48 +0000 (00:55 -0500)]
bcachefs: better check_bp_exists() error message

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: add counter_flags for counters
Hongbo Li [Tue, 12 Nov 2024 08:15:47 +0000 (16:15 +0800)]
bcachefs: add counter_flags for counters

In bcachefs, io_read and io_write counter record the amount
of data which has been read and written. They increase in
unit of sector, so to display correctly, they need to be
shifted to the left by the size of a sector. Other counters
like io_move, move_extent_{read, write, finish} also have
this problem.

In order to support different unit, we add extra column to
mark the counter type by using TYPE_COUNTER and TYPE_SECTORS
in BCH_PERSISTENT_COUNTERS().

Fixes: 1c6fdbd8f246 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bcachefs_metadata_version_autofix_errors
Kent Overstreet [Tue, 10 Dec 2024 19:19:30 +0000 (14:19 -0500)]
bcachefs: bcachefs_metadata_version_autofix_errors

It's time to make self healing the default: change the error action for
old filesystems to fix_safe, matching the default for current
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bcachefs_metadata_version_persistent_inode_cursors
Kent Overstreet [Mon, 2 Dec 2024 02:44:38 +0000 (21:44 -0500)]
bcachefs: bcachefs_metadata_version_persistent_inode_cursors

Persistent cursors for inode allocation.

A free inodes btree would add substantial overhead to inode allocation
and freeing - a "next num to allocate" cursor is always going to be
faster.

We just need it to be persistent, to avoid scanning the inodes btree
from the start on startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bcachefs_metadata_version_inode_depth
Kent Overstreet [Thu, 3 Aug 2023 00:27:38 +0000 (20:27 -0400)]
bcachefs: bcachefs_metadata_version_inode_depth

This adds a new inode field, bi_depth, for directory inodes: this allows
us to make the check_directory_structure pass much more efficient.

Currently, to ensure the filesystem is fully connect and has no loops,
for every directory we follow backpointers until we find the root. But
by adding a depth counter, it sufficies to only check the parent of each
directory, and check that the parent's bi_depth is smaller.

(fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename
causes bi_depth off, but the chain to the root is still strictly
decreasing, then the algorithm still works and there's no need for fsck
to fixup the bi_depth fields).

We've already checked backpointers, so we know that every directory
(excluding the root)has a valid parent: if bi_depth is always
decreasing, every chain must terminate, and terminate at the root
directory.

bi_depth will not necessarily be correct when fsck runs, due to
directory renames - we can't change bi_depth on every child directory
when renaming a directory. That's ok; fsck will silently fix the
bi_depth field as needed, and future fsck runs will be much faster.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Option changes now get propagated to reflinked data
Kent Overstreet [Sun, 20 Oct 2024 06:12:21 +0000 (02:12 -0400)]
bcachefs: Option changes now get propagated to reflinked data

Now that bch2_move_get_io_opts() re-propagates changed inode io options
to bch_extent_rebalance, we can properly suport changing IO path options
for reflinked data.

Changing a per-file IO path option, either via the xattr interface or
via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the
inode number is marked as needing a scan, via
bch2_set_rebalance_needs_scan()), and rebalance will use
bch2_move_data(), which will walk the inode number and pick up the new
options.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bcachefs_metadata_version_reflink_p_may_update_opts
Kent Overstreet [Thu, 7 Nov 2024 04:16:24 +0000 (23:16 -0500)]
bcachefs: bcachefs_metadata_version_reflink_p_may_update_opts

Previously, io path option changes on a file would be picked up
automatically and applied to existing data - but not for reflinked data,
as we had no way of doing this safely. A user may have had permission to
copy (and reflink) a given file, but not write to it, and if so they
shouldn't be allowed to change e.g. nr_replicas or other options.

This uses the incompat feature mechanism in the previous patch to add a
new incompatible flag to bch_reflink_p, indicating whether a given
reflink pointer may propagate io path option changes back to the
indirect extent.

In this initial patch we're only setting it for the source extents.

We'd like to set it for the destination in a reflink copy, when the user
has write access to the source, but that requires mnt_idmap which is not
curretly plumbed up to remap_file_range.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: BCH_SB_VERSION_INCOMPAT
Kent Overstreet [Tue, 12 Nov 2024 02:50:29 +0000 (21:50 -0500)]
bcachefs: BCH_SB_VERSION_INCOMPAT

We've been getting away from feature bits: they don't have any kind of
ordering, and thus it's possible for people to enable weird combinations
of features that were never tested or intended to be run.

Much better to just give every new feature, compatible or incompatible,
a version number.

Additionally, we probably won't ever rev the major version number: major
version numbers represent incompatible versions, but that doesn't really
fit with how we actually roll out incompatible features - we need a
better way of rolling out incompatible features.

So, this patch adds two new superblock fields:
- BCH_SB_VERSION_INCOMPAT
- BCH_SB_VERSION_INCOMPAT_ALLOWED

BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up
to version number x are allowed to be used without user prompting, but
it does not by itself deny old versions from mounting.

BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must
be <= BCH_SB_VERSION_INCOMPAT_ALLOWED.

BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use
an incompatible feature, so as to not unnecessarily break compatibility
with old versions.

bch2_request_incompat_feature() is the new interface to check if an
incompatible feature may be used.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Only run check_backpointers_to_extents in debug mode
Kent Overstreet [Fri, 15 Nov 2024 01:47:32 +0000 (20:47 -0500)]
bcachefs: Only run check_backpointers_to_extents in debug mode

The backpointers passes, check_backpointers_to_extents() and
check_extents_to_backpointers() are the most expensive fsck passes.

Now that we're running the same check and repair code when using a
backpointer at runtime (via bch2_backpointer_get_key()) that fsck does,
there's no reason fsck needs to - except to verify that the filesystem
really has no errors in debug mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: better backpointer_target_not_found() error message
Kent Overstreet [Tue, 10 Dec 2024 19:04:39 +0000 (14:04 -0500)]
bcachefs: better backpointer_target_not_found() error message

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_backpointer_get_key() now repairs dangling backpointers
Kent Overstreet [Tue, 12 Nov 2024 08:46:31 +0000 (03:46 -0500)]
bcachefs: bch2_backpointer_get_key() now repairs dangling backpointers

Continuing on with the self healing theme, we should be running any
check and repair code at runtime that we can - instead of declaring the
filesystemt inconsistent.

This will also let us skip running the backpointers -> extents fsck pass
except in debug mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: check_extents_to_backpointers() now only checks buckets with mismatches
Kent Overstreet [Fri, 15 Nov 2024 21:31:54 +0000 (16:31 -0500)]
bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches

Instead of walking every extent and every backpointer it points to,
first sum up backpointers in each bucket and check for mismatches, and
only look for missing backpointers if mismatches were detected, and only
check extents in those buckets.

This is a major fsck scalability improvement, since the two backpointers
passes (backpointers -> extents and extents -> backpointers) are the
most expensive fsck passes by far.

Additionally, to speed up the upgrade for backpointer bucket gens, or in
situations when we have to rebuild alloc info, add a special case for
when no backpointers are found in a bucket - don't check each individual
backpointer (in particular, avoiding the write buffer flushes), just
recreate them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Add write buffer flush param to backpointer_get_key()
Kent Overstreet [Fri, 15 Nov 2024 03:13:29 +0000 (22:13 -0500)]
bcachefs: Add write buffer flush param to backpointer_get_key()

In an upcoming patch bch2_backpointer_get_key() will be repairing when
it finds a dangling backpointer; it will need to flush the btree write
buffer before it can definitively say there's an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: kill __bch2_extent_ptr_to_bp()
Kent Overstreet [Sun, 17 Nov 2024 23:37:41 +0000 (18:37 -0500)]
bcachefs: kill __bch2_extent_ptr_to_bp()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_extent_ptr_to_bp() no longer depends on device
Kent Overstreet [Mon, 18 Nov 2024 04:58:21 +0000 (23:58 -0500)]
bcachefs: bch2_extent_ptr_to_bp() no longer depends on device

bch_backpointer no longer contains the bucket_offset field, it's just a
direct LBA mapping (with low bits to account for compressed extent
splitting), so we don't need to refer to the device to construct it
anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bcachefs_metadata_version_disk_accounting_big_endian
Kent Overstreet [Fri, 29 Nov 2024 22:41:43 +0000 (17:41 -0500)]
bcachefs: bcachefs_metadata_version_disk_accounting_big_endian

Fix sort order for disk accounting keys, in order to fix a regression on
mount times.

The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.

This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bcachefs_metadata_version_backpointer_bucket_gen
Kent Overstreet [Sun, 17 Nov 2024 04:53:07 +0000 (23:53 -0500)]
bcachefs: bcachefs_metadata_version_backpointer_bucket_gen

New on disk format version: backpointers new include the generation
number of the bucket they refer to, and the obsolete bucket_offset field
(no longer needed because we no longer store backpointers in alloc keys)
is gone.

This is an expensive forced upgrade - hopefully the last; we have to run
the extents_to_backpointers recovery pass to regenerate backpointers.

It's a forced incompatible upgrade because the alternative would've been
permamently making backpointers bigger, and as one of the biggest btrees
(along with the extents btree) that's not an ideal option.

It's worth it though, because this allows us to make the
check_extents_to_backpointers pass drastically cheaper: an upcoming
patch changes it to sum up backpointers in a bucket and check the sum
against the sector counts for that bucket, only looking for missing
backpointers if they don't match (and then only for specific buckets).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_btree_path_peek_slot() doesn't return errors
Kent Overstreet [Fri, 13 Dec 2024 10:58:34 +0000 (05:58 -0500)]
bcachefs: bch2_btree_path_peek_slot() doesn't return errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: trace_key_cache_fill
Kent Overstreet [Fri, 13 Dec 2024 10:43:00 +0000 (05:43 -0500)]
bcachefs: trace_key_cache_fill

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Log message in journal for snapshot deletion
Kent Overstreet [Thu, 12 Dec 2024 09:00:40 +0000 (04:00 -0500)]
bcachefs: Log message in journal for snapshot deletion

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: bch2_trans_log_msg()
Kent Overstreet [Thu, 12 Dec 2024 05:44:28 +0000 (00:44 -0500)]
bcachefs: bch2_trans_log_msg()

Export a helper for logging to the journal when we're already in a
transaction context.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3 months agobcachefs: Kill snapshot_t->equiv
Kent Overstreet [Thu, 12 Dec 2024 09:03:32 +0000 (04:03 -0500)]
bcachefs: Kill snapshot_t->equiv

Now entirely dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Snapshot deletion no longer uses snapshot_t->equiv
Kent Overstreet [Thu, 12 Dec 2024 08:03:58 +0000 (03:03 -0500)]
bcachefs: Snapshot deletion no longer uses snapshot_t->equiv

Switch to generating a private list of interior nodes to delete, instead
of using the equivalence class in the global data structure.

This eliminates possible races with snapshot creation, and is much
cleaner - it'll let us delete a lot of janky code for calculating and
maintaining the equivalence classes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key()
Kent Overstreet [Thu, 12 Dec 2024 07:41:37 +0000 (02:41 -0500)]
bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key()

When deleting dead snapshots, we move keys from redundant interior
snapshot nodes to child nodes - unless there's already a key, in which
case the ancestor key is deleted.

Previously, we tracked via equiv_seen whether the child snapshot had a
key, but this was tricky w.r.t. transaction restarts, and not
transactionally safe w.r.t. updates in the child snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Don't run overwrite triggers before insert
Kent Overstreet [Thu, 12 Dec 2024 07:27:52 +0000 (02:27 -0500)]
bcachefs: Don't run overwrite triggers before insert

This breaks when the trigger is inserting updates for the same btree, as
the inode trigger now does.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: alloc_data_type_set() happens in alloc trigger
Kent Overstreet [Thu, 12 Dec 2024 07:32:32 +0000 (02:32 -0500)]
bcachefs: alloc_data_type_set() happens in alloc trigger

Originally, we ran insert triggers before overwrite so that if an extent
was being moved (by fallocate insert/collapse range), the bucket sector
count wouldn't hit 0 partway through, and so we don't trigger state
changes caused by that too soon.

But this is better solved by just moving the data type change to the
alloc trigger itself, where it's already called.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Fix key cache + BTREE_ITER_all_snapshots
Kent Overstreet [Fri, 13 Dec 2024 10:29:27 +0000 (05:29 -0500)]
bcachefs: Fix key cache + BTREE_ITER_all_snapshots

Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups,
since they exist only to represent deletions of keys in ancestor
snapshots - except, they should not be filtered in
BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean
them up.

This means that that the key cache has to store whiteouts, and key cache
fills cannot filter them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots
Kent Overstreet [Thu, 12 Dec 2024 07:26:15 +0000 (02:26 -0500)]
bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots

In BTREE_ITER_all_snapshots mode, we're required to only return keys
where the snapshot field matches the iterator position -
BTREE_ITER_filter_snapshots requires pulling keys into the key cache
from ancestor snapshots, so we have to check for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: tidy btree_trans_peek_journal()
Kent Overstreet [Fri, 13 Dec 2024 11:02:24 +0000 (06:02 -0500)]
bcachefs: tidy btree_trans_peek_journal()

Change to match bch2_btree_trans_peek_updates() calling convention.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: tidy up __bch2_btree_iter_peek()
Kent Overstreet [Thu, 12 Dec 2024 08:38:14 +0000 (03:38 -0500)]
bcachefs: tidy up __bch2_btree_iter_peek()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: check_indirect_extents can run online
Kent Overstreet [Mon, 9 Dec 2024 02:10:27 +0000 (21:10 -0500)]
bcachefs: check_indirect_extents can run online

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Refactor c->opts.reconstruct_alloc
Kent Overstreet [Tue, 10 Dec 2024 18:23:47 +0000 (13:23 -0500)]
bcachefs: Refactor c->opts.reconstruct_alloc

Now handled in one place.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Add empty statement between label and declaration in check_inode_hash_info_...
Nathan Chancellor [Tue, 10 Dec 2024 18:12:07 +0000 (11:12 -0700)]
bcachefs: Add empty statement between label and declaration in check_inode_hash_info_matches_root()

Clang 18 and newer warns (or errors with CONFIG_WERROR=y):

  fs/bcachefs/str_hash.c:164:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
    164 |         struct bch_inode_unpacked inode;
        |         ^

In Clang 17 and prior, this is an unconditional hard error:

  fs/bcachefs/str_hash.c:164:2: error: expected expression
    164 |         struct bch_inode_unpacked inode;
        |         ^
  fs/bcachefs/str_hash.c:165:30: error: use of undeclared identifier 'inode'
    165 |         ret = bch2_inode_unpack(k, &inode);
        |                                     ^
  fs/bcachefs/str_hash.c:169:55: error: use of undeclared identifier 'inode'
    169 |         struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode);
        |                                                              ^
  fs/bcachefs/str_hash.c:171:40: error: use of undeclared identifier 'inode'
    171 |                 ret = repair_inode_hash_info(trans, &inode);
        |                                                      ^

Add an empty statement between the label and the declaration to fix the
warning/error without disturbing the code too much.

Fixes: 2519d3b0d656 ("bcachefs: bch2_str_hash_check_key() now checks inode hash info")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412092339.QB7hffGC-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: trace_write_buffer_maybe_flush
Kent Overstreet [Tue, 10 Dec 2024 15:29:12 +0000 (10:29 -0500)]
bcachefs: trace_write_buffer_maybe_flush

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: bch2_snapshot_exists()
Kent Overstreet [Mon, 9 Dec 2024 06:31:43 +0000 (01:31 -0500)]
bcachefs: bch2_snapshot_exists()

bch2_snapshot_equiv() is going away; convert users that just wanted to
know if the snapshot exists to something better

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: bch2_check_key_has_snapshot() prints btree id
Kent Overstreet [Mon, 9 Dec 2024 03:30:19 +0000 (22:30 -0500)]
bcachefs: bch2_check_key_has_snapshot() prints btree id

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: bch2_str_hash_check_key() now checks inode hash info
Kent Overstreet [Mon, 9 Dec 2024 02:47:34 +0000 (21:47 -0500)]
bcachefs: bch2_str_hash_check_key() now checks inode hash info

Versions of the same inode in different snapshots must have the same
hash info; this is critical for lookups to work correctly.

We're going to be running the str_hash checks online, at readdir or
xattr list time, so we now need str_hash_check_key() to check for inode
hash seed mismatches, since it won't be run right after check_inodes().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Don't BUG_ON() inode unpack error
Kent Overstreet [Mon, 9 Dec 2024 03:00:36 +0000 (22:00 -0500)]
bcachefs: Don't BUG_ON() inode unpack error

Bkey validation checks that inodes are well-formed and unpack
successfully, so an unpack error should always indicate memory
corruption or some other kind of hardware bug - but these are still
errors we can recover from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Use proper errcodes for inode unpack errors
Kent Overstreet [Mon, 9 Dec 2024 02:42:49 +0000 (21:42 -0500)]
bcachefs: Use proper errcodes for inode unpack errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: kill sysfs internal/accounting
Kent Overstreet [Mon, 9 Dec 2024 01:55:03 +0000 (20:55 -0500)]
bcachefs: kill sysfs internal/accounting

Since we added per-inode counters there's now far too many counters to
show in one shot - if we want this in the future, it'll have to be in
debugfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Kill unnecessary mark_lock usage
Kent Overstreet [Sun, 8 Dec 2024 09:11:21 +0000 (04:11 -0500)]
bcachefs: Kill unnecessary mark_lock usage

We can't hold mark_lock while calling fsck_err() - that's a deadlock,
mark_lock is meant to be a leaf node lock.

It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices
since the bucket_gens array describes its size, and we can't race with
device removal or resize during gc/fsck since that takes state lock.

Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Don't start rewriting btree nodes until after journal replay
Kent Overstreet [Mon, 9 Dec 2024 11:00:33 +0000 (06:00 -0500)]
bcachefs: Don't start rewriting btree nodes until after journal replay

This fixes a deadlock during journal replay when btree node read errors
kick off a ton of rewrites: we don't want them competing with journal
replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty...
Kent Overstreet [Sat, 7 Dec 2024 04:15:05 +0000 (23:15 -0500)]
bcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty transition

For each bucket we track when the bucket became nonempty and when it
became empty again: if we can ensure that there will be no journal
flushes in the range [nonempty, empty) (possibly because they occured at
the same journal sequence number), then it's safe to reuse the bucket
without waiting for a journal commit.

This is a major performance optimization for erasure coding, where
writes are initially replicated, but the extra replicas are quickly
dropped: if those buckets are reused and overwritten without issuing a
cache flush to the underlying device, then they only cost bus bandwidth.

But there's a tricky corner case when there's multiple empty -> nonempty
-> empty transitions in quick succession, i.e. when data is getting
overwritten immediately as it's being written.

If this happens and the previous empty transition hasn't been flushed,
we need to continue tracking the previous nonempty transition - not
start a new one.

Fixing this means we now need to track both the nonempty and empty
transitions in bch_alloc_v4.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: bch2_journal_noflush_seq() now takes [start, end)
Kent Overstreet [Sun, 8 Dec 2024 05:28:16 +0000 (00:28 -0500)]
bcachefs: bch2_journal_noflush_seq() now takes [start, end)

Harder to screw up if we're explicit about the range, and more correct
as journal reservations can be outstanding on multiple journal entries
simultaneously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition
Kent Overstreet [Sun, 8 Dec 2024 01:43:07 +0000 (20:43 -0500)]
bcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Don't add unknown accounting types to eytzinger tree
Kent Overstreet [Thu, 5 Dec 2024 17:35:43 +0000 (12:35 -0500)]
bcachefs: Don't add unknown accounting types to eytzinger tree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Plumb bkey_validate_context to journal_entry_validate
Kent Overstreet [Sun, 8 Dec 2024 02:36:15 +0000 (21:36 -0500)]
bcachefs: Plumb bkey_validate_context to journal_entry_validate

This lets us print the exact location in the journal if it was found in
the journal, or correctly print if it was found in the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 months agobcachefs: Use a heap for handling overwrites in btree node scan
Kent Overstreet [Sat, 7 Dec 2024 00:23:22 +0000 (19:23 -0500)]
bcachefs: Use a heap for handling overwrites in btree node scan

Fix an O(n^2) issue when we find many overlapping (overwritten) btree
nodes - especially when one node overwrites many smaller nodes.

This was discovered to be an issue with the bcachefs
merge_torture_flakey test - if we had a large btree that was then
emptied, the number of difficult overwrites can be unbounded.

Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>