]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
4 weeks agobcachefs: Fix btree iter flags in data move
Kent Overstreet [Mon, 17 Mar 2025 19:07:06 +0000 (15:07 -0400)]
bcachefs: Fix btree iter flags in data move

Rebalance requires a not_extents iterator.

This wasn't hit before because all_snapshots disableds is_extents on
snapshots btrees - but has no effect on the reflink btree.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: Validate bch_sb.offset field
Kent Overstreet [Mon, 17 Mar 2025 17:58:51 +0000 (13:58 -0400)]
bcachefs: Validate bch_sb.offset field

This was missed - but it needs to be correct for the superblock recovery
tool that scans the start and end of the device for backup superblocks:
we don't want to pick up superblocks that belong to a different
partition that starts at a different offset.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: bch2_sb_validate() doesn't need bch_sb_handle
Kent Overstreet [Mon, 17 Mar 2025 14:54:21 +0000 (10:54 -0400)]
bcachefs: bch2_sb_validate() doesn't need bch_sb_handle

Minor refactoring, so that bch2_sb_validate() can be used in the new
userspace superblock recovery tool.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: Add missing random.h includes
Kent Overstreet [Mon, 17 Mar 2025 15:28:26 +0000 (11:28 -0400)]
bcachefs: Add missing random.h includes

Fix build in userspace, and good hygeine.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: Better incompat version/feature error messages
Kent Overstreet [Sat, 15 Mar 2025 23:57:20 +0000 (19:57 -0400)]
bcachefs: Better incompat version/feature error messages

If we can't mount because of an incompatibility, print what's supported
and unsupported - to help solve PEBKAC issues.

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: Fix offset_into_extent in data move path
Kent Overstreet [Sat, 15 Mar 2025 21:27:27 +0000 (17:27 -0400)]
bcachefs: Fix offset_into_extent in data move path

Fixes the following:

[   17.607394] kernel BUG at fs/bcachefs/reflink.c:261!
[   17.608316] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   17.608485] CPU: 0 UID: 0 PID: 564 Comm: bch-rebalance/3 Tainted: G           OE      6.14.0-rc6-arch1-gfcb0bd9609d2 #7 0efd7a8f4a00afeb2c5fb6e7ecb1aec8ddcbb1e1
[   17.608616] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   17.608736] Hardware name: Micro-Star International Co., Ltd. MS-7D75/MAG B650 TOMAHAWK WIFI (MS-7D75), BIOS 1.74 08/01/2023
[   17.608855] RIP: 0010:bch2_lookup_indirect_extent+0x252/0x290 [bcachefs]
[   17.609006] Code: 00 00 00 00 e8 7f 51 f5 ff 89 c3 85 c0 74 52 48 8b 7d b0 4c 89 ee e8 4d 4b f4 ff 48 63 d3 48 89 d0 31 d2 e9 2e ff ff ff 0f 0b <0f> 0b 48 8b 7d b0 4c 89 ee 48 89 55 a8 e8 2c 4b f4 ff 4c 8b 55 a8
[   17.609136] RSP: 0018:ffffa3714455f850 EFLAGS: 00010246
[   17.609261] RAX: 0000000000000080 RBX: ffff895891098790 RCX: 0000000000000000
[   17.609387] RDX: 0000000000000080 RSI: ffffa3714455fa90 RDI: ffff895889550000
[   17.609511] RBP: ffffa3714455f8c0 R08: ffff895891098790 R09: 0000000000000001
[   17.609637] R10: ffffa3714455f8d8 R11: ffffa3714455f950 R12: ffffa3714455fa58
[   17.609763] R13: ffff895891098790 R14: ffffa3714455fa58 R15: ffff895889550000
[   17.609888] FS:  0000000000000000(0000) GS:ffff896757c00000(0000) knlGS:0000000000000000
[   17.610015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.610143] CR2: 0000716b8cda2750 CR3: 0000000914e22000 CR4: 0000000000f50ef0
[   17.610272] PKRU: 55555554
[   17.610403] Call Trace:
[   17.610535]  <TASK>
[   17.610662]  ? __die_body.cold+0x19/0x27
[   17.610791]  ? die+0x2e/0x50
[   17.610918]  ? do_trap+0xca/0x110
[   17.611049]  ? do_error_trap+0x6a/0x90
[   17.611178]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611331]  ? exc_invalid_op+0x50/0x70
[   17.611468]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611620]  ? asm_exc_invalid_op+0x1a/0x20
[   17.611757]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611911]  ? bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612084]  bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612256]  ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612431]  ? bch2_move_extent+0x3d7/0x6e0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612607]  ? __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612782]  __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612959]  ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613149]  do_rebalance+0x517/0x8d0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613342]  ? local_clock_noinstr+0xd/0xd0
[   17.613518]  ? local_clock+0x15/0x30
[   17.613693]  ? __bch2_trans_get+0x152/0x300 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613890]  ? __pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.614090]  bch2_rebalance_thread+0x66/0xb0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]

The offset_into_extent bit was copied from the read path, but it's
unnecessary here, where we always want to read and move the entire
indirect extent, and it causes the assertion pop - because we're using a
non-extents iterator, which always points to the end of the reflink
pointer.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: use sha256() instead of crypto_shash API
Eric Biggers [Sun, 16 Mar 2025 03:47:17 +0000 (20:47 -0700)]
bcachefs: use sha256() instead of crypto_shash API

Just use sha256() instead of the clunky crypto API.  This is much
simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: Remove unnecessary softdeps on crc32c and crc64
Eric Biggers [Sun, 16 Mar 2025 03:03:19 +0000 (20:03 -0700)]
bcachefs: Remove unnecessary softdeps on crc32c and crc64

Since bcachefs does not access crc32c and crc64 through the crypto API,
there is no need to use module softdeps to ensure they are loaded.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: #if 0 out (enable|disable)_encryption()
Kent Overstreet [Sun, 16 Mar 2025 17:39:14 +0000 (13:39 -0400)]
bcachefs: #if 0 out (enable|disable)_encryption()

These weren't hooked up, but they probably should be - add some comments
for context.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: Improve can_write_extent()
Kent Overstreet [Sun, 16 Mar 2025 01:32:33 +0000 (21:32 -0400)]
bcachefs: Improve can_write_extent()

This fixes another "rebalance spinning and doing no work" issue;
rebalance was reading extents it wanted to move, but then failing in
bch2_write() -> bch2_alloc_sectors_start() due to being unable to
allocate sufficient replicas.

This was triggered by a user playing with the durability settings, the
foreground device was an NVME device with durability=2, and originally
he'd set the background device to durability=2 as well, but changed it
back to 1 (the default) after seeing IO errors.

That meant that with replicas=2, we want to move data off the NVME
device which satisfies that constraint, but with a single durability=1
device on the background target there's no way to move the extent to
that target while satisfiying the "required replicas" constraint.

The solution for now is for bch2_data_update_init() to check for this,
and return an error - before kicking off the read.

bch2_data_update_init() already had two different checks for "will we be
able to write this extent", with partially duplicated code, so this
patch combines and improves that logic.

Additionally, we now always bail out and return an error if there's
insufficient space on the destination target. Previously, we only did
this for BCH_WRITE_alloc_nowait moves, because it might be the case that
copygc just needs to free up space on the destination target.

But we really shouldn't kick off a move if the destination is full, we
can't currently distinguish between "really full" and "just need to wait
for copygc", and if we are going to wait on copygc it'd be better to do
that before kicking off the move.

This will additionally fix "rebalance spinning" issues caused by a
filesystem that has more data than can fit in background_target - which
is a valid scenario, since we don't exclude foreground/cache devices
when calculating filesystem capacity.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: trace_io_move_write_fail
Kent Overstreet [Sat, 15 Mar 2025 23:24:44 +0000 (19:24 -0400)]
bcachefs: trace_io_move_write_fail

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: Increase blacklist range
Alan Huang [Sat, 15 Mar 2025 07:39:42 +0000 (15:39 +0800)]
bcachefs: Increase blacklist range

Now there are 16 journal buffers, 8 is too small to be enough.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: __bch2_read() now takes a btree_trans
Kent Overstreet [Mon, 10 Mar 2025 17:33:41 +0000 (13:33 -0400)]
bcachefs: __bch2_read() now takes a btree_trans

Next patch will be checking if the extent we're reading from matches the
IO failure we saw before marking the failure.

For this to work, __bch2_read() needs to take the same transaction
context that bch2_rbio_retry() uses to do that check.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
4 weeks agobcachefs: BCH_READ_data_update -> bch_read_bio.data_update
Kent Overstreet [Wed, 12 Mar 2025 20:56:09 +0000 (16:56 -0400)]
bcachefs: BCH_READ_data_update -> bch_read_bio.data_update

Read flags are codepath dependent and change as they're passed around,
while the fields in rbio._state are mostly fixed properties of that
particular object.

Losing track of BCH_READ_data_update would be bad, and previously it was
not obvious if it was always correctly set in the rbio, so this is a
safety cleanup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Checksum errors get additional retries
Kent Overstreet [Sat, 8 Mar 2025 17:56:43 +0000 (12:56 -0500)]
bcachefs: Checksum errors get additional retries

It's possible for checksum errors to be transient - e.g. flakey
controller or cable, thus we need additional retries (besides retrying
from different replicas) before we can definitely return an error.

This is particularly important for the next patch, which will allow the
data move path to move extents with checksum errors - we don't want to
accidentally introduce bitrot due to a transient error!

- bch2_bkey_pick_read_device() is substantially reworked, and
  bch2_dev_io_failures is expanded to record more information about the
  type of failure (i.e. number of checksum errors).

  It now returns an error code that describes more precisely the reason
  for the failure - checksum error, io error, or offline device, instead
  of the previous generic "insufficient devices". This is important for
  the next patches that add poisoning, as we only want to poison extents
  when we've got real checksum errors (or perhaps IO errors?) - not
  because a device was offline.

- Add a new option and superblock field for the number of checksum
  retries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Print message on successful read retry
Kent Overstreet [Sat, 8 Mar 2025 23:42:34 +0000 (18:42 -0500)]
bcachefs: Print message on successful read retry

Users have been asking for this, and now that errors are returned to the
top level read retry path - we can.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Return errors to top level bch2_rbio_retry()
Kent Overstreet [Sun, 9 Mar 2025 00:37:10 +0000 (19:37 -0500)]
bcachefs: Return errors to top level bch2_rbio_retry()

Next patch will be adding an additional retry loop for checksum errors,
so that we can rule out transient errors before marking an extent as
poisoned.

Prerequisite to this is returning errors to bch2_rbio_retry(); this will
also let us add a "successful retry" message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: BCH_ERR_data_read_buffer_too_small
Kent Overstreet [Sat, 8 Mar 2025 16:37:51 +0000 (11:37 -0500)]
bcachefs: BCH_ERR_data_read_buffer_too_small

Now that the read path uses proper error codes, we can get rid of the
weird rbio->hole signalling to the move path that the read didn't
happen.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Read error message now indicates if it was for an internal move
Kent Overstreet [Sat, 8 Mar 2025 16:24:22 +0000 (11:24 -0500)]
bcachefs: Read error message now indicates if it was for an internal move

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
Kent Overstreet [Tue, 11 Mar 2025 13:04:09 +0000 (09:04 -0400)]
bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path

When we do a read to a buffer that's mapped into userspace, it's
possible to get a spurious checksum error if userspace was modified the
buffer at the same time.

When we retry those, they have to be bounced before we know definitively
whether we're reading corrupt data.

But the retry path propagates read flags differently, so needs special
handling.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Convert read path to standard error codes
Kent Overstreet [Fri, 7 Mar 2025 22:20:22 +0000 (17:20 -0500)]
bcachefs: Convert read path to standard error codes

Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard
error codes that describe precisely which error occured.

This is going to be used for the data move path, to move but poison
extents with checksum errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Debug params for data corruption injection
Kent Overstreet [Sat, 8 Mar 2025 23:42:56 +0000 (18:42 -0500)]
bcachefs: Debug params for data corruption injection

dm-flakey is busted, and this is simpler anyways - this lets us test the
checksum error retry ptahs

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Don't create bch_io_failures unless it's needed
Kent Overstreet [Mon, 10 Mar 2025 15:54:13 +0000 (11:54 -0400)]
bcachefs: Don't create bch_io_failures unless it's needed

Only needed in retry path, no point in wasting stack space.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: bch2_bkey_ptrs_rebalance_opts()
Kent Overstreet [Thu, 13 Mar 2025 04:47:51 +0000 (00:47 -0400)]
bcachefs: bch2_bkey_ptrs_rebalance_opts()

Small optimization for bch2_bkey_sectors_need_rebalance()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: Add a cond_resched() to btree cache teardown
Kent Overstreet [Fri, 14 Mar 2025 22:19:17 +0000 (18:19 -0400)]
bcachefs: Add a cond_resched() to btree cache teardown

[12308.606480] watchdog: BUG: soft lockup - CPU#18 stuck for 26s! [umount:48479]
[12308.606485] Modules linked in: bcachefs lz4hc_compress lz4_compress lz4_decompress sunrpc overlay nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xfrm_user ip6table_nat ip6table_filter ip6_tables iptable_nat xt_addrtype iptable_filter ip_tables x_tables nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample ext4 mbcache jbd2 nls_iso8859_1 nls_cp850 vfat fat binfmt_misc skx_edac_common nfit edac_core libnvdimm cbc encrypted_keys intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm drivetemp rapl intel_cstate coretemp mgag200 i2c_algo_bit ixgbe drm_shmem_helper drm_kms_helper mdio_devres xfrm_algo mdio drm ptp intel_uncore mei_me efi_pstore evdev uas pl2303 pps_core libphy usb_storage usbserial lpc_ich mei drm_panel_orientation_quirks acpi_power_meter tiny_power_button ipmi_si mfd_core intel_pch_thermal acpi_tad acpi_ipmi ioatdma
[12308.606541]  ipmi_devintf ipmi_msghandler dca wmi button efivarfs polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic xhci_pci xhci_hcd aesni_intel ehci_pci ehci_hcd gf128mul crypto_simd cryptd usbcore hpwdt usb_common
[12308.606557] CPU: 18 UID: 0 PID: 48479 Comm: umount Tainted: G             L     6.14.0-rc6-x86_64-00159-ga09496a03e63 #1
[12308.606560] Tainted: [L]=SOFTLOCKUP
[12308.606561] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023
[12308.606563] RIP: 0010:clear_page_erms+0x7/0x10
[12308.606570] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[12308.606572] RSP: 0018:ffff9ed5b622fba0 EFLAGS: 00010246
[12308.606574] RAX: 0000000000000000 RBX: ffff90347fffe6c0 RCX: 00000000000004c0
[12308.606575] RDX: ffffe34ea9bec1c0 RSI: 00000000000405f0 RDI: ffff902eafb07b40
[12308.606576] RBP: ffff9ed5b622fbf0 R08: 0000000000000001 R09: 0000000000000006
[12308.606577] R10: 0000000000040001 R11: 0000000000000000 R12: ffffe34ea9bec000
[12308.606578] R13: 0000000000000000 R14: 0000000000000006 R15: ffffe34ea9bed000
[12308.606580] FS:  00007fe704ecfb68(0000) GS:ffff9053fea00000(0000) knlGS:0000000000000000
[12308.606581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12308.606582] CR2: 00007f18159068ae CR3: 00000001314d0005 CR4: 00000000007726f0
[12308.606583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12308.606584] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[12308.606584] PKRU: 55555554
[12308.606585] Call Trace:
[12308.606587]  <IRQ>
[12308.606590]  ? show_regs.cold+0x19/0x28
[12308.606595]  ? watchdog_timer_fn.cold+0x3d/0x9d
[12308.606598]  ? __pfx_watchdog_timer_fn+0x10/0x10
[12308.606602]  ? __hrtimer_run_queues+0x12e/0x250
[12308.606607]  ? hrtimer_interrupt+0xfd/0x220
[12308.606609]  ? __sysvec_apic_timer_interrupt+0x53/0xe0
[12308.606614]  ? sysvec_apic_timer_interrupt+0x76/0xa0
[12308.606619]  </IRQ>
[12308.606620]  <TASK>
[12308.606620]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[12308.606626]  ? clear_page_erms+0x7/0x10
[12308.606628]  ? __free_pages_ok+0x374/0x640
[12308.606633]  free_frozen_pages+0x34/0x570
[12308.606636]  __folio_put+0x87/0xe0
[12308.606641]  free_large_kmalloc+0x70/0x80
[12308.606645]  kfree+0x2f6/0x390
[12308.606648]  kvfree+0x2d/0x40
[12308.606653]  __btree_node_data_free+0xaf/0xf0 [bcachefs]
[12308.606726]  btree_node_data_free+0x6a/0x80 [bcachefs]
[12308.606778]  bch2_fs_btree_cache_exit+0x262/0x440 [bcachefs]
[12308.606829]  bch2_fs_release+0xe8/0x340 [bcachefs]
[12308.606905]  kobject_put+0x60/0xc0
[12308.606908]  bch2_fs_free+0xdd/0x120 [bcachefs]
[12308.606981]  bch2_kill_sb+0x1e/0x30 [bcachefs]
[12308.607051]  deactivate_locked_super+0x32/0xb0
[12308.607055]  deactivate_super+0x40/0x50
[12308.607057]  cleanup_mnt+0xc3/0x160
[12308.607060]  __cleanup_mnt+0x12/0x20
[12308.607062]  task_work_run+0x5f/0xa0
[12308.607064]  syscall_exit_to_user_mode+0x194/0x1a0
[12308.607066]  do_syscall_64+0x67/0x170
[12308.607068]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[12308.607070] RIP: 0033:0x7fe704e66eed
[12308.607073] Code: 08 49 89 ca b8 a5 00 00 00 0f 05 48 89 c7 e8 8a e6 ff ff 48 83 c4

Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
5 weeks agobcachefs: rebalance, copygc status also print stacktrace
Kent Overstreet [Thu, 13 Mar 2025 19:21:13 +0000 (15:21 -0400)]
bcachefs: rebalance, copygc status also print stacktrace

These are commonly needed when debugging, and saves from having to ask
users to dig.

Also, rebalance_status now includes pending rebalance work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Kill bch2_remount()
Kent Overstreet [Thu, 13 Mar 2025 15:44:52 +0000 (11:44 -0400)]
bcachefs: Kill bch2_remount()

Single caller, so inline it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Kill a bit of dead code
Kent Overstreet [Tue, 11 Mar 2025 13:31:03 +0000 (09:31 -0400)]
bcachefs: Kill a bit of dead code

Found with CC=clang W=1

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Use max() to improve gen_after()
Thorsten Blum [Tue, 11 Mar 2025 11:13:11 +0000 (12:13 +0100)]
bcachefs: Use max() to improve gen_after()

Use max() to simplify gen_after() and improve its readability.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Remove unnecessary byte allocation
Thorsten Blum [Sat, 8 Mar 2025 19:53:53 +0000 (20:53 +0100)]
bcachefs: Remove unnecessary byte allocation

The extra byte is not used - remove it.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: We no longer read stripes into memory at startup
Kent Overstreet [Tue, 11 Feb 2025 01:15:40 +0000 (20:15 -0500)]
bcachefs: We no longer read stripes into memory at startup

And the stripes heap gets deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: trace_stripe_create
Kent Overstreet [Fri, 7 Mar 2025 19:30:29 +0000 (14:30 -0500)]
bcachefs: trace_stripe_create

Add a simple tracepoint for stripe creation, we'll want to expand this
later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: get_existing_stripe() uses new stripe lru
Kent Overstreet [Tue, 11 Feb 2025 01:34:47 +0000 (20:34 -0500)]
bcachefs: get_existing_stripe() uses new stripe lru

Convert to the new persistent stripe LRU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: ec_stripe_delete() uses new stripe lru
Kent Overstreet [Tue, 11 Feb 2025 01:35:08 +0000 (20:35 -0500)]
bcachefs: ec_stripe_delete() uses new stripe lru

Convert to the new persistent stripe LRU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: journal write path comment
Kent Overstreet [Fri, 7 Mar 2025 17:00:56 +0000 (12:00 -0500)]
bcachefs: journal write path comment

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Kick devices out after too many write IO errors
Kent Overstreet [Wed, 26 Feb 2025 23:44:23 +0000 (18:44 -0500)]
bcachefs: Kick devices out after too many write IO errors

We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).

But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.

This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.

In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".

Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.

After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Change BCH_MEMBER_STATE_failed semantics
Kent Overstreet [Fri, 7 Mar 2025 15:50:49 +0000 (10:50 -0500)]
bcachefs: Change BCH_MEMBER_STATE_failed semantics

Previously, we woudn't try to read at all from a failed device - that
doesn't make much sense, the device may be unhealthy (perhaps taking
longer than it should to service reads), but if it's our only option we
should still try to read from it.

Now, bch2_bkey_pick_read_device() will pick failed devices only if there
are no non-failed replicas to read from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_dev_get_ioref() may now sleep
Kent Overstreet [Sat, 1 Mar 2025 22:34:33 +0000 (17:34 -0500)]
bcachefs: bch2_dev_get_ioref() may now sleep

The next patch implementing freezing will change bch2_dev_get_ioref() to
sleep if a device is currently frozen.

Add an annotation and fix the journal code accordingly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Fix btree_node_scan io_ref handling
Kent Overstreet [Sat, 1 Mar 2025 21:14:28 +0000 (16:14 -0500)]
bcachefs: Fix btree_node_scan io_ref handling

This was completely fubar; it's now simplified a bit as well.
Note that for_each_online_member() takes and releases io_refs as it
iterates, so we need to release that if we break.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Implement blk_holder_ops
Kent Overstreet [Tue, 25 Feb 2025 23:50:38 +0000 (18:50 -0500)]
bcachefs: Implement blk_holder_ops

We can't use the standard fs_holder_ops because they're meant for single
device filesystems - fs_bdev_mark_dead() in particular - and they assume
that the blk_holder is the super_block, which also doesn't work for a
multi device filesystem.

These generally follow the standard fs_holder_ops; the
locking/refcounting is a bit simplified because c->ro_ref suffices, and
bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire
filesystem.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Make sure c->vfs_sb is set before starting fs
Kent Overstreet [Wed, 26 Feb 2025 03:14:06 +0000 (22:14 -0500)]
bcachefs: Make sure c->vfs_sb is set before starting fs

This is necessary for the new blk_holder_ops, which want the vfs
super_block available for synchronization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Stash a pointer to the filesystem for blk_holder_ops
Kent Overstreet [Tue, 25 Feb 2025 23:58:46 +0000 (18:58 -0500)]
bcachefs: Stash a pointer to the filesystem for blk_holder_ops

Note that we open block devices before we allocate bch_fs, but once
attached to a filesystem they will be closed before the bch_fs is torn
down - so stashing a pointer without a refcount looks incorrect but it's
not.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Finish bch2_account_io_completion() conversions
Kent Overstreet [Fri, 28 Feb 2025 19:38:47 +0000 (14:38 -0500)]
bcachefs: Finish bch2_account_io_completion() conversions

More prep work for automatically kicking devices out after too many IO
errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_account_io_completion()
Kent Overstreet [Fri, 28 Feb 2025 19:07:22 +0000 (14:07 -0500)]
bcachefs: bch2_account_io_completion()

We need to start accounting successes for every IO, not just failures,
so introduce a unified hook for io completion accounting and convert
io_read.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Fix read path io_ref handling
Kent Overstreet [Fri, 28 Feb 2025 18:59:15 +0000 (13:59 -0500)]
bcachefs: Fix read path io_ref handling

We were using our device pointer after we'd released our ref to it.

Unlikely to be a race that's practical to hit, since actually removing a
member device is a whole process besides just taking it offline, but -
needs to be fixed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: data_update now checks for extents that can't be moved
Kent Overstreet [Fri, 28 Feb 2025 16:37:36 +0000 (11:37 -0500)]
bcachefs: data_update now checks for extents that can't be moved

If a device is ro or failed, we might not have anywhere to move a
replica.

Check for this early, before doing the read and attempting to write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: give bch2_write_super() a proper error code
Kent Overstreet [Sat, 1 Mar 2025 20:46:59 +0000 (15:46 -0500)]
bcachefs: give bch2_write_super() a proper error code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bcachefs_metadata_version_extent_flags
Kent Overstreet [Tue, 25 Feb 2025 01:29:58 +0000 (20:29 -0500)]
bcachefs: bcachefs_metadata_version_extent_flags

This implements a new extent field bitflags that apply to the whole
extent. There's been a couple things we've wanted this for in the past,
but the immediate need is extent poisoning, to solve a rebalance issue.

Unknown extent fields can't be parsed (we won't known their size, so we
can't advance to the next field), so this is an incompat feature, and
using it prevents the filesystem from being mounted by old versions.

This also adds the BCH_EXTENT_poisoned flag; this indicates that the
data is known to be bad (i.e. there was a checksum error, and we had to
write a new checksum) and reads will return errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_request_incompat_feature() now returns error code
Kent Overstreet [Fri, 28 Feb 2025 23:59:58 +0000 (18:59 -0500)]
bcachefs: bch2_request_incompat_feature() now returns error code

For future usage, we'll want a dedicated error code for better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Fix error type in bch2_alloc_v3_validate()
Thorsten Blum [Mon, 10 Mar 2025 19:20:29 +0000 (20:20 +0100)]
bcachefs: Fix error type in bch2_alloc_v3_validate()

Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate().

Fixes: b65db750e2bb ("bcachefs: Enumerate fsck errors")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field
Kent Overstreet [Mon, 10 Mar 2025 18:20:58 +0000 (14:20 -0400)]
bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field

These features are set on format and incompat upgarde.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agoDocumentation: bcachefs: SubmittingPatches: Convert footnotes to reST syntax
Bagas Sanjaya [Mon, 24 Feb 2025 12:40:28 +0000 (19:40 +0700)]
Documentation: bcachefs: SubmittingPatches: Convert footnotes to reST syntax

Footnotes list are outputted in htmldocs simply as long-running
paragraph instead. Use reST numbered footnotes syntax for the job.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agoDocumentation: bcachefs: SubmittingPatches: Demote section headings
Bagas Sanjaya [Mon, 24 Feb 2025 12:40:27 +0000 (19:40 +0700)]
Documentation: bcachefs: SubmittingPatches: Demote section headings

SubmttingPatches.rst has 4 section headings, all under the same heading
levels. In absence of title headings, these section headings are all
ended up as title headings in the docs output, which also affect
the index toctree (increasing titles to 6 from the original 2)
due to :numbered: option.

Demote second-to-last section headings, making "Submitting patches
to bcachefs" as title heading.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agoDocumentation: bcachefs: Split index toctree
Bagas Sanjaya [Mon, 24 Feb 2025 12:40:26 +0000 (19:40 +0700)]
Documentation: bcachefs: Split index toctree

bcachefs subsystem currently has 4 docs: two are development notes and
the rest are actual filesystem docs. These two groups are clearly
distinct and can be organized.

Split the toctree into two, one for each docs group. While at it, also
reduce :maxdepth: so that only title headings are listed in the
toctrees.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agoDocumentation: bcachefs: Add casefolding toctree entry
Bagas Sanjaya [Sat, 22 Feb 2025 09:18:53 +0000 (16:18 +0700)]
Documentation: bcachefs: Add casefolding toctree entry

Sphinx reports htmldocs toctree warning:

Documentation/filesystems/bcachefs/casefolding.rst: WARNING: document isn't included in any toctree

Fix the warning by adding casefolding documentation entry to bcachefs
toctree.

Fixes: bc5cc09246c5 ("bcachefs: bcachefs_metadata_version_casefolding")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250221161728.32739f85@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agoDocumentation: bcachefs: casefolding: Use bullet list for dirent structure
Bagas Sanjaya [Sat, 22 Feb 2025 09:18:52 +0000 (16:18 +0700)]
Documentation: bcachefs: casefolding: Use bullet list for dirent structure

The doc lists dirent structure for both regular and casefolded names,
yet it is written (and rendered) as long paragraph instead.

Write the structure list as bullet list.

Fixes: bc5cc09246c5 ("bcachefs: bcachefs_metadata_version_casefolding")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agoDocumentation: bcachefs: casefolding: Fix dentry/dcache considerations section
Bagas Sanjaya [Sat, 22 Feb 2025 09:18:51 +0000 (16:18 +0700)]
Documentation: bcachefs: casefolding: Fix dentry/dcache considerations section

Sphinx reports htmldocs warnings on dentry/dcache section:

Documentation/filesystems/bcachefs/casefolding.rst:75: WARNING: Title underline too short.

dentry/dcache considerations
--------- [docutils]
Documentation/filesystems/bcachefs/casefolding.rst:84: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]

Fix the section by:

* Extending the section underline to match the section title length;
* Separating problem list from surrounding paragraphs.

Fixes: bc5cc09246c5 ("bcachefs: bcachefs_metadata_version_casefolding")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250221161911.2d16138b@canb.auug.org.au/
Closes: https://lore.kernel.org/linux-next/20250221162135.79be0147@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agoDocumentation: bcachefs: casefolding: Do not italicize NUL
Bagas Sanjaya [Sat, 22 Feb 2025 09:18:50 +0000 (16:18 +0700)]
Documentation: bcachefs: casefolding: Do not italicize NUL

Sphinx reports htmldocs warning:

Documentation/filesystems/bcachefs/casefolding.rst:36: WARNING: Inline interpreted text or phrase reference start-string without end-string. [docutils]

That's because NUL word is italicized but it is written in plural form
instead (`NUL`s). Sphinx, however, doesn't tip over when the italicized
word in this fashion is followed by punctuation instead.

Do not italicize the word to keep Sphinx happy.

Fixes: bc5cc09246c5 ("bcachefs: bcachefs_metadata_version_casefolding")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250221162135.79be0147@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: sysfs internal/trigger_btree_updates
Kent Overstreet [Thu, 13 Feb 2025 17:46:15 +0000 (12:46 -0500)]
bcachefs: sysfs internal/trigger_btree_updates

Add a debug knob to manually trigger the btree updates worker.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bcachefs_metadata_version_casefolding
Joshua Ashton [Sun, 13 Aug 2023 17:34:17 +0000 (18:34 +0100)]
bcachefs: bcachefs_metadata_version_casefolding

This patch implements support for case-insensitive file name lookups
in bcachefs.

The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.

More information is provided in Documentation/bcachefs/casefolding.rst

Compatibility notes:

This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.

Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Split out dirent alloc and name initialization
Joshua Ashton [Sun, 13 Aug 2023 16:49:12 +0000 (17:49 +0100)]
bcachefs: Split out dirent alloc and name initialization

Splits out the code that allocates the dirent and initializes the name
to make things easier to implement casefolding in a future commit.

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Kill dirent_occupied_size() in create path
Kent Overstreet [Thu, 20 Feb 2025 18:15:50 +0000 (13:15 -0500)]
bcachefs: Kill dirent_occupied_size() in create path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Kill dirent_occupied_size() in rename path
Kent Overstreet [Thu, 20 Feb 2025 17:58:21 +0000 (12:58 -0500)]
bcachefs: Kill dirent_occupied_size() in rename path

Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bcachefs_metadata_version_stripe_lru
Kent Overstreet [Sat, 8 Feb 2025 02:31:03 +0000 (21:31 -0500)]
bcachefs: bcachefs_metadata_version_stripe_lru

Add a persistent LRU for stripes, ordered by "number of empty blocks",
i.e. order in which we wish to reuse them.

This will replace the in-memory stripes heap, so we can kill off reading
stripes into memory at startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bcachefs_metadata_version_stripe_backpointers
Kent Overstreet [Fri, 7 Feb 2025 06:34:00 +0000 (01:34 -0500)]
bcachefs: bcachefs_metadata_version_stripe_backpointers

Stripes now have backpointers.

This is needed for proper scrub - stripe checksums need to be verified,
separately from extents within the stripe, since a block may not be full
of live extents but it's still needed for reconstruct.

And this will be needed for (efficient) evacuate/repair paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Advance bch_alloc.oldest_gen if no stale pointers
Kent Overstreet [Sat, 8 Feb 2025 00:56:11 +0000 (19:56 -0500)]
bcachefs: Advance bch_alloc.oldest_gen if no stale pointers

Now that we've got cached backpointers and aren't leaving around stale
pointers on bucket invalidation, we no longer need the periodic (rare)
gc_gens - which recalculates each bucket's oldest gen to avoid wraparound.

We can't delete that code because we've got to support existing
filesystems that will still have stale pointers, but this gets rid of
another scalability limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Invalidate cached data by backpointers
Kent Overstreet [Fri, 7 Feb 2025 23:12:57 +0000 (18:12 -0500)]
bcachefs: Invalidate cached data by backpointers

If we don't leave stale pointers around, we won't have to deal with
bucket gen wraparound.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bcachefs_metadata_version_cached_backpointers
Kent Overstreet [Fri, 7 Feb 2025 06:33:35 +0000 (01:33 -0500)]
bcachefs: bcachefs_metadata_version_cached_backpointers

Cached pointers now have backpointers.

This means that we'll be able to kill cached pointers in the
bucket_invalidate path, when invalidating/reusing buckets containing
cached data, instead of leaving them around to be cleaned up by gc_gens
garbago collection - which requires a full metadata scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: rework bch2_trans_commit_run_triggers()
Kent Overstreet [Tue, 11 Feb 2025 18:45:46 +0000 (13:45 -0500)]
bcachefs: rework bch2_trans_commit_run_triggers()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Better trigger ordering
Kent Overstreet [Tue, 11 Feb 2025 15:09:31 +0000 (10:09 -0500)]
bcachefs: Better trigger ordering

Transactional triggers need to run in a defined ordering, which is not
quite the same as btree ID integer comparison.

Previously this was handled in a hacky way in
bch2_trans_commit_run_triggers(), since it was only the alloc btree that
needed special handling, but upcoming stripe btree changes are going to
require more ordering changes - so, define that ordering.

Next patch will change the transaction commit path to use it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock
Kent Overstreet [Tue, 11 Feb 2025 01:32:37 +0000 (20:32 -0500)]
bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock

Introduce per-entry locks, like with struct bucket - the stripes heap is
going away.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Rework bch2_check_lru_key()
Kent Overstreet [Mon, 10 Feb 2025 23:48:12 +0000 (18:48 -0500)]
bcachefs: Rework bch2_check_lru_key()

It's now easier to add new LRU types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: decouple bch2_lru_check_set() from alloc btree
Kent Overstreet [Mon, 10 Feb 2025 23:42:45 +0000 (18:42 -0500)]
bcachefs: decouple bch2_lru_check_set() from alloc btree

Pass in the backpointer explicitly, instead of assuming 'referring_k' is
an alloc key and calculating it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/
Kent Overstreet [Mon, 10 Feb 2025 23:39:50 +0000 (18:39 -0500)]
bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/

FRAGMENTATION_START was incorrect, there's currently only one
fragmentation LRU (at the end of the reserved bits for LRU type), and
we're getting ready to add a stripe fragmentation lru - so give it a
better name.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_lru_change() checks for no-op
Kent Overstreet [Mon, 10 Feb 2025 23:37:50 +0000 (18:37 -0500)]
bcachefs: bch2_lru_change() checks for no-op

Minor cleanup, no reason for the caller to have to this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: minor journal errcode cleanup
Kent Overstreet [Wed, 12 Feb 2025 14:47:39 +0000 (09:47 -0500)]
bcachefs: minor journal errcode cleanup

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_write_op_error() now prints info about data update
Kent Overstreet [Mon, 10 Feb 2025 22:04:08 +0000 (17:04 -0500)]
bcachefs: bch2_write_op_error() now prints info about data update

A user has been seeing the "error verifying existing checksum while
rewriting existing data (memory corruption?)" error.

This generally indicates a hardware issue (and that may be the case
here), but it might also indicate a bug, in which case we need more
information to look for patterns.

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: metadata_target is not an inode option
Kent Overstreet [Mon, 10 Feb 2025 16:55:33 +0000 (11:55 -0500)]
bcachefs: metadata_target is not an inode option

This option only applies filesystem wide.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: eytzinger1_{next,prev} cleanup
Andreas Gruenbacher [Tue, 28 Jan 2025 17:24:15 +0000 (18:24 +0100)]
bcachefs: eytzinger1_{next,prev} cleanup

The eytzinger code was previously relying on the following wrap-around
properties and their "eytzinger0" equivalents:

  eytzinger1_prev(0, size) == eytzinger1_last(size)
  eytzinger1_next(0, size) == eytzinger1_first(size)

However, these properties are no longer relied upon and no longer
necessary, so remove the corresponding asserts and forbid the use of
eytzinger1_prev(0, size) and eytzinger1_next(0, size).

This allows to further simplify the code in eytzinger1_next() and
eytzinger1_prev(): where the left shifting happens, eytzinger1_next() is
trying to move i to the lowest child on the left, which is equivalent to
doubling i until the next doubling would cause it to be greater than
size.  This is implemented by shifting i to the left so that the most
significant bits align and then shifting i to the right by one if the
result is greater than size.

Likewise, eytzinger1_prev() is trying to move to the lowest child on the
right; the same applies here.

The 1-offset in (size - 1) in eytzinger1_next() isn't needed at all, but
the equivalent offset in eytzinger1_prev() is surprisingly needed to
preserve the 'eytzinger1_prev(0, size) == eytzinger1_last(size)'
property.  However, since we no longer support that property, we can get
rid of these offsets as well.  This saves one addition in each function
and makes the code less confusing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: convert eytzinger sort to be 1-based (2)
Andreas Gruenbacher [Mon, 27 Jan 2025 19:54:52 +0000 (20:54 +0100)]
bcachefs: convert eytzinger sort to be 1-based (2)

In this second step, transform the eytzinger indexes i, j, and k in
eytzinger1_sort_r() from 0-based to 1-based.  This step looks a bit
messy, but the resulting code is slightly better.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: convert eytzinger sort to be 1-based (1)
Andreas Gruenbacher [Wed, 27 Nov 2024 12:26:10 +0000 (13:26 +0100)]
bcachefs: convert eytzinger sort to be 1-based (1)

In this first step, convert the eytzinger sort functions to use 1-based
primitives.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: convert eytzinger0_find to be 1-based
Andreas Gruenbacher [Tue, 28 Jan 2025 09:56:37 +0000 (10:56 +0100)]
bcachefs: convert eytzinger0_find to be 1-based

Several of the algorithms on eytzinger trees are implemented in terms of
the eytzinger0 primitives.  However, those algorithms can just as easily
be expressed in terms of the eytzinger1 primitives, and that leads to
better and easier to understand code.  Start by converting
eytzinger0_find().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Add eytzinger0_find self test
Andreas Gruenbacher [Sat, 1 Feb 2025 12:55:46 +0000 (13:55 +0100)]
bcachefs: Add eytzinger0_find self test

Function eytzinger0_find() isn't currently covered, so add a self test.

We can rely on eytzinger0_find_le() here because it is being
tested independently.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: add eytzinger0_find_ge self test
Andreas Gruenbacher [Mon, 27 Jan 2025 16:15:36 +0000 (17:15 +0100)]
bcachefs: add eytzinger0_find_ge self test

Add an eytzinger0_find_ge() self test similar to eytzinger0_find_gt().

Note that this test requires eytzinger0_find_ge() to return the first
matching element in the array in case of duplicates.  To prevent
bisection errors, we only add this test after strenghening the original
implementation (see the previous commit).

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: implement eytzinger0_find_ge directly
Andreas Gruenbacher [Mon, 27 Jan 2025 16:52:39 +0000 (17:52 +0100)]
bcachefs: implement eytzinger0_find_ge directly

Implement eytzinger0_find_ge() directly instead of implementing it in
terms of eytzinger0_find_le() and adjusting the result.

This turns eytzinger0_find_ge() into a minimum search, so when there are
duplicate elements, the result of eytzinger0_find_ge() will now always
point at the first matching element.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: implement eytzinger0_find_gt directly
Andreas Gruenbacher [Mon, 27 Jan 2025 16:52:39 +0000 (17:52 +0100)]
bcachefs: implement eytzinger0_find_gt directly

Instead of implementing eytzinger0_find_gt() in terms of
eytzinger0_find_le() and adjusting the result, implement it directly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: add eytzinger0_find_gt self test
Andreas Gruenbacher [Mon, 27 Jan 2025 16:05:21 +0000 (17:05 +0100)]
bcachefs: add eytzinger0_find_gt self test

Add an eytzinger0_find_gt() self test similar to eytzinger0_find_le().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: simplify eytzinger0_find_le
Andreas Gruenbacher [Mon, 27 Jan 2025 13:33:20 +0000 (14:33 +0100)]
bcachefs: simplify eytzinger0_find_le

Replace the over-complicated implementation of eytzinger0_find_le() by
an equivalent, simpler version.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: convert eytzinger0_find_le to be 1-based
Andreas Gruenbacher [Tue, 28 Jan 2025 09:56:04 +0000 (10:56 +0100)]
bcachefs: convert eytzinger0_find_le to be 1-based

eytzinger0_find_le() is also easy to concert to 1-based eytzinger (but
see the next commit).

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: improve eytzinger0_find_le self test
Andreas Gruenbacher [Sun, 26 Jan 2025 16:57:06 +0000 (17:57 +0100)]
bcachefs: improve eytzinger0_find_le self test

Rename eytzinger0_find_test_val() to eytzinger0_find_test_le() and add a
new eytzinger0_find_test_val() wrapper that calls it.

We have already established that the array is sorted in eytzinger order,
so we can use the eytzinger iterator functions and check the boundary
conditions to verify the result of eytzinger0_find_le().

Only scan the entire array if we get an incorrect result.  When we need
to scan, use eytzinger0_for_each_prev() so that we'll stop at the
highest matching element in the array in case there are duplicates;
going through the array linearly wouldn't give us that.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: add eytzinger0_for_each_prev
Andreas Gruenbacher [Mon, 27 Jan 2025 16:26:05 +0000 (17:26 +0100)]
bcachefs: add eytzinger0_for_each_prev

Add an eytzinger0_for_each_prev() macro for iterating through an
eytzinger array in reverse.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: eytzinger0_find_test improvement
Andreas Gruenbacher [Sun, 26 Jan 2025 10:22:33 +0000 (11:22 +0100)]
bcachefs: eytzinger0_find_test improvement

In eytzinger0_find_test(), remember the smallest element seen so far
instead of comparing adjacent array elements.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: eytzinger[01]_test improvement
Andreas Gruenbacher [Sun, 26 Jan 2025 10:28:59 +0000 (11:28 +0100)]
bcachefs: eytzinger[01]_test improvement

In eytzinger[01]_test(), make sure that eytzinger[01]_for_each()
iterates over all array elements.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: eytzinger self tests: fix cmp_u16 typo
Andreas Gruenbacher [Tue, 26 Nov 2024 22:33:55 +0000 (23:33 +0100)]
bcachefs: eytzinger self tests: fix cmp_u16 typo

Fix an obvious typo in cmp_u16().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: eytzinger self tests: missing newline termination
Andreas Gruenbacher [Tue, 26 Nov 2024 20:55:49 +0000 (21:55 +0100)]
bcachefs: eytzinger self tests: missing newline termination

pr_info() format strings need to be newline terminated.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: eytzinger self tests: loop cleanups
Andreas Gruenbacher [Tue, 26 Nov 2024 11:12:36 +0000 (12:12 +0100)]
bcachefs: eytzinger self tests: loop cleanups

The iterator variable of eytzinger0_for_each() loops has been changed to
be locally scoped at some point, so remove variables defined outside the
loop that are now unused.  In addition and for clarity, use a different
variable inside those loops where an outside variable would be shadowed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: EYTZINGER_DEBUG fix
Andreas Gruenbacher [Tue, 28 Jan 2025 00:39:23 +0000 (01:39 +0100)]
bcachefs: EYTZINGER_DEBUG fix

When EYTZINGER_DEBUG is defined, <linux/bug.h> needs to be included.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_blacklist_entries_gc cleanup
Andreas Gruenbacher [Tue, 28 Jan 2025 09:32:47 +0000 (10:32 +0100)]
bcachefs: bch2_blacklist_entries_gc cleanup

Use an eytzinger0_for_each() loop here.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: bch2_bkey_ptr_data_type() now correctly returns cached for cached ptrs
Kent Overstreet [Fri, 7 Feb 2025 21:58:34 +0000 (16:58 -0500)]
bcachefs: bch2_bkey_ptr_data_type() now correctly returns cached for cached ptrs

Necessary for adding backpointers for cached pointers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 weeks agobcachefs: Add time_stat for btree writes
Kent Overstreet [Mon, 27 Jan 2025 06:22:42 +0000 (01:22 -0500)]
bcachefs: Add time_stat for btree writes

We have other metadata IO types covered, this was missing.

Note: this includes the time until completion, i.e. including parent
pointer update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>