Kent Overstreet [Wed, 25 Sep 2024 20:46:06 +0000 (16:46 -0400)]
bcachefs: Fix accounting read + device removal
accounting read was checking if accounting replicas entries were marked
in the superblock prior to applying accounting from the journal,
which meant that a recently removed device could spuriously trigger a
"not marked in superblocked" error (when journal entries zero out the
offending counter).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 24 Sep 2024 02:32:58 +0000 (22:32 -0400)]
bcachefs: fix transaction restart handling in check_extents(), check_dirents()
Dealing with outside state within a btree transaction is always tricky.
check_extents() and check_dirents() have to accumulate counters for
i_sectors and i_nlink (for subdirectories). There were two bugs:
- transaction commit may return a restart; therefore we have to commit
before accumulating to those counters
- get_inode_all_snapshots() may return a transaction restart, before
updating w->last_pos; then, on the restart,
check_i_sectors()/check_subdir_count() would see inodes that were not
for w->last_pos
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Tue, 24 Sep 2024 01:41:46 +0000 (09:41 +0800)]
bcachefs: fix the memory leak in exception case
The pointer clean points the memory allocated by kmemdup, when the
return value of bch2_sb_clean_validate_late is not zero. The memory
pointed by clean is leaked. So we should free it in this case.
Fixes: a37ad1a3aba9 ("bcachefs: sb-clean.c") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Tue, 24 Sep 2024 01:42:24 +0000 (09:42 +0800)]
bcachefs: fast exit when darray_make_room failed
In downgrade_table_extra, the return value is needed. When it
return failed, we should exit immediately.
Fixes: 7773df19c35f ("bcachefs: metadata version bucket_stripe_sectors") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs: assign return error when iterating through layout
syzbot reported a null ptr deref in __copy_user [0]
In __bch2_read_super, when a corrupt backup superblock matches the
default opts offset, no error is assigned to ret and the freed superblock
gets through, possibly being assigned as the best sb in bch2_fs_open and
being later dereferenced, causing a fault. Assign EINVALID to ret when
iterating through layout.
Piotr Zalewski [Sun, 22 Sep 2024 15:18:01 +0000 (15:18 +0000)]
bcachefs: memset bounce buffer portion to 0 after key_sort_fix_overlapping
Zero-initialize part of allocated bounce buffer which wasn't touched by
subsequent bch2_key_sort_fix_overlapping to mitigate later uinit-value
use KMSAN bug[1].
After applying the patch reproducer still triggers stack overflow[2] but
it seems unrelated to the uninit-value use warning. After further
investigation it was found that stack overflow occurs because KMSAN adds
too many function calls[3]. Backtrace of where the stack magic number gets
smashed was added as a reply to syzkaller thread[3].
It was confirmed that task's stack magic number gets smashed after the code
path where KSMAN detects uninit-value use is executed, so it can be assumed
that it doesn't contribute in any way to uninit-value use detection.
Kent Overstreet [Mon, 23 Sep 2024 21:30:59 +0000 (17:30 -0400)]
bcachefs: Add extra padding in bkey_make_mut_noupdate()
This fixes a kasan splat in propagate_key_to_snapshot_leaves() -
varint_decode_fast() does reads (that it never uses) up to 7 bytes past
the end of the integer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 23 Sep 2024 20:39:49 +0000 (16:39 -0400)]
bcachefs: Fix infinite loop in propagate_key_to_snapshot_leaves()
As we iterate we need to mark that we no longer need iterators -
otherwise we'll infinite loop via the "too many iters" check when
there's many snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ahmed Ehab [Sat, 21 Sep 2024 21:00:36 +0000 (00:00 +0300)]
bcachefs: Hold read lock in bch2_snapshot_tree_oldest_subvol()
Syzbot reports a problem that a warning is triggered due to suspicious
use of rcu_dereference_check(). That is triggered by a call of
bch2_snapshot_tree_oldest_subvol().
The cause of the warning is that inside
bch2_snapshot_tree_oldest_subvol(), snapshot_t() is called which calls
rcu_dereference() that requires a read lock to be held. Also, the call
of bch2_snapshot_tree_next() eventually calls snapshot_t().
To fix this, call rcu_read_lock() before calling snapshot_t(). Then,
release the lock after the termination of the while loop.
Reported-by: <syzbot+f7c41a878676b72c16a6@syzkaller.appspotmail.com> Signed-off-by: Ahmed Ehab <bottaawesome633@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs: return err ptr instead of null in read sb clean
syzbot reported a null-ptr-deref in bch2_fs_start. [0]
When a sb is marked clear but doesn't have a clean section
bch2_read_superblock_clean returns NULL which PTR_ERR_OR_ZERO
lets through, eventually leading to a null ptr dereference down
the line. Adjust read sb clean to return an ERR_PTR indicating the
invalid clean section.
Yang Li [Mon, 9 Sep 2024 00:58:02 +0000 (08:58 +0800)]
bcachefs: Remove duplicated include in backpointers.c
The header files bbpos.h is included twice in backpointers.c,
so one inclusion of each can be removed.
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 6 Sep 2024 23:14:36 +0000 (19:14 -0400)]
bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
This factors out ec_strie_head_devs_update(), which initializes the
bitmap of devices we're allocating from, and runs it every time
c->rw_devs_change_count changes.
We also cancel pending, not allocated stripes, since they may refer to
devices that are no longer available.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 6 Sep 2024 23:12:53 +0000 (19:12 -0400)]
bcachefs: bch_fs.rw_devs_change_count
Add a counter that's incremented whenever rw devices change; this will
be used for erasure coding so that it can keep ec_stripe_head in sync
and not deadlock on a new stripe when a device it wants goes away.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 1 Sep 2024 18:54:42 +0000 (14:54 -0400)]
bcachefs: bch_stripe.disk_label
When reshaping existing stripes, we should keep them on the same target
that they were allocated on; to do this, we need to add a field to the
btree stripe type.
This is a tad awkward, because we only have 8 bits left, and targets are
16 bits - but we only need to store a label, not a full target.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 5 Sep 2024 00:49:37 +0000 (20:49 -0400)]
bcachefs: Rework btree node pinning
In backpointers fsck, we do a seqential scan of one btree, and check
references to another: extents <-> backpointers
Checking references generates random lookups, so we want to pin that
btree in memory (or only a range, if it doesn't fit in ram).
Previously, this was done with a simple check in the shrinker - "if
btree node is in range being pinned, don't free it" - but this generated
OOMs, as our shrinker wasn't well behaved if there was less memory
available than expected.
Instead, we now have two different shrinkers and lru lists; the second
shrinker being for pinned nodes, with seeks set much higher than normal
- so they can still be freed if necessary, but we'll prefer not to.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Wed, 4 Sep 2024 07:15:32 +0000 (15:15 +0800)]
bcachefs: Fix compilation error for bch2_sb_member_alloc
Fix the following compilation error:
```
fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’:
fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement
508 | unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices);
```
Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 2 Sep 2024 02:39:42 +0000 (22:39 -0400)]
bcachefs: Options for recovery_passes, recovery_passes_exclude
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 4 Sep 2024 19:30:48 +0000 (15:30 -0400)]
bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodes
When freeing in a shrinker callback, we need to notify memory reclaim,
so it knows forward progress has been made.
Normally this is done in e.g. slab code, but we're not freeing through
slab - or rather we are, but these allocations are big, and use the
kmalloc_large() path.
This is really a bug in the slub code, but we're working around it here
for now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 1 Sep 2024 21:32:22 +0000 (17:32 -0400)]
bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't
allocate from the full filesystem - but we don't want spurious
allocation failures due to open buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Thorsten Blum [Mon, 26 Aug 2024 10:11:36 +0000 (12:11 +0200)]
bcachefs: Annotate bch_replicas_entry_{v0,v1} with __counted_by()
Add the __counted_by compiler attribute to the flexible array members
devs to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Increment nr_devs before adding a new device to the devs array and
adjust the array indexes accordingly. Add a helper macro for adding a
new device.
In bch2_journal_read(), explicitly set nr_devs to 0.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
ll /mnt/bcachefs
total 2
drwx------. 2 root root 0 Jun 14 14:10 lost+found
-rw-r--r--. 1 root root 1945 Jun 14 14:12 profile
ll /mnt/idmapped1/
total 2
drwx------. 2 1000 1000 0 Jun 14 14:10 lost+found
-rw-r--r--. 1 1000 1000 1945 Jun 14 14:12 profile
Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Thorsten Blum [Sat, 24 Aug 2024 13:57:41 +0000 (15:57 +0200)]
bcachefs: Annotate struct bch_xattr with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
x_name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alan Huang [Thu, 15 Aug 2024 15:40:53 +0000 (23:40 +0800)]
bcachefs: Refactor bch2_bset_fix_lookup_table
bch2_bset_fix_lookup_table is too complicated to be easily understood,
the comment "l now > where" there is also incorrect when where ==
t->end_offset. This patch therefore refactor the function, the idea is
that when where >= rw_aux_tree(b, t)[t->size - 1].offset, we don't need
to adjust the rw aux tree.
Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 30 Jun 2024 13:25:56 +0000 (09:25 -0400)]
bcachefs: Assert that we don't lock nodes when !trans->locked
We rely on the trans->locked to know if a trans has nodes locked for
assertions about deadlocks; there can't be more than one trans in the
same process that is locked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Matthew Wilcox (Oracle) [Tue, 20 Aug 2024 04:10:11 +0000 (05:10 +0100)]
bcachefs: Do not check folio_has_private()
folio_has_private() is an attractive nuisance; filesystem authors
generally don't realise that it actually checks two flags (one of which
is never set by bcachefs). There's no need to check the private flag at
all; for folios owned by bcachefs, we know that folio->private is NULL
when the private flag is clear and non-NULL when the private flag is set.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 19 Aug 2024 19:11:20 +0000 (15:11 -0400)]
bcachefs: Drop memalloc_nofs_save() in bch2_btree_node_mem_alloc()
It's really not needed: the only locks used here are the btree cache
lock, which we drop for GFP_WAIT allocations, and btree node locks - but
we also drop those for GFP_WAIT allocations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Youling Tang [Thu, 15 Aug 2024 08:57:43 +0000 (16:57 +0800)]
bcachefs: drop unused posix acl handlers
Remove struct nop_posix_acl_{access,default} for bcachefs filesystem
that don't depend on the xattr handler in their inode->i_op->listxattr()
method in any way. There's nothing more to do than to simply remove the
handler. It's been effectively unused ever since we introduced the new
posix acl api. See [1] for details.
Link [1]: https://patchwork.kernel.org/project/linux-fsdevel/cover/20230125-fs-acl-remove-generic-xattr-handlers-v3-0-f760cc58967d@kernel.org/
Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alan Huang [Mon, 12 Aug 2024 08:06:09 +0000 (16:06 +0800)]
bcachefs: Minimize the search range used to calculate the mantissa
When the search key's mantissa is larger than the node i's, we know that
the search key is larger than the first key of the cacheline corresponding
to node i, so that when we are calculating the mantissa of right side
nodes of node i, the left side of the search range can be the first key
of node i. Once the search range is minimized, the mantissa we are
calculating can have more useful bits, thus reduce the slow path
comparison. Besides, we can now remove all the prev array stuff.
Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The macro allocate_dropping_locks accepts a parameter _trans,
but it was not used, rather the variable trans was directly used,
which may be a local variable inside a function that calls the macros.
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The macro allocate_dropping_locks_errocode accepts a parameter _trans,
but it was not used, rather the variable trans was directly used,
which may be a local variable inside a function that calls the macros.
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>