]> www.infradead.org Git - users/griffoul/linux.git/log
users/griffoul/linux.git
2 years agobcachefs: Dump journal state when we get stuck
Kent Overstreet [Wed, 24 Feb 2021 06:16:49 +0000 (01:16 -0500)]
bcachefs: Dump journal state when we get stuck

We had a bug reported where the journal is failing to allocate a journal
write - this should help figure out what's going on.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix a 64 bit divide on 32 bit
Kent Overstreet [Sat, 20 Feb 2021 10:05:18 +0000 (05:05 -0500)]
bcachefs: Fix a 64 bit divide on 32 bit

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't use inode btree key cache in fsck code
Kent Overstreet [Mon, 8 Mar 2021 02:43:21 +0000 (21:43 -0500)]
bcachefs: Don't use inode btree key cache in fsck code

We had a cache coherency bug with the btree key cache in the fsck code -
this fixes fsck to be consistent about not using it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't call into journal reclaim when we're not supposed to
Kent Overstreet [Mon, 8 Mar 2021 00:04:16 +0000 (19:04 -0500)]
bcachefs: Don't call into journal reclaim when we're not supposed to

This was causing a deadlock when btree_update_nodes_writtes() invokes
journal reclaim because of the btree cache being too dirty.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Create allocator threads when allocating filesystem
Kent Overstreet [Fri, 5 Mar 2021 23:00:55 +0000 (18:00 -0500)]
bcachefs: Create allocator threads when allocating filesystem

We're seeing failures to mount because of a failure to start the
allocator threads, which currently happens fairly late in the mount
process, after walking all metadata, and kthread_create() fails if
something has tried to kill the mount process, which is probably not
what we want.

This patch avoids this issue by creating, but not starting, the
allocator threads when we preallocate all of our other in memory data
structures.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix for bch2_btree_node_get_noiter() returning -ENOMEM
Kent Overstreet [Wed, 24 Feb 2021 02:41:25 +0000 (21:41 -0500)]
bcachefs: Fix for bch2_btree_node_get_noiter() returning -ENOMEM

bch2_btree_node_get_noiter() isn't used from the btree iterator code,
which retries with the btree node cache cannibalize lock held on
-ENOMEM, so we should do it ourself if necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add error message for some allocation failures
Kent Overstreet [Tue, 23 Feb 2021 20:16:41 +0000 (15:16 -0500)]
bcachefs: Add error message for some allocation failures

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Extents may now cross btree node boundaries
Kent Overstreet [Wed, 10 Feb 2021 21:13:57 +0000 (16:13 -0500)]
bcachefs: Extents may now cross btree node boundaries

When snapshots arrive, we won't necessarily be able to arbitrarily split
existis - when we need to split an existing extent, we'll have to check
if the extent was overwritten in child snapshots and if so emit a
whiteout for the split in the child snapshot.

Because extents couldn't span btree nodes previously, journal replay
would sometimes have to split existing extents. That's no good anymore,
but fortunately since extent handling has already been lifted above most
of the btree code there's no real need for that rule anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: iter->real_pos
Kent Overstreet [Fri, 12 Feb 2021 02:57:32 +0000 (21:57 -0500)]
bcachefs: iter->real_pos

We need to differentiate between the search position of a btree
iterator, vs. what it actually points at (what we found). This matters
for extents, where iter->pos will typically be the start of the key we
found and iter->real_pos will be the end of the key we found (which soon
won't necessarily be in the same btree node!) and it will also matter
for snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Ensure btree iterators are traversed in bch2_trans_commit()
Kent Overstreet [Wed, 10 Mar 2021 00:37:40 +0000 (19:37 -0500)]
bcachefs: Ensure btree iterators are traversed in bch2_trans_commit()

The upcoming patch to allow extents to span btree nodes will require
this... and this assertion seems to be popping, and it's not a very good
assertion anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Drop invalid stripe ptrs in fsck
Kent Overstreet [Wed, 17 Feb 2021 18:37:22 +0000 (13:37 -0500)]
bcachefs: Drop invalid stripe ptrs in fsck

More repair code, now that we can repair extents during initial gc.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix unnecessary read amplificaiton when allocating ec stripes
Robbie Litchfield [Wed, 10 Feb 2021 00:18:13 +0000 (13:18 +1300)]
bcachefs: Fix unnecessary read amplificaiton when allocating ec stripes

When allocating an erasure coding stripe, bcachefs will always reuse any
partial stripes before reserving a new stripe. This causes unnecessary
read amplification when preparing a stripe for writing. This patch changes
bcachefs to always reserve new stripes first, only relying on stripe reuse
when copygc needs more time to empty buckets from existing stripes.

Signed-off-by: Robbie Litchfield <blam.kiwi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fsck fixes
Kent Overstreet [Sat, 13 Feb 2021 01:53:29 +0000 (20:53 -0500)]
bcachefs: Fsck fixes

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix a shift greater than type size
Kent Overstreet [Thu, 11 Feb 2021 19:49:36 +0000 (14:49 -0500)]
bcachefs: Fix a shift greater than type size

Found by UBSAN

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Assert that we're not trying to flush journal seq in the future
Kent Overstreet [Wed, 10 Feb 2021 18:39:48 +0000 (13:39 -0500)]
bcachefs: Assert that we're not trying to flush journal seq in the future

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix bch2_btree_iter_peek_prev()
Kent Overstreet [Mon, 8 Feb 2021 02:11:49 +0000 (21:11 -0500)]
bcachefs: Fix bch2_btree_iter_peek_prev()

This makes bch2_btree_iter_peek_prev() and bch2_btree_iter_prev()
consistent with peek() and next(), w.r.t. iter->pos.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: bch2_btree_iter_advance_pos()
Kent Overstreet [Mon, 8 Feb 2021 02:28:58 +0000 (21:28 -0500)]
bcachefs: bch2_btree_iter_advance_pos()

This adds a new common helper for advancing past the last key returned
by peek().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Kill bch2_btree_iter_set_pos_same_leaf()
Kent Overstreet [Mon, 8 Feb 2021 01:16:21 +0000 (20:16 -0500)]
bcachefs: Kill bch2_btree_iter_set_pos_same_leaf()

The only reason we were keeping this around was for
BTREE_INSERT_NOUNLOCK semantics - if bch2_btree_iter_set_pos() advances
to the next leaf node, it'll drop the lock on the node that we just
inserted to.

But we don't rely on BTREE_INSERT_NOUNLOCK semantics for the extents
btree, just the inodes btree, and if we do need it for the extents btree
in the future we can do it more cleanly by cloning the iterator - this
lets us delete some special cases in the btree iterator code, which is
complicated enough as it is.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Simplify btree_iter_(next|prev)_leaf()
Kent Overstreet [Sun, 7 Feb 2021 23:52:13 +0000 (18:52 -0500)]
bcachefs: Simplify btree_iter_(next|prev)_leaf()

There's no good reason for these functions to not be using
bch2_btree_iter_set_pos().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix for hash_redo_key() in fsck
Kent Overstreet [Wed, 10 Feb 2021 00:54:40 +0000 (19:54 -0500)]
bcachefs: Fix for hash_redo_key() in fsck

It's possible we're calling hash_redo_key() because of a duplicate key -
easiest fix for that is to just not use BCH_HASH_SET_MUST_CREATE.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add flushed_seq_ondisk to journal_debug_to_text()
Kent Overstreet [Wed, 10 Feb 2021 00:54:04 +0000 (19:54 -0500)]
bcachefs: Add flushed_seq_ondisk to journal_debug_to_text()

Also, make the wait in bch2_journal_flush_seq() interruptible, not just
killable.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Redo checks for sufficient devices
Kent Overstreet [Sun, 7 Feb 2021 04:17:26 +0000 (23:17 -0500)]
bcachefs: Redo checks for sufficient devices

When the replicas mechanism was added, for tracking data by which drives
it's replicated on, the check for whether we have sufficient devices was
never updated to make use of it. This patch finally does that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Run fsck if BCH_FEATURE_alloc_v2 isn't set
Kent Overstreet [Wed, 3 Feb 2021 20:31:17 +0000 (15:31 -0500)]
bcachefs: Run fsck if BCH_FEATURE_alloc_v2 isn't set

We're using BCH_FEATURE_alloc_v2 to also gate journalling updates to dev
usage - we don't have the code for reconstructing this from buckets
anymore, so we need to run fsck if it's not set.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fixes/improvements for journal entry reservations
Kent Overstreet [Wed, 3 Feb 2021 18:10:55 +0000 (13:10 -0500)]
bcachefs: Fixes/improvements for journal entry reservations

This fixes some arithmetic bugs in "bcachefs: Journal updates to dev
usage" - additionally, it cleans things up by switching everything that
goes in every journal entry to the journal_entry_res mechanism.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Include device in btree IO error messages
Kent Overstreet [Tue, 2 Feb 2021 22:08:54 +0000 (17:08 -0500)]
bcachefs: Include device in btree IO error messages

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Journal updates to dev usage
Kent Overstreet [Fri, 22 Jan 2021 02:52:06 +0000 (21:52 -0500)]
bcachefs: Journal updates to dev usage

This eliminates the need to scan every bucket to regenerate dev_usage at
mount time.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Persist 64 bit io clocks
Kent Overstreet [Thu, 21 Jan 2021 20:28:59 +0000 (15:28 -0500)]
bcachefs: Persist 64 bit io clocks

Originally, bcachefs - going back to bcache - stored, for each bucket, a
16 bit counter corresponding to how long it had been since the bucket
was read from. But, this required periodically rescaling counters on
every bucket to avoid wraparound. That wasn't an issue in bcache, where
we'd perodically rewrite the per bucket metadata all at once, but in
bcachefs we're trying to avoid having to walk every single bucket.

This patch switches to persisting 64 bit io clocks, corresponding to the
64 bit bucket timestaps introduced in the previous patch with
KEY_TYPE_alloc_v2.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: KEY_TYPE_alloc_v2
Kent Overstreet [Fri, 22 Jan 2021 23:01:07 +0000 (18:01 -0500)]
bcachefs: KEY_TYPE_alloc_v2

This introduces a new version of KEY_TYPE_alloc, which uses the new
varint encoding introduced for inodes. This means we'll eventually be
able to support much larger bucket sizes (for SMR devices), and the
read/write time fields are expanded to 64 bits - which will be used in
the next patch to get rid of the periodic rescaling of those fields.

Also, for buckets that are members of erasure coded stripes, this adds
persistent fields for the index of the stripe they're members of and the
stripe redundancy. This is part of work to get rid of having to scan and
read into memory the alloc and stripes btrees at mount time.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add missing call to bch2_replicas_entry_sort()
Kent Overstreet [Tue, 2 Feb 2021 20:56:44 +0000 (15:56 -0500)]
bcachefs: Add missing call to bch2_replicas_entry_sort()

This fixes a bug introduced by "bcachefs: Improve diagnostics when
journal entries are missing" - devices in a replicas entry are supposed
to be sorted.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add an assertion to check for journal writes to same location
Kent Overstreet [Fri, 29 Jan 2021 18:58:10 +0000 (13:58 -0500)]
bcachefs: Add an assertion to check for journal writes to same location

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add an option for metadata_target
Kent Overstreet [Fri, 29 Jan 2021 20:37:28 +0000 (15:37 -0500)]
bcachefs: Add an option for metadata_target

Also, make journal writes obey foreground_target and metadata_target.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Repair bad data pointers
Kent Overstreet [Thu, 28 Jan 2021 00:08:54 +0000 (19:08 -0500)]
bcachefs: Repair bad data pointers

Now that we can repair metadata during GC, we can handle bad pointers
that would trigger errors being marked, when they need to just be
dropped.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add (partial) support for fixing btree topology
Kent Overstreet [Wed, 27 Jan 2021 01:59:00 +0000 (20:59 -0500)]
bcachefs: Add (partial) support for fixing btree topology

When we walk the btrees during recovery, part of that is checking that
btree topology is correct: for every interior btree node, its child
nodes should exactly span the range the parent node covers.

Previously, we had checks for this, but not repair code. Now that we
have the ability to do btree updates during initial GC, this patch adds
that repair code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add support for doing btree updates prior to journal replay
Kent Overstreet [Wed, 27 Jan 2021 01:15:46 +0000 (20:15 -0500)]
bcachefs: Add support for doing btree updates prior to journal replay

Some errors may need to be fixed in order for GC to successfully run -
walk and mark all metadata. But we can't start the allocators and do
normal btree updates until after GC has completed, and allocation
information is known to be consistent, so we need a different method of
doing btree updates.

Fortunately, we already have code for walking the btree while overlaying
keys from the journal to be replayed. This patch adds an update path
that adds keys to the list of keys to be replayed by journal replay, and
also fixes up iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add BTREE_PTR_RANGE_UPDATED
Kent Overstreet [Wed, 27 Jan 2021 01:13:54 +0000 (20:13 -0500)]
bcachefs: Add BTREE_PTR_RANGE_UPDATED

This is so that when we discover btree topology issues, we can just
update the pointer to a btree node and signal btree read path that the
min/max keys in the node header should be updated from the node pointer.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Refactor checking of btree topology
Kent Overstreet [Tue, 26 Jan 2021 21:04:38 +0000 (16:04 -0500)]
bcachefs: Refactor checking of btree topology

Still a lot of work to be done here: we can't yet repair btree topology
issues, but this patch refactors things so that we have better access to
what we need in the topology checks. Next up will be figuring out a way
to do btree updates during gc, before journal replay is done.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Improve diagnostics when journal entries are missing
Kent Overstreet [Tue, 26 Jan 2021 21:04:12 +0000 (16:04 -0500)]
bcachefs: Improve diagnostics when journal entries are missing

There's an outstanding bug with journal entries being missing in journal
replay. This patch adds code to print out where the journal entries were
physically located that were around the entry(ies) being missing, which
should make debugging easier.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix BCH_REPLICAS_MAX check
Kent Overstreet [Wed, 27 Jan 2021 02:22:19 +0000 (21:22 -0500)]
bcachefs: Fix BCH_REPLICAS_MAX check

Ideally, this limit will be going away in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix build in userspace
Kent Overstreet [Thu, 28 Jan 2021 00:36:09 +0000 (19:36 -0500)]
bcachefs: Fix build in userspace

The userspace bch_err() macro doesn't use the filesystem argument. Could
also be fixed with a better macro.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix an assertion
Kent Overstreet [Mon, 25 Jan 2021 19:04:31 +0000 (14:04 -0500)]
bcachefs: Fix an assertion

If we're invalidating a bucket that has cached data in it, data_type
won't be 0 - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Mark superblocks transactionally
Kent Overstreet [Fri, 22 Jan 2021 22:56:34 +0000 (17:56 -0500)]
bcachefs: Mark superblocks transactionally

More work towards getting rid of the in memory struct bucket: this path
adds code for marking superblock and journal buckets via the btree, and
uses it in the device add and journal resize paths.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Kill bch2_invalidate_bucket()
Kent Overstreet [Fri, 22 Jan 2021 23:19:15 +0000 (18:19 -0500)]
bcachefs: Kill bch2_invalidate_bucket()

This patch is working towards eventually getting rid of the in memory
struct bucket, and relying only on the btree representation.

Since bch2_invalidate_bucket() was only used for incrementing gens, not
invalidating cached data, no other counters were being changed as a side
effect - meaning it's safe for the allocator code to increment the
bucket gen directly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Refactor dev usage
Kent Overstreet [Fri, 22 Jan 2021 01:51:51 +0000 (20:51 -0500)]
bcachefs: Refactor dev usage

This is to make it more amenable for serialization.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Kill metadata only gc
Kent Overstreet [Fri, 22 Jan 2021 02:51:42 +0000 (21:51 -0500)]
bcachefs: Kill metadata only gc

This was useful before we had transactional updates to interior btree
nodes - but now, it's just extra unneeded complexity.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Ensure __bch2_trans_commit() always calls bch2_trans_reset()
Kent Overstreet [Fri, 22 Jan 2021 00:30:35 +0000 (19:30 -0500)]
bcachefs: Ensure __bch2_trans_commit() always calls bch2_trans_reset()

This was leading to a very strange bug in bch2_bucket_io_time_reset(),
where we'd retry without clearing out the list of updates.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix a faulty assertion
Kent Overstreet [Fri, 22 Jan 2021 00:15:49 +0000 (19:15 -0500)]
bcachefs: Fix a faulty assertion

If journal replay hasn't finished, the journal can't be empty - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Switch replicas.c allocations to GFP_KERNEL
Kent Overstreet [Fri, 22 Jan 2021 00:14:37 +0000 (19:14 -0500)]
bcachefs: Switch replicas.c allocations to GFP_KERNEL

We're transitioning to memalloc_nofs_save/restore instead of GFP flags
with the rest of the kernel, and GFP_NOIO was excessively strict and
causing unnnecessary allocation failures - these allocations are done
with btree locks dropped.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix loopback in dio mode
Kent Overstreet [Thu, 21 Jan 2021 19:42:23 +0000 (14:42 -0500)]
bcachefs: Fix loopback in dio mode

We had a deadlock on page_lock, because buffered reads signal completion
by unlocking the page, but the dio read path normally dirties the pages
it's reading to with set_page_dirty_lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Clean up bch2_extent_can_insert
Kent Overstreet [Thu, 21 Jan 2021 00:42:09 +0000 (19:42 -0500)]
bcachefs: Clean up bch2_extent_can_insert

It was using an internal btree node iterator interface, when
bch2_btree_iter_peek_slot() sufficed. We were hitting a null ptr deref
that looked like it was from the iterator not being uptodate - this will
also fix that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix an assertion pop
Kent Overstreet [Wed, 20 Jan 2021 22:31:31 +0000 (17:31 -0500)]
bcachefs: Fix an assertion pop

There was a race: btree node writes drop their reference on journal pins
before clearing the btree_node_write_in_flight flag.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't allocate stripes at POS_MIN
Kent Overstreet [Tue, 19 Jan 2021 01:20:24 +0000 (20:20 -0500)]
bcachefs: Don't allocate stripes at POS_MIN

In the future, stripe index 0 will be a sentinal value. This patch
doesn't disallow stripes at POS_MIN yet, leaving that for when we do the
on disk format changes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Rework allocating buckets for stripes
Kent Overstreet [Tue, 19 Jan 2021 04:26:42 +0000 (23:26 -0500)]
bcachefs: Rework allocating buckets for stripes

Allocating buckets for existing stripes was busted, in part because the
data structures were too contorted. This reworks new stripes so that we
have an array of open buckets that matches blocks in the stripe, and
it's sparse if we're reusing an existing stripe.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Verify transaction updates are sorted
Kent Overstreet [Tue, 19 Jan 2021 00:59:03 +0000 (19:59 -0500)]
bcachefs: Verify transaction updates are sorted

A user reported a bug that implies they might not be correctly sorted,
this should help track that down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Preserve stripe blockcounts on existing stripes
Kent Overstreet [Sun, 17 Jan 2021 22:43:49 +0000 (17:43 -0500)]
bcachefs: Preserve stripe blockcounts on existing stripes

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Kill stripe->dirty
Kent Overstreet [Sun, 17 Jan 2021 21:45:19 +0000 (16:45 -0500)]
bcachefs: Kill stripe->dirty

This makes bch2_stripes_write() work more like bch2_alloc_write().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix gc updating stripes info
Kent Overstreet [Sun, 17 Jan 2021 21:16:37 +0000 (16:16 -0500)]
bcachefs: Fix gc updating stripes info

The primary stripes radix tree can be sparse, which was causing an
assertion to pop because the one use for gc isn't. Fix this by changing
the algorithm to copy between the two radix trees.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix double counting of stripe block counts by GC
Kent Overstreet [Sun, 17 Jan 2021 20:18:11 +0000 (15:18 -0500)]
bcachefs: Fix double counting of stripe block counts by GC

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix integer overflow in bch2_disk_reservation_get()
Kent Overstreet [Sun, 17 Jan 2021 18:19:16 +0000 (13:19 -0500)]
bcachefs: Fix integer overflow in bch2_disk_reservation_get()

The sectors argument shouldn't have been a u32 - it can be up to U32_MAX
(i.e. fallocate creating persistent reservations), and if replication is
enabled we'll overflow when we calculate the real number of sectors to
reserve. Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Correctly order flushes and journal writes on multi device filesystems
Kent Overstreet [Sat, 16 Jan 2021 20:40:33 +0000 (15:40 -0500)]
bcachefs: Correctly order flushes and journal writes on multi device filesystems

All writes prior to a journal write need to be flushed before the
journal write itself happens. On single device filesystems, it suffices
to mark the write with REQ_PREFLUSH|REQ_FUA, but on multi device
filesystems we need to issue flushes to every device - and wait for them
to complete - before issuing the journal writes. Previously, we were
issuing flushes to every device, but we weren't waiting for them to
complete before issuing the journal writes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Run jset_validate in write path as well
Kent Overstreet [Thu, 14 Jan 2021 21:21:22 +0000 (16:21 -0500)]
bcachefs: Run jset_validate in write path as well

This is because we had a bug where we were writing out journal entries
with garbage last_seq, and not catching it.

Also, completely ignore jset->last_seq when JSET_NO_FLUSH is true,
because of aforementioned bug, but change the write path to set last_seq
to 0 when JSET_NO_FLUSH is true.

Minor other cleanups and comments.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Factor out bch2_ec_stripes_heap_start()
Kent Overstreet [Thu, 14 Jan 2021 21:19:23 +0000 (16:19 -0500)]
bcachefs: Factor out bch2_ec_stripes_heap_start()

This fixes a bug where mark and sweep gc incorrectly was clearing out
the stripes heap and causing assertions to fire later - simpler to just
create the stripes heap after gc has finished.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add btree node prefetching to bch2_btree_and_journal_walk()
Kent Overstreet [Mon, 11 Jan 2021 21:11:02 +0000 (16:11 -0500)]
bcachefs: Add btree node prefetching to bch2_btree_and_journal_walk()

bch2_btree_and_journal_walk() walks the btree overlaying keys from the
journal; it was introduced so that we could read in the alloc btree
prior to journal replay being done, when journalling of updates to
interior btree nodes was introduced.

But it didn't have btree node prefetching, which introduced a severe
regression with mount times, particularly on spinning rust. This patch
implements btree node prefetching for the btree + journal walk,
hopefully fixing that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Erasure coding fixes & refactoring
Kent Overstreet [Mon, 11 Jan 2021 18:51:23 +0000 (13:51 -0500)]
bcachefs: Erasure coding fixes & refactoring

 - Originally bch_extent_stripe_ptr didn't contain the block index,
   instead we'd have to search through the stripe pointers to figure out
   which pointer matched. When the block field was added to
   bch_extent_stripe_ptr, not all of the code was updated to use it.
   This patch fixes that, and we also now verify that field where it
   makes sense.

 - The ec_stripe_buf_init/exit() functions have been improved, and are
   now used by the bch2_ec_read_extent() (recovery read) path.

 - get_stripe_key() is now used by bch2_ec_read_extent().

 - We now have a getter and setter for checksums within a stripe, like
   we had previously for block sector counts, and ec_generate_checksums
   and ec_validate_checksums are now quite a bit smaller and cleaner.

ec.c still needs a lot of work, but this patch is slowly moving things
in the right direction.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add cannibalize lock to btree_cache_to_text()
Kent Overstreet [Mon, 11 Jan 2021 18:37:35 +0000 (13:37 -0500)]
bcachefs: Add cannibalize lock to btree_cache_to_text()

More debugging info is always a good thing.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix .splice_write
Kent Overstreet [Tue, 27 Apr 2021 18:18:22 +0000 (14:18 -0400)]
bcachefs: Fix .splice_write

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix bch2_replicas_gc2
Kent Overstreet [Sun, 10 Jan 2021 18:38:09 +0000 (13:38 -0500)]
bcachefs: Fix bch2_replicas_gc2

This fixes a regression introduced by "bcachefs: Refactor filesystem
usage accounting". We have to include all the replicas entries that have
any of the entries for different journal entries nonzero, we can't skip
them if they sum to zero.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: bch2_alloc_write() should be writing for all devices
Kent Overstreet [Sat, 9 Jan 2021 02:20:58 +0000 (21:20 -0500)]
bcachefs: bch2_alloc_write() should be writing for all devices

Alloc info isn't stored on a particular device, it makes no sense to
only be writing it out for rw members - this was causing fsck to not fix
alloc info errors, oops.

Also, make sure we write out alloc info in other repair paths.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix btree node split after merge operations
Kent Overstreet [Fri, 8 Jan 2021 15:56:39 +0000 (10:56 -0500)]
bcachefs: Fix btree node split after merge operations

A btree node merge operation deletes a key in the parent node; if when
inserting into the parent node we split the parent node, we can end up
with a whiteout in the parent node that we don't want.

The existing code drops them before doing the split, because they can
screw up picking the pivot, but we forgot about the unwritten writeouts
area - that needs to be cleared out too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Reserve some open buckets for btree allocations
Kent Overstreet [Thu, 7 Jan 2021 22:18:14 +0000 (17:18 -0500)]
bcachefs: Reserve some open buckets for btree allocations

This reverts part of the change from "bcachefs: Don't use
BTREE_INSERT_USE_RESERVE so much" - it turns out we still should be
reserving open buckets for btree node allocations, because otherwise
data bucket allocations (especially with erasure coding enabled) can use
up all our open buckets and we won't be able to do the metadata update
that lets us release those open bucket references. Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Work around a zstd bug
Kent Overstreet [Thu, 7 Jan 2021 22:06:22 +0000 (17:06 -0500)]
bcachefs: Work around a zstd bug

The zstd compression code seems to have a bug where it will write just
past the end of the destination buffer - probably only when the
compressed output isn't going to fit in the destination buffer, which
will never happen if you're always allocating a bigger buffer than the
source buffer which would explain other users not hitting it. But, we
size the buffer according to how much contiguous space on disk we have,
so...

generally, bugs like this don't write more than a word past the end of
the buffer, so an easy workaround is to subtract a fudge factor from the
buffer size.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't error out of recovery process on journal read error
Kent Overstreet [Wed, 6 Jan 2021 23:49:35 +0000 (18:49 -0500)]
bcachefs: Don't error out of recovery process on journal read error

We don't want to fail the recovery/mount because of a single error
reading from the journal - the relevant journal entry may still be found
on other devices, and missing or no journal entries found is already
handled later in the recovery process.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix journal_buf_realloc()
Kent Overstreet [Mon, 4 Jan 2021 20:46:57 +0000 (15:46 -0500)]
bcachefs: Fix journal_buf_realloc()

It used to be safe to reallocate a buf that the write path owns without
holding the journal lock, but now this can trigger an assertion in
journal_seq_to_buf().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Reduce/kill BKEY_PADDED use
Kent Overstreet [Thu, 17 Dec 2020 20:08:58 +0000 (15:08 -0500)]
bcachefs: Reduce/kill BKEY_PADDED use

With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.

In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Use separate new stripes for copygc and non-copygc
Kent Overstreet [Tue, 15 Dec 2020 17:53:30 +0000 (12:53 -0500)]
bcachefs: Use separate new stripes for copygc and non-copygc

Allocations for copygc have to be kept separate from everything else,
so that copygc doesn't get starved.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Change allocations for ec stripes to blocking
Kent Overstreet [Tue, 15 Dec 2020 17:38:17 +0000 (12:38 -0500)]
bcachefs: Change allocations for ec stripes to blocking

We don't want writes to not get erasure coded just because the allocator
temporarily wasn't keeping up.

However, it's not guaranteed that these allocations will ever succeed,
we can currently get stuck - especially if devices are different sizes -
we still have work to do in this area.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't read existing stripes synchronously in write path
Kent Overstreet [Tue, 15 Dec 2020 00:41:03 +0000 (19:41 -0500)]
bcachefs: Don't read existing stripes synchronously in write path

Previously, in the stripe creation path, when reusing an existing stripe
we'd read the existing stripe synchronously - ouch.

Now, we allocate two stripe bufs if we're using an existing stripe, so
that we can do the read asynchronously - and, we read the full stripe so
that we can run recovery, if necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Change when we allow overwrites
Kent Overstreet [Tue, 15 Dec 2020 02:59:33 +0000 (21:59 -0500)]
bcachefs: Change when we allow overwrites

Originally, we'd check for -ENOSPC when getting a disk reservation
whenever the new extent took up more space on disk than the old extent.

Erasure coding screwed this up, because with erasure coding writes are
initially replicated, and then in the background the extra replicas are
dropped when the stripe is created. This means that with erasure coding
enabled, writes will always take up more space on disk than the data
they're overwriting - but, according to posix, overwrites aren't
supposed to return ENOSPC.

So, in this patch we fudge things: if the new extent has more replicas
than the _effective_ replicas of the old extent, or if the old extent is
compressed and the new one isn't, we check for ENOSPC when getting the
disk reservation - otherwise, we don't.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't use BTREE_INSERT_USE_RESERVE so much
Kent Overstreet [Mon, 21 Dec 2020 22:17:18 +0000 (17:17 -0500)]
bcachefs: Don't use BTREE_INSERT_USE_RESERVE so much

Previously, we were using BTREE_INSERT_RESERVE in a lot of places where
it no longer makes sense.

 - we now have more open_buckets than we used to, and the reserves work
   better, so we shouldn't need to use BTREE_INSERT_RESERVE just because
   we're holding open_buckets pinned anymore.

 - We have the btree key cache for updates to the alloc btree, meaning
   we no longer need the btree reserve to ensure the allocator can make
   forward progress.

This means that we should only need a reserve for btree updates to
ensure that copygc can make forward progress.

Since it's now just for copygc, we can also fold RESERVE_BTREE into
RESERVE_MOVINGGC (the allocator's freelist reserve).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix iterator overflow in move path
Kent Overstreet [Mon, 21 Dec 2020 02:42:19 +0000 (21:42 -0500)]
bcachefs: Fix iterator overflow in move path

The move path was calling bch2_bucket_io_time_reset() for cached
pointers (which it shouldn't have been), and then not calling
bch2_trans_reset() when it got -EINTR (indicating transaction restart).
Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix btree lock being incorrectly dropped
Kent Overstreet [Sun, 20 Dec 2020 02:31:05 +0000 (21:31 -0500)]
bcachefs: Fix btree lock being incorrectly dropped

__btree_trans_get_iter() was using bch2_btree_iter_upgrade, but it
shouldn't have been because on failure bch2_btree_iter_upgrade may drop
locks in other iterators, expecting the transaction to be restarted. But
__btree_trans_get_iter can't return an error to indicate that we need to
restart thet transaction - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix for spinning in journal reclaim on startup
Kent Overstreet [Sat, 19 Dec 2020 20:39:10 +0000 (15:39 -0500)]
bcachefs: Fix for spinning in journal reclaim on startup

We normally avoid having too many dirty keys in the btree key cache, to
ensure that we can always shrink our caches to reclaim memory if needed.

But this check was causing us to go into an infinite loop on startup, in
the btree insert path before journal reclaim was started.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix race between journal_seq_copy() and journal_seq_drop()
Kent Overstreet [Wed, 16 Dec 2020 20:41:29 +0000 (15:41 -0500)]
bcachefs: Fix race between journal_seq_copy() and journal_seq_drop()

In bch2_btree_interior_update_will_free_node, we copy the journal pins
from outstanding writes on the btree node we're about to free. But, this
can race with the writes completing, and dropping their journal pins.

To guard against this, just use READ_ONCE() in bch2_journal_pin_copy().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't write bucket IO time lazily
Kent Overstreet [Sat, 17 Oct 2020 01:39:16 +0000 (21:39 -0400)]
bcachefs: Don't write bucket IO time lazily

With the btree key cache code, we don't need to update the alloc btree
lazily - and this will mean we can remove the bch2_alloc_write() call in
the shutdown path.

Future work: we really need to expend the bucket IO clocks from 16 to 64
bits, so that we don't have to rescale them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add BCH_BKEY_PTRS_MAX
Kent Overstreet [Wed, 16 Dec 2020 19:23:27 +0000 (14:23 -0500)]
bcachefs: Add BCH_BKEY_PTRS_MAX

This now means "the maximum number of pointers within a bkey" - and
bch_devs_list is updated to use it instead of BCH_REPLICAS_MAX, since
stripes can contain more than BCH_REPLICAS_MAX pointers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Check for duplicate device ptrs in bch2_bkey_ptrs_invalid()
Kent Overstreet [Wed, 16 Dec 2020 19:18:33 +0000 (14:18 -0500)]
bcachefs: Check for duplicate device ptrs in bch2_bkey_ptrs_invalid()

This is something we clearly should be checking for, but weren't -
oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Add some cond_rescheds() in shutdown path
Kent Overstreet [Sun, 13 Dec 2020 21:12:04 +0000 (16:12 -0500)]
bcachefs: Add some cond_rescheds() in shutdown path

Particularly on emergency shutdown we can end up having to clean up a
lot of dirty cached btree keys here.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix btree node merge -> split operations
Kent Overstreet [Fri, 11 Dec 2020 17:02:48 +0000 (12:02 -0500)]
bcachefs: Fix btree node merge -> split operations

If a btree node merger is followed by a split or compact of the parent
node, we could end up with the parent btree node iterator pointing to
the whiteout inserted by the btree node merge operation - the fix is to
ensure that interior btree node iterators always point to the first non
whiteout.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Always check if we need disk res in extent update path
Kent Overstreet [Thu, 10 Dec 2020 18:38:54 +0000 (13:38 -0500)]
bcachefs: Always check if we need disk res in extent update path

With erasure coding, we now have processes in the background that
compact data, causing it to take up less space on disk than when it was
written, or potentially when it was read.

This means that we can't trust the page cache when it says "we have data
on disk taking up x amount of space here" - there's always the potential
to race with background compaction.

To fix this, just check if we need to add to our disk reservation in the
bch2_extent_update() path, in the transaction that will do the btree
update.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Update transactional triggers interface to pass old & new keys
Kent Overstreet [Thu, 10 Dec 2020 18:13:56 +0000 (13:13 -0500)]
bcachefs: Update transactional triggers interface to pass old & new keys

This is needed to fix a bug where we're overflowing iterators within a
btree transaction, because we're updating the stripes btree (to update
block counts) and the stripes btree trigger is unnecessarily updating
the alloc btree - it doesn't need to update the alloc btree when the
pointers within a stripe aren't changing.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Only try to get existing stripe once in stripe create path
Kent Overstreet [Wed, 9 Dec 2020 18:39:30 +0000 (13:39 -0500)]
bcachefs: Only try to get existing stripe once in stripe create path

The stripe creation path was too state-machiney: it would always run the
full state machine until it had succesfully created a new stripe.

But if we tried to get and reuse an existing stripe after we'd already
allocated some buckets, the buckets we'd allocated might have conflicted
with the blocks in the existing stripe we need to keep - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix __btree_iter_next() when all iters are in use_next() when all iters...
Kent Overstreet [Wed, 9 Dec 2020 18:34:42 +0000 (13:34 -0500)]
bcachefs: Fix __btree_iter_next() when all iters are in use_next() when all iters are in use

Also, print out more information on btree transaction iterator overflow.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix rand_delete() test
Kent Overstreet [Mon, 7 Dec 2020 16:44:12 +0000 (11:44 -0500)]
bcachefs: Fix rand_delete() test

When we didn't find a key to delete we were getting a null ptr deref.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Try to print full btree error message
Kent Overstreet [Sun, 6 Dec 2020 21:30:02 +0000 (16:30 -0500)]
bcachefs: Try to print full btree error message

Metadata corruption bugs are hard to debug if we can't see exactly what
went wrong - try to allocate a bigger buffer so we can print out
everything we have.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Prevent journal reclaim from spinning
Kent Overstreet [Sun, 6 Dec 2020 21:29:13 +0000 (16:29 -0500)]
bcachefs: Prevent journal reclaim from spinning

Without checking if we actually flushed anything, journal reclaim could
still go into an infinite loop while trying ot shut down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Fix btree key cache dirty checks
Kent Overstreet [Sun, 6 Dec 2020 02:03:57 +0000 (21:03 -0500)]
bcachefs: Fix btree key cache dirty checks

Had a type that meant we were triggering journal reclaim _much_ more
aggressively than needed. Also, fix a potential integer overflow.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Be more conservation about journal pre-reservations
Kent Overstreet [Sat, 5 Dec 2020 21:25:05 +0000 (16:25 -0500)]
bcachefs: Be more conservation about journal pre-reservations

 - Try to always keep 1/8th of the journal free, on top of
   pre-reservations
 - Move the check for whether the journal is stuck to
   bch2_journal_space_available, and make it only fire when there aren't
   any journal writes in flight (that might free up space by updating
   last_seq)

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't require flush/fua on every journal write
Kent Overstreet [Sat, 14 Nov 2020 14:59:58 +0000 (09:59 -0500)]
bcachefs: Don't require flush/fua on every journal write

This patch adds a flag to journal entries which, if set, indicates that
they weren't done as flush/fua writes.

 - non flush/fua journal writes don't update last_seq (i.e. they don't
   free up space in the journal), thus the journal free space
   calculations now check whether nonflush journal writes are currently
   allowed (i.e. are we low on free space, or would doing a flush write
   free up a lot of space in the journal)

 - write_delay_ms, the user configurable option for when open journal
   entries are automatically written, is now interpreted as the max
   delay between flush journal writes (default 1 second).

 - bch2_journal_flush_seq_async is changed to ensure a flush write >=
   the requested sequence number has happened

 - journal read/replay must now ignore, and blacklist, any journal
   entries newer than the most recent flush entry in the journal. Also,
   the way the read_entire_journal option is handled has been improved;
   struct journal_replay now has an entry, 'ignore', for entries that
   were read but should not be used.

 - assorted refactoring and improvements related to journal read in
   journal_io.c and recovery.c

Previously, we'd have to issue a flush/fua write every time we
accumulated a full journal entry - typically the bucket size. Now we
need to issue them much less frequently: when an fsync is requested, or
it's been more than write_delay_ms since the last flush, or when we need
to free up space in the journal. This is a significant performance
improvement on many write heavy workloads.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Improve journal free space calculations
Kent Overstreet [Sat, 14 Nov 2020 17:29:21 +0000 (12:29 -0500)]
bcachefs: Improve journal free space calculations

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Increase journal pipelining
Kent Overstreet [Fri, 13 Nov 2020 23:36:33 +0000 (18:36 -0500)]
bcachefs: Increase journal pipelining

This patch increases the maximum journal buffers in flight from 2 to 4 -
this will be particularly helpful when in the future we stop requiring
flush+fua for every journal write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2 years agobcachefs: Don't issue btree writes that weren't journalled
Kent Overstreet [Thu, 3 Dec 2020 21:20:18 +0000 (16:20 -0500)]
bcachefs: Don't issue btree writes that weren't journalled

If we have an error in the btree interior update path that prevents us
from journalling the update, we can't issue the corresponding btree node
write - we didn't get a journal sequence number that would cause it to
be ignored in recovery.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>