www.infradead.org Git - users/hch/xfstests-dev.git/commit

fstests: btrfs/125: do not use raid5 for metadata

[BUG]
There are several bug reports of btrfs/125 failure recently, either
causing balance failure (-EIO), or even kernel crash.

The balance failure looks like this:

     Mount normal and balance
    +ERROR: error during balancing '/mnt/scratch': Input/output error
    +There may be more info in syslog - try dmesg | tail
    +md5sum: /mnt/scratch/tf2: Input/output error

The test case btrfs/125 is not reliable in the past, and has been
discussed several times:

https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/
https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/#t

[CAUSE]
There are several different factors involved.

1. RMW mix the old and new metadata, causing unrepairable corruption
   E.g. with the following layout:

   data 1   |<- Stale metadata ->| (from the out-of-date device)
   data 2   |     Unused         |
   parity   |PPPPPPPPPPPPPPPPPPPP|

   In above case, although metadata on data 1 is out-of-date, we can
   still rebuild the correct data from parity and data 2.

   But if we have new metadata writes into the data 2 stripe, an RMW
   will screw up the whole situation:

   data 1   |<- Stale metadata ->| (from the out-of-date device)
   data 2   |<-  New metadata  ->|
   parity   |XXXXXXXXXXXXXXXXXXXX|

   The RMW will use the stale metadata and new metadata to calculate new
   parity.
   The resulted new parity will no longer be able to recover the old
   data 1.

   This is a known bug, thus our documentation is already recommending
   to avoid RAID56 for metadata usage.

   > Metadata
   >    Do not use raid5 nor raid6 for metadata. Use raid1 or raid1c3
   >    respectively.

   And this is very hard to fix, unlike data we can fetch the
   data csum and verify during RMW, we can not do that during RMW.

   At the timing of RMW, we're holding the rbio lock for the full
   stripe.
   If the extent tree search requires a read-recover, it will generate
   another rbio, which may cover the same full stripe we're working on,
   leading to a deadlock.

   Furthermore the current RAID56 repair code is all based on veritical
   sectors, but metadata can cross several horizontal sectors.
   This will require multiple combinations to repair a metadata.

2. Crash caused by double freeing a bio
   By chance if the above RMW corrupted csum tree, then during
   btrfs_submit_chunk() we will hit an error path that leads to double
   freeing of a bio, resulting crash or a KASAN report.

   Thankfully the patch fixing the use-after-free is already sent to the
   mailing list:
   https://lore.kernel.org/linux-btrfs/f4f916352ddf3f80048567ec7d8cc64cb388dc09.1724493430.git.wqu@suse.com/

[WORKAROUND]
Since it's very hard to fix the RAID56 metadata problem without a
deadlock or a huge code rework, for now just use RAID1 for the metadata
of this particular test case.

There may be a chance to fix the situation by properly marking the
missing-then-reappear device as out-of-date, so no direct read from that
device.

But that will also be a huge new feature, not something can be done
immediately.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zorro Lang <zlang@kernel.org>

author	Qu Wenruo <wqu@suse.com>
	Mon, 26 Aug 2024 22:47:08 +0000 (08:17 +0930)
committer	Zorro Lang <zlang@kernel.org>
	Mon, 2 Sep 2024 20:31:44 +0000 (04:31 +0800)
commit	869685f2f308a3b79cd97ad90923d2b595b916d4
tree	5e21f2e3ed43c531bca7aa86d3c4b13b32c64af8	tree
parent	fea26871799f2326626c7a78ba5548853c4b1159	commit \| diff