Darrick J. Wong [Wed, 3 Jul 2024 21:21:29 +0000 (14:21 -0700)]
man: document vectored scrub mode
Add a manpage to document XFS_IOC_SCRUBV_METADATA. From the kernel
patch:
Introduce a variant on XFS_SCRUB_METADATA that allows for a vectored
mode. The caller specifies the principal metadata object that they want
to scrub (allocation group, inode, etc.) once, followed by an array of
scrub types they want called on that object. The kernel runs the scrub
operations and writes the output flags and errno code to the
corresponding array element.
A new pseudo scrub type BARRIER is introduced to force the kernel to
return to userspace if any corruptions have been found when scrubbing
the previous scrub types in the array. This enables userspace to
schedule, for example, the sequence:
1. data fork
2. barrier
3. directory
If the data fork scrub is clean, then the kernel will perform the
directory scrub. If not, the barrier in 2 will exit back to userspace.
The alternative would have been an interface where userspace passes a
pointer to an empty buffer, and the kernel formats that with
xfs_scrub_vecs that tell userspace what it scrubbed and what the outcome
was. With that the kernel would have to communicate that the buffer
needed to have been at least X size, even though for our cases
XFS_SCRUB_TYPE_NR + 2 would always be enough.
Compared to that, this design keeps all the dependency policy and
ordering logic in userspace where it already resides instead of
duplicating it in the kernel. The downside of that is that it needs the
barrier logic.
When running fstests in "rebuild all metadata after each test" mode, I
observed a 10% reduction in runtime due to fewer transitions across the
system call boundary.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
xfs_scrub: detect and repair directory tree corruptions
Now that we have online fsck for directory tree structure problems, we
need to find a place to call it. The scanner requires that parent
pointers are enabled, that directory link counts are correct, and that
every directory entry has a corresponding parent pointer. Therefore, we
can only run it after phase 4 fixes every file, and phase 5 resets the
link counts.
In other words, we call it as part of the phase 5 file scan that we do
to warn about weird looking file names. This has the added benefit that
opening the directory by handle is less likely to fail if there are
loops in the directory structure. For now, only plumb in enough to try
to fix directory tree problems right away; the next patch will make
phase 5 retry the dirloop scanner until the problems are fixed or we
stop making forward progress.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
xfs_scrub: fix erroring out of check_inode_names
The early exit logic in this function is a bit suboptimal -- we don't
need to close the @fd if we haven't even opened it, and since all errors
are fatal, we don't need to bump the progress counter. The logic in
this function is about to get more involved due to the addition of the
directory tree structure checker, so clean up these warts.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
xfs_spaceman: report directory tree corruption in the health information
Report directories that are the source of corruption in the directory
tree. While we're at it, add the documentation updates for the new
reporting flags and scrub type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
libfrog: add directory tree structure scrubber to scrub library
Make it so that scrub clients can detect corruptions within the
directory tree structure itself. Update the documentation for the scrub
ioctl to mention this new functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:27 +0000 (14:21 -0700)]
xfs_repair: check parent pointers
Use the parent pointer index that we constructed in the previous patch
to check that each file's parent pointer records exactly match the
directory entries that we recorded while walking directory entries.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:26 +0000 (14:21 -0700)]
xfs_repair: move the global dirent name store to a separate object
Abstract the main parent pointer dirent names xfblob object into a
separate data structure to hide implementation details.
The goals here are (a) reduce memory usage when we can by deduplicating
dirent names that exist in multiple directories; and (b) provide a
unique id for each name in the system so that sorting incore parent
pointer records can be done in a stable manner. Fast stable sorting of
records is required for the dirent <-> pptr matching algorithm.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:26 +0000 (14:21 -0700)]
xfs_repair: junk duplicate hashtab entries when processing sf dirents
dir_hash_add() adds the passed-in dirent to the directory hashtab even
if there's already a duplicate. Therefore, if we detect a duplicate or
a garbage entry while processing the a shortform directory's entries, we
need to junk the newly added entry, just like we do when processing
directory data blocks.
This will become particularly relevant in the next patch, where we
generate a master index of parent pointers from the non-junked hashtab
entries of each directory that phase6 scans.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:25 +0000 (14:21 -0700)]
xfs_db: remove some boilerplate from xfs_attr_set
In preparation for online/offline repair wanting to use xfs_attr_set,
move some of the boilerplate out of this function into the callers.
Repair can initialize the da_args completely, and the userspace flag
handling/twisting goes away once we move it to xfs_attr_change.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:25 +0000 (14:21 -0700)]
xfs: create a blob array data structure
Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata. For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.
Use the xfile abstraction to store the attribute information in memory
that can be swapped out.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Enable parent pointer support in mkfs via the '-n parent' parameter.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: move the no-V4 filesystem check to join the rest] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
mkfs: Add parent pointers during protofile creation
Inodes created from protofile parsing will also need to add the
appropriate parent pointers.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: use xfs_parent_add from libxfs instead of open-coding xfs_attr_set] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:24 +0000 (14:21 -0700)]
libxfs: create new files with attr forks if necessary
Create new files with attr forks if they're going to have parent
pointers. In the next patch we'll fix mkfs to use the same parent
creation functions as the kernel, so we're going to need this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:23 +0000 (14:21 -0700)]
xfs_db: obfuscate dirent and parent pointer names consistently
When someone wants to perform an obfuscated metadump of a filesystem
where parent pointers are enabled, we have to use the *exact* same
obfuscated name for both the directory entry and the parent pointer.
Create a name remapping table so that when we obfuscate a dirent name or
a parent pointer name, we can apply the same obfuscation when we find
the corresponding parent pointer or dirent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:22 +0000 (14:21 -0700)]
xfs_db: report parent bit on xattrs
Display the parent bit on xattr keys
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:22 +0000 (14:21 -0700)]
xfs_scrub: use parent pointers to report lost file data
If parent pointers are enabled, compute the path to the file while we're
doing the fsmap scan and report that, instead of walking the entire
directory tree to print the paths of the (hopefully few) files that lost
data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_logprint: decode parent pointers in ATTRI items fully
This patch modifies the ATTRI print routines to look for the parent
pointer flag, and decode logged parent pointers fully when dumping log
contents. Between the existing ATTRI: printouts and the new ones
introduced here, we can figure out what was stored in each log iovec,
as well as the higher level parent pointer that was logged.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: adjust to new ondisk format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch adds the flags i, n, and f to the parent command. These flags add
filtering options that are used by the new parent pointer tests in xfstests, and
help to improve the test run time. The flags are:
-i: Only show parent pointer records containing the given inode
-n: Only show parent pointer records containing the given filename
-f: Print records in short format: ino/gen/namelen/name
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: adapt to new getparents ioctl] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:21 +0000 (14:21 -0700)]
xfs_io: adapt parent command to new parent pointer ioctls
For ages, xfs_io has had a totally useless 'parent' command that enabled
callers to walk the parents or print the directory tree path of an open
file. This code used the ioctl interface presented by SGI's version of
parent pointers that was never merged. Rework the code in here to use
the new ioctl interfaces that we've settled upon. Get rid of the old
parent pointer checking code since xfs_repair/xfs_scrub will take care
of that.
(This originally was in the "xfsprogs: implement the upper half of
parent pointers" megapatch.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:20 +0000 (14:21 -0700)]
libfrog: add parent pointer support code
Add some support code to libfrog so that client programs can walk file
descriptors and handles upwards through the directory tree; and obtain a
reasonable file path from a file descriptor/handle. This code will be
used in xfsprogs utilities.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:20 +0000 (14:21 -0700)]
xfs_logprint: dump new attr log item fields
Dump the new extended attribute log item fields. This was split out
from the previous patch to make libxfs resyncing easier. This code
needs more cleaning, which we'll do in the next few patches before
moving on to the parent pointer code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:19 +0000 (14:21 -0700)]
xfs_repair: check free space requirements before allowing upgrades
Currently, the V5 feature upgrades permitted by xfs_repair do not affect
filesystem space usage, so we haven't needed to verify the geometry.
However, this will change once we start to allow the sysadmin to add new
metadata indexes to existing filesystems. Add all the infrastructure we
need to ensure that there's enough space for metadata space reservations
and per-AG reservations the next time the filesystem will be mounted.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
[david: Recompute transaction reservation values; Exit with error if upgrade fails] Signed-off-by: Dave Chinner <david@fromorbit.com>
[djwong: Refuse to upgrade if any part of the fs has < 10% free] Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:18 +0000 (14:21 -0700)]
xfs_scrub: automatic downgrades to dry-run mode in service mode
When service mode is enabled, xfs_scrub is being run within the context
of a systemd service. The service description language doesn't have any
particularly good constructs for adding in a '-n' argument if the
filesystem is readonly, which means that xfs_scrub is passed a path, and
needs to switch to dry-run mode on its own if the fs is mounted
readonly or the kernel doesn't support repairs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 23 Jul 2024 23:27:45 +0000 (16:27 -0700)]
xfs_scrub_all: fail fast on masked units
If xfs_scrub_all tries to start a masked xfs_scrub@ unit, that's a sign
that the system administrator really didn't want us to scrub that
filesystem. Instead of retrying pointlessly, just make a note of the
failure and move on.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:18 +0000 (14:21 -0700)]
xfs_scrub_all: implement retry and backoff for dbus calls
Calls to systemd across dbus are remote procedure calls, which means
that they're subject to transitory connection failures (e.g. systemd
re-exec itself). We don't want to fail at the *first* sign of what
could be temporary trouble, so implement a limited retry with fibonacci
backoff before we resort to invoking xfs_scrub as a subprocess.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:18 +0000 (14:21 -0700)]
xfs_scrub_all: convert systemctl calls to dbus
Convert the systemctl invocations to direct dbus calls, which decouples
us from the CLI in favor of direct API calls. This spares us from some
of the insanity of divining service state from program outputs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:17 +0000 (14:21 -0700)]
xfs_scrub_all: encapsulate all the systemctl code in an object
Move all the systemd service handling code to an object so that we can
contain all the insanity^Wdetails in a single place. This also makes
the killfuncs handling similar to starting background processes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:17 +0000 (14:21 -0700)]
xfs_scrub_all: encapsulate all the subprocess code in an object
Move all the xfs_scrub subprocess handling code to an object so that we
can contain all the details in a single place. This also simplifies the
background state management.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:16 +0000 (14:21 -0700)]
xfs_scrub_all: enable periodic file data scrubs automatically
Enhance xfs_scrub_all with the ability to initiate a file data scrub
periodically. The user must specify the period, and they may optionally
specify the path to a file that will record the last time the file data
was scrubbed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:16 +0000 (14:21 -0700)]
xfs_scrub_all: remove journalctl background process
Now that we only start systemd services if we're running in service
mode, there's no need for the background journalctl process that only
ran if we had started systemd services in non-service mode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:16 +0000 (14:21 -0700)]
xfs_scrub_all: only use the xfs_scrub@ systemd services in service mode
Since the per-mount xfs_scrub@.service definition includes a bunch of
resource usage constraints, we no longer want to use those services if
xfs_scrub_all is being run directly by the sysadmin (aka not in service
mode) on the presumption that sysadmins want answers as quickly as
possible.
Therefore, only try to call the systemd service from xfs_scrub_all if
SERVICE_MODE is set in the environment. If reaching out to systemd
fails and we're in service mode, we still want to run xfs_scrub
directly. Split the makefile variables as necessary so that we only
pass -b to xfs_scrub in service mode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:15 +0000 (14:21 -0700)]
xfs_scrub_all: tighten up the security on the background systemd service
Currently, xfs_scrub_all has to run with enough privileges to find
mounted XFS filesystems and the device associated with that mount and to
start xfs_scrub@<mountpoint> sub-services. Minimize the risk of
xfs_scrub_all escaping its service container or contaminating the rest
of the system by using systemd's sandboxing controls to prohibit as much
access as possible.
The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_all.service' in systemd 249.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:15 +0000 (14:21 -0700)]
xfs_scrub_fail: tighten up the security on the background systemd service
Currently, xfs_scrub_fail has to run with enough privileges to access
the journal contents for a given scrub run and to send a report via
email. Minimize the risk of xfs_scrub_fail escaping its service
container or contaminating the rest of the system by using systemd's
sandboxing controls to prohibit as much access as possible.
The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_fail@.service' in systemd 249.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:15 +0000 (14:21 -0700)]
xfs_scrub: tighten up the security on the background systemd service
Currently, xfs_scrub has to run with some elevated privileges. Minimize
the risk of xfs_scrub escaping its service container or contaminating
the rest of the system by using systemd's sandboxing controls to
prohibit as much access as possible.
The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub@.service' in systemd 249.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:15 +0000 (14:21 -0700)]
xfs_scrub: use dynamic users when running as a systemd service
Five years ago, systemd introduced the DynamicUser directive that
allocates a new unique user/group id, runs a service with those ids, and
deletes them after the service exits. This is a good replacement for
User=nobody, since it eliminates the threat of nobody-services messing
with each other.
Make this transition ahead of all the other security tightenings that
will land in the next few patches, and add credits for the people who
suggested the change and reviewed it.
Darrick J. Wong [Wed, 3 Jul 2024 21:21:14 +0000 (14:21 -0700)]
xfs_scrub.service: reduce background CPU usage to less than one core if possible
Currently, the xfs_scrub background service is configured to use -b,
which means that the program runs completely serially. However, even
using all of one CPU core with idle priority may be enough to cause
thermal throttling and unwanted fan noise on smaller systems (e.g.
laptops) with fast IO systems.
Let's try to avoid this (at least on systemd) by using cgroups to limit
the program's usage to slghtly more than half of one CPU and lowering
the nice priority in the scheduler. What we /really/ want is to run
steadily on an efficiency core, but there doesn't seem to be a means to
ask the scheduler not to ramp up the CPU frequency for a particular
task.
While we're at it, group the resource limit directives together.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:14 +0000 (14:21 -0700)]
xfs_scrub: allow auxiliary pathnames for sandboxing
In the next patch, we'll tighten up the security on the xfs_scrub
service so that it can't escape. However, sandboxing the service
involves making the host filesystem as inaccessible as possible, with
the filesystem to scrub bind mounted onto a known location within the
sandbox. Hence we need one path for reporting and a new -M argument to
tell scrub what it should actually be trying to open.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:14 +0000 (14:21 -0700)]
xfs_scrub: tune fstrim minlen parameter based on free space histograms
Currently, phase 8 runs very slowly on filesystems with a lot of small
free space extents. To reduce the amount of time spent on fstrim
activities during phase 8, we want to balance estimated runtime against
completeness of the trim. In short, the goal is to reduce runtime by
avoiding small trim requests.
At the start of phase 8, a CDF is computed in decreasing order of extent
length from the histogram buckets created during the fsmap scan in phase
7. A point corresponding to the fstrim percentage target is chosen from
the CDF and mapped back to a histogram bucket, and free space extents
smaller than that amount are ommitted from fstrim.
On my aging /home filesystem, the free space histogram reported by
xfs_spaceman looks like this:
From this table, we see that free space extents that are 16 blocks or
longer constitute 99.3% of the free space in the filesystem but only
27.5% of the extents. If we set the fstrim minlen parameter to 16
blocks, that means that we can trim over 99% of the space in one third
of the time it would take to trim everything.
Add a new -o fstrim_pct= option to xfs_scrub just in case there are
users out there who want a different percentage. For example, accepting
a 95% trim would net us a speed increase of nearly two orders of
magnitude, ignoring system call overhead. Setting it to 100% will trim
everything, just like fstrim(8).
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:12 +0000 (14:21 -0700)]
xfs_scrub: improve responsiveness while trimming the filesystem
On a 10TB filesystem where the free space in each AG is heavily
fragmented, I noticed some very high runtimes on a FITRIM call for the
entire filesystem. xfs_scrub likes to report progress information on
each phase of the scrub, which means that a strace for the entire
filesystem:
shows that scrub is uncommunicative for the entire duration. We can't
report any progress for the duration of the call, and the program is not
responsive to signals. Reducing the size of the FITRIM requests to a
single AG at a time produces lower times for each individual call, but
even this isn't quite acceptable, because the time between progress
reports are still very high:
I then had the idea to limit the length parameter of each call to a
smallish amount (~11GB) so that we could report progress relatively
quickly, but much to my surprise, each FITRIM call still took ~68
seconds!
Unfortunately, the by-length fstrim implementation handles this poorly
because it walks the entire free space by length index (cntbt), which is
a very inefficient way to walk a subset of an AG when the free space is
fragmented.
To fix that, I created a second implementation in the kernel that will
walk the bnobt and perform the trims in block number order. This
algorithm constrains the amount of btree scanning to something
resembling the range passed in, which reduces the amount of time it
takes to respond to a signal.
Therefore, break up the FITRIM calls so they don't scan more than 11GB
of space at a time. Break the calls up by AG so that each call only has
to take one AGF per call, because each AG that we traverse causes a log
force.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:12 +0000 (14:21 -0700)]
xfs_scrub: report FITRIM errors properly
Move the error reporting for the FITRIM ioctl out of vfs.c and into
phase8.c. This makes it so that IO errors encountered during trim are
counted as runtime errors instead of being dropped silently.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:12 +0000 (14:21 -0700)]
xfs_scrub: fix the work estimation for phase 8
If there are latent errors on the filesystem, we aren't going to do any
work during phase 8 and it makes no sense to add that into the work
estimate for the progress bar.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:11 +0000 (14:21 -0700)]
xfs_scrub: move FITRIM to phase 8
Issuing discards against the filesystem should be the *last* thing that
xfs_scrub does, after everything else has been checked, repaired, and
found to be clean. If we can't satisfy all those conditions, we have no
business telling the storage to discard itself.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:11 +0000 (14:21 -0700)]
xfs_scrub: report deceptive file extensions
Earlier this year, ESET revealed that Linux users had been tricked into
opening executables containing malware payloads. The trickery came in
the form of a malicious zip file containing a filename with the string
"job offer․pdf". Note that the filename does *not* denote a real pdf
file, since the last four codepoints in the file name are "ONE DOT
LEADER", p, d, and f. Not period (ok, FULL STOP), p, d, f like you'd
normally expect.
Teach xfs_scrub to look for codepoints that could be confused with a
period followed by alphanumerics.
Darrick J. Wong [Wed, 3 Jul 2024 21:21:10 +0000 (14:21 -0700)]
xfs_scrub: reduce size of struct name_entry
libicu doesn't support processing strings longer than 2GB in length, and
we never feed the unicrash code a name longer than about 300 bytes.
Rearrange the structure to reduce the head structure size from 56 bytes
to 44 bytes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:10 +0000 (14:21 -0700)]
xfs_scrub: store bad flags with the name entry
When scrub is checking unicode names, there are certain properties of
the directory/attribute/label name itself that it can complain about.
Store these in struct name_entry so that the confusable names detector
can pick this up later.
This restructuring enables a subsequent patch to detect suspicious
sequences in the NFC normalized form of the name without needing to hang
on to that NFC form until the end of processing. IOWs, it's a memory
usage optimization.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:09 +0000 (14:21 -0700)]
xfs_scrub: hoist non-rendering character predicate
Hoist this predicate code into its own function; we're going to use it
elsewhere later on. While we're at it, document how we generated this
list in the first place.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:09 +0000 (14:21 -0700)]
xfs_scrub: guard against libicu returning negative buffer lengths
The libicu functions u_strFromUTF8, unorm2_normalize, and
uspoof_getSkeleton return int32_t values. Guard against negative return
values, even though the library itself never does this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:09 +0000 (14:21 -0700)]
xfs_scrub: avoid potential UAF after freeing a duplicate name entry
Change the function declaration of unicrash_add to set the caller's
@new_entry to NULL if we detect an updated name entry and do not wish to
continue processing. This avoids a theoretical UAF if the unicrash_add
caller were to accidentally continue using the pointer.
This isn't an /actual/ UAF because the function formerly set @badflags
to zero, but let's be a little defensive.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:09 +0000 (14:21 -0700)]
xfs_scrub: add a couple of omitted invisible code points
I missed a few non-rendering code points in the "zero width"
classification code. Add them now, and sort the list. Finding them is
an annoyingly manual process because there are various code points that
are not supposed to affect the rendering of a string of text but are not
explicitly named as such. There are other code points that, when
surrounded by code points from the same chart, actually /do/ affect the
rendering.
IOWs, the only way to figure this out is to grep the likely code points
and then go figure out how each of them render by reading the Unicode
spec or trying it.
Darrick J. Wong [Wed, 3 Jul 2024 21:21:08 +0000 (14:21 -0700)]
xfs_scrub: hoist code that removes ignorable characters
Hoist the loop that removes "ignorable" code points from the skeleton
string into a separate function and give the UChar cursors names that
are easier to understand. Convert the code to use the safe versions of
the U16_ accessor functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:08 +0000 (14:21 -0700)]
xfs_scrub: use proper UChar string iterators
For code that wants to examine a UChar string, use libicu's string
iterators to walk UChar strings, instead of the open-coded U16_NEXT*
macros that perform no typechecking.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:08 +0000 (14:21 -0700)]
xfs_scrub: try to repair space metadata before file metadata
Phase 4 (metadata repairs) of xfs_scrub has suffered a mild race
condition since the beginning of its existence. Repair functions for
higher level metadata such as directories build the new directory blocks
in an unlinked temporary file and use atomic extent swapping to commit
the corrected directory contents into the existing directory. Atomic
extent swapping requires consistent filesystem space metadata, but phase
4 has never enforced correctness dependencies between space and file
metadata repairs.
Before the previous patch eliminated the per-AG repair lists, this error
was not often hit in testing scenarios because the allocator generally
succeeds in placing file data blocks in the same AG as the inode. With
pool threads now able to pop file repairs from the repair list before
space repairs complete, this error became much more obvious.
Fortunately, the new phase 4 design makes it easy to try to enforce the
consistency requirements of higher level file metadata repairs. Split
the repair list into one for space metadata and another for file
metadata. Phase 4 will now try to fix the space metadata until it stops
making progress on that, and only then will it try to fix file metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:08 +0000 (14:21 -0700)]
xfs_scrub: recheck entire metadata objects after corruption repairs
When we've finished making repairs to some domain of filesystem metadata
(file, AG, etc.) to correct an inconsistency, we should recheck all the
other metadata types within that domain to make sure that we neither
made things worse nor introduced more cross-referencing problems. If we
did, requeue the item to make the repairs. If the only changes we made
were optimizations, don't bother.
The XFS_SCRUB_TYPE_ values are getting close to the max for a u32, so
I chose u64 for sri_selected.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 3 Jul 2024 21:21:08 +0000 (14:21 -0700)]
xfs_scrub: improve thread scheduling repair items during phase 4
As it stands, xfs_scrub doesn't do a good job of scheduling repair items
during phase 4. The repair lists are sharded by AG, and one repair
worker is started for each per-AG repair list. Consequently, if one AG
requires considerably more work than the others (e.g. inodes are not
spread evenly among the AGs) then phase 4 can stall waiting for that one
worker thread when there's still plenty of CPU power available.
While our initial assumptions were that repairs would be vanishingly
scarce, the reality is that "repairs" can be triggered for optimizations
like gaps in the xattr structures, or clearing the inode reflink flag on
inodes that no longer share data. In real world testing scenarios, the
lack of balance leads to complaints about excessive runtime of
xfs_scrub.
To fix these balance problems, we replace the per-AG repair item lists
in the scrub context with a single repair item list. Phase 4 will be
redesigned as follows:
The repair worker will grab a repair item from the main list, try to
repair it, record whether the repair attempt made any progress, and
requeue the item if it was not fully fixed. A separate repair scheduler
function starts the repair workers, and waits for them all to complete.
Requeued repairs are merged back into the main repair list. If we made
any forward progress, we'll start another round of repairs with the
repair workers. Phase 4 retains the behavior that if the pool stops
making forward progress, it will try all the repairs one last time,
serially.
To facilitate this new design, phase 2 will queue repairs of space
metadata items directly to the main list. Phase 3's worker threads will
queue repair items to per-thread lists and splice those lists into the
main list at the end.
On a filesystem crafted to put all the inodes in a single AG, this
restores xfs_scrub's ability to parallelize repairs. There seems to be
a slight performance hit for the evenly-spread case, but avoiding a
performance cliff due to an unbalanced fs is more important here.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>