Carlos Maiolino [Tue, 6 Aug 2024 13:49:48 +0000 (15:49 +0200)]
Merge tag 'repair-pptrs-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfsprogs: offline repair for parent pointers [v13.8 20/28]
This series implements online checking and repair for directory parent
pointer metadata. The checking half is fairly straightforward -- for
each outgoing directory link (forward or backwards), grab the inode at
the other end, and confirm that there's a corresponding link. If we
can't grab an inode or lock it, we'll save that link for a slower loop
that cycles all the locks, confirms the continued existence of the link,
and rechecks the link if it's actually still there.
Repairs are a bit more involved -- for directories, we walk the entire
filesystem to rebuild the dirents from parent pointer information.
Parent pointer repairs do the same walk but rebuild the pptrs from the
dirent information, but with the added twist that it duplicates all the
xattrs so that it can use the atomic extent swapping code to commit the
repairs atomically.
This introduces an added twist to the xattr repair code -- we use dirent
hooks to detect a colliding update to the pptr data while we're not
holding the ILOCKs; if one is detected, we restart the xattr salvaging
process but this time hold all the ILOCKs until the end of the scan.
For offline repair, the phase6 directory connectivity scan generates an
index of all the expected parent pointers in the filesystem. Then it
walks each file and compares the parent pointers attached to that file
against the index generated, and resyncs the results as necessary.
The last patch teaches xfs_scrub to report pathnames of files that are
being repaired, when possible.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:49:21 +0000 (15:49 +0200)]
Merge tag 'pptrs-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfsprogs: Parent Pointers [v13.8 18/28]
This is the latest parent pointer attributes for xfs. The goal of this
patch set is to add a parent pointer attribute to each inode. The
attribute name containing the parent inode, generation, and directory
offset, while the attribute value contains the file name. This feature
will enable future optimizations for online scrub, shrink, nfs handles,
verity, or any other feature that could make use of quickly deriving an
inodes path from the mount point.
Directory parent pointers are stored as namespaced extended attributes
of a file. Because parent pointers are an indivisible tuple of
(dirent_name, parent_ino, parent_gen) we cannot use the usual attr name
lookup functions to find a parent pointer. This is solvable by
introducing a new lookup mode that checks both the name and the value of
the xattr.
Therefore, introduce this new name-value lookup mode that's gated on the
XFS_ATTR_PARENT namespace. This requires the introduction of new
opcodes for the extended attribute update log intent items, which
actually means that parent pointers (itself an INCOMPAT feature) does
not depend on the LOGGED_XATTRS log incompat feature bit.
To reduce collisions on the dirent names of parent pointers, introduce a
new attr hash mode that is the dir2 namehash of the dirent name xor'd
with the parent inode number.
At this point, Allison has moved on to other things, so I've merged her
patchset into djwong-dev for merging.
Updates since v12 [djwong]:
Rebase on 6.9-rc and update the online fsck design document.
Redesign the ondisk format to use the name-value lookups to get us back
to the point where the attr is (dirent_name -> parent_ino/gen).
Updates since v11 [djwong]:
Rebase on 6.4-rc and make some tweaks and bugfixes to enable the repair
prototypes. Merge with djwong-dev and make online repair actually work.
Updates since v10 [djwong]:
Merge in the ondisk format changes to get rid of the diroffset conflicts
with the parent pointer repair code, rebase the entire series with the
attr vlookup changes first, and merge all the other random fixes.
Updates since v9:
Reordered patches 2 and 3 to be 6 and 7
xfs: Add xfs_verify_pptr
moved parent pointer validators to xfs_parent
xfs: Add parent pointer ioctl
Extra validation checks for fs id
added missing release for the inode
use GFP_KERNEL flags for malloc/realloc
reworked ioctl to use pptr listenty and flex array
NEW
xfs: don't remove the attr fork when parent pointers are enabled
NEW
directory lookups should return diroffsets too
NEW
xfs: move/add parent pointer validators to xfs_parent
Updates since v8:
xfs: parent pointer attribute creation
Fix xfs_parent_init to release log assist on alloc fail
Add slab cache for xfs_parent_defer
Fix xfs_create to release after unlock
Add xfs_parent_start and xfs_parent_finish wrappers
removed unused xfs_parent_name_irec and xfs_init_parent_name_irec
xfs: add parent attributes to link
Start/finish wrapper updates
Fix xfs_link to disallow reservationless quotas
xfs: add parent attributes to symlink
Fix xfs_symlink to release after unlock
Start/finish wrapper updates
xfs: Add parent pointers to rename
Start/finish wrapper updates
Fix rename to only grab logged xattr once
Fix xfs_rename to disallow reservationless quotas
Fix double unlock on dqattach fail
Move parent frees to out_release_wip
xfs: Add parent pointers to xfs_cross_rename
Hoist parent pointers into rename
Questions comments and feedback appreciated!
Thanks all!
Allison
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Prior to introducing parent pointer extended attributes, let's spend
some time cleaning up the attr code and strengthening the validation
that it performs on attrs coming in from the disk.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:48:39 +0000 (15:48 +0200)]
Merge tag 'scrub-media-scan-service-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub_all: automatic media scan service [v30.9 15/28]
Now that we've completed the online fsck functionality, there are a few
things that could be improved in the automatic service. Specifically,
we would like to perform a more intensive metadata + media scan once per
month, to give the user confidence that the filesystem isn't losing data
silently. To accomplish this, enhance xfs_scrub_all to be able to
trigger media scans. Next, add a duplicate set of system services that
start the media scans automatically.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:48:23 +0000 (15:48 +0200)]
Merge tag 'scrub-service-security-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: tighten security of systemd services [v30.9 14/28]
To reduce the risk of the online fsck service suffering some sort of
catastrophic breach that results in attackers reconfiguring the running
system, I embarked on a security audit of the systemd service files.
The result should be that all elements of the background service
(individual scrub jobs, the scrub_all initiator, and the failure
reporting) run with as few privileges and within as strong of a sandbox
as possible.
Granted, this does nothing about the potential for the /kernel/ screwing
up, but at least we could prevent obvious container escapes.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:47:58 +0000 (15:47 +0200)]
Merge tag 'scrub-fstrim-minlen-freesp-histogram-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: use free space histograms to reduce fstrim runtime [v30.9 13/28]
This patchset dramatically reduces the runtime of the FITRIM calls made
during phase 8 of xfs_scrub. It turns out that phase 8 can really get
bogged down if the free space contains a large number of very small
extents. In these cases, the runtime can increase by an order of
magnitude to free less than 1%% of the free space. This is not worth the
time, since we're spending a lot of time to do very little work. The
FITRIM ioctl allows us to specify a minimum extent length, so we can use
statistical methods to compute a minlen parameter.
It turns out xfs_db/spaceman already have the code needed to create
histograms of free space extent lengths. We add the ability to compute
a CDF of the extent lengths, which make it easy to pick a minimum length
corresponding to 99%% of the free space. In most cases, this results in
dramatic reductions in phase 8 runtime. Hence, move the histogram code
to libfrog, and wire up xfs_scrub, since phase 7 already walks the
fsmap.
We also add a new -o suboption to xfs_scrub so that people who /do/ want
to examine every free extent can do so.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:47:46 +0000 (15:47 +0200)]
Merge tag 'scrub-fstrim-phase-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: move fstrim to a separate phase [v30.9 12/28]
Back when I originally designed xfs_scrub, all filesystem metadata
checks were complete by the end of phase 3, and phase 4 was where all
the metadata repairs occurred. On the grounds that the filesystem
should be fully consistent by then, I made a call to FITRIM at the end
of phase 4 to discard empty space in the filesystem.
Unfortunately, that's no longer the case -- summary counters, link
counts, and quota counters are not checked until phase 7. It's not safe
to instruct the storage to unmap "empty" areas if we don't know where
those empty areas are, so we need to create a phase 8 to trim the fs.
While we're at it, make it more obvious that fstrim only gets to run if
there are no unfixed corruptions and no other runtime errors have
occurred.
Finally, reduce the latency impacts on the rest of the system by
breaking up the fstrim work into a loop that targets only 16GB per call.
This enables better progress reporting for interactive runs and cgroup
based resource constraints for background runs.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In early 2023, malware researchers disclosed a phishing attack that was
targeted at people running Linux workstations. The attack vector
involved the use of filenames containing what looked like a file
extension but instead contained a lookalike for the full stop (".")
and a common extension ("pdf"). Enhance xfs_scrub phase 5 to detect
these types of attacks and warn the system administrator.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:47:09 +0000 (15:47 +0200)]
Merge tag 'scrub-repair-scheduling-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: improve scheduling of repair items [v30.9 10/28]
Currently, phase 4 of xfs_scrub uses per-AG repair item lists to
schedule repair work across a thread pool. This scheme is suboptimal
when most of the repairs involve a single AG because all the work gets
dumped on a single pool thread.
Instead, we should create a thread pool with the same number of workers
as CPUs, and dispatch individual repair tickets as separate work items
to maximize parallelization.
However, we also need to ensure that repairs to space metadata and file
metadata are kept in separate queues because file repairs generally
depend on correctness of space metadata.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:46:57 +0000 (15:46 +0200)]
Merge tag 'scrub-object-tracking-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: use scrub_item to track check progress [v30.9 09/28]
Now that we've introduced tickets to track the status of repairs to a
specific principal XFS object (fs, ag, file), use them to track the
scrub state of those same objects. Ultimately, we want to make it easy
to introduce vectorized repair, where we send a batch of repair requests
to the kernel instead of making millions of ioctl calls. For now,
however, we'll settle for easier bookkeepping.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:46:15 +0000 (15:46 +0200)]
Merge tag 'scrub-repair-data-deps-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: track data dependencies for repairs [v30.9 08/28]
Certain kinds of XFS metadata depend on the correctness of lower level
metadata. For example, directory indexes depends on the directory data
fork, which in turn depend on the directory inode to be correct. The
current scrub code does not strictly preserve these dependencies if it
has to defer a repair until phase 4, because phase 4 prioritizes repairs
by type (corruption, then cross referencing, and then preening) and
loses the ordering of in the previous phases. This leads to absurd
things like trying to repair a directory before repairing its corrupted
fork, which is absurd.
To solve this problem, introduce a repair ticket structure to track all
the repairs pending for a principal object (inode, AG, etc). This
reduces memory requirements if an object requires more than one type of
repair and makes it very easy to track the data dependencies between
sub-objects of a principal object. Repair dependencies between object
types (e.g. bnobt before inodes) must still be encoded statically into
phase 4.
A secondary benefit of this new ticket structure is that we can decide
to attempt a repair of an object A that was flagged for a cross
referencing error during the scan if a different object B depends on A
but only B showed definitive signs of corruption.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:45:53 +0000 (15:45 +0200)]
Merge tag 'scrub-better-repair-warnings-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: improve warnings about difficult repairs [v30.9 07/28]
While I was poking through the QA results for xfs_scrub, I noticed that
it doesn't warn the user when the primary and secondary realtime
metadata are so out of whack that the chances of a successful repair are
not so high. I decided that it was worth refactoring the scrub code a
bit so that we could warn the user about these types of things, and
ended up refactoring unnecessary helpers out of existence and fixing
other reporting gaps.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:45:40 +0000 (15:45 +0200)]
Merge tag 'scrub-repair-fixes-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfs_scrub: fixes to the repair code [v30.9 06/28]
Now that we've landed the new kernel code, it's time to reorganize the
xfs_scrub code that handles repairs. Clean up various naming warts and
misleading error messages. Move the repair code to scrub/repair.c as
the first step. Then, fix various issues in the repair code before we
start reorganizing things.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
While doing QA of the online fsck code, I made a few observations:
First, nobody was checking that the di_onlink field is actually zero;
Second, that allocating a temporary file for repairs can fail (and
thus bring down the entire fs) if the inode cluster is corrupt; and
Third, that file link counts do not pin at ~0U to prevent integer
overflows.
This scattered patchset fixes those three problems.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:45:05 +0000 (15:45 +0200)]
Merge tag 'dirattr-validate-owners-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfsprogs: set and validate dir/attr block owners [v30.9 04/28]
There are a couple of significatn changes that need to be made to the
directory and xattr code before we can support online repairs of those
data structures.
The first change is because online repair is designed to use libxfs to
create a replacement dir/xattr structure in a temporary file, and use
atomic extent swapping to commit the corrected structure. To avoid the
performance hit of walking every block of the new structure to rewrite
the owner number, we instead change libxfs to allow callers of the dir
and xattr code the ability to set an explicit owner number to be written
into the header fields of any new blocks that are created.
The second change is to update the dir/xattr code to actually *check*
the owner number in each block that is read off the disk, since we don't
currently do that.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Carlos Maiolino [Tue, 6 Aug 2024 13:44:20 +0000 (15:44 +0200)]
Merge tag 'atomic-file-updates-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next
xfsprogs: atomic file updates [v30.9 03/28]
This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.
This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes. A
successful call completion guarantees that the new contents will be seen
even if the system fails.
The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically. The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.
User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file. If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas. Note that application software must quiesce writes to the file
while it stages an atomic update. This will be addressed by a
subsequent series.
This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file. For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.
However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed. For file access mediated
via read and write, fanotify or inotify are probably sufficient. For
mmaped files, that may not be fast enough.
The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down. Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.
Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years. It is also not
the RWF_ATOMIC patchset that has been shared. This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.
As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata. The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.
Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges. This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode. If this
completes successfully, the new contents can be committed atomically
into the inode being repaired. This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.
For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Enable parent pointer support in mkfs via the '-n parent' parameter.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: move the no-V4 filesystem check to join the rest] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:25 +0000 (16:23 -0700)]
xfs: create a blob array data structure
Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata. For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.
Use the xfile abstraction to store the attribute information in memory
that can be swapped out.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
mkfs: Add parent pointers during protofile creation
Inodes created from protofile parsing will also need to add the
appropriate parent pointers.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: use xfs_parent_add from libxfs instead of open-coding xfs_attr_set] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:27 +0000 (16:23 -0700)]
xfs_repair: check parent pointers
Use the parent pointer index that we constructed in the previous patch
to check that each file's parent pointer records exactly match the
directory entries that we recorded while walking directory entries.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:24 +0000 (16:23 -0700)]
libxfs: create new files with attr forks if necessary
Create new files with attr forks if they're going to have parent
pointers. In the next patch we'll fix mkfs to use the same parent
creation functions as the kernel, so we're going to need this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:26 +0000 (16:23 -0700)]
xfs_repair: move the global dirent name store to a separate object
Abstract the main parent pointer dirent names xfblob object into a
separate data structure to hide implementation details.
The goals here are (a) reduce memory usage when we can by deduplicating
dirent names that exist in multiple directories; and (b) provide a
unique id for each name in the system so that sorting incore parent
pointer records can be done in a stable manner. Fast stable sorting of
records is required for the dirent <-> pptr matching algorithm.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:26 +0000 (16:23 -0700)]
xfs_repair: junk duplicate hashtab entries when processing sf dirents
dir_hash_add() adds the passed-in dirent to the directory hashtab even
if there's already a duplicate. Therefore, if we detect a duplicate or
a garbage entry while processing the a shortform directory's entries, we
need to junk the newly added entry, just like we do when processing
directory data blocks.
This will become particularly relevant in the next patch, where we
generate a master index of parent pointers from the non-junked hashtab
entries of each directory that phase6 scans.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:25 +0000 (16:23 -0700)]
xfs_db: remove some boilerplate from xfs_attr_set
In preparation for online/offline repair wanting to use xfs_attr_set,
move some of the boilerplate out of this function into the callers.
Repair can initialize the da_args completely, and the userspace flag
handling/twisting goes away once we move it to xfs_attr_change.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:22 +0000 (16:23 -0700)]
xfs_db: obfuscate dirent and parent pointer names consistently
When someone wants to perform an obfuscated metadump of a filesystem
where parent pointers are enabled, we have to use the *exact* same
obfuscated name for both the directory entry and the parent pointer.
Create a name remapping table so that when we obfuscate a dirent name or
a parent pointer name, we can apply the same obfuscation when we find
the corresponding parent pointer or dirent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:22 +0000 (16:23 -0700)]
xfs_db: report parent bit on xattrs
Display the parent bit on xattr keys
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:22 +0000 (16:23 -0700)]
xfs_scrub: use parent pointers to report lost file data
If parent pointers are enabled, compute the path to the file while we're
doing the fsmap scan and report that, instead of walking the entire
directory tree to print the paths of the (hopefully few) files that lost
data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_logprint: decode parent pointers in ATTRI items fully
This patch modifies the ATTRI print routines to look for the parent
pointer flag, and decode logged parent pointers fully when dumping log
contents. Between the existing ATTRI: printouts and the new ones
introduced here, we can figure out what was stored in each log iovec,
as well as the higher level parent pointer that was logged.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: adjust to new ondisk format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch adds the flags i, n, and f to the parent command. These flags add
filtering options that are used by the new parent pointer tests in xfstests, and
help to improve the test run time. The flags are:
-i: Only show parent pointer records containing the given inode
-n: Only show parent pointer records containing the given filename
-f: Print records in short format: ino/gen/namelen/name
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: adapt to new getparents ioctl] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:20 +0000 (16:23 -0700)]
xfs_io: adapt parent command to new parent pointer ioctls
For ages, xfs_io has had a totally useless 'parent' command that enabled
callers to walk the parents or print the directory tree path of an open
file. This code used the ioctl interface presented by SGI's version of
parent pointers that was never merged. Rework the code in here to use
the new ioctl interfaces that we've settled upon. Get rid of the old
parent pointer checking code since xfs_repair/xfs_scrub will take care
of that.
(This originally was in the "xfsprogs: implement the upper half of
parent pointers" megapatch.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:20 +0000 (16:23 -0700)]
libfrog: add parent pointer support code
Add some support code to libfrog so that client programs can walk file
descriptors and handles upwards through the directory tree; and obtain a
reasonable file path from a file descriptor/handle. This code will be
used in xfsprogs utilities.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:20 +0000 (16:23 -0700)]
xfs_logprint: dump new attr log item fields
Dump the new extended attribute log item fields. This was split out
from the previous patch to make libxfs resyncing easier. This code
needs more cleaning, which we'll do in the next few patches before
moving on to the parent pointer code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:18 +0000 (16:23 -0700)]
xfs_scrub_all: implement retry and backoff for dbus calls
Calls to systemd across dbus are remote procedure calls, which means
that they're subject to transitory connection failures (e.g. systemd
re-exec itself). We don't want to fail at the *first* sign of what
could be temporary trouble, so implement a limited retry with fibonacci
backoff before we resort to invoking xfs_scrub as a subprocess.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub_all: tighten up the security on the background systemd service
Currently, xfs_scrub_all has to run with enough privileges to find
mounted XFS filesystems and the device associated with that mount and to
start xfs_scrub@<mountpoint> sub-services. Minimize the risk of
xfs_scrub_all escaping its service container or contaminating the rest
of the system by using systemd's sandboxing controls to prohibit as much
access as possible.
The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_all.service' in systemd 249.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:19 +0000 (16:23 -0700)]
xfs_repair: check free space requirements before allowing upgrades
Currently, the V5 feature upgrades permitted by xfs_repair do not affect
filesystem space usage, so we haven't needed to verify the geometry.
However, this will change once we start to allow the sysadmin to add new
metadata indexes to existing filesystems. Add all the infrastructure we
need to ensure that there's enough space for metadata space reservations
and per-AG reservations the next time the filesystem will be mounted.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
[david: Recompute transaction reservation values; Exit with error if upgrade fails] Signed-off-by: Dave Chinner <david@fromorbit.com>
[djwong: Refuse to upgrade if any part of the fs has < 10% free] Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:17 +0000 (16:23 -0700)]
xfs_scrub_all: convert systemctl calls to dbus
Convert the systemctl invocations to direct dbus calls, which decouples
us from the CLI in favor of direct API calls. This spares us from some
of the insanity of divining service state from program outputs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub_fail: tighten up the security on the background systemd service
Currently, xfs_scrub_fail has to run with enough privileges to access
the journal contents for a given scrub run and to send a report via
email. Minimize the risk of xfs_scrub_fail escaping its service
container or contaminating the rest of the system by using systemd's
sandboxing controls to prohibit as much access as possible.
The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_fail@.service' in systemd 249.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: enable periodic file data scrubs automatically
Enhance xfs_scrub_all with the ability to initiate a file data scrub
periodically. The user must specify the period, and they may optionally
specify the path to a file that will record the last time the file data
was scrubbed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:18 +0000 (16:23 -0700)]
xfs_scrub: automatic downgrades to dry-run mode in service mode
When service mode is enabled, xfs_scrub is being run within the context
of a systemd service. The service description language doesn't have any
particularly good constructs for adding in a '-n' argument if the
filesystem is readonly, which means that xfs_scrub is passed a path, and
needs to switch to dry-run mode on its own if the fs is mounted
readonly or the kernel doesn't support repairs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:17 +0000 (16:23 -0700)]
xfs_scrub_all: encapsulate all the systemctl code in an object
Move all the systemd service handling code to an object so that we can
contain all the insanity^Wdetails in a single place. This also makes
the killfuncs handling similar to starting background processes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:18 +0000 (16:23 -0700)]
xfs_scrub_all: fail fast on masked units
If xfs_scrub_all tries to start a masked xfs_scrub@ unit, that's a sign
that the system administrator really didn't want us to scrub that
filesystem. Instead of retrying pointlessly, just make a note of the
failure and move on.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:17 +0000 (16:23 -0700)]
xfs_scrub_all: encapsulate all the subprocess code in an object
Move all the xfs_scrub subprocess handling code to an object so that we
can contain all the details in a single place. This also simplifies the
background state management.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: remove journalctl background process
Now that we only start systemd services if we're running in service
mode, there's no need for the background journalctl process that only
ran if we had started systemd services in non-service mode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: only use the xfs_scrub@ systemd services in service mode
Since the per-mount xfs_scrub@.service definition includes a bunch of
resource usage constraints, we no longer want to use those services if
xfs_scrub_all is being run directly by the sysadmin (aka not in service
mode) on the presumption that sysadmins want answers as quickly as
possible.
Therefore, only try to call the systemd service from xfs_scrub_all if
SERVICE_MODE is set in the environment. If reaching out to systemd
fails and we're in service mode, we still want to run xfs_scrub
directly. Split the makefile variables as necessary so that we only
pass -b to xfs_scrub in service mode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:14 +0000 (16:23 -0700)]
xfs_scrub: tune fstrim minlen parameter based on free space histograms
Currently, phase 8 runs very slowly on filesystems with a lot of small
free space extents. To reduce the amount of time spent on fstrim
activities during phase 8, we want to balance estimated runtime against
completeness of the trim. In short, the goal is to reduce runtime by
avoiding small trim requests.
At the start of phase 8, a CDF is computed in decreasing order of extent
length from the histogram buckets created during the fsmap scan in phase
7. A point corresponding to the fstrim percentage target is chosen from
the CDF and mapped back to a histogram bucket, and free space extents
smaller than that amount are ommitted from fstrim.
On my aging /home filesystem, the free space histogram reported by
xfs_spaceman looks like this:
From this table, we see that free space extents that are 16 blocks or
longer constitute 99.3% of the free space in the filesystem but only
27.5% of the extents. If we set the fstrim minlen parameter to 16
blocks, that means that we can trim over 99% of the space in one third
of the time it would take to trim everything.
Add a new -o fstrim_pct= option to xfs_scrub just in case there are
users out there who want a different percentage. For example, accepting
a 95% trim would net us a speed increase of nearly two orders of
magnitude, ignoring system call overhead. Setting it to 100% will trim
everything, just like fstrim(8).
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:13 +0000 (16:23 -0700)]
xfs_scrub: improve responsiveness while trimming the filesystem
On a 10TB filesystem where the free space in each AG is heavily
fragmented, I noticed some very high runtimes on a FITRIM call for the
entire filesystem. xfs_scrub likes to report progress information on
each phase of the scrub, which means that a strace for the entire
filesystem:
shows that scrub is uncommunicative for the entire duration. We can't
report any progress for the duration of the call, and the program is not
responsive to signals. Reducing the size of the FITRIM requests to a
single AG at a time produces lower times for each individual call, but
even this isn't quite acceptable, because the time between progress
reports are still very high:
I then had the idea to limit the length parameter of each call to a
smallish amount (~11GB) so that we could report progress relatively
quickly, but much to my surprise, each FITRIM call still took ~68
seconds!
Unfortunately, the by-length fstrim implementation handles this poorly
because it walks the entire free space by length index (cntbt), which is
a very inefficient way to walk a subset of an AG when the free space is
fragmented.
To fix that, I created a second implementation in the kernel that will
walk the bnobt and perform the trims in block number order. This
algorithm constrains the amount of btree scanning to something
resembling the range passed in, which reduces the amount of time it
takes to respond to a signal.
Therefore, break up the FITRIM calls so they don't scan more than 11GB
of space at a time. Break the calls up by AG so that each call only has
to take one AGF per call, because each AG that we traverse causes a log
force.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub: tighten up the security on the background systemd service
Currently, xfs_scrub has to run with some elevated privileges. Minimize
the risk of xfs_scrub escaping its service container or contaminating
the rest of the system by using systemd's sandboxing controls to
prohibit as much access as possible.
The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub@.service' in systemd 249.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub: use dynamic users when running as a systemd service
Five years ago, systemd introduced the DynamicUser directive that
allocates a new unique user/group id, runs a service with those ids, and
deletes them after the service exits. This is a good replacement for
User=nobody, since it eliminates the threat of nobody-services messing
with each other.
Make this transition ahead of all the other security tightenings that
will land in the next few patches, and add credits for the people who
suggested the change and reviewed it.
Darrick J. Wong [Mon, 29 Jul 2024 23:23:12 +0000 (16:23 -0700)]
xfs_scrub: report FITRIM errors properly
Move the error reporting for the FITRIM ioctl out of vfs.c and into
phase8.c. This makes it so that IO errors encountered during trim are
counted as runtime errors instead of being dropped silently.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub.service: reduce background CPU usage to less than one core if possible
Currently, the xfs_scrub background service is configured to use -b,
which means that the program runs completely serially. However, even
using all of one CPU core with idle priority may be enough to cause
thermal throttling and unwanted fan noise on smaller systems (e.g.
laptops) with fast IO systems.
Let's try to avoid this (at least on systemd) by using cgroups to limit
the program's usage to slghtly more than half of one CPU and lowering
the nice priority in the scheduler. What we /really/ want is to run
steadily on an efficiency core, but there doesn't seem to be a means to
ask the scheduler not to ramp up the CPU frequency for a particular
task.
While we're at it, group the resource limit directives together.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:12 +0000 (16:23 -0700)]
xfs_scrub: fix the work estimation for phase 8
If there are latent errors on the filesystem, we aren't going to do any
work during phase 8 and it makes no sense to add that into the work
estimate for the progress bar.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:14 +0000 (16:23 -0700)]
xfs_scrub: allow auxiliary pathnames for sandboxing
In the next patch, we'll tighten up the security on the xfs_scrub
service so that it can't escape. However, sandboxing the service
involves making the host filesystem as inaccessible as possible, with
the filesystem to scrub bind mounted onto a known location within the
sandbox. Hence we need one path for reporting and a new -M argument to
tell scrub what it should actually be trying to open.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: move FITRIM to phase 8
Issuing discards against the filesystem should be the *last* thing that
xfs_scrub does, after everything else has been checked, repaired, and
found to be clean. If we can't satisfy all those conditions, we have no
business telling the storage to discard itself.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:09 +0000 (16:23 -0700)]
xfs_scrub: try to repair space metadata before file metadata
Phase 4 (metadata repairs) of xfs_scrub has suffered a mild race
condition since the beginning of its existence. Repair functions for
higher level metadata such as directories build the new directory blocks
in an unlinked temporary file and use atomic extent swapping to commit
the corrected directory contents into the existing directory. Atomic
extent swapping requires consistent filesystem space metadata, but phase
4 has never enforced correctness dependencies between space and file
metadata repairs.
Before the previous patch eliminated the per-AG repair lists, this error
was not often hit in testing scenarios because the allocator generally
succeeds in placing file data blocks in the same AG as the inode. With
pool threads now able to pop file repairs from the repair list before
space repairs complete, this error became much more obvious.
Fortunately, the new phase 4 design makes it easy to try to enforce the
consistency requirements of higher level file metadata repairs. Split
the repair list into one for space metadata and another for file
metadata. Phase 4 will now try to fix the space metadata until it stops
making progress on that, and only then will it try to fix file metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:08 +0000 (16:23 -0700)]
xfs_scrub: hoist scrub retry loop to scrub_item_check_file
For metadata check calls, use the ioctl retry and freeze permission
tracking in scrub_item that we created in the last patch. This enables
us to move the check retry loop out of xfs_scrub_metadata and into its
caller to remove a long backwards jump, and gets us closer to
vectorizing scrub calls.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: report deceptive file extensions
Earlier this year, ESET revealed that Linux users had been tricked into
opening executables containing malware payloads. The trickery came in
the form of a malicious zip file containing a filename with the string
"job offer․pdf". Note that the filename does *not* denote a real pdf
file, since the last four codepoints in the file name are "ONE DOT
LEADER", p, d, and f. Not period (ok, FULL STOP), p, d, f like you'd
normally expect.
Teach xfs_scrub to look for codepoints that could be confused with a
period followed by alphanumerics.
Darrick J. Wong [Mon, 29 Jul 2024 23:23:09 +0000 (16:23 -0700)]
xfs_scrub: recheck entire metadata objects after corruption repairs
When we've finished making repairs to some domain of filesystem metadata
(file, AG, etc.) to correct an inconsistency, we should recheck all the
other metadata types within that domain to make sure that we neither
made things worse nor introduced more cross-referencing problems. If we
did, requeue the item to make the repairs. If the only changes we made
were optimizations, don't bother.
The XFS_SCRUB_TYPE_ values are getting close to the max for a u32, so
I chose u64 for sri_selected.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>