fstests: check-parallel
Runs tests in parallel runner threads. Each runner thread has it's
own set of tests to run, and runs a separate instance of check
to run those tests.
check-parallel sets up loop devices, mount points, results
directories, etc for each instance and divides the tests up between
the runner threads.
It currently hard codes the XFS and generic test lists, and then
gives each check invocation an explicit list of tests to run. It
also passes through exclusions so that test exclude filtering is
still done by check.
This is far from ideal, but I didn't want to have to embark on a
major refactoring of check to be able to run stuff in parallel.
It was quite the challenge just to get all the tests and test
infrastructure up to the point where they can run reliably in
parallel.
Hence I've left the actual factoring of test selection and setup
out of the patchset for the moment. The plan is to factor both the
test setup and the test list runner loop out of check and share them
between check and check-parallel, hence not requiring check-parallel
to run check directly. That is future work, however.
With the current test runner setup, it is not uncommon to see >5000%
cpu usage, 150-200kiops and 4-5GB/s of disk bandwidth being used
when running 64 runners. This is a serious stress load as it is
constantly mounting and unmounting dozens of filesystems, creating
and destroying devices, dropping caches, running sync, running CPU
hot plug, running page cache migration, etc.
The massive amount of IO that load generates causes qemu hosts to
abort (i.e. crash) because they run out of vm map segments. Hence
bumping up the max_map_count on the host like so:
echo
1048576 > /proc/sys/vm/max_map_count
is necessary.
There is no significant memory pressure to speak of from running the
tests like this. I've seen a maximum of about 50GB of RAM used when
running tests like this, so running on a 64p/64GB VM the additional
concurrency doesn't really stress memory capacity like it does CPU
and IO.
All the runners are executed in private mount namespaces. This is
to prevent ephemeral mount namespace clones from taking a reference
to every mounted filesystem in the machine and so causing random
"device busy after unmount" failures in the tests that are running
concurrently with the mount namespace setup and teardown.
A typical `pstree -N mnt` looks like:
$ pstree -N mnt
[
4026531841]
bash
bash───pstree
[0]
sudo───sudo───check-parallel─┬─check-parallel───nsexec───check───311─┬─cut
│ └─md5sum
├─check-parallel───nsexec───check───750─┬─750───sleep
│ └─750.fsstress───4*[750.fsstress───{750.fsstress}]
├─check-parallel───nsexec───check───013───013───sed
├─check-parallel───nsexec───check───251───cp
├─check-parallel───nsexec───check───467───open_by_handle
├─check-parallel───nsexec───check───650─┬─650───sleep
│ └─650.fsstress─┬─61*[650.fsstress───{650.fsstress}]
│ └─2*[650.fsstress]
├─check-parallel───nsexec───check───707
├─check-parallel───nsexec───check───705
├─check-parallel───nsexec───check───416
├─check-parallel───nsexec───check───477───2*[open_by_handle]
├─check-parallel───nsexec───check───140───140
├─check-parallel───nsexec───check───562
├─check-parallel───nsexec───check───415───xfs_io───{xfs_io}
├─check-parallel───nsexec───check───291
├─check-parallel───nsexec───check───017
├─check-parallel───nsexec───check───016
├─check-parallel───nsexec───check───168───2*[168───168]
├─check-parallel───nsexec───check───672───2*[672───672]
├─check-parallel───nsexec───check───170─┬─170───170───170
│ └─170───170
├─check-parallel───nsexec───check───531───122*[t_open_tmpfiles]
├─check-parallel───nsexec───check───387
├─check-parallel───nsexec───check───748
├─check-parallel───nsexec───check───388─┬─388.fsstress───4*[388.fsstress───{388.fsstress}]
│ └─sleep
├─check-parallel───nsexec───check───328───328
├─check-parallel───nsexec───check───352
├─check-parallel───nsexec───check───042
├─check-parallel───nsexec───check───426───open_by_handle
├─check-parallel───nsexec───check───756───2*[open_by_handle]
├─check-parallel───nsexec───check───227
├─check-parallel───nsexec───check───208───aio-dio-invalid───2*[aio-dio-invalid]
├─check-parallel───nsexec───check───746───cp
├─check-parallel───nsexec───check───187───187
├─check-parallel───nsexec───check───027───8*[027]
├─check-parallel───nsexec───check───045───xfs_io───{xfs_io}
├─check-parallel───nsexec───check───044
├─check-parallel───nsexec───check───204
├─check-parallel───nsexec───check───186───186
├─check-parallel───nsexec───check───449
├─check-parallel───nsexec───check───231───su───fsx
├─check-parallel───nsexec───check───509
├─check-parallel───nsexec───check───127───5*[127───fsx]
├─check-parallel───nsexec───check───047
├─check-parallel───nsexec───check───043
├─check-parallel───nsexec───check───475───pkill
├─check-parallel───nsexec───check───299─┬─fio─┬─4*[fio]
│ │ ├─2*[fio───4*[{fio}]]
│ │ └─{fio}
│ └─pgrep
├─check-parallel───nsexec───check───551───aio-dio-write-v
├─check-parallel───nsexec───check───323───aio-last-ref-he───100*[{aio-last-ref-he}]
├─check-parallel───nsexec───check───648───sleep
├─check-parallel───nsexec───check───046
├─check-parallel───nsexec───check───753─┬─753.fsstress───4*[753.fsstress]
│ └─pkill
├─check-parallel───nsexec───check───507───507
├─check-parallel───nsexec───check───629─┬─3*[629───xfs_io───{xfs_io}]
│ └─5*[629]
├─check-parallel───nsexec───check───073───umount
├─check-parallel───nsexec───check───615───615
├─check-parallel───nsexec───check───176───punch-alternati
├─check-parallel───nsexec───check───294
├─check-parallel───nsexec───check───236───236
├─check-parallel───nsexec───check───165─┬─165─┬─165─┬─cut
│ │ │ └─xfs_io───{xfs_io}
│ │ └─165───grep
│ └─165
├─check-parallel───nsexec───check───259───sync
├─check-parallel───nsexec───check───442───442.fsstress───4*[442.fsstress───{442.fsstress}]
├─check-parallel───nsexec───check───558───255*[558]
├─check-parallel───nsexec───check───358───358───358
├─check-parallel───nsexec───check───169───169
└─check-parallel───nsexec───check───297─┬─297.fsstress─┬─284*[297.fsstress───{297.fsstress}]
│ └─716*[297.fsstress]
└─sleep
A typical test run looks like:
$ time sudo ./check-parallel /mnt/xfs -s xfs -x dump
Runner 63 Failures: xfs/170
Runner 36 Failures: xfs/050
Runner 30 Failures: xfs/273
Runner 29 Failures: generic/135
Runner 25 Failures: generic/603
Tests run: 1140
Failure count: 5
Ten slowest tests - runtime in seconds:
xfs/013 454
generic/707 414
generic/017 398
generic/387 395
generic/748 390
xfs/140 351
generic/562 351
generic/705 347
generic/251 344
xfs/016 343
Cleanup on Aisle 5?
total 0
crw-------. 1 root root 10, 236 Nov 27 09:27 control
lrwxrwxrwx. 1 root root 7 Nov 27 09:27 fast -> ../dm-0
/dev/mapper/fast 1.4T 192G 1.2T 14% /mnt/xfs
real 9m29.056s
user 0m0.005s
sys 0m0.022s
$
Yeah, that runtime is real - under 10 minutes for a full XFS auto
group test run. When running this normally (i.e. via check) on this
machine, it usually takes just under 4 hours to run the same set
of tests. i.e. I can run ./check-parallel roughly 25x times on this
machine in the same time it takes to run ./check.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Zorro lang <zlang@redhat.com>
Signed-off-by: Zorro Lang <zlang@kernel.org>