Christian Brauner [Thu, 20 Mar 2025 13:24:08 +0000 (14:24 +0100)]
pidfs: improve multi-threaded exec and premature thread-group leader exit polling
This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.
A quick recap of these two cases:
(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
leader thread, all other threads in the thread-group including the
thread-group leader are killed and the struct pid of the
thread-group leader will be taken over by the subthread that called
exec. IOW, two tasks change their TIDs.
(2) A premature thread-group leader exit means that the thread-group
leader exited before all of the other subthreads in the thread-group
have exited.
Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.
The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.
But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.
So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.
If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.
Christian Brauner [Sun, 16 Mar 2025 12:49:09 +0000 (13:49 +0100)]
pidfs: ensure that PIDFS_INFO_EXIT is available
When we currently create a pidfd we check that the task hasn't been
reaped right before we create the pidfd. But it is of course possible
that by the time we return the pidfd to userspace the task has already
been reaped since we don't check again after having created a dentry for
it.
This was fine until now because that race was meaningless. But now that
we provide PIDFD_INFO_EXIT it is a problem because it is possible that
the kernel returns a reaped pidfd and it depends on the race whether
PIDFD_INFO_EXIT information is available. This depends on if the task
gets reaped before or after a dentry has been attached to struct pid.
Make this consistent and only returned pidfds for reaped tasks if
PIDFD_INFO_EXIT information is available. This is done by performing
another check whether the task has been reaped right after we attached a
dentry to struct pid.
Since pidfs_exit() is called before struct pid's task linkage is removed
the case where the task got reaped but a dentry was already attached to
struct pid and exit information was recorded and published can be
handled correctly. In that case we do return a pidfd for a reaped task
like we would've before.
Christian Brauner [Wed, 5 Mar 2025 12:07:51 +0000 (13:07 +0100)]
Merge patch series "pidfs: provide information after task has been reaped"
Christian Brauner <brauner@kernel.org> says:
Various tools need access to information about a process/task even after
it has already been reaped. For example, systemd's journal logs and uses
such information as the cgroup id and exit status to deal with processes
that have been sent via SCM_PIDFD or SCM_PEERPIDFD. By the time the
pidfd is received the process might have already been reaped.
This series aims to provide information by extending the PIDFD_GET_INFO
ioctl to retrieve the exit code and cgroup id. There might be other
stuff that we would want in the future.
Pidfd polling allows waiting on either task exit or for a task to have
been reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP
must be observed before exit information can be retrieved, i.e., exit
information is only provided once the task has been reaped.
Note, that if a thread-group leader exits before other threads in the
thread-group then exit information will only be available once the
thread-group is empty. This aligns with wait() as well, where reaping of
a thread-group leader that exited before the thread-group was empty is
delayed until the thread-group is empty.
With PIDFD_INFO_EXIT autoreaping might actually become usable because it
means a parent can ignore SIGCHLD or set SA_NOCLDWAIT and simply use
pidfd polling and PIDFD_INFO_EXIT to get get status information for its
children. The kernel will autocleanup right away instead of delaying.
This includes expansive selftests including for thread-group behior and
multi-threaded exec by a non-thread-group leader thread.
* patches from https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-0-c8c3d8361705@kernel.org:
selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest
selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest
selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest
selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest
selftests/pidfd: add third PIDFD_INFO_EXIT selftest
selftests/pidfd: add second PIDFD_INFO_EXIT selftest
selftests/pidfd: add first PIDFD_INFO_EXIT selftest
selftests/pidfd: expand common pidfd header
pidfs/selftests: ensure correct headers for ioctl handling
selftests/pidfd: fix header inclusion
pidfs: allow to retrieve exit information
pidfs: record exit code and cgroupid at exit
pidfs: use private inode slab cache
pidfs: move setting flags into pidfs_alloc_file()
pidfd: rely on automatic cleanup in __pidfd_prepare()
pidfs: switch to copy_struct_to_user()
Christian Brauner [Wed, 5 Mar 2025 10:08:16 +0000 (11:08 +0100)]
pidfs: allow to retrieve exit information
Some tools like systemd's jounral need to retrieve the exit and cgroup
information after a process has already been reaped. This can e.g.,
happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.
Christian Brauner [Wed, 5 Mar 2025 10:08:14 +0000 (11:08 +0100)]
pidfs: use private inode slab cache
Introduce a private inode slab cache for pidfs. In follow-up patches
pidfs will gain the ability to provide exit information to userspace
after the task has been reaped. This means storing exit information even
after the task has already been released and struct pid's task linkage
is gone. Store that information alongside the inode.