]> www.infradead.org Git - nvme.git/log
nvme.git
5 weeks agopidfs: improve multi-threaded exec and premature thread-group leader exit polling
Christian Brauner [Thu, 20 Mar 2025 13:24:08 +0000 (14:24 +0100)]
pidfs: improve multi-threaded exec and premature thread-group leader exit polling

This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
5 weeks agopidfs: ensure that PIDFS_INFO_EXIT is available
Christian Brauner [Sun, 16 Mar 2025 12:49:09 +0000 (13:49 +0100)]
pidfs: ensure that PIDFS_INFO_EXIT is available

When we currently create a pidfd we check that the task hasn't been
reaped right before we create the pidfd. But it is of course possible
that by the time we return the pidfd to userspace the task has already
been reaped since we don't check again after having created a dentry for
it.

This was fine until now because that race was meaningless. But now that
we provide PIDFD_INFO_EXIT it is a problem because it is possible that
the kernel returns a reaped pidfd and it depends on the race whether
PIDFD_INFO_EXIT information is available. This depends on if the task
gets reaped before or after a dentry has been attached to struct pid.

Make this consistent and only returned pidfds for reaped tasks if
PIDFD_INFO_EXIT information is available. This is done by performing
another check whether the task has been reaped right after we attached a
dentry to struct pid.

Since pidfs_exit() is called before struct pid's task linkage is removed
the case where the task got reaped but a dentry was already attached to
struct pid and exit information was recorded and published can be
handled correctly. In that case we do return a pidfd for a reaped task
like we would've before.

Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoMerge patch series "pidfs: provide information after task has been reaped"
Christian Brauner [Wed, 5 Mar 2025 12:07:51 +0000 (13:07 +0100)]
Merge patch series "pidfs: provide information after task has been reaped"

Christian Brauner <brauner@kernel.org> says:

Various tools need access to information about a process/task even after
it has already been reaped. For example, systemd's journal logs and uses
such information as the cgroup id and exit status to deal with processes
that have been sent via SCM_PIDFD or SCM_PEERPIDFD. By the time the
pidfd is received the process might have already been reaped.

This series aims to provide information by extending the PIDFD_GET_INFO
ioctl to retrieve the exit code and cgroup id. There might be other
stuff that we would want in the future.

Pidfd polling allows waiting on either task exit or for a task to have
been reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP
must be observed before exit information can be retrieved, i.e., exit
information is only provided once the task has been reaped.

Note, that if a thread-group leader exits before other threads in the
thread-group then exit information will only be available once the
thread-group is empty. This aligns with wait() as well, where reaping of
a thread-group leader that exited before the thread-group was empty is
delayed until the thread-group is empty.

With PIDFD_INFO_EXIT autoreaping might actually become usable because it
means a parent can ignore SIGCHLD or set SA_NOCLDWAIT and simply use
pidfd polling and PIDFD_INFO_EXIT to get get status information for its
children. The kernel will autocleanup right away instead of delaying.

This includes expansive selftests including for thread-group behior and
multi-threaded exec by a non-thread-group leader thread.

* patches from https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-0-c8c3d8361705@kernel.org:
  selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest
  selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest
  selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest
  selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest
  selftests/pidfd: add third PIDFD_INFO_EXIT selftest
  selftests/pidfd: add second PIDFD_INFO_EXIT selftest
  selftests/pidfd: add first PIDFD_INFO_EXIT selftest
  selftests/pidfd: expand common pidfd header
  pidfs/selftests: ensure correct headers for ioctl handling
  selftests/pidfd: fix header inclusion
  pidfs: allow to retrieve exit information
  pidfs: record exit code and cgroupid at exit
  pidfs: use private inode slab cache
  pidfs: move setting flags into pidfs_alloc_file()
  pidfd: rely on automatic cleanup in __pidfd_prepare()
  pidfs: switch to copy_struct_to_user()

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-0-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: add seventh PIDFD_INFO_EXIT selftest
Christian Brauner [Wed, 5 Mar 2025 10:08:26 +0000 (11:08 +0100)]
selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest

Add a selftest for PIDFD_INFO_EXIT behavior.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-16-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: add sixth PIDFD_INFO_EXIT selftest
Christian Brauner [Wed, 5 Mar 2025 10:08:25 +0000 (11:08 +0100)]
selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest

Add a selftest for PIDFD_INFO_EXIT behavior.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-15-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: add fifth PIDFD_INFO_EXIT selftest
Christian Brauner [Wed, 5 Mar 2025 10:08:24 +0000 (11:08 +0100)]
selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest

Add a selftest for PIDFD_INFO_EXIT behavior.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-14-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: add fourth PIDFD_INFO_EXIT selftest
Christian Brauner [Wed, 5 Mar 2025 10:08:23 +0000 (11:08 +0100)]
selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest

Add a selftest for PIDFD_INFO_EXIT behavior.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-13-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: add third PIDFD_INFO_EXIT selftest
Christian Brauner [Wed, 5 Mar 2025 10:08:22 +0000 (11:08 +0100)]
selftests/pidfd: add third PIDFD_INFO_EXIT selftest

Add a selftest for PIDFD_INFO_EXIT behavior.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-12-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: add second PIDFD_INFO_EXIT selftest
Christian Brauner [Wed, 5 Mar 2025 10:08:21 +0000 (11:08 +0100)]
selftests/pidfd: add second PIDFD_INFO_EXIT selftest

Add a selftest for PIDFD_INFO_EXIT behavior.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-11-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: add first PIDFD_INFO_EXIT selftest
Christian Brauner [Wed, 5 Mar 2025 10:08:20 +0000 (11:08 +0100)]
selftests/pidfd: add first PIDFD_INFO_EXIT selftest

Add a selftest for PIDFD_INFO_EXIT behavior.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-10-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: expand common pidfd header
Christian Brauner [Wed, 5 Mar 2025 10:08:19 +0000 (11:08 +0100)]
selftests/pidfd: expand common pidfd header

Move more infrastructure to the pidfd header.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-9-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agopidfs/selftests: ensure correct headers for ioctl handling
Christian Brauner [Wed, 5 Mar 2025 10:08:18 +0000 (11:08 +0100)]
pidfs/selftests: ensure correct headers for ioctl handling

Ensure that necessary ioctl infrastructure is available.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-8-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agoselftests/pidfd: fix header inclusion
Christian Brauner [Wed, 5 Mar 2025 10:08:17 +0000 (11:08 +0100)]
selftests/pidfd: fix header inclusion

Ensure that necessary defines are present.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-7-c8c3d8361705@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agopidfs: allow to retrieve exit information
Christian Brauner [Wed, 5 Mar 2025 10:08:16 +0000 (11:08 +0100)]
pidfs: allow to retrieve exit information

Some tools like systemd's jounral need to retrieve the exit and cgroup
information after a process has already been reaped. This can e.g.,
happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-6-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agopidfs: record exit code and cgroupid at exit
Christian Brauner [Wed, 5 Mar 2025 10:08:15 +0000 (11:08 +0100)]
pidfs: record exit code and cgroupid at exit

Record the exit code and cgroupid in release_task() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agopidfs: use private inode slab cache
Christian Brauner [Wed, 5 Mar 2025 10:08:14 +0000 (11:08 +0100)]
pidfs: use private inode slab cache

Introduce a private inode slab cache for pidfs. In follow-up patches
pidfs will gain the ability to provide exit information to userspace
after the task has been reaped. This means storing exit information even
after the task has already been released and struct pid's task linkage
is gone. Store that information alongside the inode.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-4-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agopidfs: move setting flags into pidfs_alloc_file()
Christian Brauner [Wed, 5 Mar 2025 10:08:13 +0000 (11:08 +0100)]
pidfs: move setting flags into pidfs_alloc_file()

Instead od adding it into __pidfd_prepare() place it where the actual
file allocation happens and update the outdated comment.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agopidfd: rely on automatic cleanup in __pidfd_prepare()
Christian Brauner [Wed, 5 Mar 2025 10:08:12 +0000 (11:08 +0100)]
pidfd: rely on automatic cleanup in __pidfd_prepare()

Rely on scope-based cleanup for the allocated file descriptor.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-2-c8c3d8361705@kernel.org
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
7 weeks agopidfs: switch to copy_struct_to_user()
Christian Brauner [Wed, 5 Mar 2025 10:08:11 +0000 (11:08 +0100)]
pidfs: switch to copy_struct_to_user()

We have a helper that deals with all the required logic.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-1-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>