ipc semaphores: reduce ipc_lock contention in semtimedop
One feature of the ipc semaphores is they are defined to be
atomic for the full set of operations done per syscall. So if you do a
semtimedop syscall changing 100 semaphores, the kernel needs to try all
100 changes and only apply them when all 100 are able to succeed without
putting the process to sleep.
Today we use a single lock per semaphore array (the ipc lock). This lock is
held every time we try a set of operations requested by userland, and
when taken again when a process is woken up.
Whenever a given set of changes sent to semtimedop would sleep, that
set is queued up on a big list of pending changes for the entire
semaphore array.
Whenever a semtimedop call changes a single semaphore value, it
walks the entire list of pending operations to see if any of them
can now succeed. The ipc lock is held for this entire loop.
This change makes two major changes, pushing both the list of pending
operations and a spinlock down to each individual semaphore. Now:
Whenever a given semaphore modification is going to block, the set of
operations semtimedop wants to do is saved onto that semaphore's list.
Whenever a givem semtimedop call changes a single semaphore value, it
walks the list of pending operations on that single semaphore to see if
they can now succeed. If any of the operations will block on a
different semaphore, they are moved to that semaphore's list.
The locking is now done per-semaphore. In order to get the changes done
atomically, the lock of every semaphore being changed is taken while we
test the requested operations. We sort the operations by semaphore id
to make sure we don't deadlock in the kernel.
I have a microbenchmark to test how quickly we can post and wait in
bulk. With this change, semtimedop is able do to more than twice
as much work in the same run. On a large numa machine, it brings
the IPC lock system time (reported by perf) down from 85% to 15%.
Signed-off-by: Chris Mason <chris.mason@oracle.com>