io_uring: support SQE grouping
SQE group is defined as one chain of SQEs starting with the first SQE
that has IOSQE_GROUP_LINK set, and ending with the first subsequent SQE
that doesn't have it set, and it is similar to a chain of linked SQEs.
However, unlike linked SQEs, where each sqe is issued after the previous
has completed, with SQE grouping all SQEs in one group can be submitted
in parallel. To simplify the implementation from beginning, all members
are queued after the leader is completed, however, this way may change
and the leader and members may be issued concurrently in future.
The 1st SQE is the group leader, and the other SQEs are group members.
The whole group shares single IOSQE_IO_LINK and IOSQE_IO_DRAIN from the
group leader, and the two flags can't be set for group members. For the
sake of simplicity, IORING_OP_LINK_TIMEOUT is disallowed for SQE group
now.
When the group is in one link chain, this group isn't submitted until
the previous SQE or group is completed. And the following SQE or group
can't be started before this group has completed. Failure from any group
member will fail the group leader, which in turn will terminate the link
chain.
When IOSQE_IO_DRAIN is set for the group leader, all requests in this
group and previous requests submitted are drained. Given IOSQE_IO_DRAIN
can be set for group leader only, we respect IO_DRAIN by always
completing group leader as the last one in the group. Meanwhile it is
natural to post leader's CQE as the last one from application viewpoint.
Combined with IOSQE_IO_LINK, SQE grouping provides a flexible way to
support N:M dependencies, such as:
- group A is chained with group B together
- group A has N SQEs
- group B has M SQEs
then M SQEs in group B depend on N SQEs in group A.
N:M dependency can support some interesting use cases in an efficient
way:
1) read from multiple files, then write the read data into single file
2) read from single file, and write the read data into multiple files
3) write same data into multiple files, and read data from multiple
files and compare if correct data is written
Note: this grabs the last sqe flag bit available. If we need more flags
in the future, then see the linked patch/discussion for how that could
be done with a setup flag.
Link: https://lore.kernel.org/io-uring/e60a3dd3-3a74-4181-8430-90c106a202f6@kernel.dk/
Link: https://lore.kernel.org/io-uring/d86e060f-be37-4efe-8d58-95cf8a22d37e@kernel.dk/
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025122247.3709133-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>