xen/fpu: stts() before the local_irq_enable(), and clts() after the local_irq_disable().
The Linux scheduler FPU allocation for a new process is a two-stage
mechanism prior to Linux v4.2. When an task is scheduled that hasn't
demonstrated a need for an FPU it set CR0.TS=1. The CR0.TS=1
will trap (and the CPU won't execute it) any FPU operations that the
task encountered. It allows the OS to lazily allocate for the
'struct task' an memory where FPU registers will be saved/restored.
When the task performs an FPU operation (MMX/SSE/etc) the first time
with CR0.TS=1 set, the hardware will trigger an exception #NM
(do_device_not_available) - and the exception handler (
math_state_restore) will setup up the memory for the task FPU
registers. And then return back to application allowing it to
execute the FPU operation (so with CR0.TS=0). And so on.
Thereafter if the task that has used the FPU is loaded, the CR0.TS
is cleared (0) so that the task can execute FPU operations unhindered.
Any tasks that are scheduled that haven't used the FPU get the
CR0.TS set (1). The kernel uses an PF_USED_MATH flag to figure
this out.
The below example should help in cementing this knowledge.
For simplicity we assume the guest/baremetal use the lazy mechanism
not eager. That makes 'switch_fpu_prepare' (called by schedule()) effectively:
if (previous task had PF_USED_MATH set)
stts (CR0.TS=1)
else
;
And ignoring the case if the task had used the FPU more than
five times - where we do things a bit different.
The time diagram looks great at 132x42.
Lets assume that we have two tasks: A and B. Both haven't used
the FPU. This is on PVHVM (or baremetal):
CR0.TS=1 CR0.TS=1 CR0.TS=0 CR0.TS=1 CR0.TS=0
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM |task B| |taskB | | task A | |taskA |
MMX |math_state_restore | | | | | | | |
op | \- fpu_init | | | | | | | |
| \- .. schedule() | | | | | | | |
| [swap task B] | | | | | | | |
| [since task A | | | | | | | |
| hadn't set | | | | | | | |
| PF_USED_MATH | | | | | | | |
| we don't muck| | | | | | | |
| with CR0.TS] | | | | | | | |
| |MMX op| | | | | | |
| | |#NM | | | | | |
| | |math_state_restore | | | | | |
| | | fpu_init worked | | | | | |
| | | clts() | | | | | |
| | |task_B->flags |= | | | | | |
| | | PF_USED_MATH | | | | | |
| | | return; | | | | | |
| | | |syscall| | | | |
| | | | |schedule() | | | |
| | | | |[swap task A] | | | |
| | | | |[taskB has | | | |
| | | | | PF_USED_MATH] | | | |
| | | | |[so CR0.TS=1] | | | |
| | | | | task A runs | | | |
| | | | | |MMX op | | |
| | | | | | |#NM | |
| | | | | | | fpu_init works | |
| | | | | | | clts() | |
| | | | | | | taskA->flags |= | |
| | | | | | | PF_USED_MATH | |
| | | | | | | return | |
| | | | | | | |MMX op |
However Xen PV ABI choose to do a shortcut. When Xen hypervisor receives
an #NM it immediately clears the CR0.TS bit and executes the PV kernel
do_device_not_available handler. Which would be OK if the exception handler
would immediately do 'clts' (CR0.TS=0). Which it does 99% except that
one time when:
* does a slab alloc which can sleep
*/
if (init_fpu(tsk)) {
which can end up calling 'schedule()' (and swapping to another task)
with the CR0.TS bit being cleared.
The scheduler can schedule-in an application that uses the FPU and
since nobody has marked the task with FP_USED_MATH we end up
reusing the FPU registers across all the tasks. Ouch.
CR0.TS=1 CR0.TS=0 CR0.TS=0 CR0.TS=0 CR0.TS=0
[but Xen sets it to
CR0.TS=0 and calls
Linux #NM:]
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM |task B| |taskB | | task A | |taskA |
MMX |math_state_restore | | | | | | | |
op | \- fpu_init | | | | | | | |
| \- .. schedule() | | | | | | | |
| [swap task B] | | | | | | | |
| [since task A | | | | | | | |
| hadn't set | | | | | | | |
| PF_USED_MATH | | | | | | | |
| we don't muck| | | | | | | |
| with CR0.TS] | | | | | | | |
| |MMX op| | | | | | |
| | |[no trap to Linux or| | | | | |
| | |Xen as CR0.TS=0] | | | | | |
| | | | | | | | |
| | |And task B clobbers | | | | | |
| | |task A FPU registers| | | | | |
| | |(or in the generic | | | | | |
| | |case whoever ran | | | | | |
| | |before task B). |syscall| | | | |
| | | | |schedule() | | | |
| | | | |[swap task A] | | | |
| | | | |[with task B] | | | |
| | | | | task A runs | | | |
| | | | | |MMX op | | |
| | | | | | |[again, no trap to | |
| | | | | | | Xen or Linux b/c | |
| | | | | | | CR0.TS=0 *1] | |
| | | | | | | |MMX op |
The [*1] refers to the Xen scheduler. If any of the
syscalls that the user application called, ended in the Linux kernel
halt (xen_safe_halt) routine - we would deschedule the guest VCPU.
When that VCPU is re-scheduled, Xen would set CR0.TS=1 back
so the #NM would function again.
Not pretty - and again - only happening if the fpu_alloc() ends
up calling the schedule().
Upstream wise (v4.2) Ingo FPU rewrite (~296 patches) fixed this.
(Tests ran for 2 weeks while they would have failed within
two hours).
Digging in it was due to:
commit
0c8c0f03e3a292e031596484275c14cf39c0ab7a
Author: Dave Hansen <dave@sr71.net>
Date: Fri Jul 17 12:28:11 2015 +0200
x86/fpu, sched: Dynamically allocate 'struct fpu'
The FPU rewrite removed the dynamic allocations of 'struct fpu'.
But, this potentially wastes massive amounts of memory (2k per
task on systems that do not have AVX-512 for instance).
Instead of having a separate slab, this patch just appends the
space that we need to the 'task_struct' which we dynamically
allocate already. This saves from doing an extra slab
allocation at fork().
When Xen hypervisor calls the PV guests #NM ('do_device_not_available')
it does:
fpu__restore(¤t->thread.fpu); /* interrupts still off */
|+- fpu__activate_curr (which just inits the already allocated space)
| \- memset(state, 0, xstate_size);
|+- fpregs_activate
\- stts()
So no call to 'schedule()' and leaking the FPU across different
tasks.
This patch modifies (and only for Xen PV guests) the state of
the CR0.TS to be set when 'schedule()' may be called. And if
'schedule()' is not called (fpu_alloc had no trouble getting
memory)', we set the CR0.TS back to zero (which actually may
not even be needed as we do that later as well).
Due to the wonder of paravirt and multicall batching the
'stts', 'clts' are not dispatched until arch_end_context_switch
is called (which is done in __switch_next which 'schedule()' does).
What that means is:
- If fpu_alloc() (well, SLAB) ends up calling 'schedule()'
the CR0.TS will get set when 'schedule()' is ready to start
the new thread.
- If fpu_alloc() had no trouble and there was no need for
'schedule()' - then will flush out the multicall effectively
doing CR0.TS=1 followed by CR0.TS=0, followed by CR0.TS=0 again.
The end result is the same.
P.S.
Multicalls is a mechanism to put a bunch of hypercalls in on
hypercall. It can execute up to 32 hypercalls.
Oracle-Bug: 14768
Orabug:
20318090
Reported-and-Tested-by: Saar Maoz <Saar.Maoz@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>