Try and replace retpoline thunk calls with:
  LFENCE
  CALL    *%\reg
for spectre_v2=retpoline,amd.
Specifically, the sequence above is 5 bytes for the low 8 registers,
but 6 bytes for the high 8 registers. This means that unless the
compilers prefix stuff the call with higher registers this replacement
will fail.
Luckily GCC strongly favours RAX for the indirect calls and most (95%+
for defconfig-x86_64) will be converted. OTOH clang strongly favours
R11 and almost nothing gets converted.
Note: it will also generate a correct replacement for the Jcc.d32
case, except unless the compilers start to prefix stuff that, it'll
never fit. Specifically:
  Jncc.d8 1f
  LFENCE
  JMP     *%\reg
1:
is 7-8 bytes long, where the original instruction in unpadded form is
only 6 bytes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.359986601@infradead.org
  *
  *   CALL *%\reg
  *
+ * It also tries to inline spectre_v2=retpoline,amd when size permits.
  */
 static int patch_retpoline(void *addr, struct insn *insn, u8 *bytes)
 {
        /* If anyone ever does: CALL/JMP *%rsp, we're in deep trouble. */
        BUG_ON(reg == 4);
 
-       if (cpu_feature_enabled(X86_FEATURE_RETPOLINE))
+       if (cpu_feature_enabled(X86_FEATURE_RETPOLINE) &&
+           !cpu_feature_enabled(X86_FEATURE_RETPOLINE_AMD))
                return -1;
 
        op = insn->opcode.bytes[0];
         * into:
         *
         *   Jncc.d8 1f
+        *   [ LFENCE ]
         *   JMP *%\reg
-        *   NOP
+        *   [ NOP ]
         * 1:
         */
        /* Jcc.d32 second opcode byte is in the range: 0x80-0x8f */
                op = JMP32_INSN_OPCODE;
        }
 
+       /*
+        * For RETPOLINE_AMD: prepend the indirect CALL/JMP with an LFENCE.
+        */
+       if (cpu_feature_enabled(X86_FEATURE_RETPOLINE_AMD)) {
+               bytes[i++] = 0x0f;
+               bytes[i++] = 0xae;
+               bytes[i++] = 0xe8; /* LFENCE */
+       }
+
        ret = emit_indirect(op, reg, bytes + i);
        if (ret < 0)
                return ret;