[Xen-devel] Ongoing/future speculative mitigation work

Discussion:

Andrew Cooper

2018-10-18 17:46:22 UTC

Hello,

This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas. They are
x86 specific, but a lot of the principles are architecture-agnostic.

1) A secrets-free hypervisor.

Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget. Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.

Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable. Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.

An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.

An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).

Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus.
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change. I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
some trivial bits of which have already found their way upstream.

To have a non-uniform memory layout, Xen may not share L4 pagetables.
i.e. Xen must never have two pcpus which reference the same pagetable in
%cr3.

This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point. Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.

If we want Xen to have a non-uniform layout, are two options are:
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
xen-shim).
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.

Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen. Option 1 looks to be the better
option longterm.

As an interesting point to note. The 32bit PV ABI prohibits sharing of
L3 pagetables, because back in the 32bit hypervisor days, we used to
have linear mappings in the Xen virtual range. This check is stale
(from a functionality point of view), but still present in Xen. A
consequence of this is that 32bit PV guests definitely don't share
top-level pagetables across vcpus.

Juergen/Boris: Do you have any idea if/how easy this infrastructure
would be to implement for 64bit PV guests as well? If a PV guest can
advertise via Elfnote that it won't share top-level pagetables, then we
can audit this trivially in Xen.

2) Scheduler improvements.

(I'm afraid this is rather more sparse because I'm less familiar with
the scheduler details.)

At the moment, all of Xen's schedulers will happily put two vcpus from
different domains on sibling hyperthreads. There has been a lot of
sidechannel research over the past decade demonstrating ways for one
thread to infer what is going on the other, but L1TF is the first
vulnerability I'm aware of which allows one thread to directly read data
out of the other.

Either way, it is now definitely a bad thing to run different guests
concurrently on siblings. Fixing this by simply not scheduling vcpus
from a different guest on siblings does result in a lower resource
utilisation, most notably when there are an odd number runable vcpus in
a domain, as the other thread is forced to idle.

A step beyond this is core-aware scheduling, where we schedule in units
of a virtual core rather than a virtual thread. This has much better
behaviour from the guests point of view, as the actually-scheduled
topology remains consistent, but does potentially come with even lower
utilisation if every other thread in the guest is idle.

A side requirement for core-aware scheduling is for Xen to have an
accurate idea of the topology presented to the guest. I need to dust
off my Toolstack CPUID/MSR improvement series and get that upstream.

One of the most insidious problems with L1TF is that, with
hyperthreading enabled, a malicious guest kernel can engineer arbitrary
data leakage by having one thread scanning the expected physical
address, and the other thread using an arbitrary cache-load gadget in
hypervisor context. This occurs because the L1 data cache is shared by
threads.

A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.

Anyway - enough of my rambling for now. Thoughts?

~Andrew

Dario Faggioli

2018-10-19 08:09:30 UTC