Discussion:
[Xen-devel] Ongoing/future speculative mitigation work
Andrew Cooper
2018-10-18 17:46:22 UTC
Permalink
Hello,

This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas.  They are
x86 specific, but a lot of the principles are architecture-agnostic.

1) A secrets-free hypervisor.

Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget.  Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.

Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable.  Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.

An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.

An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space.  This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).

Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus. 
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change.  I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
some trivial bits of which have already found their way upstream.

To have a non-uniform memory layout, Xen may not share L4 pagetables. 
i.e. Xen must never have two pcpus which reference the same pagetable in
%cr3.

This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point.  Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.

If we want Xen to have a non-uniform layout, are two options are:
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
xen-shim).
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.

Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen.  Option 1 looks to be the better
option longterm.

As an interesting point to note.  The 32bit PV ABI prohibits sharing of
L3 pagetables, because back in the 32bit hypervisor days, we used to
have linear mappings in the Xen virtual range.  This check is stale
(from a functionality point of view), but still present in Xen.  A
consequence of this is that 32bit PV guests definitely don't share
top-level pagetables across vcpus.

Juergen/Boris: Do you have any idea if/how easy this infrastructure
would be to implement for 64bit PV guests as well?  If a PV guest can
advertise via Elfnote that it won't share top-level pagetables, then we
can audit this trivially in Xen.


2) Scheduler improvements.

(I'm afraid this is rather more sparse because I'm less familiar with
the scheduler details.)

At the moment, all of Xen's schedulers will happily put two vcpus from
different domains on sibling hyperthreads.  There has been a lot of
sidechannel research over the past decade demonstrating ways for one
thread to infer what is going on the other, but L1TF is the first
vulnerability I'm aware of which allows one thread to directly read data
out of the other.

Either way, it is now definitely a bad thing to run different guests
concurrently on siblings.  Fixing this by simply not scheduling vcpus
from a different guest on siblings does result in a lower resource
utilisation, most notably when there are an odd number runable vcpus in
a domain, as the other thread is forced to idle.

A step beyond this is core-aware scheduling, where we schedule in units
of a virtual core rather than a virtual thread.  This has much better
behaviour from the guests point of view, as the actually-scheduled
topology remains consistent, but does potentially come with even lower
utilisation if every other thread in the guest is idle.

A side requirement for core-aware scheduling is for Xen to have an
accurate idea of the topology presented to the guest.  I need to dust
off my Toolstack CPUID/MSR improvement series and get that upstream.

One of the most insidious problems with L1TF is that, with
hyperthreading enabled, a malicious guest kernel can engineer arbitrary
data leakage by having one thread scanning the expected physical
address, and the other thread using an arbitrary cache-load gadget in
hypervisor context.  This occurs because the L1 data cache is shared by
threads.

A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels.  Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF.  Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.

Anyway - enough of my rambling for now.  Thoughts?

~Andrew
Dario Faggioli
2018-10-19 08:09:30 UTC
Permalink
Post by Andrew Cooper
Hello,
Hey,

This is very accurate and useful... thanks for it. :-)
Post by Andrew Cooper
1) A secrets-free hypervisor.
Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget. Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.
Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable. Furthermore, throwing a few
array_index_nospec()
into the code isn't a viable solution to the problem.
An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.
[...]
2) Scheduler improvements.
(I'm afraid this is rather more sparse because I'm less familiar with
the scheduler details.)
At the moment, all of Xen's schedulers will happily put two vcpus from
different domains on sibling hyperthreads. There has been a lot of
sidechannel research over the past decade demonstrating ways for one
thread to infer what is going on the other, but L1TF is the first
vulnerability I'm aware of which allows one thread to directly read data
out of the other.
Either way, it is now definitely a bad thing to run different guests
concurrently on siblings.
Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the
first serious issues discovered so far and, for instance, even on x86,
not all Intel CPUs and none of the AMD ones, AFAIK, are affected.

Therefore, although I certainly think we _must_ have the proper
scheduler enhancements in place (and in fact I'm working on that :-D)
it should IMO still be possible for the user to decide whether or not
to use them (either by opting-in or opting-out, I don't care much at
this stage).
Post by Andrew Cooper
Fixing this by simply not scheduling vcpus
from a different guest on siblings does result in a lower resource
utilisation, most notably when there are an odd number runable vcpus in
a domain, as the other thread is forced to idle.
Right.
Post by Andrew Cooper
A step beyond this is core-aware scheduling, where we schedule in units
of a virtual core rather than a virtual thread. This has much better
behaviour from the guests point of view, as the actually-scheduled
topology remains consistent, but does potentially come with even lower
utilisation if every other thread in the guest is idle.
Yes, basically, what you describe as 'core-aware scheduling' here can
be build on top of what you had described above as 'not scheduling
vcpus from different guests'.

I mean, we can/should put ourselves in a position where the user can
choose if he/she wants:
- just 'plain scheduling', as we have now,
- "just" that only vcpus of the same domains are scheduled on siblings
hyperthread,
- full 'core-aware scheduling', i.e., only vcpus that the guest
actually sees as virtual hyperthread siblings, are scheduled on
hardware hyperthread siblings.

About the performance impact, indeed it's even higher with core-aware
scheduling. Something that we can see about doing, is acting on the
guest scheduler, e.g., telling it to try to "pack the load", and keep
siblings busy, instead of trying to avoid doing that (which is what
happens by default in most cases).

In Linux, this can be done by playing with the sched-flags (see, e.g.,
https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20 ,
and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ).

The idea would be to avoid, as much as possible, the case when "every
other thread is idle in the guest". I'm not sure about being able to do
something by default, but we can certainly document things (like "if
you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags'
in your Linux guests").

I haven't checked whether other OSs' schedulers have something similar.
Post by Andrew Cooper
A side requirement for core-aware scheduling is for Xen to have an
accurate idea of the topology presented to the guest. I need to dust
off my Toolstack CPUID/MSR improvement series and get that upstream.
Indeed. Without knowing which one of the guest's vcpus are to be
considered virtual hyperthread siblings, I can only get you as far as
"only scheduling vcpus of the same domain on siblings hyperthread". :-)
Post by Andrew Cooper
One of the most insidious problems with L1TF is that, with
hyperthreading enabled, a malicious guest kernel can engineer
arbitrary
data leakage by having one thread scanning the expected physical
address, and the other thread using an arbitrary cache-load gadget in
hypervisor context. This occurs because the L1 data cache is shared
by
threads.
Right. So, sorry if this is a stupid question, but how does this relate
to the "secret-free hypervisor", and with the "if a piece of memory
isn't mapped, it can't be loaded into the cache".

So, basically, I'm asking whether I am understanding it correctly that
secret-free Xen + core-aware scheduling would *not* be enough for
mitigating L1TF properly (and if the answer is no, why... but only if
you have 5 mins to explain it to me :-P).

In fact, ISTR that core-scheduling plus something that looked to me
similar enough to "secret-free Xen", is how Microsoft claims to be
mitigating L1TF on hyper-v...
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to
continue
using hyperthreading even in the presence of L1TF.
Err... ok, but we still want core-aware scheduling, or at least we want
to avoid having vcpus from different domains on siblings, don't we? In
order to avoid leaks between guests, I mean.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
Andrew Cooper
2018-10-19 12:17:11 UTC
Permalink
Post by Dario Faggioli
Post by Andrew Cooper
Hello,
Hey,
This is very accurate and useful... thanks for it. :-)
Post by Andrew Cooper
1) A secrets-free hypervisor.
Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget. Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.
Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable. Furthermore, throwing a few
array_index_nospec()
into the code isn't a viable solution to the problem.
An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.
[...]
2) Scheduler improvements.
(I'm afraid this is rather more sparse because I'm less familiar with
the scheduler details.)
At the moment, all of Xen's schedulers will happily put two vcpus from
different domains on sibling hyperthreads. There has been a lot of
sidechannel research over the past decade demonstrating ways for one
thread to infer what is going on the other, but L1TF is the first
vulnerability I'm aware of which allows one thread to directly read data
out of the other.
Either way, it is now definitely a bad thing to run different guests
concurrently on siblings.
Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the
first serious issues discovered so far and, for instance, even on x86,
not all Intel CPUs and none of the AMD ones, AFAIK, are affected.
TLBleed is an excellent paper and associated research, but is still just
inference - a vast quantity of post-processing is required to extract
the key.

There are plenty of other sidechannels which affect all SMT
implementations, such as the effects of executing an mfence instruction,
execution unit
Post by Dario Faggioli
Therefore, although I certainly think we _must_ have the proper
scheduler enhancements in place (and in fact I'm working on that :-D)
it should IMO still be possible for the user to decide whether or not
to use them (either by opting-in or opting-out, I don't care much at
this stage).
I'm not suggesting that we leave people without a choice, but given an
option which doesn't share siblings between different guests, it should
be the default.
Post by Dario Faggioli
Post by Andrew Cooper
Fixing this by simply not scheduling vcpus
from a different guest on siblings does result in a lower resource
utilisation, most notably when there are an odd number runable vcpus in
a domain, as the other thread is forced to idle.
Right.
Post by Andrew Cooper
A step beyond this is core-aware scheduling, where we schedule in units
of a virtual core rather than a virtual thread. This has much better
behaviour from the guests point of view, as the actually-scheduled
topology remains consistent, but does potentially come with even lower
utilisation if every other thread in the guest is idle.
Yes, basically, what you describe as 'core-aware scheduling' here can
be build on top of what you had described above as 'not scheduling
vcpus from different guests'.
I mean, we can/should put ourselves in a position where the user can
- just 'plain scheduling', as we have now,
- "just" that only vcpus of the same domains are scheduled on siblings
hyperthread,
- full 'core-aware scheduling', i.e., only vcpus that the guest
actually sees as virtual hyperthread siblings, are scheduled on
hardware hyperthread siblings.
About the performance impact, indeed it's even higher with core-aware
scheduling. Something that we can see about doing, is acting on the
guest scheduler, e.g., telling it to try to "pack the load", and keep
siblings busy, instead of trying to avoid doing that (which is what
happens by default in most cases).
In Linux, this can be done by playing with the sched-flags (see, e.g.,
https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20 ,
and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ).
The idea would be to avoid, as much as possible, the case when "every
other thread is idle in the guest". I'm not sure about being able to do
something by default, but we can certainly document things (like "if
you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags'
in your Linux guests").
I haven't checked whether other OSs' schedulers have something similar.
Post by Andrew Cooper
A side requirement for core-aware scheduling is for Xen to have an
accurate idea of the topology presented to the guest. I need to dust
off my Toolstack CPUID/MSR improvement series and get that upstream.
Indeed. Without knowing which one of the guest's vcpus are to be
considered virtual hyperthread siblings, I can only get you as far as
"only scheduling vcpus of the same domain on siblings hyperthread". :-)
Post by Andrew Cooper
One of the most insidious problems with L1TF is that, with
hyperthreading enabled, a malicious guest kernel can engineer
arbitrary
data leakage by having one thread scanning the expected physical
address, and the other thread using an arbitrary cache-load gadget in
hypervisor context. This occurs because the L1 data cache is shared
by
threads.
Right. So, sorry if this is a stupid question, but how does this relate
to the "secret-free hypervisor", and with the "if a piece of memory
isn't mapped, it can't be loaded into the cache".
So, basically, I'm asking whether I am understanding it correctly that
secret-free Xen + core-aware scheduling would *not* be enough for
mitigating L1TF properly (and if the answer is no, why... but only if
you have 5 mins to explain it to me :-P).
In fact, ISTR that core-scheduling plus something that looked to me
similar enough to "secret-free Xen", is how Microsoft claims to be
mitigating L1TF on hyper-v...
Correct - that is what HyperV appears to be doing.

Its best to consider the secret-free Xen and scheduler improvements as
orthogonal.  In particular, the secret-free Xen is defence in depth
against SP1, and the risk of future issues, but does have
non-speculative benefits as well.

That said, the only way to use HT and definitely be safe to L1TF without
a secret-free Xen is to have the synchronised entry/exit logic working.
Post by Dario Faggioli
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to
continue
using hyperthreading even in the presence of L1TF.
Err... ok, but we still want core-aware scheduling, or at least we want
to avoid having vcpus from different domains on siblings, don't we? In
order to avoid leaks between guests, I mean.
Ideally, we'd want all of these.  I expect the only reasonable way to
develop them is one on top of another.

~Andrew
Mihai Donțu
2018-10-22 09:32:54 UTC
Permalink
Post by Andrew Cooper
[...]
Post by Dario Faggioli
Therefore, although I certainly think we _must_ have the proper
scheduler enhancements in place (and in fact I'm working on that :-D)
it should IMO still be possible for the user to decide whether or not
to use them (either by opting-in or opting-out, I don't care much at
this stage).
I'm not suggesting that we leave people without a choice, but given an
option which doesn't share siblings between different guests, it should
be the default.
+1
Post by Andrew Cooper
[...]
Its best to consider the secret-free Xen and scheduler improvements as
orthogonal. In particular, the secret-free Xen is defence in depth
against SP1, and the risk of future issues, but does have
non-speculative benefits as well.
That said, the only way to use HT and definitely be safe to L1TF without
a secret-free Xen is to have the synchronised entry/exit logic working.
Post by Dario Faggioli
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises
siblings on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to
continue using hyperthreading even in the presence of L1TF.
Err... ok, but we still want core-aware scheduling, or at least we want
to avoid having vcpus from different domains on siblings, don't we? In
order to avoid leaks between guests, I mean.
Ideally, we'd want all of these. I expect the only reasonable way to
develop them is one on top of another.
If there was a vote, I'd place the scheduler changes at the top.
--
Mihai Donțu
Wei Liu
2018-10-22 14:55:34 UTC
Permalink
Post by Andrew Cooper
Hello,
This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas.  They are
x86 specific, but a lot of the principles are architecture-agnostic.
1) A secrets-free hypervisor.
Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget.  Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.
Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable.  Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.
An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space.  This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.

Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
Post by Andrew Cooper
Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus. 
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change.  I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
some trivial bits of which have already found their way upstream.
To have a non-uniform memory layout, Xen may not share L4 pagetables. 
i.e. Xen must never have two pcpus which reference the same pagetable in
%cr3.
This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point.  Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.
Which bit of Linux code are you referring to? If you remember it off the
top of your head, it would save me some time digging around. If not,
never mind, I can look it up myself.
Post by Andrew Cooper
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
xen-shim).
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.
Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen.  Option 1 looks to be the better
option longterm.
What is the problem with 1+2 at the same time? I think XPTI can be
enabled / disabled on a per-guest basis?

Wei.
Woodhouse, David
2018-10-22 15:09:09 UTC
Permalink
Adding Stefan to Cc.

Should we take this to the spexen or another mailing list?
Post by Wei Liu
Post by Andrew Cooper
Hello,
This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas. They are
x86 specific, but a lot of the principles are architecture-agnostic.
1) A secrets-free hypervisor.
Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget. Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.
Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable. Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.
An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
Post by Andrew Cooper
Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus.
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change. I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
some trivial bits of which have already found their way upstream.
To have a non-uniform memory layout, Xen may not share L4 pagetables.
i.e. Xen must never have two pcpus which reference the same pagetable in
%cr3.
This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point. Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.
Which bit of Linux code are you referring to? If you remember it off the
top of your head, it would save me some time digging around. If not,
never mind, I can look it up myself.
Post by Andrew Cooper
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
xen-shim).
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.
Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen. Option 1 looks to be the better
option longterm.
What is the problem with 1+2 at the same time? I think XPTI can be
enabled / disabled on a per-guest basis?
Wei.
Andrew Cooper
2018-10-22 15:14:02 UTC
Permalink
Post by Woodhouse, David
Adding Stefan to Cc.
Should we take this to the spexen or another mailing list?
Now that L1TF is public, so is all of this.  I see no reason to continue
it in private.

~Andrew
Jan Beulich
2018-10-25 14:50:12 UTC
Permalink
Post by Wei Liu
Post by Andrew Cooper
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries (and hence could spy at xenheap but - due to not
being mapped - not domheap)?

Jan
George Dunlap
2018-10-25 14:56:50 UTC
Permalink
Post by Jan Beulich
Post by Wei Liu
Post by Andrew Cooper
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries
I don't think it would.
Post by Jan Beulich
(and hence could spy at xenheap but - due to not
being mapped - not domheap)?
Er, didn't follow this bit -- if L1TF is related to host physical
addresses, how does having a virtual mapping in Xen affect things in any
way?

-George
Jan Beulich
2018-10-25 15:02:04 UTC
Permalink
Post by George Dunlap
Post by Jan Beulich
Post by Wei Liu
Post by Andrew Cooper
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries
I don't think it would.
Post by Jan Beulich
(and hence could spy at xenheap but - due to not
being mapped - not domheap)?
Er, didn't follow this bit -- if L1TF is related to host physical
addresses, how does having a virtual mapping in Xen affect things in any
way?
Hmm, indeed. Scratch that part.

Jan
Andrew Cooper
2018-10-25 16:29:15 UTC
Permalink
Post by Jan Beulich
Post by George Dunlap
Post by Jan Beulich
Post by Wei Liu
Post by Andrew Cooper
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries
I don't think it would.
Post by Jan Beulich
(and hence could spy at xenheap but - due to not
being mapped - not domheap)?
Er, didn't follow this bit -- if L1TF is related to host physical
addresses, how does having a virtual mapping in Xen affect things in any
way?
Hmm, indeed. Scratch that part.
There seems to be quite a bit of confusion in these replies.

To exploit L1TF, the data in question has to be present in the L1 cache
when the attack is performed.

In practice, an attacker has to arrange for target data to be resident
in the L1 cache.  One way it can do this when HT is enabled is via a
cache-load gadget such as the first half of an SP1 attack on the other
hyperthread.  A different way mechanism is to try and cause Xen to
speculatively access a piece of data, and have the hardware prefetch
bring it into the cache.

Everything which is virtually mapped in Xen is potentially vulnerable,
and the goal of the "secret-free Xen" is to make sure that in the
context of one vcpu pulling off an attack like this, there is no
interesting data which can be exfiltrated.

A single xenheap model means that everything allocated with
alloc_xenheap_page() (e.g. struct domain, struct vcpu, pcpu stacks) are
potentially exposed to all domains.

A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.

~Andrew
George Dunlap
2018-10-25 16:43:24 UTC
Permalink
Post by Andrew Cooper
Post by Jan Beulich
Post by George Dunlap
Post by Jan Beulich
Post by Wei Liu
Post by Andrew Cooper
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries
I don't think it would.
Post by Jan Beulich
(and hence could spy at xenheap but - due to not
being mapped - not domheap)?
Er, didn't follow this bit -- if L1TF is related to host physical
addresses, how does having a virtual mapping in Xen affect things in any
way?
Hmm, indeed. Scratch that part.
There seems to be quite a bit of confusion in these replies.
To exploit L1TF, the data in question has to be present in the L1 cache
when the attack is performed.
In practice, an attacker has to arrange for target data to be resident
in the L1 cache.  One way it can do this when HT is enabled is via a
cache-load gadget such as the first half of an SP1 attack on the other
hyperthread.  A different way mechanism is to try and cause Xen to
speculatively access a piece of data, and have the hardware prefetch
bring it into the cache.
Right -- so a split xen/domheap model doesn't prevent L1TF attacks, but
it does make L1TF much harder to pull off, because it now only works if
you can manage to get onto the same core as the victim, after the victim
has accessed the data you want.

So it would reduce the risk of L1TF significantly, but not enough (I
think) that we could recommend disabling other mitigations.

-George
Andrew Cooper
2018-10-25 16:50:03 UTC
Permalink
Post by George Dunlap
Post by Andrew Cooper
Post by Jan Beulich
Post by George Dunlap
Post by Jan Beulich
Post by Wei Liu
Post by Andrew Cooper
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries
I don't think it would.
Post by Jan Beulich
(and hence could spy at xenheap but - due to not
being mapped - not domheap)?
Er, didn't follow this bit -- if L1TF is related to host physical
addresses, how does having a virtual mapping in Xen affect things in any
way?
Hmm, indeed. Scratch that part.
There seems to be quite a bit of confusion in these replies.
To exploit L1TF, the data in question has to be present in the L1 cache
when the attack is performed.
In practice, an attacker has to arrange for target data to be resident
in the L1 cache.  One way it can do this when HT is enabled is via a
cache-load gadget such as the first half of an SP1 attack on the other
hyperthread.  A different way mechanism is to try and cause Xen to
speculatively access a piece of data, and have the hardware prefetch
bring it into the cache.
Right -- so a split xen/domheap model doesn't prevent L1TF attacks, but
it does make L1TF much harder to pull off, because it now only works if
you can manage to get onto the same core as the victim, after the victim
has accessed the data you want.
So it would reduce the risk of L1TF significantly, but not enough (I
think) that we could recommend disabling other mitigations.
Correct.  All of these suggestions are for increased defence in depth. 
They are not replacements for the existing mitigations.

From a practical point of view, until people work out how to
comprehensively solve SP1, reducing the quantity of mapped data is the
only practical defence that an OS/Hypervisor has.

~Andrew
George Dunlap
2018-10-25 17:07:30 UTC
Permalink
Post by Andrew Cooper
Post by George Dunlap
Post by Andrew Cooper
Post by Jan Beulich
Post by George Dunlap
Post by Jan Beulich
Post by Wei Liu
Post by Andrew Cooper
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space. This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.
Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.
How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries
I don't think it would.
Post by Jan Beulich
(and hence could spy at xenheap but - due to not
being mapped - not domheap)?
Er, didn't follow this bit -- if L1TF is related to host physical
addresses, how does having a virtual mapping in Xen affect things in any
way?
Hmm, indeed. Scratch that part.
There seems to be quite a bit of confusion in these replies.
To exploit L1TF, the data in question has to be present in the L1 cache
when the attack is performed.
In practice, an attacker has to arrange for target data to be resident
in the L1 cache.  One way it can do this when HT is enabled is via a
cache-load gadget such as the first half of an SP1 attack on the other
hyperthread.  A different way mechanism is to try and cause Xen to
speculatively access a piece of data, and have the hardware prefetch
bring it into the cache.
Right -- so a split xen/domheap model doesn't prevent L1TF attacks, but
it does make L1TF much harder to pull off, because it now only works if
you can manage to get onto the same core as the victim, after the victim
has accessed the data you want.
So it would reduce the risk of L1TF significantly, but not enough (I
think) that we could recommend disabling other mitigations.
Correct.  All of these suggestions are for increased defence in depth. 
They are not replacements for the existing mitigations.
But it could be a mitigation for, say, Meltdown, yes? I'm trying to
remember the details; but wouldn't a "secret-free Xen" mean that
disabling XPTI entirely for 64-bit PV guests would be a reasonable
decision (even if many people left it enabled 'just in case')?

-George
Jan Beulich
2018-10-26 09:16:15 UTC
Permalink
Post by Andrew Cooper
A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.
It was not clear to me from Wei's original mail that talk here is
about "split" in a sense of "per-domain"; I was assuming the
CONFIG_SEPARATE_XENHEAP mode instead.

Jan
Wei Liu
2018-10-26 09:28:20 UTC
Permalink
Post by Jan Beulich
Post by Andrew Cooper
A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.
It was not clear to me from Wei's original mail that talk here is
about "split" in a sense of "per-domain"; I was assuming the
CONFIG_SEPARATE_XENHEAP mode instead.
The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
I what I wanted most is the partial direct map which reduces the amount
of data mapped inside Xen context -- the original idea was removing
direct map discussed during one of the calls IIRC. I thought making the
partial direct map mode work and make it as small as possible will get
us 90% there.

The "per-domain" heap is a different work item.

Wei.
Post by Jan Beulich
Jan
Jan Beulich
2018-10-26 09:56:58 UTC
Permalink
Post by Wei Liu
Post by Jan Beulich
Post by Andrew Cooper
A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.
It was not clear to me from Wei's original mail that talk here is
about "split" in a sense of "per-domain"; I was assuming the
CONFIG_SEPARATE_XENHEAP mode instead.
The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
I what I wanted most is the partial direct map which reduces the amount
of data mapped inside Xen context -- the original idea was removing
direct map discussed during one of the calls IIRC. I thought making the
partial direct map mode work and make it as small as possible will get
us 90% there.
The "per-domain" heap is a different work item.
But if we mean to go that route, going (back) to the separate
Xen heap model seems just like an extra complication to me.
Yet I agree that this would remove the need for a fair chunk of
the direct map. Otoh a statically partitioned Xen heap would
bring back scalability issues which we had specifically meant to
get rid of by moving away from that model.

Jan
George Dunlap
2018-10-26 10:51:13 UTC
Permalink
Post by Jan Beulich
Post by Wei Liu
Post by Jan Beulich
Post by Andrew Cooper
A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.
It was not clear to me from Wei's original mail that talk here is
about "split" in a sense of "per-domain"; I was assuming the
CONFIG_SEPARATE_XENHEAP mode instead.
The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
I what I wanted most is the partial direct map which reduces the amount
of data mapped inside Xen context -- the original idea was removing
direct map discussed during one of the calls IIRC. I thought making the
partial direct map mode work and make it as small as possible will get
us 90% there.
The "per-domain" heap is a different work item.
But if we mean to go that route, going (back) to the separate
Xen heap model seems just like an extra complication to me.
Yet I agree that this would remove the need for a fair chunk of
the direct map. Otoh a statically partitioned Xen heap would
bring back scalability issues which we had specifically meant to
get rid of by moving away from that model.
I think turning SEPARATE_XENHEAP back on would just be the first step.
We definitely would then need to sort things out so that it's scalable
again.

After system set-up, the key difference between xenheap and domheap
pages is that xenheap pages are assumed to be always mapped (i.e., you
can keep a pointer to them and it will be valid), whereas domheap pages
cannot assumed to be mapped, and need to be wrapped with
[un]map_domain_page().

The basic solution involves having a xenheap virtual address mapping
area not tied to the physical layout of the memory. domheap and xenheap
memory would have to come from the same pool, but xenheap would need to
be mapped into the xenheap virtual memory region before being returned.

-George
Jan Beulich
2018-10-26 11:20:47 UTC
Permalink
Post by George Dunlap
Post by Jan Beulich
Post by Wei Liu
Post by Jan Beulich
Post by Andrew Cooper
A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.
It was not clear to me from Wei's original mail that talk here is
about "split" in a sense of "per-domain"; I was assuming the
CONFIG_SEPARATE_XENHEAP mode instead.
The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
I what I wanted most is the partial direct map which reduces the amount
of data mapped inside Xen context -- the original idea was removing
direct map discussed during one of the calls IIRC. I thought making the
partial direct map mode work and make it as small as possible will get
us 90% there.
The "per-domain" heap is a different work item.
But if we mean to go that route, going (back) to the separate
Xen heap model seems just like an extra complication to me.
Yet I agree that this would remove the need for a fair chunk of
the direct map. Otoh a statically partitioned Xen heap would
bring back scalability issues which we had specifically meant to
get rid of by moving away from that model.
I think turning SEPARATE_XENHEAP back on would just be the first step.
We definitely would then need to sort things out so that it's scalable
again.
After system set-up, the key difference between xenheap and domheap
pages is that xenheap pages are assumed to be always mapped (i.e., you
can keep a pointer to them and it will be valid), whereas domheap pages
cannot assumed to be mapped, and need to be wrapped with
[un]map_domain_page().
The basic solution involves having a xenheap virtual address mapping
area not tied to the physical layout of the memory. domheap and xenheap
memory would have to come from the same pool, but xenheap would need to
be mapped into the xenheap virtual memory region before being returned.
Wouldn't this most easily be done by making alloc_xenheap_pages()
call alloc_domheap_pages() and then vmap() the result? Of course
we may need to grow the vmap area in that case.

Jan
George Dunlap
2018-10-26 11:24:43 UTC
Permalink
Post by Jan Beulich
Post by George Dunlap
Post by Jan Beulich
Post by Wei Liu
Post by Jan Beulich
Post by Andrew Cooper
A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.
It was not clear to me from Wei's original mail that talk here is
about "split" in a sense of "per-domain"; I was assuming the
CONFIG_SEPARATE_XENHEAP mode instead.
The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
I what I wanted most is the partial direct map which reduces the amount
of data mapped inside Xen context -- the original idea was removing
direct map discussed during one of the calls IIRC. I thought making the
partial direct map mode work and make it as small as possible will get
us 90% there.
The "per-domain" heap is a different work item.
But if we mean to go that route, going (back) to the separate
Xen heap model seems just like an extra complication to me.
Yet I agree that this would remove the need for a fair chunk of
the direct map. Otoh a statically partitioned Xen heap would
bring back scalability issues which we had specifically meant to
get rid of by moving away from that model.
I think turning SEPARATE_XENHEAP back on would just be the first step.
We definitely would then need to sort things out so that it's scalable
again.
After system set-up, the key difference between xenheap and domheap
pages is that xenheap pages are assumed to be always mapped (i.e., you
can keep a pointer to them and it will be valid), whereas domheap pages
cannot assumed to be mapped, and need to be wrapped with
[un]map_domain_page().
The basic solution involves having a xenheap virtual address mapping
area not tied to the physical layout of the memory. domheap and xenheap
memory would have to come from the same pool, but xenheap would need to
be mapped into the xenheap virtual memory region before being returned.
Wouldn't this most easily be done by making alloc_xenheap_pages()
call alloc_domheap_pages() and then vmap() the result? Of course
we may need to grow the vmap area in that case.
I couldn't answer that question without a lot more digging. :-) I'd
always assumed that the reason for the original reason for having the
xenheap direct-mapped on 32-bit was something to do with early-boot
allocation; if there is something tricky there, we'd need to
special-case the early-boot allocation somehow.

-George
Jan Beulich
2018-10-26 11:33:35 UTC
Permalink
Post by George Dunlap
Post by Jan Beulich
Post by George Dunlap
The basic solution involves having a xenheap virtual address mapping
area not tied to the physical layout of the memory. domheap and xenheap
memory would have to come from the same pool, but xenheap would need to
be mapped into the xenheap virtual memory region before being returned.
Wouldn't this most easily be done by making alloc_xenheap_pages()
call alloc_domheap_pages() and then vmap() the result? Of course
we may need to grow the vmap area in that case.
I couldn't answer that question without a lot more digging. :-) I'd
always assumed that the reason for the original reason for having the
xenheap direct-mapped on 32-bit was something to do with early-boot
allocation; if there is something tricky there, we'd need to
special-case the early-boot allocation somehow.
The reason for the split on 32-bit was simply the lack of sufficient
VA space.

Jan
George Dunlap
2018-10-26 11:43:32 UTC
Permalink
Post by Jan Beulich
Post by George Dunlap
Post by Jan Beulich
Post by George Dunlap
The basic solution involves having a xenheap virtual address mapping
area not tied to the physical layout of the memory. domheap and xenheap
memory would have to come from the same pool, but xenheap would need to
be mapped into the xenheap virtual memory region before being returned.
Wouldn't this most easily be done by making alloc_xenheap_pages()
call alloc_domheap_pages() and then vmap() the result? Of course
we may need to grow the vmap area in that case.
I couldn't answer that question without a lot more digging. :-) I'd
always assumed that the reason for the original reason for having the
xenheap direct-mapped on 32-bit was something to do with early-boot
allocation; if there is something tricky there, we'd need to
special-case the early-boot allocation somehow.
The reason for the split on 32-bit was simply the lack of sufficient
VA space.
That tells me why the domheap was *not* direct-mapped; but it doesn't
tell me why the xenheap *was*. Was it perhaps just something that
evolved from what we inherited from Linux?

-George
Jan Beulich
2018-10-26 11:45:49 UTC
Permalink
Post by George Dunlap
Post by Jan Beulich
Post by George Dunlap
Post by Jan Beulich
Post by George Dunlap
The basic solution involves having a xenheap virtual address mapping
area not tied to the physical layout of the memory. domheap and xenheap
memory would have to come from the same pool, but xenheap would need to
be mapped into the xenheap virtual memory region before being returned.
Wouldn't this most easily be done by making alloc_xenheap_pages()
call alloc_domheap_pages() and then vmap() the result? Of course
we may need to grow the vmap area in that case.
I couldn't answer that question without a lot more digging. :-) I'd
always assumed that the reason for the original reason for having the
xenheap direct-mapped on 32-bit was something to do with early-boot
allocation; if there is something tricky there, we'd need to
special-case the early-boot allocation somehow.
The reason for the split on 32-bit was simply the lack of sufficient
VA space.
That tells me why the domheap was *not* direct-mapped; but it doesn't
tell me why the xenheap *was*. Was it perhaps just something that
evolved from what we inherited from Linux?
Presumably, but there I'm really the wrong one to ask. When I joined,
things had long been that way.

Jan
Tamas K Lengyel
2018-10-24 15:24:32 UTC
Permalink
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.

Thanks,
Tamas
Dario Faggioli
2018-10-25 16:01:50 UTC
Permalink
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed?
I don't have any data handy right now, but I have certainly seen
hyperthreading being beneficial for performance in more than a few
benchmarks and workloads. How much so, this indeed varies *a lot* both
with the platform and with the workload itself.

That being said, I agree it would be good to have as much data as
possible. I'll try to do something about that.
Post by Tamas K Lengyel
We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Which is indeed very interesting. But, as we're discussing in the other
thread, I would, in your case, do some more measurements, varying the
configuration of the system, in order to be absolutely sure you are not
hitting some bug or anomaly.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
Tamas K Lengyel
2018-10-25 16:25:28 UTC
Permalink
Post by Dario Faggioli
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed?
I don't have any data handy right now, but I have certainly seen
hyperthreading being beneficial for performance in more than a few
benchmarks and workloads. How much so, this indeed varies *a lot* both
with the platform and with the workload itself.
That being said, I agree it would be good to have as much data as
possible. I'll try to do something about that.
Post by Tamas K Lengyel
We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Which is indeed very interesting. But, as we're discussing in the other
thread, I would, in your case, do some more measurements, varying the
configuration of the system, in order to be absolutely sure you are not
hitting some bug or anomaly.
Sure, I would be happy to repeat tests that were done in the past to
see whether they are still holding. We have run this test with Xen
4.10, 4.11 and 4.12-unstable on laptops and desktops, using credit1
and credit2, and it is consistent that hyperthreading yields the worst
performance. It varies between platforms but it's around 10-40%
performance hit with hyperthread on. This test we do is a very CPU
intensive test where we heavily oversubscribe the system. But I don't
think it would be all that unusual to run into such a setup in the
real world from time-to-time.

Tamas
Dario Faggioli
2018-10-25 17:23:36 UTC
Permalink
Post by Tamas K Lengyel
Post by Dario Faggioli
Which is indeed very interesting. But, as we're discussing in the other
thread, I would, in your case, do some more measurements, varying the
configuration of the system, in order to be absolutely sure you are not
hitting some bug or anomaly.
Sure, I would be happy to repeat tests that were done in the past to
see whether they are still holding. We have run this test with Xen
4.10, 4.11 and 4.12-unstable on laptops and desktops, using credit1
and credit2, and it is consistent that hyperthreading yields the worst
performance.
So, just to be clear, I'm not saying it's impossible to find a workload
for which HT is detrimental. Quite the opposite. And these benchmarks
you're running might well fall into that category.

I'm just suggesting to double check that. :-)
Post by Tamas K Lengyel
It varies between platforms but it's around 10-40%
performance hit with hyperthread on. This test we do is a very CPU
intensive test where we heavily oversubscribe the system. But I don't
think it would be all that unusual to run into such a setup in the
real world from time-to-time.
Ah, ok, so you're _heavily_ oversubscribing...

So, I don't think that an heavily oversubscribed host, where all vCPUs
would want to run 100% CPU intensive activities --and this not being
some transient situation-- is that common. And for the ones for which
it is, there is not much we can do, hyperthreading or not.

In any case, hyperthreading works best when the workload is mixed,
where it helps making sure that IO-bound tasks have enough chances to
file a lot of IO requests, without conflicting too much with the CPU-
bound tasks doing their number/logic crunching.

Having _everyone_ wanting to do actual stuff on the CPUs is, IMO, one
of the worst workloads for hyperthreading, and it is in fact a workload
where I've always seen it having the least beneficial effect on
performance. I guess it's possible that, in your case, it's actually
really doing more harm than good.

It's an interesting data point, but I wouldn't use a workload like that
to measure the benefit, or the impact, of an SMT related change.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
Tamas K Lengyel
2018-10-25 17:29:08 UTC
Permalink
Post by Dario Faggioli
Post by Tamas K Lengyel
Post by Dario Faggioli
Which is indeed very interesting. But, as we're discussing in the other
thread, I would, in your case, do some more measurements, varying the
configuration of the system, in order to be absolutely sure you are not
hitting some bug or anomaly.
Sure, I would be happy to repeat tests that were done in the past to
see whether they are still holding. We have run this test with Xen
4.10, 4.11 and 4.12-unstable on laptops and desktops, using credit1
and credit2, and it is consistent that hyperthreading yields the worst
performance.
So, just to be clear, I'm not saying it's impossible to find a workload
for which HT is detrimental. Quite the opposite. And these benchmarks
you're running might well fall into that category.
I'm just suggesting to double check that. :-)
Post by Tamas K Lengyel
It varies between platforms but it's around 10-40%
performance hit with hyperthread on. This test we do is a very CPU
intensive test where we heavily oversubscribe the system. But I don't
think it would be all that unusual to run into such a setup in the
real world from time-to-time.
Ah, ok, so you're _heavily_ oversubscribing...
So, I don't think that an heavily oversubscribed host, where all vCPUs
would want to run 100% CPU intensive activities --and this not being
some transient situation-- is that common. And for the ones for which
it is, there is not much we can do, hyperthreading or not.
In any case, hyperthreading works best when the workload is mixed,
where it helps making sure that IO-bound tasks have enough chances to
file a lot of IO requests, without conflicting too much with the CPU-
bound tasks doing their number/logic crunching.
Having _everyone_ wanting to do actual stuff on the CPUs is, IMO, one
of the worst workloads for hyperthreading, and it is in fact a workload
where I've always seen it having the least beneficial effect on
performance. I guess it's possible that, in your case, it's actually
really doing more harm than good.
It's an interesting data point, but I wouldn't use a workload like that
to measure the benefit, or the impact, of an SMT related change.
Thanks, and indeed this test is the worst-case scenario for
hyperthreading, that's was our goal. While a typical work-load may not
be similar, it is a possible one for the system we are concerned
about. So if at any given time the benefit of hyperthreading ranges
between say +30% and -30% and we can't predict the workload or
optimize it, it is looking like a safe bet to just disable
hyperthreading. Would you agree?

Tamas
Dario Faggioli
2018-10-26 07:31:12 UTC
Permalink
Post by Tamas K Lengyel
Post by Dario Faggioli
Having _everyone_ wanting to do actual stuff on the CPUs is, IMO, one
of the worst workloads for hyperthreading, and it is in fact a workload
where I've always seen it having the least beneficial effect on
performance. I guess it's possible that, in your case, it's
actually
really doing more harm than good.
It's an interesting data point, but I wouldn't use a workload like that
to measure the benefit, or the impact, of an SMT related change.
Thanks, and indeed this test is the worst-case scenario for
hyperthreading, that's was our goal. While a typical work-load may not
be similar, it is a possible one for the system we are concerned
about.
Sure, and that is fine. But at the same time, it is not much, if at
all, related with speculative execution, L1TF and coscheduling. It's
just that, with this workload, hyperthreading is bad, and not much more
to say.
Post by Tamas K Lengyel
So if at any given time the benefit of hyperthreading ranges
between say +30% and -30% and we can't predict the workload or
optimize it, it is looking like a safe bet to just disable
hyperthreading. Would you agree?
That's, AFAICR, the OpenBSD's take, back at the time of when TLBLeed
came out. But, no, I don't really agree. Not entirely, at least.

The way I see it, is that there are special workloads where SMT gives,
say, -30%, and those should just disable it, and be done.

For others, it's perfectly fine to keep it on, and we should, ideally,
find a solution to the security issues it introduces, without
nullifying the performance benefit it introduces.

And when it comes to judge how good, or bad, such solutions are, we
should consider both the best and the worst case scenarios, and I'd say
that the best case scenario is more important, as for the worst case,
one could just disable SMT, as said above.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
Andrew Cooper
2018-10-25 16:55:42 UTC
Permalink
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server.  Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU.  Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.

In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units.  Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.

Two arbitrary vcpus are not an optimised workload.  If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement.  I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.

~Andrew
George Dunlap
2018-10-25 17:01:55 UTC
Permalink
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server.  Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU.  Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units.  Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload.  If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement.  I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".

-George
Tamas K Lengyel
2018-10-25 17:35:18 UTC
Permalink
Post by George Dunlap
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server. Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU. Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units. Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload. If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement. I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".
Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.

Tamas
Andrew Cooper
2018-10-25 17:43:00 UTC
Permalink
Post by Tamas K Lengyel
Post by George Dunlap
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server. Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU. Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units. Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload. If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement. I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".
Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.
So long as you've got the XSA-273 patches, you should be able to park
and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.

You should be able to effectively change hyperthreading configuration at
runtime.  It's not quite the same as changing it in the BIOS, but from a
competition of pipeline resources, it should be good enough.

~Andrew
Tamas K Lengyel
2018-10-25 17:58:05 UTC
Permalink
On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by George Dunlap
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server. Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU. Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units. Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload. If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement. I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".
Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.
So long as you've got the XSA-273 patches, you should be able to park
and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
You should be able to effectively change hyperthreading configuration at
runtime. It's not quite the same as changing it in the BIOS, but from a
competition of pipeline resources, it should be good enough.
Thanks, indeed that is a handy tool to have. We often can't disable
hyperthreading in the BIOS anyway because most BIOS' don't allow you
to do that when TXT is used. That said, with this tool we still
require some way to determine when to do parking/reactivation of
hyperthreads. We could certainly park hyperthreads when we see the
system is being oversubscribed in terms of number of vCPUs being
active, but for real optimization we would have to understand the
workloads running within the VMs if I understand correctly?

Tamas
Andrew Cooper
2018-10-25 18:13:01 UTC
Permalink
Post by Tamas K Lengyel
On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by George Dunlap
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server. Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU. Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units. Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload. If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement. I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".
Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.
So long as you've got the XSA-273 patches, you should be able to park
and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
You should be able to effectively change hyperthreading configuration at
runtime. It's not quite the same as changing it in the BIOS, but from a
competition of pipeline resources, it should be good enough.
Thanks, indeed that is a handy tool to have. We often can't disable
hyperthreading in the BIOS anyway because most BIOS' don't allow you
to do that when TXT is used.
Hmm - that's an odd restriction.  I don't immediately see why such a
restriction would be necessary.
Post by Tamas K Lengyel
That said, with this tool we still
require some way to determine when to do parking/reactivation of
hyperthreads. We could certainly park hyperthreads when we see the
system is being oversubscribed in terms of number of vCPUs being
active, but for real optimization we would have to understand the
workloads running within the VMs if I understand correctly?
TBH, I'd perhaps start with an admin control which lets them switch
between the two modes, and some instructions on how/why they might want
to try switching.

Trying to second-guess the best HT setting automatically is most likely
going to be a lost cause.  It will be system specific as to whether the
same workload is better with or without HT.

~Andrew
Tamas K Lengyel
2018-10-25 18:35:51 UTC
Permalink
On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
Post by Tamas K Lengyel
On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by George Dunlap
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server. Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU. Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units. Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload. If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement. I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".
Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.
So long as you've got the XSA-273 patches, you should be able to park
and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
You should be able to effectively change hyperthreading configuration at
runtime. It's not quite the same as changing it in the BIOS, but from a
competition of pipeline resources, it should be good enough.
Thanks, indeed that is a handy tool to have. We often can't disable
hyperthreading in the BIOS anyway because most BIOS' don't allow you
to do that when TXT is used.
Hmm - that's an odd restriction. I don't immediately see why such a
restriction would be necessary.
Post by Tamas K Lengyel
That said, with this tool we still
require some way to determine when to do parking/reactivation of
hyperthreads. We could certainly park hyperthreads when we see the
system is being oversubscribed in terms of number of vCPUs being
active, but for real optimization we would have to understand the
workloads running within the VMs if I understand correctly?
TBH, I'd perhaps start with an admin control which lets them switch
between the two modes, and some instructions on how/why they might want
to try switching.
Trying to second-guess the best HT setting automatically is most likely
going to be a lost cause. It will be system specific as to whether the
same workload is better with or without HT.
This may just not be practically possible at the end as the system
administrator may have no idea what workload will be running on any
given system. It may also vary between one user to the next on the
same system, without the users being allowed to tune such details of
the system. If we can show that with core-scheduling deployed for most
workloads performance is improved by x % it may be a safe option. But
if every system needs to be tuned and evaluated in terms of its
eventual workload, that task becomes problematic. I appreciate the
insights though!

Tamas
Andrew Cooper
2018-10-25 18:39:52 UTC
Permalink
Post by Tamas K Lengyel
On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
Post by Tamas K Lengyel
On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by George Dunlap
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server. Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU. Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units. Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload. If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement. I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".
Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.
So long as you've got the XSA-273 patches, you should be able to park
and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
You should be able to effectively change hyperthreading configuration at
runtime. It's not quite the same as changing it in the BIOS, but from a
competition of pipeline resources, it should be good enough.
Thanks, indeed that is a handy tool to have. We often can't disable
hyperthreading in the BIOS anyway because most BIOS' don't allow you
to do that when TXT is used.
Hmm - that's an odd restriction. I don't immediately see why such a
restriction would be necessary.
Post by Tamas K Lengyel
That said, with this tool we still
require some way to determine when to do parking/reactivation of
hyperthreads. We could certainly park hyperthreads when we see the
system is being oversubscribed in terms of number of vCPUs being
active, but for real optimization we would have to understand the
workloads running within the VMs if I understand correctly?
TBH, I'd perhaps start with an admin control which lets them switch
between the two modes, and some instructions on how/why they might want
to try switching.
Trying to second-guess the best HT setting automatically is most likely
going to be a lost cause. It will be system specific as to whether the
same workload is better with or without HT.
This may just not be practically possible at the end as the system
administrator may have no idea what workload will be running on any
given system. It may also vary between one user to the next on the
same system, without the users being allowed to tune such details of
the system. If we can show that with core-scheduling deployed for most
workloads performance is improved by x % it may be a safe option. But
if every system needs to be tuned and evaluated in terms of its
eventual workload, that task becomes problematic. I appreciate the
insights though!
To a first approximation, a superuser knob of "switch between single and
dual threaded mode" can be used by people to experiment as to which is
faster overall.

If it really is the case that disabling HT makes things faster, then
you've suddenly gained (almost-)core scheduling "for free" alongside
that perf improvement.

~Andrew
Dario Faggioli
2018-10-26 07:49:15 UTC
Permalink
Post by Tamas K Lengyel
On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
Post by Andrew Cooper
TBH, I'd perhaps start with an admin control which lets them switch
between the two modes, and some instructions on how/why they might want
to try switching.
Trying to second-guess the best HT setting automatically is most likely
going to be a lost cause. It will be system specific as to whether the
same workload is better with or without HT.
This may just not be practically possible at the end as the system
administrator may have no idea what workload will be running on any
given system. It may also vary between one user to the next on the
same system, without the users being allowed to tune such details of
the system. If we can show that with core-scheduling deployed for most
workloads performance is improved by x % it may be a safe option.
I haven't done this kind of benchmark yet, but I'd say that, if every
vCPU of every domain is doing 100% CPU intensive work, core-scheduling
isn't going to make much difference, or help you much, as compared to
regular scheduling with hyperthreading enabled.

Actual numbers may vary depending on whether VMs have odd or even
number of vCPUs but, e.g., on hardware with 2 threads per core, and
using VMs with at least 2 vCPUs each, the _perfect_ implementation of
core-scheduling would still manage to keep all the *threads* busy,
which is --as far as our speculations currently go-- what is causing
the performance degradation you're seeing.

So, again, if it is confirmed that this workload of yours is a
particularly bad one for SMT, then you are just better off disabling
hyperthreading. And, no, I don't think such a situation is common
enough to say "let's disable for everyone by default".
Post by Tamas K Lengyel
But
if every system needs to be tuned and evaluated in terms of its
eventual workload, that task becomes problematic.
So, the scheduler has the notion of the system load (at least, Credit2
does), and it is in theory possible to put together some heuristics for
basically stopping using hyperthreading, upon certain conditions.

This, however, I see it as something completely orthogonal from
security related consideration and from core-scheduling.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
Tamas K Lengyel
2018-10-26 12:01:46 UTC
Permalink
Post by Dario Faggioli
Post by Tamas K Lengyel
On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
Post by Andrew Cooper
TBH, I'd perhaps start with an admin control which lets them switch
between the two modes, and some instructions on how/why they might want
to try switching.
Trying to second-guess the best HT setting automatically is most likely
going to be a lost cause. It will be system specific as to whether the
same workload is better with or without HT.
This may just not be practically possible at the end as the system
administrator may have no idea what workload will be running on any
given system. It may also vary between one user to the next on the
same system, without the users being allowed to tune such details of
the system. If we can show that with core-scheduling deployed for most
workloads performance is improved by x % it may be a safe option.
I haven't done this kind of benchmark yet, but I'd say that, if every
vCPU of every domain is doing 100% CPU intensive work, core-scheduling
isn't going to make much difference, or help you much, as compared to
regular scheduling with hyperthreading enabled.
Understood, we actually went into the this with the assumption that in such
cases core-scheduling would underperform plain credit1. The idea was to
measure the worst case with plain scheduling and with core-scheduling to be
able to see the difference clearly between the two.
Post by Dario Faggioli
Actual numbers may vary depending on whether VMs have odd or even
number of vCPUs but, e.g., on hardware with 2 threads per core, and
using VMs with at least 2 vCPUs each, the _perfect_ implementation of
core-scheduling would still manage to keep all the *threads* busy,
which is --as far as our speculations currently go-- what is causing
the performance degradation you're seeing.
So, again, if it is confirmed that this workload of yours is a
particularly bad one for SMT, then you are just better off disabling
hyperthreading. And, no, I don't think such a situation is common
enough to say "let's disable for everyone by default".
I wasn't asking to make it the default in Xen but if we make it the default
for our deployment where such workloads are entirely possible, would that
be reasonable. Again, we don't know the workload and we can't predict it.
We were hoping to use core-scheduling eventually but it was not expected
that hyperthreading can cause such drops in performance. If there are tests
that I can run which are the "best case" for hyperthreading, I would like
to repeat those tests to see where we are.

Thanks,
Tamas
Dario Faggioli
2018-10-26 14:17:39 UTC
Permalink
Post by Tamas K Lengyel
Post by Dario Faggioli
I haven't done this kind of benchmark yet, but I'd say that, if every
vCPU of every domain is doing 100% CPU intensive work, core-
scheduling
isn't going to make much difference, or help you much, as compared to
regular scheduling with hyperthreading enabled.
Understood, we actually went into the this with the assumption that
in such cases core-scheduling would underperform plain credit1.
Which may actually happen. Or it might improve things a little, because
there are higher chances that a core only has 1 thread busy. But then
we're not really benchmarking core-scheduling vs. plain-scheduling,
we're benchmarking a side-effect of core-scheduling, which is not
equally interesting.
Post by Tamas K Lengyel
The idea was to measure the worst case with plain scheduling and with
core-scheduling to be able to see the difference clearly between the
two.
For the sake of benchmarking core-scheduling solutions, we should put
ourself in a position where what we measure is actually its own impact,
and I don't think this very workload put us there.

Then, of course, if this workload is relevant to you, you indeed have
the right and should benchmark and evaluate it, and we're always
interested in hearing what you find out. :-)
Post by Tamas K Lengyel
Post by Dario Faggioli
Actual numbers may vary depending on whether VMs have odd or even
number of vCPUs but, e.g., on hardware with 2 threads per core, and
using VMs with at least 2 vCPUs each, the _perfect_ implementation of
core-scheduling would still manage to keep all the *threads* busy,
which is --as far as our speculations currently go-- what is
causing
the performance degradation you're seeing.
So, again, if it is confirmed that this workload of yours is a
particularly bad one for SMT, then you are just better off
disabling
hyperthreading. And, no, I don't think such a situation is common
enough to say "let's disable for everyone by default".
I wasn't asking to make it the default in Xen but if we make it the
default for our deployment where such workloads are entirely
possible, would that be reasonable.
It all comes to how common a situation where you have a massively
oversubscribed system, with a fully CPU-bound workload, for significant
chunks of time.

As said in a previous email, I think that, if this is common enough,
and it is not something just transient, you'll are in trouble anyway.
And if it's not causing you/your customers troubles already, it might
not be that common, and hence it wouldn't be necessary/wise to disable
SMT.

But of course, you know your workload, and your requirements, much more
than me. If this kind of load really is what you experience, or what
you want to target, then yes, apparently disabling SMT is your best way
to go.
Post by Tamas K Lengyel
If there are
tests that I can run which are the "best case" for hyperthreading, I
would like to repeat those tests to see where we are.
If we come up with a good enough synthetic benchmark, I'll let you
know.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
George Dunlap
2018-10-26 10:11:18 UTC
Permalink
Post by Andrew Cooper
Post by Tamas K Lengyel
On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by George Dunlap
Post by Andrew Cooper
Post by Tamas K Lengyel
Post by Andrew Cooper
A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels. Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF. Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.
Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.
Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server. Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU. Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.
In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units. Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.
Two arbitrary vcpus are not an optimised workload. If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement. I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.
As another data point, I think it was MSCI who said they always disabled
hyperthreading, because they also found that their workloads ran slower
with HT than without. Presumably they were doing massive number
crunching, such that each thread was waiting on the ALU a significant
portion of the time anyway; at which point the superscalar scheduling
and/or reduction in cache efficiency would have brought performance from
"no benefit" down to "negative benefit".
Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.
So long as you've got the XSA-273 patches, you should be able to park
and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
You should be able to effectively change hyperthreading configuration at
runtime. It's not quite the same as changing it in the BIOS, but from a
competition of pipeline resources, it should be good enough.
Thanks, indeed that is a handy tool to have. We often can't disable
hyperthreading in the BIOS anyway because most BIOS' don't allow you
to do that when TXT is used.
Hmm - that's an odd restriction.  I don't immediately see why such a
restriction would be necessary.
Post by Tamas K Lengyel
That said, with this tool we still
require some way to determine when to do parking/reactivation of
hyperthreads. We could certainly park hyperthreads when we see the
system is being oversubscribed in terms of number of vCPUs being
active, but for real optimization we would have to understand the
workloads running within the VMs if I understand correctly?
TBH, I'd perhaps start with an admin control which lets them switch
between the two modes, and some instructions on how/why they might want
to try switching.
Trying to second-guess the best HT setting automatically is most likely
going to be a lost cause.  It will be system specific as to whether the
same workload is better with or without HT.
There may be hardware-specific performance counters that could be used
to detect when pathological cases are happening. But that would need to
be implemented and/or re-verified on basically every new piece of hardware.

-George
Wei Liu
2018-12-07 18:40:52 UTC
Permalink
Post by Andrew Cooper
Hello,
This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas.  They are
x86 specific, but a lot of the principles are architecture-agnostic.
1) A secrets-free hypervisor.
Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget.  Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.
Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable.  Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.
An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space.  This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus. 
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change.  I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
some trivial bits of which have already found their way upstream.
To have a non-uniform memory layout, Xen may not share L4 pagetables. 
i.e. Xen must never have two pcpus which reference the same pagetable in
%cr3.
This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point.  Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
xen-shim).
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.
Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen.  Option 1 looks to be the better
option longterm.
As an interesting point to note.  The 32bit PV ABI prohibits sharing of
L3 pagetables, because back in the 32bit hypervisor days, we used to
have linear mappings in the Xen virtual range.  This check is stale
(from a functionality point of view), but still present in Xen.  A
consequence of this is that 32bit PV guests definitely don't share
top-level pagetables across vcpus.
Correction: 32bit PV ABI prohibits sharing of L2 pagetables, but L3
pagetables can be shared. So guests will schedule the same top-level
pagetables across vcpus.

But, 64bit Xen creates a monitor table for 32bit PAE guest and put the
CR3 provided by guest to the first slot, so pcpus don't share the same
L4 pagetables. The property we want still holds.
Post by Andrew Cooper
Juergen/Boris: Do you have any idea if/how easy this infrastructure
would be to implement for 64bit PV guests as well?  If a PV guest can
advertise via Elfnote that it won't share top-level pagetables, then we
can audit this trivially in Xen.
After reading Linux kernel code, I think it is not going to be trivial.
As now threads in Linux share one pagetable (as it should be).

In order to make each thread has its own pagetable while still maintain
the illusion of one address space, there needs to be synchronisation
under the hood.

There is code in Linux to synchronise vmalloc, but that's only for the
kernel portion. The infrastructure to synchronise userspace portion is
missing.

One idea is to follow the same model as vmalloc -- maintain a reference
pagetable in struct mm and a list of pagetables for threads, then
synchronise the pagetables in the page fault handler. But this is
probably a bit hard to sell to Linux maintainers because it will touch a
lot of the non-Xen code, increase complexity and decrease performance.

Thoughts?

Wei.
George Dunlap
2018-12-10 12:19:18 UTC
Permalink
Post by Wei Liu
Post by Andrew Cooper
Hello,
This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas.  They are
x86 specific, but a lot of the principles are architecture-agnostic.
1) A secrets-free hypervisor.
Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget.  Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.
Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable.  Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.
An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.
An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space.  This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).
Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus. 
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change.  I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
some trivial bits of which have already found their way upstream.
To have a non-uniform memory layout, Xen may not share L4 pagetables. 
i.e. Xen must never have two pcpus which reference the same pagetable in
%cr3.
This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point.  Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
xen-shim).
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.
Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen.  Option 1 looks to be the better
option longterm.
As an interesting point to note.  The 32bit PV ABI prohibits sharing of
L3 pagetables, because back in the 32bit hypervisor days, we used to
have linear mappings in the Xen virtual range.  This check is stale
(from a functionality point of view), but still present in Xen.  A
consequence of this is that 32bit PV guests definitely don't share
top-level pagetables across vcpus.
Correction: 32bit PV ABI prohibits sharing of L2 pagetables, but L3
pagetables can be shared. So guests will schedule the same top-level
pagetables across vcpus. >
But, 64bit Xen creates a monitor table for 32bit PAE guest and put the
CR3 provided by guest to the first slot, so pcpus don't share the same
L4 pagetables. The property we want still holds.
Ah, right -- but Xen can get away with this because in PAE mode, "L3" is
just 4 entries that are loaded on CR3-switch and not automatically kept
in sync by the hardware; i.e., the OS already needs to do its own
"manual syncing" if it updates any of the L3 entires; so it's the same
for Xen.
Post by Wei Liu
Post by Andrew Cooper
Juergen/Boris: Do you have any idea if/how easy this infrastructure
would be to implement for 64bit PV guests as well?  If a PV guest can
advertise via Elfnote that it won't share top-level pagetables, then we
can audit this trivially in Xen.
After reading Linux kernel code, I think it is not going to be trivial.
As now threads in Linux share one pagetable (as it should be).
In order to make each thread has its own pagetable while still maintain
the illusion of one address space, there needs to be synchronisation
under the hood.
There is code in Linux to synchronise vmalloc, but that's only for the
kernel portion. The infrastructure to synchronise userspace portion is
missing.
One idea is to follow the same model as vmalloc -- maintain a reference
pagetable in struct mm and a list of pagetables for threads, then
synchronise the pagetables in the page fault handler. But this is
probably a bit hard to sell to Linux maintainers because it will touch a
lot of the non-Xen code, increase complexity and decrease performance.
Sorry -- what do you mean "synchronize vmalloc"? If every thread has a
different view of the kernel's vmalloc area, then every thread must have
a different L4 table, right? And if every thread has a different L4
table, then we've already got the main thing we need from Linux, don't we?
Just had an IRL chat with Wei: The syncronization he was talking about
was a syncronization *of the kernel space* *between procesess*. What we
would need in Linux is a synchronization *of userspace* *between
threads*. So the same basic idea is there, but it would require a
reasomable amount of extra extension work.

Since the work that would need to be done in Linux is exactly the same
work that we'd need to do in Xen, I think the Linux maintainers would be
pretty annoyed if we asked them to do it instead of doing it ourselves.

-George

Loading...