RFH: Kernel OOPS in xen_netbk_rx_action / xenvif_gop

Post by Philipp Hahn
Hello,
on one of our hosts (Xen-4.1.3 with Linux-3.10.26 + Debian patches)
running 16 Linux VMs (linux-3.2.39 and others) netback crashes during

[38551.549615] Oops: 0000 [#1] SMP

Is there any more output above this line? Is it a NULL pointer
dereference or something else?

[38551.549665] Modules linked in: tun xt_physdev xen_blkback xen_netback ip6_tables
iptable_filter ip_tables ebtable_nat ebtables x_tables xen_gntdev nfsv3 nfsv4
rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss oid_registry nfs fscache dns_resolver lockd
sunrpc fuse loop xen_blkfront xen_evtchn blktap quota_v2 quota_tree xenfs xen_privcmd
coretemp crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
glue_helper aes_x86_64 snd_pcm snd_timer snd soundcore snd_page_alloc tpm_tis tpm lpc_ich
tpm_bios i7core_edac i2c_i801 psmouse microcode edac_core serio_raw pcspkr mperf ioatdma
mfd_core processor evdev thermal_sys ext4 jbd2 crc16 bonding bridge stp llc dm_snapshot
dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif ehci_pci uhci_hcd ehci_hcd mptsas
mptscsih mptbase scsi_transport_sas usbcore usb_common igb dca i2c_algo_bit i2c_core ptp
pps_core button
[38551.550601] CPU: 0 PID: 12587 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1 Debian
3.10.11-1.58.201405060908
[38551.550693] Hardware name: FUJITSU PRIMERGY BX620 S6/D3051, BIOS 080015 Rev.3C78.3051
07/22/2011
[38551.550781] task: ffff880004b067c0 ti: ffff8800561ec000 task.ti: ffff8800561ec000
[38551.550865] RIP: e030:[<ffffffffa04147dc>] [<ffffffffa04147dc>]
xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]

Try addr2line?

[38551.550959] RSP: e02b:ffff8800561edce8 EFLAGS: 00010202
[38551.551009] RAX: ffffc900104adac0 RBX: ffff8800541e95c0 RCX: ffffc90010864000
[38551.551064] RDX: 000000000000003b RSI: 0000000000000000 RDI: ffff880040014380
[38551.551120] RBP: ffff8800570e6800 R08: 0000000000000000 R09: ffff880004799800
[38551.551175] R10: ffffffff813ca115 R11: ffff88005e4fdb08 R12: ffff880054e6f800
[38551.551231] R13: ffff8800561edd58 R14: ffffc900104a1000 R15: 0000000000000000
[38551.551289] FS: 00007f19a54a8700(0000) GS:ffff88005da00000(0000)
knlGS:0000000000000000
[38551.551374] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[38551.551425] CR2: ffffc900108641d8 CR3: 0000000054cb3000 CR4: 0000000000002660
[38551.551481] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[38551.551537] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[38551.551630] ffff880004b06ba0 0000000000000000 ffff88005da13ec0 ffff88005da13ec0
[38551.551726] 0000000004b067c0 ffffc900104a8ac0 ffffc900104a1020 000000005da13ec0
[38551.551823] 0000000000000000 0000000000000001 ffffc900104a8ac0 ffffc900104adac0
[38551.551966] [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
[38551.552021] [<ffffffffa0416033>] ? xen_netbk_kthread+0x174/0x841 [xen_netback]
[38551.552106] [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
[38551.560239] [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[38551.560325] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560381] [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[38551.560466] [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
[38551.560518] [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
[38551.560572] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560628] [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
[38551.560680] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560734] Code: 8b b3 d0 00 00 00 48 8b bb d8 00 00 00 0f b7 74 37 02 89 70 08 eb 07
c7 40 08 00 00 00 00 89 d2 c7 40 04 00 00 00 00 48 83 c2 08 <0f> b7 34 d1 89 30 c7 44 24
60 00 00 00 00 8b 44 d1 04 89 44 24
[38551.561151] RIP [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[38551.561238] RSP <ffff8800561edce8>
[38551.561283] CR2: ffffc900108641d8
[38551.561624] ---[ end trace 8c260c6af259c4aa ]---

The host itself is still alive and reachable by network, but all VMs are
no longer reachable.
The crash does not happen on every reboot: The VM was running fine for
1½ week after a dom0 kernel update, but now crashed the following past
two nights.

What's the Dom0 kernel version before upgrading? That would help us
narrow down the range of changesets.

The oops happens in guest receive path. Unfortunately that's a very
complex function, it's hard to identify the problem by looking at the
code.

And as you seem to be using a distro kernel, have your reported to
Debian yet? I don't quite understand which Debian release has 3.10
kernel though.

Post by Philipp Hahn
I'm yet unable to reproduce this on demand, but would like to prepared
next time it happens again.
@Ian: I found your mail "Re: [Xen-devel] Kernel 3.7.0-pre-rc1 kernel BUG
at drivers/net/xen-netback/netback.c:405 RIP: e030:[<ffffffff814714f9>]
[<ffffffff814714f9>] netbk_gop_frag_copy+0x379/0x380" from 2012-10-09,
which describes a crash in the same function, but at a complete
different (later) location. You hinted that a difference in hardware
might explain, why I'm unable to reproduce it, as my test environment
has different HW (no "igb", but "e1000e").

3.7.0 is too old. There has been lots of changes since then.

You mentioned 3.10.26 at the beginning but now it's 3.10.11? I'm
confused.

If it's dereferencing NULL pointer, skb_shinfo(skb) == NULL?

else
meta->gso_size = 0;
7c8: c7 40 08 00 00 00 00 movl $0x0,0x8(%rax)
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
meta->size = 0;
meta->id = req->id;
7cf: 89 d2 mov %edx,%edx
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:610
if (!vif->gso_prefix)
meta->gso_size = skb_shinfo(skb)->gso_size;
else
meta->gso_size = 0;
meta->size = 0;
7d1: c7 40 04 00 00 00 00 movl $0x0,0x4(%rax)
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
meta->id = req->id;
7d8: 48 83 c2 08 add $0x8,%rdx
7dc: 0f b7 34 d1 movzwl (%rcx,%rdx,8),%esi

0x651 + 0x18B = 0x7DC

Not that I know of.

Post by Philipp Hahn
2. If yes, is there a fix in newer Linux kernels?
3. If no, What data should I collect in addition?

There's one more patch that you can pick up from 3.10.y tree. I doubt it
will make much difference though.

I think the first thing to do is to identify which line of code is
causing the problem. If it is actually the line you're referring to in
your analyse then we need to figure out why skb_shinfo(skb) is NULL...

Post by Philipp Hahn
Xen-Hypervisor is 4.1.3 from Debian, but as this is a kernel crash, I
don't expect a newer version of Xen to fix it (correct me if I'm wrong).

You're correct. Upgrading hypervisor won't help.

Wei.

Post by Philipp Hahn
Thanks in advance.
Philipp
PS: I'm not afraid of getting my hands dirty doing Linux coding, but
currently I'm out of ideas of how to best proceed.
--
Philipp Hahn
Open Source Software Engineer
Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
http://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876

Philipp Hahn

2014-06-06 22:12:32 UTC

Hello,

Post by Philipp Hahn
on one of our hosts (Xen-4.1.3 with Linux-3.10.26 + Debian patches)
running 16 Linux VMs (linux-3.2.39 and others) netback crashes during

[38551.549615] Oops: 0000 [#1] SMP

Is there any more output above this line? Is it a NULL pointer
dereference or something else?

Sorry, those lines got lost somehow during copy&paste:

[38551.547728] XXXlan0: port 9(vif26.0) entered disabled state
[38551.549365] BUG: unable to handle kernel paging request at
ffffc900108641d8
[38551.549461] IP: [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0
[xen_netback]
[38551.549551] PGD 57e20067 PUD 57e21067 PMD 571a7067 PTE 0
[38551.549615] Oops: 0000 [#1] SMP

[38551.550865] RIP: e030:[<ffffffffa04147dc>] [<ffffffffa04147dc>]
xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]

Try addr2line?

Good to know, but since that host is already rebooted, I no longer know
the module load address, which seems to render addr2line useless.

Post by Philipp Hahn
The host itself is still alive and reachable by network, but all VMs are
no longer reachable.
The crash does not happen on every reboot: The VM was running fine for
1½ week after a dom0 kernel update, but now crashed the following past
two nights.

What's the Dom0 kernel version before upgrading? That would help us
narrow down the range of changesets.

The previous kernel was 3.10.15. The update was performed to get another
bug fixed, which went into the Debian update between .11 and .26:

commit 0ff773f59ff375c42af2238457bda98ed4ddcd25
Author: David Vrabel <***@citrix.com>
Date: Wed Sep 11 14:52:48 2013 +0100
xen-netback: count number required slots for an skb more carefully
[ Upstream commit 6e43fc04a6bc357d260583b8440882f28069207f ]

Post by Wei Liu
The oops happens in guest receive path. Unfortunately that's a very
complex function, it's hard to identify the problem by looking at the
code.
And as you seem to be using a distro kernel, have your reported to
Debian yet? I don't quite understand which Debian release has 3.10
kernel though.

Actually this is "Univention Corporate Server" (UCS), which is
Debian-Squeeze based but with a newer Xen-4.1.3 and newer Linux-3.10 kernel.

Post by Wei Liu
3.7.0 is too old. There has been lots of changes since then.

Probably so, but thanks for the confirmation.

/root/linux-3.10.11/drivers/net/xen-netback/netback.c:606

...

Post by Wei Liu
You mentioned 3.10.26 at the beginning but now it's 3.10.11? I'm
confused.

This has something to do with Debian patch policy: The first 3.10 kernel
was 3.10.11, so that number stays, even when it up-patched to 3.10.26.

Post by Wei Liu
If it's dereferencing NULL pointer, skb_shinfo(skb) == NULL?

/root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
meta->id = req->id;
7d8: 48 83 c2 08 add $0x8,%rdx
7dc: 0f b7 34 d1 movzwl (%rcx,%rdx,8),%esi

0x651 + 0x18B = 0x7DC

0x651 is the start of xen_netbk_rx_action() from objdump.

...

Post by Wei Liu
There's one more patch that you can pick up from 3.10.y tree. I doubt it
will make much difference though.
I think the first thing to do is to identify which line of code is
causing the problem. If it is actually the line you're referring to in
your analyse then we need to figure out why skb_shinfo(skb) is NULL...

I'll try to add some debug output to yell if skb_shinfo() is NULL, but
it might take some time until the bug manifests again.

Post by Wei Liu
Wei.

Thank you for your feedback.

Philipp
--
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
***@univention.de

http://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876

Philipp Hahn

2014-06-18 16:48:31 UTC

Hello,

We are now more or less able to reproduce the OOPS within one hour by

[32918.795695] XXXlan0: port 3(vif18.0) entered disabled state
[32918.798732] BUG: unable to handle kernel paging request at ffffc90010da2188
[32918.798823] IP: [<ffffffffa04287dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[32918.798911] PGD 95822067 PUD 95823067 PMD 94f47067 PTE 0
[32918.798974] Oops: 0000 [#1] SMP
[32918.799023] Modules linked in: xt_physdev xen_blkback xen_netback ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables xen_gntdev nfsv3 nfsv4 rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss oid_registry nfs fscache dns_resolver lockd sunrpc fuse loop xen_blkfront xen_evtchn blktap quota_v2 quota_tree xenfs xen_privcmd coretemp crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw snd_pcm gf128mul snd_timer glue_helper snd aes_x86_64 soundcore snd_page_alloc microcode tpm_tis tpm tpm_bios pcspkr lpc_ich mfd_core acpi_power_meter i7core_edac mperf serio_raw i2c_i801 evdev edac_core processor ioatdma thermal_sys ext4 jbd2 crc16 bonding bridge stp llc dm_snapshot dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif hid_generic usbhid hid mptsas mptscsih mptbase scs

i_transport_sas ehci_pci button uhci_hcd ehci_hcd usbcore usb_common igb dca i2c_algo_bit i2c_core ptp pps_core

[32918.799958] CPU: 0 PID: 6450 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1 Debian 3.10.11-1.58.201405060908
[32918.800050] Hardware name: FUJITSU PRIMERGY BX920 S2/D3030, BIOS 080015 Rev.3D94.3030 10/09/2012
[32918.800137] task: ffff880093864880 ti: ffff88009266c000 task.ti: ffff88009266c000
[32918.800220] RIP: e030:[<ffffffffa04287dc>] [<ffffffffa04287dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[32918.800314] RSP: e02b:ffff88009266dce8 EFLAGS: 00010212
[32918.800364] RAX: ffffc9001082dac0 RBX: ffff880004d86ac0 RCX: ffffc90010da2000
[32918.800419] RDX: 0000000000000031 RSI: 0000000000000000 RDI: ffff880004bdd280
[32918.800474] RBP: ffff8800932db800 R08: 0000000000000000 R09: ffff8800952f3800
[32918.800529] R10: 0000000000007ff0 R11: ffff88009c611380 R12: ffff8800932db800
[32918.800584] R13: ffff88009266dd58 R14: ffffc90010821000 R15: 0000000000000000
[32918.800642] FS: 00007f2f8fdcd700(0000) GS:ffff88009c600000(0000) knlGS:0000000000000000
[32918.800728] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[32918.800778] CR2: ffffc90010da2188 CR3: 0000000093eb0000 CR4: 0000000000002660
[32918.800834] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32918.800889] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[32918.800981] ffff880093864c60 000000008106d2af ffff88009c613ec0 ffff88009c613ec0
[32918.801077] 0000000093864880 ffffc90010828ac0 ffffc90010821020 000000009c613ec0
[32918.801173] 0000000000000000 0000000000000001 ffffc90010828ac0 ffffc9001082dac0
[32918.801314] [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
[32918.801368] [<ffffffffa042a033>] ? xen_netbk_kthread+0x174/0x841 [xen_netback]
[32918.801454] [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
[32918.801504] [<ffffffffa0429ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[32918.801590] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[32918.801645] [<ffffffffa0429ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[32918.801730] [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
[32918.801781] [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
[32918.801834] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[32918.801890] [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
[32918.801941] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[32918.801995] Code: 8b b3 d0 00 00 00 48 8b bb d8 00 00 00 0f b7 74 37 02 89 70 08 eb 07 c7 40 08 00 00 00 00 89 d2 c7 40 04 00 00 00 00 48 83 c2 08 <0f> b7 34 d1 89 30 c7 44 24 60 00 00 00 00 8b 44 d1 04 89 44 24
[32918.802400] RIP [<ffffffffa04287dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[32918.802486] RSP <ffff88009266dce8>
[32918.802529] CR2: ffffc90010da2188
[32918.802859] ---[ end trace baf81e34c52eb41c ]---

(gdb) list *(xen_netbk_rx_action+0x18b)
0xffffffffa04287dc is in xen_netbk_rx_action
(/var/build/temp/tmp.hW3dNilayw/pbuilder/linux-3.10.11/drivers/net/xen-netback/netback
.c:611).
606 meta->gso_size = skb_shinfo(skb)->gso_size;
607 else
608 meta->gso_size = 0;
609
610 meta->size = 0;
611 meta->id = req->id;
612 npo->copy_off = 0;
613 npo->copy_gref = req->gref;
614
615 data = skb->data;

After more debugging today I think something like this happens:

1. The VM is receiving packets through bonding + bridge + netback +
netfront.

2. For some unknown reason at least one packet remains in the rx queue
and is not delivered to the domU immediately by netback.

3. The VM finishes shutting down.

4. The shared ring between dom0 and domU is freed.

5. then xen-netback continues processing the pending requests and tries
to put the packet into the now already released shared ring.

From reading the attached disassembly I guess, that

AX = &meta
CX = &rx->string
DX =~ rx.req_cons
CR2 = &req->id
where
CX + DX * sizeof(union struct xen_netif_rx_{request,response})=8 = CR2

Any additional ideas or insight is appreciated.

FYI: The host has only a single CPU and is running >=2 VMs so far.

Post by Wei Liu
There's one more patch that you can pick up from 3.10.y tree. I doubt it
will make much difference though.

Which patch are you referring to?

Sincerely
Philipp

Wei Liu

2014-06-19 14:12:52 UTC

On Wed, Jun 18, 2014 at 06:48:31PM +0200, Philipp Hahn wrote:
[...]

Post by Philipp Hahn
(gdb) list *(xen_netbk_rx_action+0x18b)
0xffffffffa04287dc is in xen_netbk_rx_action
(/var/build/temp/tmp.hW3dNilayw/pbuilder/linux-3.10.11/drivers/net/xen-netback/netback
.c:611).
606 meta->gso_size = skb_shinfo(skb)->gso_size;
607 else
608 meta->gso_size = 0;
609
610 meta->size = 0;
611 meta->id = req->id;
612 npo->copy_off = 0;
613 npo->copy_gref = req->gref;
614
615 data = skb->data;
1. The VM is receiving packets through bonding + bridge + netback +
netfront.
2. For some unknown reason at least one packet remains in the rx queue
and is not delivered to the domU immediately by netback.
3. The VM finishes shutting down.
4. The shared ring between dom0 and domU is freed.
5. then xen-netback continues processing the pending requests and tries
to put the packet into the now already released shared ring.

From reading the attached disassembly I guess, that

AX = &meta
CX = &rx->string
DX =~ rx.req_cons
CR2 = &req->id
where
CX + DX * sizeof(union struct xen_netif_rx_{request,response})=8 = CR2
Any additional ideas or insight is appreciated.

I think your analysis makes sense. Netback does have it's internal queue
and kthread can certainly be scheduled away. There doesn't seem to be a
synchronisation point between a vif getting disconnet and internal queue
gets processed. I attach a quick hack. If it does work to a degree then
we can try to work out a proper fix.

Post by Philipp Hahn
FYI: The host has only a single CPU and is running >=2 VMs so far.

Post by Wei Liu
There's one more patch that you can pick up from 3.10.y tree. I doubt it
will make much difference though.

Which patch are you referring to?

You can have a look at 3.10.y tree for all the patches between your
current version and the latest stable version.

Wei.

---8<---
From d2f428a93e6e296bc5f55e16f44ac1ad63a951a8 Mon Sep 17 00:00:00 2001
From: Wei Liu <***@citrix.com>
Date: Thu, 19 Jun 2014 15:07:47 +0100
Subject: [PATCH] quick hack

---
drivers/net/xen-netback/common.h | 1 +
drivers/net/xen-netback/interface.c | 1 +
drivers/net/xen-netback/netback.c | 8 ++++++++
3 files changed, 10 insertions(+)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index f2faa77..9239824 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -66,6 +66,7 @@ struct xenvif {
/* The shared rings and indexes. */
struct xen_netif_tx_back_ring tx;
struct xen_netif_rx_back_ring rx;
+ bool mapped;

/* Frontend feature information. */
u8 can_sg:1;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 540a796..5f11763 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -271,6 +271,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
vif->dev = dev;
INIT_LIST_HEAD(&vif->schedule_list);
INIT_LIST_HEAD(&vif->notify_list);
+ vif->mapped = false;

vif->credit_bytes = vif->remaining_credit = ~0UL;
vif->credit_usec = 0UL;
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 36efb41..f4f3693 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -720,6 +720,11 @@ static void xen_netbk_rx_action(struct xen_netbk *netbk)
vif = netdev_priv(skb->dev);
nr_frags = skb_shinfo(skb)->nr_frags;

+ if (!vif->mapped) {
+ dev_kfree_skb(skb);
+ continue;
+ }
+
sco = (struct skb_cb_overlay *)skb->cb;
sco->meta_slots_used = netbk_gop_skb(skb, &npo);

@@ -1864,6 +1869,8 @@ static int xen_netbk_kthread(void *data)

void xen_netbk_unmap_frontend_rings(struct xenvif *vif)
{
+ vif->mapped = false;
+
if (vif->tx.sring)
xenbus_unmap_ring_vfree(xenvif_to_xenbus_device(vif),
vif->tx.sring);
@@ -1899,6 +1906,7 @@ int xen_netbk_map_frontend_rings(struct xenvif *vif,
BACK_RING_INIT(&vif->rx, rxs, PAGE_SIZE);

vif->rx_req_cons_peek = 0;
+ vif->mapped = true;

return 0;

--
1.7.10.4

David Vrabel

2014-06-19 14:35:11 UTC

Post by Wei Liu
[...]

From reading the attached disassembly I guess, that

AX = &meta
CX = &rx->string
DX =~ rx.req_cons
CR2 = &req->id
where
CX + DX * sizeof(union struct xen_netif_rx_{request,response})=8 = CR2
Any additional ideas or insight is appreciated.

The kthread_stop() in xenvif_disconnect() waits for the kthread to exit
so I don't see how Philipp's analysis can be right.

David

Wei Liu

2014-06-19 14:41:10 UTC

Post by David Vrabel

Post by Wei Liu
[...]

From reading the attached disassembly I guess, that

AX = &meta
CX = &rx->string
DX =~ rx.req_cons
CR2 = &req->id
where
CX + DX * sizeof(union struct xen_netif_rx_{request,response})=8 = CR2
Any additional ideas or insight is appreciated.

The kthread_stop() in xenvif_disconnect() waits for the kthread to exit
so I don't see how Philipp's analysis can be right.

He's using 3.10 kernel. One kthread serves many vifs. The kthread won't
stop.

Wei.

Post by David Vrabel
David

Philipp Hahn

2014-06-27 08:42:24 UTC

Hello Wei Liu,

...

Post by Philipp Hahn
5. then xen-netback continues processing the pending requests and tries

...

Post by Wei Liu
I think your analysis makes sense. Netback does have it's internal queue
and kthread can certainly be scheduled away. There doesn't seem to be a
synchronisation point between a vif getting disconnet and internal queue
gets processed. I attach a quick hack. If it does work to a degree then
we can try to work out a proper fix.

Your quick hack seems to have solved the problem: The network survived
the week-end, but we had to change the VMs as one of them was required
last weekend. We're currently re-checking that the bug still occurs with
the old kernel but the ne

We added some debug output (UniDEBUG) as we observed another OOPS in one
test run, but I think that was a mis-compiled kernel as the size of
function was the same as previous, but it should be 0x712 according to

[ 6196.712232] BUG: unable to handle kernel paging request at ffffc90010d94678
[ 6196.712322] IP: [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[ 6196.712410] PGD 95822067 PUD 95823067 PMD 94721067 PTE 0
[ 6196.712473] Oops: 0000 [#1] SMP

...

[ 6196.713434] CPU: 0 PID: 11743 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1
Univention 3.10.11-1.58.201405060908a~xenXXX

...

[ 6196.713618] task: ffff8800917f7840 ti: ffff880004bde000 task.ti: ffff880004bde000
[ 6196.713701] RIP: e030:[<ffffffffa04147dc>] [<ffffffffa04147dc>]
xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[ 84.833333] device eth2 entered promiscuous mode
[ 248.191165] UniDEBUG vif->mapped is set to false (xenvif_alloc)
[ 248.442727] device vif1.0 entered promiscuous mode
[ 250.721054] UniDEBUG vif->mapped is true (xen_netbk_map_frontend_rings)
[ 250.721099] XXXlan0: port 2(vif1.0) entered forwarding state
[ 250.721103] XXXlan0: port 2(vif1.0) entered forwarding state
[ 253.473859] UniDEBUG vif->mapped is set to false (xenvif_alloc)
[ 253.737812] device vif2.0 entered promiscuous mode
[ 255.639021] UniDEBUG vif->mapped is true (xen_netbk_map_frontend_rings)
[ 255.639067] XXXlan0: port 3(vif2.0) entered forwarding state
[ 255.639072] XXXlan0: port 3(vif2.0) entered forwarding state
[ 592.867375] UniDEBUG vif->mapped is set to false(xen_netbk_unmap_frontend_rings)
[ 592.868147] XXXlan0: port 3(vif2.0) entered disabled state
[ 593.499258] XXXlan0: port 3(vif2.0) entered disabled state
[ 593.499293] device vif2.0 left promiscuous mode
[ 593.499295] XXXlan0: port 3(vif2.0) entered disabled state
[ 595.386548] UniDEBUG vif->mapped is set to false (xenvif_alloc)
[ 595.633665] device vif3.0 entered promiscuous mode
[ 597.390410] UniDEBUG vif->mapped is true (xen_netbk_map_frontend_rings)
[ 597.390458] XXXlan0: port 3(vif3.0) entered forwarding state
[ 597.390462] XXXlan0: port 3(vif3.0) entered forwarding state
[ 936.549840] UniDEBUG vif->mapped is set to false(xen_netbk_unmap_frontend_rings)
[ 936.549869] XXXlan0: port 3(vif3.0) entered disabled state
[ 936.553024] UniDEBUG vif->mapped is false

here it would oops previously.

[ 937.459565] device vif3.0 left promiscuous mode
[ 937.459570] XXXlan0: port 3(vif3.0) entered disabled state
[ 1115.250900] INFO: task xenwatch:14 blocked for more than 120 seconds.
[ 1115.250902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1115.250904] xenwatch D ffff8800952d9080 0 14 2 0x00000000
[ 1115.250907] ffff8800952d9080 0000000000000246 ffff880094510880 0000000000013ec0
[ 1115.250909] ffff88009530ffd8 0000000000013ec0 ffff88009530ffd8 0000000000013ec0
[ 1115.250911] ffff8800952d9080 0000000000013ec0 0000000000013ec0 ffff88009530e010
[ 1115.250921] [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
[ 1115.250925] [<ffffffffa040a396>] ? xenvif_free+0x7a/0xb6 [xen_netback]
[ 1115.250930] [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
[ 1115.250934] [<ffffffff812697c2>] ? xenbus_rm+0x44/0x4f
[ 1115.250937] [<ffffffffa0409ead>] ? netback_remove+0x5d/0x7e [xen_netback]
[ 1115.250940] [<ffffffff8126a72f>] ? xenbus_dev_remove+0x29/0x4e
[ 1115.250943] [<ffffffff812a4ff0>] ? __device_release_driver+0x7f/0xd5
[ 1115.250946] [<ffffffff812a50fc>] ? device_release_driver+0x1d/0x29
[ 1115.250948] [<ffffffff812a4519>] ? bus_remove_device+0xee/0x103
[ 1115.250950] [<ffffffff812a2b49>] ? device_del+0x112/0x182
[ 1115.250952] [<ffffffff812a2bc2>] ? device_unregister+0x9/0x12
[ 1115.250955] [<ffffffff81268e20>] ? xenwatch_thread+0x122/0x15f
[ 1115.250957] [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
[ 1115.250959] [<ffffffff81268cfe>] ? xs_watch+0x57/0x57
[ 1115.250962] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[ 1115.250964] [<ffffffff81268cfe>] ? xs_watch+0x57/0x57
[ 1115.250966] [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
[ 1115.250969] [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
[ 1115.250972] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[ 1115.250975] [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
[ 1115.250977] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56

Any idea?

Sincerely
Philipp

Philipp Hahn

2014-06-27 17:48:36 UTC

Hello Wei Liu,
...

[ 1115.250900] INFO: task xenwatch:14 blocked for more than 120 seconds.
[ 1115.250902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1115.250904] xenwatch D ffff8800952d9080 0 14 2 0x00000000
[ 1115.250907] ffff8800952d9080 0000000000000246 ffff880094510880 0000000000013ec0
[ 1115.250909] ffff88009530ffd8 0000000000013ec0 ffff88009530ffd8 0000000000013ec0
[ 1115.250911] ffff8800952d9080 0000000000013ec0 0000000000013ec0 ffff88009530e010
[ 1115.250921] [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
[ 1115.250925] [<ffffffffa040a396>] ? xenvif_free+0x7a/0xb6 [xen_netback]

...

Post by Philipp Hahn
Any idea?

I guess we found the problem ourselves: For thus removed skb's the
reference counter on the associated vif was not decremented, as it is
normally done in two locations at the end of the function

Post by Philipp Hahn
Subject: [PATCH] quick hack

...

Post by Philipp Hahn
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 36efb41..f4f3693 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -720,6 +720,11 @@ static void xen_netbk_rx_action(struct xen_netbk *netbk)
vif = netdev_priv(skb->dev);
nr_frags = skb_shinfo(skb)->nr_frags;
+ if (!vif->mapped) {
+ dev_kfree_skb(skb);

xenvif_put(vif);

Post by Philipp Hahn
+ continue;
+ }
+
sco = (struct skb_cb_overlay *)skb->cb;
sco->meta_slots_used = netbk_gop_skb(skb, &npo);

The test is currently running again for the weekend and on Monday we
will hopefully know more.

Thanks again.

Philipp

Philipp Hahn

2014-06-27 18:24:51 UTC

Hello Wei Liu,

On 27.06.2014 19:48, Philipp Hahn wrote:> I guess we found the problem
ourselves: For thus removed skb's the

Post by Philipp Hahn
reference counter on the associated vif was not decremented, as it is
normally done in two locations at the end of the function

...

Post by Philipp Hahn
The test is currently running again for the weekend and on Monday we
will hopefully know more.

FYI: The test VM survived the first reboot without locking up:

Jun 27 19:43:43 xenmbint05b01 kernel: [ 1716.359537] UniDEBUG
vif->mapped is set to false (xenvif_alloc)
Jun 27 19:43:45 xenmbint05b01 kernel: [ 1718.558540] UniDEBUG
vif->mapped is true (xen_netbk_map_frontend_rings)
Jun 27 19:49:23 xenmbint05b01 kernel: [ 2055.895207] UniDEBUG
vif->mapped is set to false(xen_netbk_unmap_frontend_rings)
Jun 27 19:49:23 xenmbint05b01 kernel: [ 2055.898349] UniDEBUG
vif->mapped is false
Jun 27 19:49:23 xenmbint05b01 kernel: [ 2055.898352] UniDEBUG
vif->mapped is false
Jun 27 19:49:25 xenmbint05b01 kernel: [ 2058.529454] UniDEBUG
vif->mapped is set to false (xenvif_alloc)

BYtE
Philipp

Philipp Hahn

2014-07-02 07:45:44 UTC

Hello Wei Liu,

Post by Philipp Hahn
On 27.06.2014 19:48, Philipp Hahn wrote:> I guess we found the problem
ourselves: For thus removed skb's the

Post by Philipp Hahn
reference counter on the associated vif was not decremented, as it is
normally done in two locations at the end of the function

...

Post by Philipp Hahn
The test is currently running again for the weekend and on Monday we
will hopefully know more.

...

Post by Philipp Hahn
Jun 27 19:49:23 xenmbint05b01 kernel: [ 2055.898349] UniDEBUG
vif->mapped is false

The host survived the weekend with the problematic VM rebooting every 5
minutes; the log shows the shared ring being accessed unmapped, where
the kernel crashed previously.

So the attached patch fixes the bug (or at least prevents the OOPS).

@Wei Liu: You said that the patch is only a quick hack to detect, if my
analysis is correct and a proper fix would be needed. For us the
attached patch works, as the problem does not happen that often and is
hard to reproduce anyway, so spending more time on that issue is
probably not worth it. And that flag doesn't look that ugly.

@stable: at least 3.10 has the bug, but other long-term-stable kernels
have it too. The code in current is different as multi-queue was added,
so the patch wouldn't be in current.

Sincerely
Philipp Hahn

Wei Liu

2014-07-10 12:41:22 UTC

Post by Philipp Hahn
Hello Wei Liu,

Post by Philipp Hahn
On 27.06.2014 19:48, Philipp Hahn wrote:> I guess we found the problem
ourselves: For thus removed skb's the

Post by Philipp Hahn
reference counter on the associated vif was not decremented, as it is
normally done in two locations at the end of the function

...

Post by Philipp Hahn
The test is currently running again for the weekend and on Monday we
will hopefully know more.

...

Post by Philipp Hahn
Jun 27 19:49:23 xenmbint05b01 kernel: [ 2055.898349] UniDEBUG
vif->mapped is false

The host survived the weekend with the problematic VM rebooting every 5
minutes; the log shows the shared ring being accessed unmapped, where
the kernel crashed previously.
So the attached patch fixes the bug (or at least prevents the OOPS).
@Wei Liu: You said that the patch is only a quick hack to detect, if my
analysis is correct and a proper fix would be needed. For us the
attached patch works, as the problem does not happen that often and is
hard to reproduce anyway, so spending more time on that issue is
probably not worth it. And that flag doesn't look that ugly.

Sorry for the late reply. I was away for two weeks.

I agree that we would like to avoid spending too much time on this
issue.

Since the problem is confirmed, I think a proper fix will be to
reference count vif and prevent it from unmapping the ring before all
queued SKBs are consumed. But it might require much more work than that
quick hack. Would you up for writing a patch? I won't be able to write
one in the near future. Further more, you're the only party now can
verify a fix.

Post by Philipp Hahn
@stable: at least 3.10 has the bug, but other long-term-stable kernels
have it too. The code in current is different as multi-queue was added,
so the patch wouldn't be in current.

FWIW this bug doesn't exist in kernel >=3.12.

Wei.

Philipp Hahn

2014-07-11 09:41:22 UTC

Hello Wei Liu,

Post by Philipp Hahn
@Wei Liu: You said that the patch is only a quick hack to detect, if my
analysis is correct and a proper fix would be needed. For us the
attached patch works, as the problem does not happen that often and is
hard to reproduce anyway, so spending more time on that issue is
probably not worth it. And that flag doesn't look that ugly.

...

Post by Wei Liu
I agree that we would like to avoid spending too much time on this
issue.

This is also what I'm thinking.

Post by Wei Liu
Since the problem is confirmed, I think a proper fix will be to
reference count vif and prevent it from unmapping the ring before all
queued SKBs are consumed.

vif is already ref-counted; see attached *untested* patch for a start.

What I don't like is xenbus_unmap_ring_vfree() potentially printing an
error on double-free when called from xen_netbk_unmap_frontend_rings().

Post by Wei Liu
But it might require much more work than that quick hack.

I'm no network driver/Xen expert, but that sounds like more work for no
gain: We would still copy packets for a guest, which is already dead. If
the ring get full, we would probably need to go to sleep and wait for an
answer we will never get anymore.

Post by Wei Liu
FWIW this bug doesn't exist in kernel >=3.12.

That is even one more point for the hack, as there the problem is
properly fixed and the problem is very obscure to trigger.

Post by Wei Liu
Would you up for writing a patch? I won't be able to write
one in the near future.
Further more, you're the only party now can verify a fix.

I had a look and created the attached patch, which is untested, as I
currently can't access the faulting system and have been unable to
reproduce it in my development environment.

The quick hack now runs on that system for several weeks now without
problems.

Sincerely
Philipp

Wei Liu

2014-07-11 09:53:07 UTC

Post by Philipp Hahn
Hello Wei Liu,

...

Post by Wei Liu
I agree that we would like to avoid spending too much time on this
issue.

This is also what I'm thinking.

Post by Wei Liu
Since the problem is confirmed, I think a proper fix will be to
reference count vif and prevent it from unmapping the ring before all
queued SKBs are consumed.

vif is already ref-counted; see attached *untested* patch for a start.

What I meant was that, in xenvif_free, we will wait for refcnt to become
0 (or 1?) before freeing vif structure. I think we should do similar
thing before unmapping the ring.

Post by Philipp Hahn
What I don't like is xenbus_unmap_ring_vfree() potentially printing an
error on double-free when called from xen_netbk_unmap_frontend_rings().

How can it be called twice? Does it happen before or after applying my
patch?

Post by Wei Liu
But it might require much more work than that quick hack.

That's true, so dropping those packets seems to be a good solution.

Post by Wei Liu
FWIW this bug doesn't exist in kernel >=3.12.

That is even one more point for the hack, as there the problem is
properly fixed and the problem is very obscure to trigger.

Post by Wei Liu
Would you up for writing a patch? I won't be able to write
one in the near future.
Further more, you're the only party now can verify a fix.

I had a look and created the attached patch, which is untested, as I
currently can't access the faulting system and have been unable to
reproduce it in my development environment.
The quick hack now runs on that system for several weeks now without
problems.

Thanks for testing.

I will need to think further if this hack is really a proper solution. I
will keep you posted if there's any update on this. In the mean time you
can carry that patch in your own tree for a while.

Wei.

Wei Liu

2014-07-11 10:32:36 UTC

Post by Philipp Hahn
Hello Wei Liu,

...

Post by Wei Liu
I agree that we would like to avoid spending too much time on this
issue.

This is also what I'm thinking.

Post by Wei Liu
Since the problem is confirmed, I think a proper fix will be to
reference count vif and prevent it from unmapping the ring before all
queued SKBs are consumed.

vif is already ref-counted; see attached *untested* patch for a start.

[...]

Post by Philipp Hahn
The quick hack now runs on that system for several weeks now without
problems.

I'm a bit confused, you said that patch was *untested*, then you said
the hack runs fine.

I guess what you meant is: the approach is generally working but the
patch you submitted is not the one running in your system? Is there
significant difference between the patch you submitted and the one
you're using in production?

Wei.

Philipp Hahn

2014-07-11 11:02:18 UTC

Hello Wei Liu,

Post by Philipp Hahn
@Wei Liu: You said that the patch is only a quick hack to
detect, if my analysis is correct and a proper fix would be
needed. For us the attached patch works, as the problem does
not happen that often and is hard to reproduce anyway, so
spending more time on that issue is probably not worth it. And
that flag doesn't look that ugly.

...

Post by Wei Liu
I agree that we would like to avoid spending too much time on
this issue.

This is also what I'm thinking.

Post by Wei Liu
Since the problem is confirmed, I think a proper fix will be to
reference count vif and prevent it from unmapping the ring
before all queued SKBs are consumed.

vif is already ref-counted; see attached *untested* patch for a
start.

v2 = using refcnt

Post by Philipp Hahn
The quick hack now runs on that system for several weeks now
without problems.

v1 = Your quick hack patch + my additional dev_kfree_skb(skb);

Post by Wei Liu
I'm a bit confused, you said that patch was *untested*, then you said
the hack runs fine.
I guess what you meant is: the approach is generally working but the
patch you submitted is not the one running in your system? Is there
significant difference between the patch you submitted and the one
you're using in production?

The PATCH form 02.07.2014 09:45 using your "vif->mapped" is running fine
on the system for ~2 weeks now.

The PATCHv2 using refcnt I posted today is untested and for RFC.

Post by Philipp Hahn
What I don't like is xenbus_unmap_ring_vfree() potentially
printing an error on double-free when called from
xen_netbk_unmap_frontend_rings().

How can it be called twice? Does it happen before or after applying
my patch?

I've not seen it called twice, but while studying the code I asked
myself if it's safe to be called twice. I've not gone down to road to
proof that it never is. Just being cautious.

Worst thing which could happen is an error message:
| xenbus_dev_error(dev, -ENOENT,
| "can't find mapped virtual address %p", vaddr);

Post by Wei Liu
I will need to think further if this hack is really a proper
solution. I will keep you posted if there's any update on this.

Thank you for your feedback and support.

Post by Wei Liu
In the mean time you can carry that patch in your own tree for a
while.

We will keep using v1 for now and I will keep you updated on this as well.

Sincerely
Philipp

Wei Liu

2014-07-11 11:16:06 UTC

Post by Philipp Hahn
Hello Wei Liu,

Post by Philipp Hahn
@Wei Liu: You said that the patch is only a quick hack to
detect, if my analysis is correct and a proper fix would be
needed. For us the attached patch works, as the problem does
not happen that often and is hard to reproduce anyway, so
spending more time on that issue is probably not worth it. And
that flag doesn't look that ugly.

...

Post by Wei Liu
I agree that we would like to avoid spending too much time on
this issue.

This is also what I'm thinking.

Post by Wei Liu
Since the problem is confirmed, I think a proper fix will be to
reference count vif and prevent it from unmapping the ring
before all queued SKBs are consumed.

vif is already ref-counted; see attached *untested* patch for a
start.

v2 = using refcnt

Oh I see, I misread your email. You actually sent me two patches, sorry.
I looked at your first one and somehow missed second one.

Unfortunately your patch with refcnt won't work.