Philipp Hahn
2014-06-06 10:26:55 UTC
Hello,
on one of our hosts (Xen-4.1.3 with Linux-3.10.26 + Debian patches)
running 16 Linux VMs (linux-3.2.39 and others) netback crashes during
no longer reachable.
The crash does not happen on every reboot: The VM was running fine for
1œ week after a dom0 kernel update, but now crashed the following past
two nights.
I'm yet unable to reproduce this on demand, but would like to prepared
next time it happens again.
@Ian: I found your mail "Re: [Xen-devel] Kernel 3.7.0-pre-rc1 kernel BUG
at drivers/net/xen-netback/netback.c:405 RIP: e030:[<ffffffff814714f9>]
[<ffffffff814714f9>] netbk_gop_frag_copy+0x379/0x380" from 2012-10-09,
which describes a crash in the same function, but at a complete
different (later) location. You hinted that a difference in hardware
might explain, why I'm unable to reproduce it, as my test environment
has different HW (no "igb", but "e1000e").
addition of GSO for IPv6 the function looks unchanged compared to
current GIT, so to me it looks like it might still be a problem with the
current implementation.
I tried to review the GIT commits myself, but I didn't see anything
obvious, but with all the recent additional changes to netback I'm
unsure of how to best proceed:
1. Is this a known bug and has someone observed it, too?
2. If yes, is there a fix in newer Linux kernels?
3. If no, What data should I collect in addition?
Xen-Hypervisor is 4.1.3 from Debian, but as this is a kernel crash, I
don't expect a newer version of Xen to fix it (correct me if I'm wrong).
Thanks in advance.
Philipp
PS: I'm not afraid of getting my hands dirty doing Linux coding, but
currently I'm out of ideas of how to best proceed.
--
Philipp Hahn
Open Source Software Engineer
Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
***@univention.de
http://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876
on one of our hosts (Xen-4.1.3 with Linux-3.10.26 + Debian patches)
running 16 Linux VMs (linux-3.2.39 and others) netback crashes during
[38551.549615] Oops: 0000 [#1] SMP
[38551.549665] Modules linked in: tun xt_physdev xen_blkback xen_netback ip6_tables
iptable_filter ip_tables ebtable_nat ebtables x_tables xen_gntdev nfsv3 nfsv4
rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss oid_registry nfs fscache dns_resolver lockd
sunrpc fuse loop xen_blkfront xen_evtchn blktap quota_v2 quota_tree xenfs xen_privcmd
coretemp crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
glue_helper aes_x86_64 snd_pcm snd_timer snd soundcore snd_page_alloc tpm_tis tpm lpc_ich
tpm_bios i7core_edac i2c_i801 psmouse microcode edac_core serio_raw pcspkr mperf ioatdma
mfd_core processor evdev thermal_sys ext4 jbd2 crc16 bonding bridge stp llc dm_snapshot
dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif ehci_pci uhci_hcd ehci_hcd mptsas
mptscsih mptbase scsi_transport_sas usbcore usb_common igb dca i2c_algo_bit i2c_core ptp
pps_core button
[38551.550601] CPU: 0 PID: 12587 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1 Debian
3.10.11-1.58.201405060908
[38551.550693] Hardware name: FUJITSU PRIMERGY BX620 S6/D3051, BIOS 080015 Rev.3C78.3051
07/22/2011
[38551.550781] task: ffff880004b067c0 ti: ffff8800561ec000 task.ti: ffff8800561ec000
[38551.550865] RIP: e030:[<ffffffffa04147dc>] [<ffffffffa04147dc>]
xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[38551.550959] RSP: e02b:ffff8800561edce8 EFLAGS: 00010202
[38551.551009] RAX: ffffc900104adac0 RBX: ffff8800541e95c0 RCX: ffffc90010864000
[38551.551064] RDX: 000000000000003b RSI: 0000000000000000 RDI: ffff880040014380
[38551.551120] RBP: ffff8800570e6800 R08: 0000000000000000 R09: ffff880004799800
[38551.551175] R10: ffffffff813ca115 R11: ffff88005e4fdb08 R12: ffff880054e6f800
[38551.551231] R13: ffff8800561edd58 R14: ffffc900104a1000 R15: 0000000000000000
[38551.551289] FS: 00007f19a54a8700(0000) GS:ffff88005da00000(0000)
knlGS:0000000000000000
[38551.551374] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[38551.551425] CR2: ffffc900108641d8 CR3: 0000000054cb3000 CR4: 0000000000002660
[38551.551481] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[38551.551537] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[38551.551630] ffff880004b06ba0 0000000000000000 ffff88005da13ec0 ffff88005da13ec0
[38551.551726] 0000000004b067c0 ffffc900104a8ac0 ffffc900104a1020 000000005da13ec0
[38551.551823] 0000000000000000 0000000000000001 ffffc900104a8ac0 ffffc900104adac0
[38551.551966] [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
[38551.552021] [<ffffffffa0416033>] ? xen_netbk_kthread+0x174/0x841 [xen_netback]
[38551.552106] [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
[38551.560239] [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[38551.560325] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560381] [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[38551.560466] [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
[38551.560518] [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
[38551.560572] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560628] [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
[38551.560680] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560734] Code: 8b b3 d0 00 00 00 48 8b bb d8 00 00 00 0f b7 74 37 02 89 70 08 eb 07
c7 40 08 00 00 00 00 89 d2 c7 40 04 00 00 00 00 48 83 c2 08 <0f> b7 34 d1 89 30 c7 44 24
60 00 00 00 00 8b 44 d1 04 89 44 24
[38551.561151] RIP [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[38551.561238] RSP <ffff8800561edce8>
[38551.561283] CR2: ffffc900108641d8
[38551.561624] ---[ end trace 8c260c6af259c4aa ]---
The host itself is still alive and reachable by network, but all VMs are[38551.549665] Modules linked in: tun xt_physdev xen_blkback xen_netback ip6_tables
iptable_filter ip_tables ebtable_nat ebtables x_tables xen_gntdev nfsv3 nfsv4
rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss oid_registry nfs fscache dns_resolver lockd
sunrpc fuse loop xen_blkfront xen_evtchn blktap quota_v2 quota_tree xenfs xen_privcmd
coretemp crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
glue_helper aes_x86_64 snd_pcm snd_timer snd soundcore snd_page_alloc tpm_tis tpm lpc_ich
tpm_bios i7core_edac i2c_i801 psmouse microcode edac_core serio_raw pcspkr mperf ioatdma
mfd_core processor evdev thermal_sys ext4 jbd2 crc16 bonding bridge stp llc dm_snapshot
dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif ehci_pci uhci_hcd ehci_hcd mptsas
mptscsih mptbase scsi_transport_sas usbcore usb_common igb dca i2c_algo_bit i2c_core ptp
pps_core button
[38551.550601] CPU: 0 PID: 12587 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1 Debian
3.10.11-1.58.201405060908
[38551.550693] Hardware name: FUJITSU PRIMERGY BX620 S6/D3051, BIOS 080015 Rev.3C78.3051
07/22/2011
[38551.550781] task: ffff880004b067c0 ti: ffff8800561ec000 task.ti: ffff8800561ec000
[38551.550865] RIP: e030:[<ffffffffa04147dc>] [<ffffffffa04147dc>]
xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[38551.550959] RSP: e02b:ffff8800561edce8 EFLAGS: 00010202
[38551.551009] RAX: ffffc900104adac0 RBX: ffff8800541e95c0 RCX: ffffc90010864000
[38551.551064] RDX: 000000000000003b RSI: 0000000000000000 RDI: ffff880040014380
[38551.551120] RBP: ffff8800570e6800 R08: 0000000000000000 R09: ffff880004799800
[38551.551175] R10: ffffffff813ca115 R11: ffff88005e4fdb08 R12: ffff880054e6f800
[38551.551231] R13: ffff8800561edd58 R14: ffffc900104a1000 R15: 0000000000000000
[38551.551289] FS: 00007f19a54a8700(0000) GS:ffff88005da00000(0000)
knlGS:0000000000000000
[38551.551374] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[38551.551425] CR2: ffffc900108641d8 CR3: 0000000054cb3000 CR4: 0000000000002660
[38551.551481] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[38551.551537] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[38551.551630] ffff880004b06ba0 0000000000000000 ffff88005da13ec0 ffff88005da13ec0
[38551.551726] 0000000004b067c0 ffffc900104a8ac0 ffffc900104a1020 000000005da13ec0
[38551.551823] 0000000000000000 0000000000000001 ffffc900104a8ac0 ffffc900104adac0
[38551.551966] [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
[38551.552021] [<ffffffffa0416033>] ? xen_netbk_kthread+0x174/0x841 [xen_netback]
[38551.552106] [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
[38551.560239] [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[38551.560325] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560381] [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
[38551.560466] [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
[38551.560518] [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
[38551.560572] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560628] [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
[38551.560680] [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
[38551.560734] Code: 8b b3 d0 00 00 00 48 8b bb d8 00 00 00 0f b7 74 37 02 89 70 08 eb 07
c7 40 08 00 00 00 00 89 d2 c7 40 04 00 00 00 00 48 83 c2 08 <0f> b7 34 d1 89 30 c7 44 24
60 00 00 00 00 8b 44 d1 04 89 44 24
[38551.561151] RIP [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
[38551.561238] RSP <ffff8800561edce8>
[38551.561283] CR2: ffffc900108641d8
[38551.561624] ---[ end trace 8c260c6af259c4aa ]---
no longer reachable.
The crash does not happen on every reboot: The VM was running fine for
1œ week after a dom0 kernel update, but now crashed the following past
two nights.
I'm yet unable to reproduce this on demand, but would like to prepared
next time it happens again.
@Ian: I found your mail "Re: [Xen-devel] Kernel 3.7.0-pre-rc1 kernel BUG
at drivers/net/xen-netback/netback.c:405 RIP: e030:[<ffffffff814714f9>]
[<ffffffff814714f9>] netbk_gop_frag_copy+0x379/0x380" from 2012-10-09,
which describes a crash in the same function, but at a complete
different (later) location. You hinted that a difference in hardware
might explain, why I'm unable to reproduce it, as my test environment
has different HW (no "igb", but "e1000e").
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:606
meta->gso_size = skb_shinfo(skb)->gso_size;
7b1: 8b b3 d0 00 00 00 mov 0xd0(%rbx),%esi
7b7: 48 8b bb d8 00 00 00 mov 0xd8(%rbx),%rdi
7be: 0f b7 74 37 02 movzwl 0x2(%rdi,%rsi,1),%esi
7c3: 89 70 08 mov %esi,0x8(%rax)
7c6: eb 07 jmp 7cf <xen_netbk_rx_action+0x17e>
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:608
else
meta->gso_size = 0;
7c8: c7 40 08 00 00 00 00 movl $0x0,0x8(%rax)
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
meta->size = 0;
meta->id = req->id;
7cf: 89 d2 mov %edx,%edx
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:610
if (!vif->gso_prefix)
meta->gso_size = skb_shinfo(skb)->gso_size;
else
meta->gso_size = 0;
meta->size = 0;
7d1: c7 40 04 00 00 00 00 movl $0x0,0x4(%rax)
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
meta->id = req->id;
7d8: 48 83 c2 08 add $0x8,%rdx
7dc: 0f b7 34 d1 movzwl (%rcx,%rdx,8),%esi
0x651 + 0x18B = 0x7DCmeta->gso_size = skb_shinfo(skb)->gso_size;
7b1: 8b b3 d0 00 00 00 mov 0xd0(%rbx),%esi
7b7: 48 8b bb d8 00 00 00 mov 0xd8(%rbx),%rdi
7be: 0f b7 74 37 02 movzwl 0x2(%rdi,%rsi,1),%esi
7c3: 89 70 08 mov %esi,0x8(%rax)
7c6: eb 07 jmp 7cf <xen_netbk_rx_action+0x17e>
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:608
else
meta->gso_size = 0;
7c8: c7 40 08 00 00 00 00 movl $0x0,0x8(%rax)
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
meta->size = 0;
meta->id = req->id;
7cf: 89 d2 mov %edx,%edx
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:610
if (!vif->gso_prefix)
meta->gso_size = skb_shinfo(skb)->gso_size;
else
meta->gso_size = 0;
meta->size = 0;
7d1: c7 40 04 00 00 00 00 movl $0x0,0x4(%rax)
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
meta->id = req->id;
7d8: 48 83 c2 08 add $0x8,%rdx
7dc: 0f b7 34 d1 movzwl (%rcx,%rdx,8),%esi
7e0: 89 30 mov %esi,(%rax)
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:612
npo->copy_off = 0;
7e2: c7 44 24 60 00 00 00 movl $0x0,0x60(%rsp)
7e9: 00
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:613
npo->copy_gref = req->gref;
7ea: 8b 44 d1 04 mov 0x4(%rcx,%rdx,8),%eax
7ee: 89 44 24 64 mov %eax,0x64(%rsp)
Ignoring the name change from {netbk -> xenvif}_gop_skb() and the/root/linux-3.10.11/drivers/net/xen-netback/netback.c:612
npo->copy_off = 0;
7e2: c7 44 24 60 00 00 00 movl $0x0,0x60(%rsp)
7e9: 00
/root/linux-3.10.11/drivers/net/xen-netback/netback.c:613
npo->copy_gref = req->gref;
7ea: 8b 44 d1 04 mov 0x4(%rcx,%rdx,8),%eax
7ee: 89 44 24 64 mov %eax,0x64(%rsp)
addition of GSO for IPv6 the function looks unchanged compared to
current GIT, so to me it looks like it might still be a problem with the
current implementation.
I tried to review the GIT commits myself, but I didn't see anything
obvious, but with all the recent additional changes to netback I'm
unsure of how to best proceed:
1. Is this a known bug and has someone observed it, too?
2. If yes, is there a fix in newer Linux kernels?
3. If no, What data should I collect in addition?
Xen-Hypervisor is 4.1.3 from Debian, but as this is a kernel crash, I
don't expect a newer version of Xen to fix it (correct me if I'm wrong).
Thanks in advance.
Philipp
PS: I'm not afraid of getting my hands dirty doing Linux coding, but
currently I'm out of ideas of how to best proceed.
--
Philipp Hahn
Open Source Software Engineer
Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
***@univention.de
http://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876