Григорий Пташко
2014-10-17 18:17:14 UTC
Hello.
The long story is this. I'm running CentOS 7 with custom built kernel.
My architecture is x86_64. I'm trying to passthrough different GPUs to xen.
I've got a problem with AMD FirePro W9100. Windows HVM guest starts with GPU
and even some 3D benchmark is running OK. But after some time of working the
domU and dom0 freeze.
I monitor the serial console for kernel panics but I don't see them at all.
I've decided to make a crash dump of the dom0 kernel to see what's going on.
And it appears that I just cannot do this.
I've tried specifying the crashkernel parameter both for the xen.gz and for
my dom0 kernel (bzImage).
1. The first case: crashkernel=256M for dom0 cmdline:
bzImage crashkernel=256M
[***@kvmxen-centos7-test1-nb ~]# systemctl status kdump.service
kdump.service - Crash recovery kernel arming
...
ÐŸÐºÑ 17 21:19:38 kvmxen-centos7-test1-nb kdumpctl[1506]: kexec: loaded kdump
kernel
...
[***@kvmxen-centos7-test1-nb ~]# cat /sys/kernel/kexec_crash_loaded
1
Here we see that kexec from kdump.service worked well. Seems like it has
loaded the dump capture kernel.
And now let's try to panic:
[***@kvmxen-centos7-test1-nb ~]# echo c > /proc/sysrq-trigger
In the console we see:
[ 421.673471] SysRq : Trigger a crash
[ 421.677110] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 421.685021] IP: [<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ 421.691172] PGD 2d11e58067 PUD 2c95d3c067 PMD 0
[ 421.695900] Oops: 0002 [#1] SMP
[ 421.699210] Modules linked in: ip6table_filter ip6_tables iptable_filter
ip_tables ebtable_nat ebtables sg rpcsec_gss_krb5 nls_utf8 iTCO_wdt
iTCO_vendor_support x86_pkg_temp_thermal coretemp crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
lrw gf128mul sb_edac glue_helper ablk_helper ipmi_si lpc_ich edac_core
cryptd i2c_i801 pcspkr mfd_core ipmi_msghandler mei_me ioatdma wmi mei
shpchp dca nfsd binfmt_misc mgag200 drm_kms_helper ttm drm ahci mlx4_core
libahci libata
[ 421.745725] CPU: 9 PID: 11422 Comm: bash Not tainted 3.17.0 #3
[ 421.751562] Hardware name: Supermicro
X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013
[ 421.761910] task: ffff882e94383640 ti: ffff882c71758000 task.ti:
ffff882c71758000
[ 421.769398] RIP: e030:[<ffffffff81484486>] [<ffffffff81484486>]
sysrq_handle_crash+0x16/0x20
[ 421.777961] RSP: e02b:ffff882c7175be88 EFLAGS: 00010246
[ 421.783276] RAX: 000000000000000f RBX: ffffffff81d2d780 RCX:
0000000000000000
[ 421.790416] RDX: 0000000000000000 RSI: ffff882eea52e5b8 RDI:
0000000000000063
[ 421.797557] RBP: ffff882c7175be88 R08: 0000000000000002 R09:
ffffffff82034afc
[ 421.804708] R10: 00000000000004a7 R11: 00000000000004a6 R12:
0000000000000063
[ 421.811839] R13: 0000000000000000 R14: 0000000000000007 R15:
0000000000000000
[ 421.818992] FS: 00007f1c0205b740(0000) GS:ffff882eea520000(0000)
knlGS:0000000000000000
[ 421.827075] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 421.832821] CR2: 0000000000000000 CR3: 0000002c2a879000 CR4:
0000000000042660
[ 421.839972] Stack:
[ 421.841998] ffff882c7175beb8 ffffffff81484cd7 0000000000000002
00007f1c0207f000
[ 421.849494] 0000000000000002 ffff882c7175bf48 ffff882c7175bed0
ffffffff8148517f
[ 421.857019] ffff882e94765380 ffff882c7175bef0 ffffffff81251afd
ffff882c7175bf48
[ 421.864514] Call Trace:
[ 421.866981] [<ffffffff81484cd7>] __handle_sysrq+0x107/0x170
[ 421.872645] [<ffffffff8148517f>] write_sysrq_trigger+0x2f/0x40
[ 421.878575] [<ffffffff81251afd>] proc_reg_write+0x3d/0x80
[ 421.884069] [<ffffffff811eaef7>] vfs_write+0xb7/0x1f0
[ 421.889209] [<ffffffff811ebb15>] SyS_write+0x55/0xd0
[ 421.894294] [<ffffffff8183fc29>] system_call_fastpath+0x16/0x1b
[ 421.900300] Code: 65 34 75 e5 4c 89 ef e8 d9 f7 ff ff eb db 0f 1f 80 00
00 00 00 66 66 66 66 90 55 c7 05 88 43 7f 00 01 00 00 00 48 89 e5 0f ae f8
<c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 2e
[ 421.920596] RIP [<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ 421.926803] RSP <ffff882c7175be88>
[ 421.930302] CR2: 0000000000000000
And that's it. The dump capture kernel is not loaded. After this kernel
panic
my server just reboot.
2. The second case: crashkernel=256M in xen.gz cmdline.
xen.gz crashkernel=256M
[***@kvmxen-centos7-test1-nb ~]# systemctl status kdump.service
kdump.service - Crash recovery kernel arming
...
Active: failed (Result: exit-code) since ÐÑ 2014-10-17 19:56:57 MSK; 1h
9min ago
...
ÐŸÐºÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: No memory reserved
for crash kernel.
ÐŸÐºÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: Starting kdump:
[FAILED]
....
As we see the kdump.service cannot load the dump capture kernel because
'No memory reserved for crash kernel'.
So the questions are:
1. How can I make crash dumps of the hypervisor and the dom0?
2. How am I supposed to diagnose the thing that causes such dom0 freezes?
I thought that if I ask on the list that my dom0 freezes, it will be a waste
of time without any logs or crash dumps.. But I cannot even make them..
I really want to contribute by testing xen and submitting bugs but I'd like
to do it with more material for the developers.
Thank you,
Grigory.
The long story is this. I'm running CentOS 7 with custom built kernel.
My architecture is x86_64. I'm trying to passthrough different GPUs to xen.
I've got a problem with AMD FirePro W9100. Windows HVM guest starts with GPU
and even some 3D benchmark is running OK. But after some time of working the
domU and dom0 freeze.
I monitor the serial console for kernel panics but I don't see them at all.
I've decided to make a crash dump of the dom0 kernel to see what's going on.
And it appears that I just cannot do this.
I've tried specifying the crashkernel parameter both for the xen.gz and for
my dom0 kernel (bzImage).
1. The first case: crashkernel=256M for dom0 cmdline:
bzImage crashkernel=256M
[***@kvmxen-centos7-test1-nb ~]# systemctl status kdump.service
kdump.service - Crash recovery kernel arming
...
ÐŸÐºÑ 17 21:19:38 kvmxen-centos7-test1-nb kdumpctl[1506]: kexec: loaded kdump
kernel
...
[***@kvmxen-centos7-test1-nb ~]# cat /sys/kernel/kexec_crash_loaded
1
Here we see that kexec from kdump.service worked well. Seems like it has
loaded the dump capture kernel.
And now let's try to panic:
[***@kvmxen-centos7-test1-nb ~]# echo c > /proc/sysrq-trigger
In the console we see:
[ 421.673471] SysRq : Trigger a crash
[ 421.677110] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 421.685021] IP: [<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ 421.691172] PGD 2d11e58067 PUD 2c95d3c067 PMD 0
[ 421.695900] Oops: 0002 [#1] SMP
[ 421.699210] Modules linked in: ip6table_filter ip6_tables iptable_filter
ip_tables ebtable_nat ebtables sg rpcsec_gss_krb5 nls_utf8 iTCO_wdt
iTCO_vendor_support x86_pkg_temp_thermal coretemp crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
lrw gf128mul sb_edac glue_helper ablk_helper ipmi_si lpc_ich edac_core
cryptd i2c_i801 pcspkr mfd_core ipmi_msghandler mei_me ioatdma wmi mei
shpchp dca nfsd binfmt_misc mgag200 drm_kms_helper ttm drm ahci mlx4_core
libahci libata
[ 421.745725] CPU: 9 PID: 11422 Comm: bash Not tainted 3.17.0 #3
[ 421.751562] Hardware name: Supermicro
X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013
[ 421.761910] task: ffff882e94383640 ti: ffff882c71758000 task.ti:
ffff882c71758000
[ 421.769398] RIP: e030:[<ffffffff81484486>] [<ffffffff81484486>]
sysrq_handle_crash+0x16/0x20
[ 421.777961] RSP: e02b:ffff882c7175be88 EFLAGS: 00010246
[ 421.783276] RAX: 000000000000000f RBX: ffffffff81d2d780 RCX:
0000000000000000
[ 421.790416] RDX: 0000000000000000 RSI: ffff882eea52e5b8 RDI:
0000000000000063
[ 421.797557] RBP: ffff882c7175be88 R08: 0000000000000002 R09:
ffffffff82034afc
[ 421.804708] R10: 00000000000004a7 R11: 00000000000004a6 R12:
0000000000000063
[ 421.811839] R13: 0000000000000000 R14: 0000000000000007 R15:
0000000000000000
[ 421.818992] FS: 00007f1c0205b740(0000) GS:ffff882eea520000(0000)
knlGS:0000000000000000
[ 421.827075] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 421.832821] CR2: 0000000000000000 CR3: 0000002c2a879000 CR4:
0000000000042660
[ 421.839972] Stack:
[ 421.841998] ffff882c7175beb8 ffffffff81484cd7 0000000000000002
00007f1c0207f000
[ 421.849494] 0000000000000002 ffff882c7175bf48 ffff882c7175bed0
ffffffff8148517f
[ 421.857019] ffff882e94765380 ffff882c7175bef0 ffffffff81251afd
ffff882c7175bf48
[ 421.864514] Call Trace:
[ 421.866981] [<ffffffff81484cd7>] __handle_sysrq+0x107/0x170
[ 421.872645] [<ffffffff8148517f>] write_sysrq_trigger+0x2f/0x40
[ 421.878575] [<ffffffff81251afd>] proc_reg_write+0x3d/0x80
[ 421.884069] [<ffffffff811eaef7>] vfs_write+0xb7/0x1f0
[ 421.889209] [<ffffffff811ebb15>] SyS_write+0x55/0xd0
[ 421.894294] [<ffffffff8183fc29>] system_call_fastpath+0x16/0x1b
[ 421.900300] Code: 65 34 75 e5 4c 89 ef e8 d9 f7 ff ff eb db 0f 1f 80 00
00 00 00 66 66 66 66 90 55 c7 05 88 43 7f 00 01 00 00 00 48 89 e5 0f ae f8
<c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 2e
[ 421.920596] RIP [<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ 421.926803] RSP <ffff882c7175be88>
[ 421.930302] CR2: 0000000000000000
And that's it. The dump capture kernel is not loaded. After this kernel
panic
my server just reboot.
2. The second case: crashkernel=256M in xen.gz cmdline.
xen.gz crashkernel=256M
[***@kvmxen-centos7-test1-nb ~]# systemctl status kdump.service
kdump.service - Crash recovery kernel arming
...
Active: failed (Result: exit-code) since ÐÑ 2014-10-17 19:56:57 MSK; 1h
9min ago
...
ÐŸÐºÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: No memory reserved
for crash kernel.
ÐŸÐºÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: Starting kdump:
[FAILED]
....
As we see the kdump.service cannot load the dump capture kernel because
'No memory reserved for crash kernel'.
So the questions are:
1. How can I make crash dumps of the hypervisor and the dom0?
2. How am I supposed to diagnose the thing that causes such dom0 freezes?
I thought that if I ask on the list that my dom0 freezes, it will be a waste
of time without any logs or crash dumps.. But I cannot even make them..
I really want to contribute by testing xen and submitting bugs but I'd like
to do it with more material for the developers.
Thank you,
Grigory.
--
Best regards,
Grigory Ptashko
+7 (916) 1489766
***@gmail.com
skype grigory_ptashko
linkedin.com/in/gptashko <http://ru.linkedin.com/in/gptashko/>
facebook.com/GrigoryPtashko <https://www.facebook.com/GrigoryPtashko>
Best regards,
Grigory Ptashko
+7 (916) 1489766
***@gmail.com
skype grigory_ptashko
linkedin.com/in/gptashko <http://ru.linkedin.com/in/gptashko/>
facebook.com/GrigoryPtashko <https://www.facebook.com/GrigoryPtashko>