[Xen-devel] kexec+kdump troubles on xen 4.5-unstable, centos 7, x86

Discussion:

[Xen-devel] kexec+kdump troubles on xen 4.5-unstable, centos 7, x86_64 (need to get a crash dump)

Григорий Пташко

2014-10-17 18:17:14 UTC

Hello.

The long story is this. I'm running CentOS 7 with custom built kernel.
My architecture is x86_64. I'm trying to passthrough different GPUs to xen.
I've got a problem with AMD FirePro W9100. Windows HVM guest starts with GPU
and even some 3D benchmark is running OK. But after some time of working the
domU and dom0 freeze.
I monitor the serial console for kernel panics but I don't see them at all.
I've decided to make a crash dump of the dom0 kernel to see what's going on.
And it appears that I just cannot do this.
I've tried specifying the crashkernel parameter both for the xen.gz and for
my dom0 kernel (bzImage).

1. The first case: crashkernel=256M for dom0 cmdline:

bzImage crashkernel=256M

[***@kvmxen-centos7-test1-nb ~]# systemctl status kdump.service
kdump.service - Crash recovery kernel arming
...
ÐŸÐºÑ 17 21:19:38 kvmxen-centos7-test1-nb kdumpctl[1506]: kexec: loaded kdump
kernel
...

[***@kvmxen-centos7-test1-nb ~]# cat /sys/kernel/kexec_crash_loaded
1

Here we see that kexec from kdump.service worked well. Seems like it has
loaded the dump capture kernel.
And now let's try to panic:

[***@kvmxen-centos7-test1-nb ~]# echo c > /proc/sysrq-trigger

In the console we see:

[ 421.673471] SysRq : Trigger a crash
[ 421.677110] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 421.685021] IP: [<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ 421.691172] PGD 2d11e58067 PUD 2c95d3c067 PMD 0
[ 421.695900] Oops: 0002 [#1] SMP
[ 421.699210] Modules linked in: ip6table_filter ip6_tables iptable_filter
ip_tables ebtable_nat ebtables sg rpcsec_gss_krb5 nls_utf8 iTCO_wdt
iTCO_vendor_support x86_pkg_temp_thermal coretemp crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
lrw gf128mul sb_edac glue_helper ablk_helper ipmi_si lpc_ich edac_core
cryptd i2c_i801 pcspkr mfd_core ipmi_msghandler mei_me ioatdma wmi mei
shpchp dca nfsd binfmt_misc mgag200 drm_kms_helper ttm drm ahci mlx4_core
libahci libata
[ 421.745725] CPU: 9 PID: 11422 Comm: bash Not tainted 3.17.0 #3
[ 421.751562] Hardware name: Supermicro
X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013
[ 421.761910] task: ffff882e94383640 ti: ffff882c71758000 task.ti:
ffff882c71758000
[ 421.769398] RIP: e030:[<ffffffff81484486>] [<ffffffff81484486>]
sysrq_handle_crash+0x16/0x20
[ 421.777961] RSP: e02b:ffff882c7175be88 EFLAGS: 00010246
[ 421.783276] RAX: 000000000000000f RBX: ffffffff81d2d780 RCX:
0000000000000000
[ 421.790416] RDX: 0000000000000000 RSI: ffff882eea52e5b8 RDI:
0000000000000063
[ 421.797557] RBP: ffff882c7175be88 R08: 0000000000000002 R09:
ffffffff82034afc
[ 421.804708] R10: 00000000000004a7 R11: 00000000000004a6 R12:
0000000000000063
[ 421.811839] R13: 0000000000000000 R14: 0000000000000007 R15:
0000000000000000
[ 421.818992] FS: 00007f1c0205b740(0000) GS:ffff882eea520000(0000)
knlGS:0000000000000000
[ 421.827075] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 421.832821] CR2: 0000000000000000 CR3: 0000002c2a879000 CR4:
0000000000042660
[ 421.839972] Stack:
[ 421.841998] ffff882c7175beb8 ffffffff81484cd7 0000000000000002
00007f1c0207f000
[ 421.849494] 0000000000000002 ffff882c7175bf48 ffff882c7175bed0
ffffffff8148517f
[ 421.857019] ffff882e94765380 ffff882c7175bef0 ffffffff81251afd
ffff882c7175bf48
[ 421.864514] Call Trace:
[ 421.866981] [<ffffffff81484cd7>] __handle_sysrq+0x107/0x170
[ 421.872645] [<ffffffff8148517f>] write_sysrq_trigger+0x2f/0x40
[ 421.878575] [<ffffffff81251afd>] proc_reg_write+0x3d/0x80
[ 421.884069] [<ffffffff811eaef7>] vfs_write+0xb7/0x1f0
[ 421.889209] [<ffffffff811ebb15>] SyS_write+0x55/0xd0
[ 421.894294] [<ffffffff8183fc29>] system_call_fastpath+0x16/0x1b
[ 421.900300] Code: 65 34 75 e5 4c 89 ef e8 d9 f7 ff ff eb db 0f 1f 80 00
00 00 00 66 66 66 66 90 55 c7 05 88 43 7f 00 01 00 00 00 48 89 e5 0f ae f8
<c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 2e
[ 421.920596] RIP [<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ 421.926803] RSP <ffff882c7175be88>
[ 421.930302] CR2: 0000000000000000

And that's it. The dump capture kernel is not loaded. After this kernel
panic
my server just reboot.

2. The second case: crashkernel=256M in xen.gz cmdline.

xen.gz crashkernel=256M

[***@kvmxen-centos7-test1-nb ~]# systemctl status kdump.service
kdump.service - Crash recovery kernel arming
...
Active: failed (Result: exit-code) since ÐÑ 2014-10-17 19:56:57 MSK; 1h
9min ago
...
ÐŸÐºÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: No memory reserved
for crash kernel.
ÐŸÐºÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: Starting kdump:
[FAILED]
....

As we see the kdump.service cannot load the dump capture kernel because
'No memory reserved for crash kernel'.

So the questions are:

1. How can I make crash dumps of the hypervisor and the dom0?

2. How am I supposed to diagnose the thing that causes such dom0 freezes?
I thought that if I ask on the list that my dom0 freezes, it will be a waste
of time without any logs or crash dumps.. But I cannot even make them..

I really want to contribute by testing xen and submitting bugs but I'd like
to do it with more material for the developers.

Thank you,
Grigory.

--
Best regards,
Grigory Ptashko

+7 (916) 1489766
***@gmail.com
skype grigory_ptashko
linkedin.com/in/gptashko <http://ru.linkedin.com/in/gptashko/>
facebook.com/GrigoryPtashko <https://www.facebook.com/GrigoryPtashko>

Andrew Cooper

2014-10-17 19:32:39 UTC

Permalink

Post by ÐÑÐ¸Ð³Ð¾ÑÐ¸Ð¹ ÐÑÐ°ÑÐºÐ¾
1. How can I make crash dumps of the hypervisor and the dom0?

Kexec of domains inside themselves is not supported. Effort is being
made to make it work, but there are some architectural challenges.

The correct method is method 2, by providing a crash region in Xen for
dom0 to load into. I suspect your problem is that systemd doesn't
understand that it is running in dom0, and is attempting to load a
normal crash kernel.

An up-to-date kexec-tools and running `kexek` manually ought to do the
right thing.

Post by ÐÑÐ¸Ð³Ð¾ÑÐ¸Ð¹ ÐÑÐ°ÑÐºÐ¾
2. How am I supposed to diagnose the thing that causes such dom0 freezes?
I thought that if I ask on the list that my dom0 freezes, it will be a waste
of time without any logs or crash dumps.. But I cannot even make them..

On the serial console, if dom0 freezes, Xen should still be usable. use
CTRL-a three times.

~Andrew

Григорий Пташко

2014-10-19 09:51:26 UTC

Permalink

Post by ÐÑÐ¸Ð³Ð¾ÑÐ¸Ð¹ ÐÑÐ°ÑÐºÐ¾
1. How can I make crash dumps of the hypervisor and the dom0?
Kexec of domains inside themselves is not supported. Effort is being made
to make it work, but there are some architectural challenges.
The correct method is method 2, by providing a crash region in Xen for
dom0 to load into. I suspect your problem is that systemd doesn't
understand that it is running in dom0, and is attempting to load a normal
crash kernel.
An up-to-date kexec-tools and running `kexek` manually ought to do the
right thing.

OK. I've tried it again. Here's my cmdline:

APPEND xen.gz console=com1 com1=115200,8n1 crashkernel=256M iommu=1 ---
bzImage ignore_loglevel serial console=ttyS1,115200n8 ...

Here's what I see in dom0:

[***@kvmxen-centos7-test1-nb admin]# xl dmesg | grep crash
(XEN) Command line: console=com1 com1=115200,8n1 crashkernel=256M iommu=1

[***@kvmxen-centos7-test1-nb admin]# kexec -p /boot/bzImage
Memory for crashkernel is not reserved
Please reserve memory by passing "crashkernel=***@Y" parameter to the kernel
Then try loading kdump kernel

Here's the kexec's version (I built it from source rpm):

[***@kvmxen-centos7-test1-nb admin]# kexec --version
kexec-tools 2.0.4 released 17 October 2014

kdump.service is disabled in systemd. What am I doing wrong?

I monitor serial console via SOL (serial over lan) with this command:

$ ipmitool -I lanplus -U user -P passwd -H host sol activate

Having the cmdline I've mentioned above, I don't see any xen dmesg.
I see only the dom0 dmesg and systemd logs while my server is starting up.
After the login prompt appears I press Ctrl-A A A or Ctrl-A Ctrl-A Ctrl-A
but nothing changes. Login prompt does not go away and I don't see any xen
logs.

Also, we I issue the panic manually, I can't do anything on this SOL
console.
I just a dom0's kernel panic and the server reboots after a few seconds.

How am I supposed to get into the *alive* xen from SOL console when a
dom0 kernel panic occurs?
Do I have a wrong cmdline to use xen serial console the way I want
(I want to see xen being alive when dom0 freezes)?

Thank you very much,
Grigory.

Post by ÐÑÐ¸Ð³Ð¾ÑÐ¸Ð¹ ÐÑÐ°ÑÐºÐ¾
~Andrew

Andrew Cooper

2014-10-19 11:30:28 UTC

Permalink

Post by Andrew Cooper

Post by ÐÑÐ¸Ð³Ð¾ÑÐ¸Ð¹ ÐÑÐ°ÑÐºÐ¾
1. How can I make crash dumps of the hypervisor and the dom0?

Kexec of domains inside themselves is not supported. Effort is
being made to make it work, but there are some architectural
challenges.
The correct method is method 2, by providing a crash region in Xen
for dom0 to load into. I suspect your problem is that systemd
doesn't understand that it is running in dom0, and is attempting
to load a normal crash kernel.
An up-to-date kexec-tools and running `kexek` manually ought to
do the right thing.
APPEND xen.gz console=com1 com1=115200,8n1 crashkernel=256M iommu=1
--- bzImage ignore_loglevel serial console=ttyS1,115200n8 ...

ttyS1 is the second serial console, not the first. Xen should be using
com2 not com1 on the command line.

Linux should be configured to use hvc0 which will then be muxed by Xen
onto the serial.

This should now get you the Xen console ring on the serial as well.

Post by Andrew Cooper
(XEN) Command line: console=com1 com1=115200,8n1 crashkernel=256M iommu=1
Memory for crashkernel is not reserved
Then try loading kdump kernel
kexec-tools 2.0.4 released 17 October 2014

I don't know when that date is from, but kexec-tools 2.0.4 is much older
than that. You want 2.0.5 or newer, which contains the Xen support.

~Andrew