Bug #1558

vkernel stuck in umtxsl state when booting from a NFS root mount

Added by rumcic about 5 years ago. Updated almost 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

When booting a vkernel with '-n 1' I'm not sure when was the last time that it
got stuck, while booting with '-n 2' fails in most if not all cases.

I can either get a useful backtrace:
0x080bccf2 in lwkt_process_ipiq () at /usr/src/sys/kern/lwkt_ipiq.c:471
471 for (n = 0; n < ncpus; ++n) {
Current language: auto; currently c
(gdb) bt
#0 0x080bccf2 in lwkt_process_ipiq () at /usr/src/sys/kern/lwkt_ipiq.c:471
#1 0x080bceb9 in lwkt_send_ipiq3 (target=0x41400000, func=0x80b1428
<systimer_intr>, arg1=0x83c958c, arg2=0) at /usr/src/sys/kern/lwkt_ipiq.c:189
#2 0x081dcc58 in vktimer_intr (dummy=0x0, frame=0x0)
at /usr/src/sys/platform/vkernel/platform/systimer.c:218
#3 0x081d70cd in kqueue_intr (arg=0x0, frame=0x0)
at /usr/src/sys/platform/vkernel/platform/kqueue.c:201
#4 0x08091df3 in ithread_handler (arg=0x1)
at /usr/src/sys/kern/kern_intr.c:807
#5 0x080bb547 in lwkt_deschedule_self (td=Cannot access memory at address 0x8
) at /usr/src/sys/kern/lwkt_thread.c:271

Or it gets stuck in umtxsl state and never does anything useful (and the
backtrace doesn't seem useful to me):
./kernel.debug -m 128m -I
tap0:bridge0 -e 'boot.netif.ip=192.168.250.235:boot.netif.netmask=255.255.255.0:boot.netif.hwaddr="00:be:0e:1d:00:00":boot.nfsroot.server=192.168.250.1:boot.nfsroot.path=/var/vkernel/vkernel_root' -v -r
neki.img -r neki2.img -n 2
Using memory file: /var/vkernel/memimg.000000
KVM mapped at 0x40000000-0x80000000
TAP UNIT 0
Copyright (c) 2003-2009 The DragonFly Project.
Copyright (c) 1992-2003 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
Slab ZoneSize set to 64KB
Leaving critical section, allowing interrupts
DragonFly v2.5.1.25.gec87c-DEVELOPMENT #64: Mon Oct 5 11:08:28 CEST 2009
:/usr/obj/usr/src/sys/MYVKERNEL
real memory = 134217728 (131072K bytes)
avail memory = 127180800 (124200K bytes)
DragonFly/MP: Multiprocessor
cpu0 (BSP)
cpu1 (AP)
initclocks
Finish MP startup
SMP: AP CPU #1 Launched!
Active CPU Mask: 00000003
objcache(exec-args): too small for ncpus, adjusting cluster_limit 16->32
start scheduler helpers on cpus: 0 1
start dummy scheduler helpers on cpus: 0 1
vkd0: type 0xa5, start 1048576, end = 1048575, size 0
vkd0: C/H/S start 1/0/1 (0) != start 1048576: invalid
vkd0: C/H/S end 0/0/0 (4294967295) != end 1048575: invalid
vkd1: type 0xa5, start 1048576, end = 1048575, size 0
vkd1: C/H/S start 1/0/1 (0) != start 1048576: invalid
vkd1: C/H/S end 0/0/0 (4294967295) != end 1048575: invalid
vke0: bpf attached
vke0: MAC address: 00:be:0e:1d:00:00
md0: Malloc disk
Device configuration finished.
lo0: bpf attached
ip: MPSAFE
arp: MPSAFE
PXE: Warning, no NFS handle passed from loader
vinum: loaded
Mounting root from nfs:
nfs_mountroot: interface vke0 ip 192.168.250.235 bcast 192.168.250.255 mask
255.255.255.0
NFS_ROOT: 192.168.250.1:/var/vkernel/vkernel_root
NFS_ROOT: No FH passed from loader, attempting mount rpc...success!
Mounting devfs
start_init: trying /sbin/init
load: 0.07 cmd: sh 2 [getblk] 0.00u 0.00s 0% 504k
load: 0.07 cmd: sh 2 [getblk] 0.00u 0.00s 0% 504k
load: 0.07 cmd: sh 2 [getblk] 0.00u 0.00s 0% 504k

on host machine top shows:
94798 root 152 0 1030M 11192K umtxsl 0 0:00 0.00% 0.00% kernel.debug

gdb ./kernel.debug 94798
GNU gdb 6.7.1
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i386-dragonfly"...
Attaching to program: /usr/obj/usr/src/sys/MYVKERNEL/kernel.debug, process
94798
Reading symbols from /usr/lib/libthread_xu.so.2...done.
Loaded symbols for /usr/lib/libthread_xu.so.2
Reading symbols from /usr/lib/libc.so.7...done.
Loaded symbols for /usr/lib/libc.so.7
Reading symbols from /usr/libexec/ld-elf.so.2...done.
Loaded symbols for /usr/libexec/ld-elf.so.2
0x282a9223 in umtx_sleep () at umtx_sleep.S:2
2 RSYSCALL(umtx_sleep)
(gdb) bt
#0 0x282a9223 in umtx_sleep () at umtx_sleep.S:2
#1 0x081d3cb9 in cpu_idle ()
at /usr/src/sys/platform/vkernel/i386/cpu_regs.c:722
#2 0x00000000 in ?? ()
Current language: auto; currently asm
--
Regards,
Rumko

History

#1 Updated by dillon about 5 years ago

:When booting a vkernel with '-n 1' I'm not sure when was the last time that it
:got stuck, while booting with '-n 2' fails in most if not all cases.

It will take some time to reproduce this, I have to set up a
NFS network root.

-Matt

#2 Updated by dillon about 5 years ago

:When booting a vkernel with '-n 1' I'm not sure when was the last time that it
:got stuck, while booting with '-n 2' fails in most if not all cases.

Ok, all the NFS issues with vkernels should now be fixed. There were
several problems but the ones causing the permanent stalls were

(1) NFS was running the kernel out of mbufs, and

(2) The NFS root mount was NFSv2 instead of NFSv3, and there appears
to be bugs with large write()s on NFSv2 mounts.

I also changed the default mount type from UDP to TCP which will improve
performance.

I have not figured out what the problem with NFSv2 write()'s is yet.
I'm still working on that one. But the vkernel should now mount w/
NFSv3 and thus work around that issue.

-Matt

#3 Updated by rumcic about 5 years ago

Matthew Dillon wrote:

>
> :When booting a vkernel with '-n 1' I'm not sure when was the last time that
> :it got stuck, while booting with '-n 2' fails in most if not all cases.
>
> Ok, all the NFS issues with vkernels should now be fixed. There were
> several problems but the ones causing the permanent stalls were
>
> (1) NFS was running the kernel out of mbufs, and
>
> (2) The NFS root mount was NFSv2 instead of NFSv3, and there appears
> to be bugs with large write()s on NFSv2 mounts.
>
> I also changed the default mount type from UDP to TCP which will improve
> performance.
>
> I have not figured out what the problem with NFSv2 write()'s is yet.
> I'm still working on that one. But the vkernel should now mount w/
> NFSv3 and thus work around that issue.
>
> -Matt

Can't seem to be able to reproduce any more hangs.

Thank you
--
Regards,
Rumko

#4 Updated by rumcic about 5 years ago

Rumko wrote:

> Matthew Dillon wrote:
>
>>
>> :When booting a vkernel with '-n 1' I'm not sure when was the last time that
>> :it got stuck, while booting with '-n 2' fails in most if not all cases.
>>
>> Ok, all the NFS issues with vkernels should now be fixed. There were
>> several problems but the ones causing the permanent stalls were
>>
>> (1) NFS was running the kernel out of mbufs, and
>>
>> (2) The NFS root mount was NFSv2 instead of NFSv3, and there appears
>> to be bugs with large write()s on NFSv2 mounts.
>>
>> I also changed the default mount type from UDP to TCP which will improve
>> performance.
>>
>> I have not figured out what the problem with NFSv2 write()'s is yet.
>> I'm still working on that one. But the vkernel should now mount w/
>> NFSv3 and thus work around that issue.
>>
>> -Matt
>
> Can't seem to be able to reproduce any more hangs.
>
> Thank you

Spoke too soon it seems, was able to make the vkernel hang in the umtxsl state
again.
The problem seems to be the combination of -n X (X > 1) and without -l (so
without locking the threads to specific cpus).

When it got stuck, but wasn't in the umtxsl state (it was using the CPU, but
doing nothing useful) the backtrace looked like (part of the gdb session):
Program received signal SIGINFO, Information request.
lwkt_process_ipiq_core (sgd=<value optimized out>, ip=0x5085d000, frame=0x0)
at /usr/src/sys/kern/lwkt_ipiq.c:582
582 return(wi != ip->ip_windex);
(gdb) bt
#0 lwkt_process_ipiq_core (sgd=<value optimized out>, ip=0x5085d000,
frame=0x0) at /usr/src/sys/kern/lwkt_ipiq.c:582
#1 0x080bc31f in lwkt_process_ipiq () at /usr/src/sys/kern/lwkt_ipiq.c:476
#2 0x080bc4a1 in lwkt_send_ipiq3 (target=0x41400000, func=0x80b0a10
<systimer_intr>, arg1=0x8318acc, arg2=0) at /usr/src/sys/kern/lwkt_ipiq.c:189
#3 0x081de3fc in vktimer_intr (dummy=0x0, frame=0x0)
at /usr/src/sys/platform/vkernel/platform/systimer.c:218
#4 0x081d8871 in kqueue_intr (arg=0x0, frame=0x0)
at /usr/src/sys/platform/vkernel/platform/kqueue.c:201
#5 0x080913eb in ithread_handler (arg=0x1)
at /usr/src/sys/kern/kern_intr.c:807
#6 0x080bab2f in lwkt_deschedule_self (td=Cannot access memory at address 0x8
) at /usr/src/sys/kern/lwkt_thread.c:271
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) c
Continuing.

Program received signal SIGINFO, Information request.
0x081d87f1 in globaldata_find (cpu=1)
at /usr/src/sys/platform/vkernel/platform/globaldata.c:61
61 KKASSERT(cpu >= 0 && cpu < ncpus);
(gdb) bt
#0 0x081d87f1 in globaldata_find (cpu=1)
at /usr/src/sys/platform/vkernel/platform/globaldata.c:61
#1 0x080bc2f2 in lwkt_process_ipiq () at /usr/src/sys/kern/lwkt_ipiq.c:473
#2 0x080bc4a1 in lwkt_send_ipiq3 (target=0x41400000, func=0x80b0a10
<systimer_intr>, arg1=0x8318acc, arg2=0) at /usr/src/sys/kern/lwkt_ipiq.c:189
#3 0x081de3fc in vktimer_intr (dummy=0x0, frame=0x0)
at /usr/src/sys/platform/vkernel/platform/systimer.c:218
#4 0x081d8871 in kqueue_intr (arg=0x0, frame=0x0)
at /usr/src/sys/platform/vkernel/platform/kqueue.c:201
#5 0x080913eb in ithread_handler (arg=0x1)
at /usr/src/sys/kern/kern_intr.c:807
#6 0x080bab2f in lwkt_deschedule_self (td=Cannot access memory at address 0x8
) at /usr/src/sys/kern/lwkt_thread.c:271
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
--
Regards,
Rumko

#5 Updated by rumcic about 5 years ago

Rumko wrote:
<snip>

Note: is not 100% reproducible ... did quite a few successful boots right now
and am not able to reproduce it ... must've been a fluke or something

#6 Updated by rumcic almost 5 years ago

Could not repeat, so mark as resolved

Also available in: Atom PDF