Bug #1079

panic: already on hash list

Added by qhwt+dfly over 5 years ago. Updated over 5 years ago.

Status:ClosedStart date:
Priority:UrgentDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hi.
Caught this panic while playing with my build box. I don't know the exact
moment when it panicked, as the monitor was connected to another box, but
here's what I was doing before the panic anyway:

- login to the build box from the console, and slogin to the router box
(running DragonFly 1.8), and rebooted it.
- switch the monitor/keyboard to the router box, wait for it to boot,
and login to the console. slogin to the build machine, start GNU screen,
start w3m (a text-based web browser, similar to lynx), tried to visit
Google but it didn't work (this was expected, because for some reason mpd
starts up earlier than ipnat and I always have to restart before I can
connect to the Internet from behind the router).
- split the screen inside GNU screen, and typed ctrl+C on w3m. A few seconds
later I realized that the build box was unresponsive, and then I heard
the machine rebooted.

Probably this panic only occurs on SMP machines, but I'd like to make sure
it won't happen on the router box before upgrading it.

The kernel dump is uploaded as ~y0netan1/crash/{kernel,vmcore}.11 on my
leaf account, in case someone is interested.

Thanks.

Unread portion of the kernel message buffer:
panic: already on hash list
mp_lock = 00000000; cpuid = 0
boot() called on cpu#0

syncing disks... 3
done
Uptime: 1h31m54s

dumping to dev #ad/0x20001, blockno 640
[snipped]
(kgdb) bt
#0 dumpsys () at ./machine/thread.h:83
#1 0xc0198cbd in boot (howto=256)
at /home/source/dragonfly/R2_0/src/sys/kern/kern_shutdown.c:375
#2 0xc0198f80 in panic (fmt=0xc02e2386 "already on hash list")
at /home/source/dragonfly/R2_0/src/sys/kern/kern_shutdown.c:800
#3 0xc01ff460 in in_pcbinsconnhash (inp=0xd316c260)
at /home/source/dragonfly/R2_0/src/sys/netinet/in_pcb.c:1053
#4 0xc02124d7 in tcp_connect_oncpu (tp=0xd316c320, sin=0xd7ef3d80,
if_sin=0xd316c260)
at /home/source/dragonfly/R2_0/src/sys/netinet/tcp_usrreq.c:917
#5 0xc02126e1 in tcp_connect (tp=0xd316c320, nam=<value optimized out>,
td=<value optimized out>)
at /home/source/dragonfly/R2_0/src/sys/netinet/tcp_usrreq.c:1036
#6 0xc02138b1 in tcp_usr_connect (so=0xd2c80fc0, nam=0xd7ef3d80,
td=0xd4102590)
at /home/source/dragonfly/R2_0/src/sys/netinet/tcp_usrreq.c:479
#7 0xc01c9972 in netmsg_pru_connect (msg=0xd62d0bd0)
at /home/source/dragonfly/R2_0/src/sys/kern/uipc_msg.c:450
#8 0xc020f0bc in tcpmsg_service_loop (dummy=0x0)
at /home/source/dragonfly/R2_0/src/sys/netinet/tcp_subr.c:385
#9 0xc01a0961 in lwkt_deschedule_self (td=Cannot access memory at address 0x8
)
at /home/source/dragonfly/R2_0/src/sys/kern/lwkt_thread.c:223

History

#1 Updated by dillon over 5 years ago

:Hi.
:Caught this panic while playing with my build box. I don't know the exact
:moment when it panicked, as the monitor was connected to another box, but
:here's what I was doing before the panic anyway:
:
:...
:
:Probably this panic only occurs on SMP machines, but I'd like to make sure
:it won't happen on the router box before upgrading it.
:
:The kernel dump is uploaded as ~y0netan1/crash/{kernel,vmcore}.11 on my
:leaf account, in case someone is interested.
:
:Thanks.
: at /home/source/dragonfly/R2_0/src/sys/kern/kern_shutdown.c:375
:#2 0xc0198f80 in panic (fmt=0xc02e2386 "already on hash list")
: at /home/source/dragonfly/R2_0/src/sys/kern/kern_shutdown.c:800
:#3 0xc01ff460 in in_pcbinsconnhash (inp=0xd316c260)
: at /home/source/dragonfly/R2_0/src/sys/netinet/in_pcb.c:1053
:#4 0xc02124d7 in tcp_connect_oncpu (tp=0xd316c320, sin=0xd7ef3d80,
: if_sin=0xd316c260)
: at /home/source/dragonfly/R2_0/src/sys/netinet/tcp_usrreq.c:917
:#5 0xc02126e1 in tcp_connect (tp=0xd316c320, nam=<value optimized out>,
: td=<value optimized out>)
: at /home/source/dragonfly/R2_0/src/sys/netinet/tcp_usrreq.c:1036

Definitely one for Seph. The permissions are set properly on your
crash dump so Seph should be able to dive into it with kgdb on leaf.

-Matt
Matthew Dillon
<>

#2 Updated by sepherosa over 5 years ago

I think this ctrl+C is important to the problem :)

One thing I need you to help confirm is that does w3m put the socket
into nonblock mode? I didn't seem to be able to download w3m source
code.

I think it may be caused by following pattern of user code:

s = socket();
/* s is not put into nonblock mode */
while (1) {
if (connect(s) < 0) { <==== here you hit ctrl+C
if (errno == EINTR)
continue; <==== another connect(2) attempt on 's'
...
}
...
}

We probably could create a much simpler test program by using the
above code pattern to reproduce the panic ...

The things from the dump related to my following assumption are:
1) so_state is 0
2) inpcb is on hash list (both the flag and the link fields prove that)

I think following things happened, if w3m used the code pattern I listed above:
- connect(2) is blocking, so first calling of connect(2) will make
kern_connect() block on lwkt_domsg()
- ctrl+C will make the lwkt_domsg() in kern_connect() return.
SS_ISCONNECTING is cleared on so_state, then so_state becomes 0, but
inpcb is left on hash list since former soconnect() succeeded.
- the second connect(2) syscall hits the wall (soconnect calls
so_pru_connect, since so_state is 0)

I will appreciate if you could reproduce the panic by using the user
code I mention above. I probably could not do it today, I could not
go back home before 10pm today :(

Best Regards,
sephe

#3 Updated by nthery over 5 years ago

tcp_connect_oncpu() first looks up the inpcb in the hash list then
adds it if it is not already
there. So if it is on the hash list when the os panics, could it mean
that it was inserted by another thread after the failed look up?

#4 Updated by sepherosa over 5 years ago

mmm, the user program may try to connect to a different remote address?

I think all syscalls are under bgl and tcp protocol threads are under
bgl. Even if they are not under bgl, connection creation (e.g. link
inpcb into hash tabl) is serialized by the tcp thread on cpu0.
SS_ISCONNECTING checking in soconnect may be racy, but it probably is
not the cause of this panic, since so_state is 0.

Best Regards,
sephe

#5 Updated by qhwt+dfly over 5 years ago

Maybe, or maybe not. I wasn't looking at the panicked machine's console
at that time, so I'm not sure if it was ctrl+C that triggered the panic,
or it was already dumping the kernel memory when I hit ctrl+C. Or it
could be SIGWINCH which triggered it. In any case, the panic is not
easy to reproduce. I also saw a complete lock up (no ctrl+alt+esc) when
I slogin'ed to this machine yesterday, but I'm not sure if it's related
to this problem at all.

I took a quick look at source code, and it seemed it didn't.
In case you can't find it elsewhere, here's the one:
http://122.249.219.233/distfiles/w3m-0.5.2.tar.gz

If I run above code and hit ctrl+C, the program terminates. If I change
it to mask SIGINT, it ignores ctrl+C until connection times out, and in the
following iterations connect() fails immediately. Hmm...

Thanks.

#6 Updated by qhwt+dfly over 5 years ago

The fix committed to HEAD and MFCed by sephe@

Also available in: Atom PDF