Bug #988
closedFirefox/Thunderbird and SSL not working with libthread_xu
Added by jgordeev over 16 years ago. Updated over 16 years ago.
0%
Description
Simon "corecode" Schubert reported XXX some months ago an issue with
Firefox/Thunderbird and SSL when using libthread_xu as the threading
library:
http://leaf.dragonflybsd.org/mailarchive/bugs/2007-12/msg00144.html
I started looking into the issue and found something that seems like a
bug to me.
I traced an execution of firefox with ktrace and observed something
strange. Below is a commented extract from the 'kdump -T' output.
// thread #4 calls umtx_sleep on 0x2a92f664
// thread #4 never returns from this syscall, according to the kdump output
75875:4 firefox-bin 1207996217.713757 CALL umtx_sleep(0x2a92f664,0,0)
// thread #1 attempts to wake up thread #4
75875:1 firefox-bin 1207996244.646643 CALL
umtx_wakeup(0x2a92f664,0x7fffffff)
// I terminate the process
75875:5 firefox-bin 1207996413.782573 PSIG SIGQUIT SIG_DFL
The expected behaviour is for thread #4 to wake up when thread #1 calls
umtx_wakeup.
I'd appreciate any feedback on the issue.
If you need information about my environment, I'd be happy to provide it.
Updated by dillon over 16 years ago
:Simon "corecode" Schubert reported XXX some months ago an issue with
:Firefox/Thunderbird and SSL when using libthread_xu as the threading
:library:
: http://leaf.dragonflybsd.org/mailarchive/bugs/2007-12/msg00144.html
:I started looking into the issue and found something that seems like a
:bug to me.
:I traced an execution of firefox with ktrace and observed something
:strange. Below is a commented extract from the 'kdump -T' output.
:// thread #4 calls umtx_sleep on 0x2a92f664
:// thread #4 never returns from this syscall, according to the kdump output
:75875:4 firefox-bin 1207996217.713757 CALL umtx_sleep(0x2a92f664,0,0)
:// thread #1 attempts to wake up thread #4
:75875:1 firefox-bin 1207996244.646643 CALL
:umtx_wakeup(0x2a92f664,0x7fffffff)
:// I terminate the process
:75875:5 firefox-bin 1207996413.782573 PSIG SIGQUIT SIG_DFL
:
:The expected behaviour is for thread #4 to wake up when thread #1 calls
:umtx_wakeup.
:
:I'd appreciate any feedback on the issue.
:If you need information about my environment, I'd be happy to provide it.
Could you paste all the ofktrace output between those two points? It
definitely should have woken up thread 4.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by jgordeev over 16 years ago
There are about 2 million lines of kdump output between these two
points, which include the output of some debugging printfs and likely
information read from/written to my personal profile. The only mentions
of 0x2a92f664 in the kdump output are the lines pasted above. And, as I
said, no return from umtx_sleep for thread 4 is noticed by ktrace.
Before I terminated the process I ran 'ps -axHl' which resulted in the
following line, among others:
UID PID TID PPID CPU PRI NI VSZ RSS WCHAN STAT TT
TIME COMMAND
1001 75875 4 74331 79 200 0 54880 43764 umtxsl IL+ p6
0:00.00 /usr/pkg/lib/firefox/firefox-bin
Notice the 'I' in 'IL+'.
The issue is very easy to reproduce - just run 'ktrace -i firefox' and
try to open an SSL site. If opening the site fails, then close the
window, wait for some time, then terminate the process. If opening the
SSL site succeeds, just try again.
I've reproduced the issue with thunderbird too, where the thread than
never woke up happened to be thread #5. I'm running DragonFly 1.12.1
with UP kernel.
I ask anybody who doesn't trust my results to try and reproduce the issue.
Updated by dillon over 16 years ago
:> :strange. Below is a commented extract from the 'kdump -T' output.
:> :// thread #4 calls umtx_sleep on 0x2a92f664
:> :// thread #4 never returns from this syscall, according to the kdump output
:> :75875:4 firefox-bin 1207996217.713757 CALL umtx_sleep(0x2a92f664,0,0)
:> :// thread #1 attempts to wake up thread #4
:> :75875:1 firefox-bin 1207996244.646643 CALL
:> :umtx_wakeup(0x2a92f664,0x7fffffff)
:> :// I terminate the process
:> :75875:5 firefox-bin 1207996413.782573 PSIG SIGQUIT SIG_DFL
:> :
:> :The expected behaviour is for thread #4 to wake up when thread #1 calls
:> :umtx_wakeup.
:> :
:> :I'd appreciate any feedback on the issue.
:> :If you need information about my environment, I'd be happy to provide it.
:>
:> Could you paste all the ofktrace output between those two points? It
:> definitely should have woken up thread 4.
:>
:> -Matt
:> Matthew Dillon
:> <dillon@backplane.com>
:>
:There are about 2 million lines of kdump output between these two
:points, which include the output of some debugging printfs and likely
:information read from/written to my personal profile. The only mentions
:of 0x2a92f664 in the kdump output are the lines pasted above. And, as I
:said, no return from umtx_sleep for thread 4 is noticed by ktrace.
:Before I terminated the process I ran 'ps -axHl' which resulted in the
:following line, among others:
: UID PID TID PPID CPU PRI NI VSZ RSS WCHAN STAT TT
:TIME COMMAND
: 1001 75875 4 74331 79 200 0 54880 43764 umtxsl IL+ p6
:0:00.00 /usr/pkg/lib/firefox/firefox-bin
:Notice the 'I' in 'IL+'.
:The issue is very easy to reproduce - just run 'ktrace -i firefox' and
:try to open an SSL site. If opening the site fails, then close the
:window, wait for some time, then terminate the process. If opening the
:SSL site succeeds, just try again.
:I've reproduced the issue with thunderbird too, where the thread than
:never woke up happened to be thread #5. I'm running DragonFly 1.12.1
:with UP kernel.
:I ask anybody who doesn't trust my results to try and reproduce the issue.
I trust them, I just want to see what else is going on... for example,
if the memory in question is remapped inbetween the sleep and the
wakeup.
2 million lines isn't too bad, but not something to post to a mailing
list. Could you throw them up on a machine that I can download from?
I can give you a leaf account if you need a machine to upload the
output to.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon over 16 years ago
The core of the umtx code in the kernel is this:
umtx_sleep:
acquire and hold VM page
tsleep loop using the page's physical address as the sleep address
release page
umtx_wakeup:
acquire and hold the VM page
lookup the physical address
wakeup based on physical address.
release page
The page is not supposed to be reassigned while a thread is sleeping
in umtx_sleep on that page, so the physical address shouldn't change
between the sleep and the wakeup.
There are two possibilities that I can think of:
- Firefox is remapping the underyling virtual address
- There is a bug in the kernel implementation somewhere
that is causing the underlying page to get reassigned somehow.
I have two patches for you to try. Please try each one separately
(not both together) and tell me which ones fix the problem, or that
neither of them fixes the problem.
-Matt
PATCH #1
Index: kern_umtx.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_umtx.c,v
retrieving revision 1.7
diff u -p -r1.7 kern_umtx.c kern_umtx.c 2 Jul 2007 01:30:07
--0000 1.7>ptr, VM_PROT_READ, &error);
++ kern_umtx.c 12 Apr 2008 18:12:12 -0000@ -105,6 +105,7
@ return (EFAULT);
m = vm_fault_page_quick((vm_offset_t)uap
if (m == NULL)
return (EFAULT);
vm_page_wire(m);
sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
offset = (vm_offset_t)uap->ptr & PAGE_MASK;
@ -121,6 +122,7
@ error = EBUSY;
}
sf_buf_free(sf);
+ vm_page_unwire(m, 1);
vm_page_unhold(m);
return(error);
}
PATCH #2
Index: vm_object.c
===================================================================
RCS file: /cvs/src/sys/vm/vm_object.c,v
retrieving revision 1.31
diff u -p -r1.31 vm_object.c vm_object.c 8 Jun 2007 02:00:47 -0000 1.31
--
+++ vm_object.c 12 Apr 2008 18:16:54 -0000@ -1605,12 +1605,15
@ }
/*
* limit is our clean_only flag. If set and the page is dirty, do
+ * not free it. If set and the page is being held by someone, do
* not free it.
*/
if (info->limit && p->valid) {
vm_page_test_dirty(p);
if (p->valid & p->dirty)
return(0);
+ if (p->hold_count)
+ return(0);
}
/*
Updated by jgordeev over 16 years ago
I've tried both patches.
The first gave me a kernel panic when I started firefox:
Unread portion of the kernel message buffer:
panic: vm_page_free: invalid wire count (2), pindex: 0x2
(kgdb) bt
#0 dumpsys () at ./machine/thread.h:83
#1 0xc02bb202 in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:375
#2 0xc02bb323 in panic (
fmt=0xc0546e94 "vm_page_free: invalid wire count (%d), pindex: 0x%lx")
at /usr/src/sys/kern/kern_shutdown.c:800
#3 0xc0421b2e in vm_page_free_toq (m=0xc0c509d0)
at /usr/src/sys/vm/vm_page.c:967
#4 0xc041f399 in vm_object_backing_scan_callback (p=0xc0c509d0,
data=0xcf29a978) at /usr/src/sys/vm/vm_page.h:567
#5 0xc041ff6a in vm_page_rb_tree_RB_SCAN (head=0xcf288074,
scancmp=0xc041fe74 <vm_page_rb_tree_SCANCMP_ALL>,
callback=0xc041f15d <vm_object_backing_scan_callback>, data=0xcf29a978)
at /usr/src/sys/vm/vm_page.c:101
#6 0xc041f815 in vm_object_collapse (object=0xcf2973c0)
at /usr/src/sys/vm/vm_object.c:1148
#7 0xc041fe16 in vm_object_deallocate (object=0xcf288060)
at /usr/src/sys/vm/vm_object.c:336
#8 0xc041a84c in vm_map_delete (map=0xc188d110, start=0, end=3217031168,
countp=0xcf29aa1c) at /usr/src/sys/vm/vm_map.c:2462
#9 0xc041b4a8 in vm_map_remove (map=0xc188d110, start=0, end=3217031168)
at /usr/src/sys/vm/vm_map.c:2608
#10 0xc02ab55a in exec_new_vmspace (imgp=0xcf29aba0, vmcopy=0x0)
at /usr/src/sys/kern/kern_exec.c:716
#11 0xc0298d7c in exec_elf_imgact (imgp=0xcf29aba0)
at /usr/src/sys/kern/imgact_elf.c:570
#12 0xc02abae4 in kern_execve (nd=0xcf29ac88, args=0xcf29ac68)
at /usr/src/sys/kern/kern_exec.c:276
#13 0xc02ac2db in sys_execve (uap=0xcf29acf0)
at /usr/src/sys/kern/kern_exec.c:539
#14 0xc04a2859 in syscall2 (frame=0xcf29ad40)
at /usr/src/sys/platform/pc32/i386/trap.c:1345
#15 0xc04923c5 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:872
#16 0x28a2c98c in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
The second patch didn't fix the issue as I've been able to successfully
reproduce the problem after applying it.
I've uploaded kdump output from a failing firefox proccess at
http://sometimes.hit.bg/ktrace-firefox.txt.gz.
Updated by dillon over 16 years ago
:I've tried both patches.
:The first gave me a kernel panic when I started firefox:
:Unread portion of the kernel message buffer:
:panic: vm_page_free: invalid wire count (2), pindex: 0x2
:
:(kgdb) bt
Ok, I think I have reproduced it on my test box. I'll spend monday
trying to track down what is going on.
-Matt
Updated by dillon over 16 years ago
I've found the problem. What is happening is that firefox is
fork()ing and this is changing the pages the mutexes are waiting on
in to copy-on-write. The causes the page to be reassigned and
the original umtx_sleep() no longer matches the umtx_wakeup().
I also found a couple of races in libthread_xu, but those weren't
the main issue.
The ktrace output was wrong, due to some issues with ktrace and
multi-threaded programs. It was calling umtx_wakeup().
I am considering my options. It can be solved fairly easily by
polling the memory location once a second to catch COW's. Unfortunately
we can't really sleep on virtual addresses, since pages can be mapped
and I really want the umtx*() calls to work with shared memory.
At the moment I do not have a clean solution but I will commit a fix
later today even if it isn't perfect.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon over 16 years ago
Firefox should no longer have any problems accessing SSL sites in HEAD.
I am recommending a full buildworld/buildkernel/installworld/
installkernel sequence.
If you don't want to do all of that then recompile the kernel and
recompile and reinstall libthread_xu.
-Matt