Bug #2927
closed
e2a21467e1 Updates to show "4.7" and other changes to major headers should be temporarily suspended
Added by davshao over 8 years ago.
Updated over 8 years ago.
Description
commit 5d920ec6b97613f06aba4a09bfb91413b1fd93c3 Fix excessive ipiq recursion (2)
does not correct the problem I have observed on at least two machines that
make -j7 buildworld
locks up the machine. (When using make buildworld I move /usr/obj/usr to start the build from scratch.) On one machine there seems to be a reproducible lock up at
sh /usr/src/tools/install.sh -o root -g wheel -m 555 lto1 /usr/obj/usr/src/ctools_x86_64_x86_64/usr/libexec/gcc50
Changes to major headers should be temporarily suspended until this problem is resolved, because I have observed that even
make quickworld
can fail to complete without lockup. Someone using UFS as their filesystem may experience quite substantial filesystem corruption after a lockup. By limiting changes to major headers, this gives the greatest chance of a successful make quickworld when this problem is finally resolved.
I've seen lockups too doing make -j 8 buildworld.
Hmm. The 5d920ec commit fixed the issues on the three build machines we've tested on (4-core/8-thread pkgbox64, 16-core/32-thread 2xXeon box, and the 4-socket 48-core opteron box). If you are still getting lockups I need your machine configuration to try to reproduce it. cpu's cores/threads, amount of memory, configured swap if any, and filesystems being used (UFS?).
Also, are you running the build from a console or from X? If from X, try running from a console and see if it still reproduces (that will tell us whether there's an X interaction or not). And see if you can break into the debugger when it locks up and get a kernel dump.
-Matt
If you are running on UFS, the only thing I see is the possibility that it is related to vfs.ffs.ffsrawbufcnt. Try disabling the use of raw buffers with a sysctl vfs.ffs.allowrawread=0 and sysctl vfs.ffs.rawreadahead=0 and see if the lockup still occurs.
-Matt
Ach, followup on that last one... the rawread stuff is only used if the kernel is customized with the DIRECTIO option. If you aren't doing that, there is no rawread stuff so ignore that last bit.
-Matt
Update: I've been doing multiple full buildworlds on the current master and haven't seen a lockup. This is on a 4-core Skylake machine and a 6-core AMD and hammer (no UFS).
Ok, I've reproduced the bug on my test box. I should be able to locate and fix it today, hopefully. It reproduces fairly quickly (less than an hour).
-Matt
- Status changed from New to In Progress
- Assignee set to dillon
- Priority changed from Normal to High
- % Done changed from 0 to 80
Ok, a fix has been committed to master and release. Please test at your convenience. We are testing on several machines here as well.
commits through to 7fb451cb3c27563ba7a (kernel - Fix excessive ipiq recursion (4)).
-Matt
Hi all.
I previously was able to reproduce lockups with hammer2 by transferring large (1G, 4G) to filesystem fairly reliably in 5 minutes or so. After last patch I was able at least once transfer files completely. After transferring files I tried to remove them thus causing a core:
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146202, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
panic: delete base 0xffffffe14c85c000 element not found at 0/512 elm 0xffffffe3434ccde0
cpuid = 3
Trace beginning at frame 0xffffffe3430b8480
panic() at panic+0x25f 0xffffffff8027cb46
panic() at panic+0x25f 0xffffffff8027cb46
hammer2_base_delete() at hammer2_base_delete+0x9f 0xffffffff81783cfb
hammer2_flush_core() at hammer2_flush_core+0xa4e 0xffffffff81788f2c
hammer2_flush_recurse() at hammer2_flush_recurse+0xa0 0xffffffff81789100
hammer2_chain_tree_RB_SCAN() at hammer2_chain_tree_RB_SCAN+0x105 0xffffffff817822b1
boot() called on cpu#3
Uptime: 5m20s
Physical memory: 15536 MB
This can be totally unrelated though.
On the other hand when I rebooted and savecore tried to save a core dump I had hard lockup.
I installed a brand new drive in a machine (hammer, encrypted) and tried to cpdup several hundred GB to it from an existing drive and the machine locked up part way through.
Even after
commit 1a5c7e0f9aa6bc9a632a92b6b832cfe676746f7f
Date: Sat Jul 23 19:19:46 2016 -0700
kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt
make -j7 buildworld
locks up on both an Intel i3-3225 CPU (Ivy Bridge, 2 core with hyperthreading)
Asrock Z77M motherboard machine and an AMD Athlon II X4 610e CPU (quad-core)
Asus M5A87 machine.
Definitely looking good so far after the most recent patch!
- % Done changed from 80 to 90
Yup, I think 743146ae2a6000 did the trick. No lockups overnight, and several people helping us test also reported no more lockups.
-Matt
Multiple machines including the Intel i3-3225 CPU and the AMD Athlon II X4 610e CPU machines have successfully
completed a full
make -j7 buildworld buildkernel
Thanks for all the hard work to debug this!
All the hammer-related issues I was seeing have resolved. Thanks Matt!
- Status changed from In Progress to Closed
- % Done changed from 90 to 100
Fixed with 743146ae2a6000, closing bug. Thanks all!
Also available in: Atom
PDF