Project

General

Profile

Bug #2927

e2a21467e1 Updates to show "4.7" and other changes to major headers should be temporarily suspended

Added by davshao 8 months ago. Updated 8 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
07/20/2016
Due date:
% Done:

100%


Description

commit 5d920ec6b97613f06aba4a09bfb91413b1fd93c3 Fix excessive ipiq recursion (2)
does not correct the problem I have observed on at least two machines that

make -j7 buildworld

locks up the machine. (When using make buildworld I move /usr/obj/usr to start the build from scratch.) On one machine there seems to be a reproducible lock up at

sh /usr/src/tools/install.sh -o root -g wheel -m 555 lto1 /usr/obj/usr/src/ctools_x86_64_x86_64/usr/libexec/gcc50

Changes to major headers should be temporarily suspended until this problem is resolved, because I have observed that even

make quickworld

can fail to complete without lockup. Someone using UFS as their filesystem may experience quite substantial filesystem corruption after a lockup. By limiting changes to major headers, this gives the greatest chance of a successful make quickworld when this problem is finally resolved.

History

#1 Updated by t_dfbsd 8 months ago

I've seen lockups too doing make -j 8 buildworld.

#2 Updated by dillon 8 months ago

Hmm. The 5d920ec commit fixed the issues on the three build machines we've tested on (4-core/8-thread pkgbox64, 16-core/32-thread 2xXeon box, and the 4-socket 48-core opteron box). If you are still getting lockups I need your machine configuration to try to reproduce it. cpu's cores/threads, amount of memory, configured swap if any, and filesystems being used (UFS?).

Also, are you running the build from a console or from X? If from X, try running from a console and see if it still reproduces (that will tell us whether there's an X interaction or not). And see if you can break into the debugger when it locks up and get a kernel dump.

-Matt

#3 Updated by dillon 8 months ago

If you are running on UFS, the only thing I see is the possibility that it is related to vfs.ffs.ffsrawbufcnt. Try disabling the use of raw buffers with a sysctl vfs.ffs.allowrawread=0 and sysctl vfs.ffs.rawreadahead=0 and see if the lockup still occurs.

-Matt

#4 Updated by dillon 8 months ago

Ach, followup on that last one... the rawread stuff is only used if the kernel is customized with the DIRECTIO option. If you aren't doing that, there is no rawread stuff so ignore that last bit.

-Matt

#5 Updated by t_dfbsd 8 months ago

Update: I've been doing multiple full buildworlds on the current master and haven't seen a lockup. This is on a 4-core Skylake machine and a 6-core AMD and hammer (no UFS).

#6 Updated by dillon 8 months ago

Ok, I've reproduced the bug on my test box. I should be able to locate and fix it today, hopefully. It reproduces fairly quickly (less than an hour).

-Matt

#7 Updated by dillon 8 months ago

  • Status changed from New to In Progress
  • Assignee set to dillon
  • Priority changed from Normal to High
  • % Done changed from 0 to 80

Ok, a fix has been committed to master and release. Please test at your convenience. We are testing on several machines here as well.

commits through to 7fb451cb3c27563ba7a (kernel - Fix excessive ipiq recursion (4)).

-Matt

#8 Updated by arcade@b1t.name 8 months ago

Hi all.

I previously was able to reproduce lockups with hammer2 by transferring large (1G, 4G) to filesystem fairly reliably in 5 minutes or so. After last patch I was able at least once transfer files completely. After transferring files I tried to remove them thus causing a core:

chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146002, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
chain 00000000c1140010.02 key=0000000000000400 meth=30 CHECK FAIL (flags=00146202, bref/data 81ee268bc3884e4f/c7a20c5a57036b22)
panic: delete base 0xffffffe14c85c000 element not found at 0/512 elm 0xffffffe3434ccde0

cpuid = 3
Trace beginning at frame 0xffffffe3430b8480
panic() at panic+0x25f 0xffffffff8027cb46
panic() at panic+0x25f 0xffffffff8027cb46
hammer2_base_delete() at hammer2_base_delete+0x9f 0xffffffff81783cfb
hammer2_flush_core() at hammer2_flush_core+0xa4e 0xffffffff81788f2c
hammer2_flush_recurse() at hammer2_flush_recurse+0xa0 0xffffffff81789100
hammer2_chain_tree_RB_SCAN() at hammer2_chain_tree_RB_SCAN+0x105 0xffffffff817822b1
boot() called on cpu#3
Uptime: 5m20s
Physical memory: 15536 MB

This can be totally unrelated though.

On the other hand when I rebooted and savecore tried to save a core dump I had hard lockup.

#9 Updated by t_dfbsd 8 months ago

I installed a brand new drive in a machine (hammer, encrypted) and tried to cpdup several hundred GB to it from an existing drive and the machine locked up part way through.

#10 Updated by davshao 8 months ago

Even after

commit 1a5c7e0f9aa6bc9a632a92b6b832cfe676746f7f
Date: Sat Jul 23 19:19:46 2016 -0700

kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt

make -j7 buildworld

locks up on both an Intel i3-3225 CPU (Ivy Bridge, 2 core with hyperthreading)
Asrock Z77M motherboard machine and an AMD Athlon II X4 610e CPU (quad-core)
Asus M5A87 machine.

#11 Updated by t_dfbsd 8 months ago

Definitely looking good so far after the most recent patch!

#12 Updated by dillon 8 months ago

  • % Done changed from 80 to 90

Yup, I think 743146ae2a6000 did the trick. No lockups overnight, and several people helping us test also reported no more lockups.

-Matt

#13 Updated by davshao 8 months ago

Multiple machines including the Intel i3-3225 CPU and the AMD Athlon II X4 610e CPU machines have successfully
completed a full
make -j7 buildworld buildkernel

Thanks for all the hard work to debug this!

#14 Updated by t_dfbsd 8 months ago

All the hammer-related issues I was seeing have resolved. Thanks Matt!

#15 Updated by dillon 8 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

Fixed with 743146ae2a6000, closing bug. Thanks all!

Also available in: Atom PDF