Project

General

Profile

Actions

Bug #3226

open

Xorg freezes in vm: thread stuck in "objtrm1"

Added by peeter almost 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
03/21/2020
Due date:
% Done:

0%

Estimated time:

Description

Xorg freezes randomly, thread stuck in "objtrm1":

0 1064 1 1063 0 174 0 455496 258864 objtrm1 D5 ?? 0:59.12 /usr/local/bin/X :0 -auth /home/

The machines otherwise functional, can ssh into it and start kgdb. Backtrace in kgdb shows this:

---
(kgdb) thread 288
[Switching to thread 288 (pid 1064/1, Xorg)]
#0 0xffffffff8066bd17 in lwkt_switch () at /usr/src/sys/kern/lwkt_thread.c:807
807 lwkt_switch_return(td->td_switch(ntd));
(kgdb) back
#0 0xffffffff8066bd17 in lwkt_switch () at /usr/src/sys/kern/lwkt_thread.c:807
#1 0xffffffff806796b9 in tsleep (ident=ident@entry=0xfffff8057a331614, flags=flags@entry=1024,
wmesg=wmesg@entry=0xffffffff80c4ba77 "objtrm1", timo=1000) at /usr/src/sys/kern/kern_synch.c:711
#2 0xffffffff8066adc5 in _refcount_wait (countp=countp@entry=0xfffff8057a331614,
wstr=wstr@entry=0xffffffff80c4ba77 "objtrm1") at /usr/src/sys/kern/kern_refcount.c:81
#3 0xffffffff8098eaa3 in refcount_wait (wstr=0xffffffff80c4ba77 "objtrm1", countp=0xfffff8057a331614)
at /usr/src/sys/sys/refcount.h:111
#4 vm_object_pip_wait (waitid=0xffffffff80c4ba77 "objtrm1", object=0xfffff8057a331580)
at /usr/src/sys/vm/vm_object.h:299
#5 vm_object_terminate (object=0xfffff8057a331580) at /usr/src/sys/vm/vm_object.c:755
#6 0xffffffff8098eed5 in vm_object_deallocate (object=0xfffff8057a331580) at /usr/src/sys/vm/vm_object.c:632
#7 0xffffffff8098446f in vm_map_entry_dispose (entry=0xfffff80579cf5fa0, countp=0xfffff803960f98b4,
map=<optimized out>) at /usr/src/sys/vm/vm_map.c:1130
#8 0xffffffff80988333 in vm_map_entry_delete (countp=0xfffff803960f98b4, entry=0xfffff80579cf5fa0,
map=0xfffff801578f3880) at /usr/src/sys/vm/vm_map.c:3182
#9 vm_map_delete (map=0xfffff801578f3880, start=34760318976, end=34775064576, countp=0xfffff803960f98b4)
at /usr/src/sys/vm/vm_map.c:3344
#10 0xffffffff80988baf in vm_map_remove (map=map@entry=0xfffff801578f3880, start=<optimized out>,
start@entry=34760318976, end=<optimized out>, end@entry=34775064576) at /usr/src/sys/vm/vm_map.c:3381
#11 0xffffffff8098b634 in sys_munmap (uap=<optimized out>) at /usr/src/sys/vm/vm_mmap.c:583
#12 0xffffffff80ba980b in syscall2 (frame=0xfffff803960f99f8) at /usr/src/sys/platform/pc64/x86_64/trap.c:1317
#13 0xffffffff80b839cd in ?? () at /usr/src/sys/platform/pc64/x86_64/exception.S:450
#14 0x0000000817e06000 in ?? ()
#15 0x0000000000e10000 in ?? ()
#16 0x0000000000000000 in ?? ()
(kgdb)
---

I don't know where to look further.

Actions #1

Updated by peeter almost 4 years ago

Hm refcount is kind high

---
(kgdb) frame 2
#2 0xffffffff8066adc5 in _refcount_wait (countp=countp@entry=0xfffff8057a331614,
wstr=wstr@entry=0xffffffff80c4ba77 "objtrm1") at /usr/src/sys/kern/kern_refcount.c:81
81 tsleep(countp, PINTERLOCKED, wstr, hz*10);

(kgdb) info locals
n = 4294967295
base_ticks = 12530660
func = "_refcount_wait"

(kgdb) p countp
$1 = (volatile u_int *) 0xfffff8057a331614

(kgdb) p *countp
$2 = 4294967295

(kgdb)
---

Actions #2

Updated by peeter almost 4 years ago

OK there are corresponding messages in /var/log/messages too about the long wait:

---
Mar 21 08:12:56 mars kernel: unhandled error in i915_gem_fault: 28
Mar 21 08:12:56 mars kernel: vm_fault: pager read error, pid 1064 (Xorg)
Mar 21 08:13:23 mars kernel: pid 8610 (chrome), uid 1001: exited on signal 11
Mar 21 08:14:00 mars kernel: warning: refcount_wait objtrm1: long wait
Mar 21 08:15:00 mars kernel: warning: refcount_wait objtrm1: long wait
Mar 21 08:17:00 mars last message repeated 2 times
Mar 21 08:22:00 mars last message repeated 5 times
Mar 21 08:22:20 mars dntpd593: issuing offset adjustment: -0.002399s
Mar 21 08:23:00 mars kernel: warning: refcount_wait objtrm1: long wait
Mar 21 08:24:00 mars kernel: warning: refcount_wait objtrm1: long wait
Mar 21 08:26:00 mars last message repeated 2 times
Mar 21 08:36:00 mars last message repeated 10 times
Mar 21 08:46:00 mars last message repeated 10 times
Mar 21 08:56:00 mars last message repeated 10 times
Mar 21 09:06:00 mars last message repeated 10 times
Mar 21 09:16:00 mars last message repeated 10 times
Mar 21 09:26:00 mars last message repeated 10 times
Mar 21 09:33:00 mars last message repeated 7 times
Mar 21 09:33:43 mars kernel: [drm] HPD interrupt storm detected on connector HDMI-A-1: switching from hotplug detection to polling
Mar 21 09:34:00 mars kernel: warning: refcount_wait objtrm1: long wait
Mar 21 09:35:00 mars kernel: warning: refcount_wait objtrm1: long wait
Mar 21 09:37:00 mars last message repeated 2 times
Mar 21 09:47:00 mars last message repeated 10 times
Mar 21 09:57:00 mars last message repeated 10 times
Mar 21 10:07:00 mars last message repeated 10 times
--

Ignore the HDMI error, I unplugged the screen long after the freeze happened.

Actions #3

Updated by peeter almost 4 years ago

Looking at frame 6:

---
(kgdb) frame 6
#6 0xffffffff8098eed5 in vm_object_deallocate (object=0xfffff8057a331580) at /usr/src/sys/vm/vm_object.c:632
632 vm_object_deallocate_locked(object);

(kgdb) info locals
vp = <optimized out>
count = 1

(kgdb) p object->ref_count
$4 = 0

(kgdb) p count
$5 = 1

(kgdb) p object
$6 = (vm_object_t) 0xfffff8057a331580

(kgdb) p *object
$7 = {token = {t_count = 0, t_ref = 0x0, t_collisions = 0, t_desc = 0xffffffff80c4ba68 "vmobj"}, backing_lk = {
lk_flags = 0, lk_timo = 0, lk_count = 536870912, lk_wmesg = 0xffffffff80c4ba62 "baclk", lk_lockholder = 0x0},
object_entry = {tqe_next = 0xfffff804a79aee20, tqe_prev = 0xfffff80534f28b60}, backing_list = {tqh_first = 0x0,
tqh_last = 0xfffff8057a3315d0}, rb_memq = {rbh_root = 0x0, rbh_inprog = 0x0, rbh_spin = {lock = 0, update = 0}},
generation = 3, size = 3600, ref_count = 0, memattr = 6 '\006', type = 4 '\004', flags = 8, pg_color = 16780,
paging_in_progress = 4294967295, resident_page_count = 0, pager_object_entry = {tqe_next = 0xfffff80571026960,
tqe_prev = 0xfffff8057292b680}, handle = 0xfffff8057a3442c0, hold_count = 1, un_pager = {devp = {devp_pglist = {
tqh_first = 0x0, tqh_last = 0xfffff8057a331640}, ops = 0xffffffff83166420, dev = 0xfffff8057a3442c0}},
swblock_root = {rbh_root = 0x0, rbh_inprog = 0x0, rbh_spin = {lock = 0, update = 0}}, swblock_count = 0, md = {
dummy_unused = 0x0}}
(kgdb)
---

vm_object->ref_count is int and in frame 6, vm_object->ref_count = 0. It seems that when it arrives in frame 2, the value has been decreased to -1. In frame 2, _refcount_wait() assumes the reference count is u_int, hence the u_int value 4294967295.

Actions #4

Updated by peeter almost 4 years ago

I wonder if ref_count could be u_int? Interesting that FreeBSD have changed it recently to volatile u_int

https://reviews.freebsd.org/D21598

Actions #5

Updated by peeter almost 4 years ago

New core dump with DRIVER_DATE to 20161108 + drm58

core.txt
---
<118>Mar 25 10:34:52 mars kernel: WARN_ON(i915_gem_object_has_pinned_pages(obj))
<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))
<118>Mar 25 10:34:53 mars kernel: WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))
<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))
<118>Mar 25 10:34:53 mars kernel: WARN_ON(i915_gem_object_has_pinned_pages(obj))
<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))i915_gem_evict_something: 0000000000000000-0000000010000000
i915_gem_evict_something: ENOSPC 0,1
i915_gem_fault: CHUNKING PASS
i915_gem_evict_something: 0000000000000000-0000000010000000
i915_gem_evict_something: ENOSPC 0,1
i915_gem_fault: CHUNKING PASS
panic: BUG in i915_gem_object_get_sg at /usr/src/sys/dev/drm/i915/i915_gem.c:5408
cpuid = 4
Trace beginning at frame 0xfffff80073505378
i915_gem_object_get_sg() at i915_gem_object_get_sg+0x2ab 0xffffffff830e3abb
i915_gem_object_get_sg() at i915_gem_object_get_sg+0x2ab 0xffffffff830e3abb
i915_get_ggtt_vma_pages() at i915_get_ggtt_vma_pages+0x35f 0xffffffff830d765f
ggtt_bind_vma() at ggtt_bind_vma+0x2b 0xffffffff830d79db
i915_vma_bind() at i915_vma_bind+0xbe 0xffffffff830dac1e
__i915_vma_do_pin() at __i915_vma_do_pin+0x419 0xffffffff830e09d9
boot() called on cpu#4
---

backtrace:

---
(kgdb) bt
#0 _get_mycpu () at ./machine/thread.h:69
#1 panic (fmt=fmt@entry=0xffffffff8316f59d "BUG in %s at %s:%u") at /usr/src/sys/kern/kern_shutdown.c:868
#2 0xffffffff830e3abb in i915_gem_object_get_sg (obj=obj@entry=0xfffff8036db1fbc0, n=4096, offset=offset@entry=0xfffff80073505424)
at /usr/src/sys/dev/drm/i915/i915_gem.c:5504
#3 0xffffffff830d765f in intel_partial_pages (view=0xfffff8037628d368, view=0xfffff8037628d368, obj=0xfffff8036db1fbc0)
at /usr/src/sys/dev/drm/i915/i915_gem_gtt.c:3689
#4 i915_get_ggtt_vma_pages (vma=0xfffff8037628d200) at /usr/src/sys/dev/drm/i915/i915_gem_gtt.c:3745
#5 0xffffffff830d79db in ggtt_bind_vma (vma=0xfffff8037628d200, cache_level=I915_CACHE_LLC, flags=<optimized out>)
at /usr/src/sys/dev/drm/i915/i915_gem_gtt.c:2633
#6 0xffffffff830dac1e in i915_vma_bind (vma=vma@entry=0xfffff8037628d200, cache_level=I915_CACHE_LLC, flags=<optimized out>, flags@entry=66)
at /usr/src/sys/dev/drm/i915/i915_gem_gtt.c:3807
#7 0xffffffff830e09d9 in __i915_vma_do_pin (vma=0xfffff8037628d200, size=0, alignment=0, flags=<optimized out>)
at /usr/src/sys/dev/drm/i915/i915_gem.c:4220
#8 0xffffffff830e0dfc in i915_vma_pin (flags=66, alignment=0, size=0, vma=0xfffff8037628d200) at /usr/src/sys/dev/drm/i915/i915_gem_gtt.h:674
#9 i915_gem_object_ggtt_pin (obj=obj@entry=0xfffff8036db1fbc0, view=view@entry=0xfffff80073505600, size=size@entry=0,
alignment=alignment@entry=0, flags=flags@entry=2) at /usr/src/sys/dev/drm/i915/i915_gem.c:4305
#10 0xffffffff830e14f9 in i915_gem_fault (vm_obj=0xfffff8036d48b040, offset=4096, prot=<optimized out>, mres=0xfffff800735057f8)
at /usr/src/sys/dev/drm/i915/i915_gem.c:1964
#11 0xffffffff8097f38d in vm_pager_get_page (seqaccess=<optimized out>, m=0xfffff800735057f8, object=<optimized out>)
at /usr/src/sys/vm/vm_pager.h:134
#12 vm_fault_object (fs=0xfffff800735057f8, first_pindex=<optimized out>, fault_type=<optimized out>, allow_nofault=1)
at /usr/src/sys/vm/vm_fault.c:2090
#13 0xffffffff80980316 in vm_fault (map=map@entry=0xfffff80351421e80, vaddr=vaddr@entry=34793852928, fault_type=fault_type@entry=2 '\002',
fault_flags=fault_flags@entry=76) at /usr/src/sys/vm/vm_fault.c:710
#14 0xffffffff80ba7fa6 in trap_pfault (frame=0xfffff800735059f8, usermode=1) at /usr/src/sys/platform/pc64/x86_64/trap.c:980
#15 0xffffffff80ba8f89 in trap (frame=0xfffff800735059f8) at /usr/src/sys/platform/pc64/x86_64/trap.c:557
#16 0xffffffff80b830aa in calltrap () at /usr/src/sys/platform/pc64/x86_64/exception.S:321
#17 0x000000080139c282 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(kgdb)
---

Actions #6

Updated by peeter almost 4 years ago

Kernel panic with drm_linux_410_DRIVER_DATE_20161121_v1 + drm60. Unfortunately, no backtrace.

---
<118>Mar 26 10:46:58 mars kernel: WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))
<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))
<118>Mar 26 10:57:11 mars kernel: WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))<4>WARN_ON(i915_gem_object_has_pinned_pages(obj))
i915_gem_evict_something: 0000000000000000-0000000010000000
i915_gem_evict_something: ENOSPC 0,1
i915_gem_fault: CHUNKING PASS
i915_gem_evict_something: 0000000000000000-0000000010000000
i915_gem_evict_something: ENOSPC 0,1
i915_gem_fault: CHUNKING PASS

[. . . repeat 300 times]

i915_gem_evict_something: 0000000000000000-0000000010000000
i915_gem_evict_something: ENOSPC 0,1
i915_gem_fault: CHUNKING PASS
i915_gem_evict_something: 0000000000000000-0000000010000000
i915_gem_evict_something: ENOSPC 0,1
i915_gem_fault: CHUNKING PASS
panic: vm_page_insert: already inserted
cpuid = 7
Trace beginning at frame 0xfffff80073507568
vm_page_insert() at vm_page_insert+0x1f5 0xffffffff80992655
vm_page_insert() at vm_page_insert+0x1f5 0xffffffff80992655
i915_gem_fault() at i915_gem_fault+0x6e7 0xffffffff830dfee7
vm_fault_object() at vm_fault_object+0x31d 0xffffffff8097f38d
vm_fault() at vm_fault+0x306 0xffffffff80980316
trap_pfault() at trap_pfault+0x206 0xffffffff80ba7fa6
boot() called on cpu#7
Uptime: 55m48s
---

Actions #7

Updated by peeter almost 4 years ago

Just to note that this is with master cf2b316 (18 Mar 2020) and later.

Actions

Also available in: Atom PDF