Project

General

Profile

Bug #3198

OpenGL app crash with Radeon driver

Added by yellowrabbit2010 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Driver
Target version:
-
Start date:
07/17/2019
Due date:
% Done:

0%

Estimated time:

Description

The programs I work with, namely, FreeCAD, Kicad, MPV, Chromium crash with ``vm_fault: pager read error'' at different times, sometimes immediately after starting, sometimes after some manipulations.
I managed to repeat the situation on the configuration with minimal changes:
- I downloaded DragonFly-X86_64-LATEST.img.bz2 2019-07-16 03:56 285M
- installed it on 8G usb flash drive
- modify rc.conf and wpa_supplicant.conf in order to connect to WiFi
- pkg install Xorg mesa-demos
- added user rabbit
- startx
- glxgears

Glxgears crushed immediately after start. Unfortunately, the test installation does not contain packages with debug information, so its core file is not very informative. But since the situation is always repeated, I built packages with debug information and made a screenshot of the gdb file with the downloaded core on the working machine (KERNCONF=X86_64_GENERIC).
I have to note that I use the core of the March version and it has no problems with graphics at all: there have been no failures for a long time with all the programs listed above. That is, it is unlikely that this is a video memory or swap file problem and so on.
= 5.5-DEVELOPMENT DragonFly v5.5.0.325.gf6792-DEVELOPMENT #25: Sun Mar 24 09:54:07 VLAT 2019 =

I made an image of the flash drive immediately after the failure, you can download it from here https://yellowrabbit.gitlab.io/pub/bugs/dfly-glxgears.img.xz . There are no passwords neither for the rabbit user nor for root.


Files

IMG_20190718_080950_HDR-min.jpg (1.83 MB) IMG_20190718_080950_HDR-min.jpg glxgears crash yellowrabbit2010, 07/17/2019 04:11 PM
Xorg.0.log (27.6 KB) Xorg.0.log test installation Xorg log yellowrabbit2010, 07/17/2019 04:13 PM
gdb-0.txt (558 Bytes) gdb-0.txt gdb with test installation glxgears core loaded yellowrabbit2010, 07/17/2019 04:13 PM
messages (34.9 KB) messages test installation system log yellowrabbit2010, 07/17/2019 04:13 PM
gdb-1.txt (2.72 KB) gdb-1.txt gdb with work machine glxgears core loaded yellowrabbit2010, 07/17/2019 04:13 PM
gdb-1.png (394 KB) gdb-1.png gdb screenshot with work machine core loaded yellowrabbit2010, 07/17/2019 04:13 PM
glxgears.core.xz (298 KB) glxgears.core.xz glxgears core from test installation yellowrabbit2010, 07/17/2019 04:18 PM
IMG_20190724_210557_HDR-min.jpg (2.57 MB) IMG_20190724_210557_HDR-min.jpg glxgears crash shot yellowrabbit2010, 07/24/2019 05:11 AM
vga.txz (312 KB) vga.txz logs & core yellowrabbit2010, 07/24/2019 05:12 AM
IMG_20190727_194932_HDR.jpg (4.11 MB) IMG_20190727_194932_HDR.jpg yellowrabbit2010, 07/27/2019 03:09 AM
core.txt.0 (1.12 MB) core.txt.0 mpv crash (vdpau) yellowrabbit2010, 08/15/2019 03:54 AM
messages (35.1 KB) messages /var/log/messages with kernel: ttm_bo_wait(): ret = 1 yellowrabbit2010, 08/15/2019 01:53 PM
core.txt.2 (1.47 MB) core.txt.2 KiCad crash yellowrabbit2010, 08/15/2019 02:41 PM
core.txt.3 (1.73 MB) core.txt.3 FreeCAD crash yellowrabbit2010, 08/15/2019 03:48 PM

History

#1

Updated by yellowrabbit2010 3 months ago

I use the core of the March version -> I use the kernel from March

#2

Updated by ftigeot 3 months ago

Thanks for this bug report.
The issue looks to be hardware-specific though; I cannot reproduce it at will on any of my machines.

The branch drm_ttm_radeon_4_4_180_v1 on leaf contains the individual changes leading to the Linux 4.4.180 update.

Can you bisect the changes on your hardware ?

Some instructions on how to do that:
* git remote add leaf git://leaf.dragonflybsd.org/~ftigeot/dragonfly.git
* git fetch leaf
* git checkout drm_ttm_radeon_4_4_180_v1
* git checkout 211521498c6a0f6fdedfa4c7210a9b3f57aeef0e
* make kernel && reboot

That last git checkout will give you the state of master just before the drm/radeon 4.4.180 changes
If all goes well with that commit, you will then be able to find out which of the following ones broke opengl on your hardware

#3

Updated by yellowrabbit2010 3 months ago

Unfortunately, this commit is already bad:(
I will try to build a working kernel of an earlier version.

#4

Updated by ftigeot 3 months ago

I managed to dig up an old Radeon HD5450 which appears to exhibit the bug.

There is no fix yet, but a workaround.
Disabling acceleration in xorg.conf will stop the crashes from happening.

Adding the line Option "NoAccel" "TRUE" in a Device section should do the trick:

Section "Device"
Identifier "Card1"
VendorName "Advanced Micro Devices [AMD] nee ATI"
BoardName "RV370 [Radeon X550]"
Driver "radeon"
Option "NoAccel" "TRUE"
BusID "PCI:1:0:0"
EndSection

#5

Updated by yellowrabbit2010 3 months ago

Thanks! Yes, the trick with disabled acceleration allows to work without failures on the kernel 4685ca1cc305dbcd40d614ecc70f60a6a71ba453 :)

I, on the other hand, built the working kernels until March 20, 2019 (v5.5.0-300-g2bbc7733d6) and am going to move on until I stop on the non-working kernel.

Of course, I’ll stop if you’re already clear about the reason for the failure:)

#6

Updated by yellowrabbit2010 3 months ago

The first commit on which the glxgears error occurs (vm_fault: pager read error) is 7dcf36dc33228b5b368783d7b6f7ada00ee671d6 on master (Thu Jun 20, drm/radeon: Upgrade to Linux 3.19.8).

Is there a branch with individual changes?

#7

Updated by ftigeot 3 months ago

I have pushed drm_ttm_radeon_3_19_8_bisect_vmfault to leaf and bisected what I could myself.
(It can be necessary to comment out drm/i915 from the tree to build successfully)

The first bad commit in that branch is fca8eb81f38c6d4b27e5fa79e030f0647dee0739
"drm/radeon: Try placing NO_CPU_ACCESS BOs outside of CPU accessible VRAM"

I tried to revert it in master but am still getting weird vm faults+crashes after a while.

#8

Updated by yellowrabbit2010 3 months ago

You're right. I got the same commit (fca8eb81f38c6d4b27e5fa79e030f0647dee0739) as the first one demonstrating a bug in glxgears.

And you are probably right as well in the fact that this commit is not the cause of failures, it just makes the problem very easy and quickly demonstrated.
The kernels that I tried after March 24 crashed with the same error, but after 3–4 hours of intensive work with graphic applications or mpv and it was impossible to make a normal error report with an easily repeated sequence of actions.

#9

Updated by ftigeot 2 months ago

The pager read error appears to always be from this chunk of code in ttm/ttm_bo_vm.c around line 550, in function ttm_bo_vm_fault_dfly():

/*
* Wait for buffer data in transit, due to a pipelined
* move.
*/
if (test_bit(TTM_BO_PRIV_FLAG_MOVING, &bo->priv_flags)) {
/*
* Here, the behavior differs between Linux and FreeBSD.
*
* On Linux, the wait is interruptible (3rd argument to
* ttm_bo_wait). There must be some mechanism to resume
* page fault handling, once the signal is processed.
*
* On FreeBSD, the wait is uninteruptible. This is not a
* problem as we can't end up with an unkillable process
* here, because the wait will eventually time out.
*
* An example of this situation is the Xorg process
* which uses SIGALRM internally. The signal could
* interrupt the wait, causing the page fault to fail
* and the process to receive SIGSEGV.
*/
ret = ttm_bo_wait(bo, false, false);
if (unlikely(ret != 0)) {
retval = VM_PAGER_ERROR;
goto out_unlock;
}
}

#10

Updated by ftigeot 2 months ago

  • Status changed from New to In Progress

I have pushed a new branch to leaf with a possible fix: drm_ttm_radeon_4_7_10_rebased_v2

Can you check if this improves the situation on your hardware ?

#11

Updated by yellowrabbit2010 2 months ago

v5.7.0-241-g7cc1c1be6e
Improvement is obvious! I was able to launch several glxgears and a cuberender. Also, chromium did not crash.

Good (always repeatable) kernel panics with MPV with VDPAU as output.

I still need to check the work of this version under load for many hours, tomorrow I will write more.

If necessary, I can upload vmcore.0 and kern.0 somewhere, it just will take awhile.

#12

Updated by yellowrabbit2010 2 months ago

After about half an hour working with Chromium, at the moment when I moved the mouse, the Xorg died:(

kernel: ttm_bo_wait(): ret = 1

#13

Updated by yellowrabbit2010 2 months ago

Quick crash of KiCad when trying to open the PCB editor.
panic: BUG in ttm_bo_add_to_lru at /usr/src/sys/dev/drm/drm/../ttm/ttm_bo.c:173
cpuid = 0
Trace beginning at frame 0xfffff803ab7bb198
ttm_bo_add_to_lru() at ttm_bo_add_to_lru+0x13a 0xffffffff84f5752a
ttm_bo_add_to_lru() at ttm_bo_add_to_lru+0x13a 0xffffffff84f5752a
ttm_eu_fence_buffer_objects() at ttm_eu_fence_buffer_objects+0x5d 0xffffffff84f5ae9d
radeon_cs_parser_fini() at radeon_cs_parser_fini+0x19d 0xffffffff8387cecd
radeon_cs_ioctl() at radeon_cs_ioctl+0x5ea 0xffffffff8387d9ba
drm_ioctl() at drm_ioctl+0xe9 0xffffffff84f3ea89
boot() called on cpu#0
Uptime: 26m16s

#14

Updated by yellowrabbit2010 2 months ago

Very fast crash in the FreeCAD when trying to open a file with a model.
<118>Aug 16 08:17:07 fly kernel: [drm] Initialized radeon 2.45.0 20080528
panic: BUG in ttm_bo_add_to_lru at /usr/src/sys/dev/drm/drm/../ttm/ttm_bo.c:173
cpuid = 0
Trace beginning at frame 0xfffff803a4857198
ttm_bo_add_to_lru() at ttm_bo_add_to_lru+0x13a 0xffffffff84f5752a
ttm_bo_add_to_lru() at ttm_bo_add_to_lru+0x13a 0xffffffff84f5752a
ttm_eu_fence_buffer_objects() at ttm_eu_fence_buffer_objects+0x5d 0xffffffff84f5ae9d
radeon_cs_parser_fini() at radeon_cs_parser_fini+0x19d 0xffffffff8387cecd
radeon_cs_ioctl() at radeon_cs_ioctl+0x5ea 0xffffffff8387d9ba
drm_ioctl() at drm_ioctl+0xe9 0xffffffff84f3ea89
boot() called on cpu#0
Uptime: 4m52s
Physical memory: 20375 MB

#15

Updated by ftigeot about 2 months ago

  • Status changed from In Progress to Resolved

I sadly wasn't able to find the source of the crash among the differences we still have with Linux in the drm code.

It could well be an existing bug in the drm/ttm code of the Linux version we are using right now -- 4.7.10 .
There is at least one Redhat bug report which corroborates this theory: https://bugzilla.redhat.com/show_bug.cgi?id=1027831

I have disabled 3D acceleration on Evergreen-class hardware for now in order to avoid these crashes (commit e4f26d7e5d0bf5eb2283f3953aa996332f16226b ).
Regular Xorg 2D and video usage is still fast, much faster than with a dumb framebuffer driver.

#16

Updated by yellowrabbit2010 about 2 months ago

Thanks for your work.
As I understand it, you have video cards of this type that do not show any glitches.
At a local store I can purchase R7 240, R5 230 or RX 550.

#17

Updated by ftigeot about 2 months ago

Among the three choices, I'd say get the R7 240.
- R5 230: are rebadged models and variants of your existing chips. they will most likely exhibit the same issue.
- RX550: it's a new generation and equires the amdgpu driver.
A port of the driver is almost working but I can't say when the last issues will finally be ironed out.

- R7 240 => that one is known to work fine and is a different radeon generation than the R5.
Get a variant with a 128bit memory bus if speed is important for you.

Also available in: Atom PDF