Bug #2845
closedHangs using dsched policys other than noop
100%
Description
When using fq or bfq with heavy disk IO on one SATA HDD (where "heavy" means running make j10 buildworld and building chromium and firefox all at the same time - heavy for a single-user el-cheapo Satellite D55 laptop, really light for a server) occasionally all disk IO will stop and no new IO can be started, including a panic dump from the debugger(!) If top, ps, etc. happen to be in the cache it looks like the processes are all stuck in vnode wait status, and there seems to be no way to kick them out of that status.
The as policy, on the other hand, takes about 3 IOs to hang. I'm not sure yet if processes are also in vnode wait status there.
Putting it as low priority as the default (noop) seems to work perfectly. Target should read 4.3.CURRENT but that isn't a choice.
Updated by deef over 9 years ago
Similar experience here... When using bfq while making parallel buildworld, the kernel produced several hundreds of these messages:
kernel: dsched_thread_io: destroy race tdio=0xffffffe0f4c6ce00
After about an hour I've found the system unresponsive.
When trying fq during dports compiling, the kernel produced just one such message and currently running pkg(8) process got stuck. On reboot the system "gave up on 232 buffers" which resulted in corrupted pkg(8) database (ouch ;-)).
When using noop scheduler, the system doesn't report any races and runs with no problems.
Updated by dillon about 9 years ago
- Status changed from New to Feedback
- Assignee set to dillon
- % Done changed from 0 to 100
We are going to remove dsched entirely. It doesn't work well with SSDs and the complexity has made finding its bugs too painful. Several people have tried over the years. We will need to rethink the whole disk fairness concept/problem.
-Matt