Bug #833

cache_lock: blocked on 0xe29c3b08 ""

Added by dpwalters about 7 years ago. Updated almost 7 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Following [diagnostic] cache_lock: blocked on 0xe29c3b08 "quota.user" in dmesg and /var/log/messages. System almost completely hangs; I am unable to create a new shell, enter commands, etc (or even reboot). Kernel options are SMP, IO_APIC, and QUOTA. The bug appears to manifest itself after applying userquota to a filesystem, mounting it, and then issuing a few commands like mkdir/chmod/etc on it. System is a dual processor Opteron 248.

bug.sh Magnifier (502 Bytes) dpwalters, 11/04/2007 02:40 PM

History

#1 Updated by dillon about 7 years ago

:Following [diagnostic] cache_lock: blocked on 0xe29c3b08 "quota.user" in dmesg and /var/log/messages. System almost completely hangs; I am unable to create a new shell, enter commands, etc (or even reboot). Kernel options are SMP, IO_APIC, and QUOTA. The bug appears to manifest itself after applying userquota to a filesystem, mounting it, and then issuing a few commands like mkdir/chmod/etc on it. System is a dual processor Opteron 248.

I'll try to reproduce this one today. It sounds like it ought to be
easy to figure out.

-Matt
Matthew Dillon
<>

#2 Updated by dillon about 7 years ago

:
:
::Following [diagnostic] cache_lock: blocked on 0xe29c3b08 "quota.user" in dmesg and /var/log/messages. System almost completely hangs; I am unable to create a new shell, enter commands, etc (or even reboot). Kernel options are SMP, IO_APIC, and QUOTA. The bug appears to manifest itself after applying userquota to a filesystem, mounting it, and then issuing a few commands like mkdir/chmod/etc on it. System is a dual processor Opteron 248.
:
: I'll try to reproduce this one today. It sounds like it ought to be
: easy to figure out.
:
: -Matt

I haven't had any luck reproducing it yet. Could you give me a
test script to run that will reproduce the problem? Also, are you
doing your tests as root or as a user ?

Alternatively if you can get a kernel core and a kgdb backtrace of one
of the stuck processes I can probably figure out what is going on from
there.

-Matt

#3 Updated by dpwalters about 7 years ago

After trying to recreate this problem in a virtual machine I seem to have
trouble recreating it too. I apologize for this bug report, after investigating quota.user
with the file command it seems it had gotten corrupted somehow as the file command
returned it as DATA. I'm not sure how it got corrupted, however, as fsck showed it as
clean. Anyway, why does the system become unresponsive after this file becomes
corrupted? Also, other than the obvious reasons, how might this file become corrupted?
I appreciate all the work you do.
-David

#4 Updated by dpwalters about 7 years ago

I seem to have spoke to soon as I can now reproduce this in a virtual machine. This is how I
was able to reproduce it. qemu -smp 2 with kernel options SMP and QUOTA. It seems that after
running touch /home/quota.user as root (maybe after rebooting or not) things start to get
weird (/home is a userquota filesystem). I get sigreturn: eflags 0x206 messages in the console
as well as others like 0x80207 and 0x80203. Also root's shell beings to lock up and subsequent
logins as root lock up after simple commands like "ls".

#5 Updated by dpwalters almost 7 years ago

It seems that these sigreturn: eflags messages and subsequent lockups are due to possibly
a bug in qemu (even in the latest CVS sources that I have pulled from 10/30). Upon further
research of these messages it seems VirtualBox is affected and I can only assume qemu is
as well (as VirtualBox is based on qemu). It just so happened that these messages occurred
just after having tried to recreate the issue I was having on my Opteron-based system and the
same sort of effect was generated.

#6 Updated by dpwalters almost 7 years ago

OK, after having dumped qemu as a debugging solution for SMP kernels,
I have discovered vkernels. I am able to reproduce the bug in a vkernel
with options QUOTA and SMP (without annoying qemu bugs). After attaching
gdb to the vkernel I am unable to get a decent backtrace. GDB has trouble
accessing the memory and gives Device Busy errors. Anyway I thought this
information might be of some help. For what it's worth, I have a picture of the
backtrace at http://woe.likewhoa.net/~david/vkernel_bt.jpg This is after the
vkernel gets hung up. (Note: I still use qemu, just not its smp functionality
as this picture is of qemu running DragonFly GENERIC and a vkernel inside
of that)

#7 Updated by corecode almost 7 years ago

I guess a simple shell script should be sufficient so that Matt can
reproduce the bug. Unless of course you want to get your hands dirty
yourself. In this case I guess gdb needs to be extended to be able to
deal with multiple threads :)

cheers
simon

#8 Updated by dpwalters almost 7 years ago

Attached is a script that probably works 100% of the time. You may want
to backup your vkernel filesystem before running the script as it may
become trashed due to the inability to sync the filesystem after becoming
locked up. Also, there is some preparation work to do before running this
script inside of a vkernel. Here are some example commands (assuming your
vkernel filesystem is a lot like the one in the vkernel man page) to prepare
the vkernel and then trigger the bug inside of the vkernel.

1. dd if=/dev/zero of=/mnt.img bs=1m count=5
2. vnconfig -c -s labels vn0 /mnt.img
3. disklabel -r -w vn0s0 auto
4. disklabel -e vn0s0 #edit the label to create a vn0s0a partition
5. newfs /dev/vn0s0a
6. echo "/dev/vn0s0a /mnt ufs rw,userquota 1 1" >> /etc/fstab
7. mount /mnt
8. ./bug.sh /mnt

After having run the script myself, it would seem you can ctrl-z out of the
script and just have a zombied process. I'm not sure, but when /home
is the filesystem and you're trying to login remotely as a user this is
very bad (as sshd gets zombied repeatedly and this is what I was experiencing).
Also, you cannot unmount the filesystem after locking it up. You also may or
may not see the "cache_lock: blocked" message(s).

#9 Updated by dillon almost 7 years ago

:Attached is a script that probably works 100% of the time. You may want
:to backup your vkernel filesystem before running the script as it may
:become trashed due to the inability to sync the filesystem after becoming
:locked up. Also, there is some preparation work to do before running this
:..

I've reproduced at least one lockup with your script. I'm tracking it
down now.

-Matt

#10 Updated by dillon almost 7 years ago

:
::Attached is a script that probably works 100% of the time. You may want
::to backup your vkernel filesystem before running the script as it may
::become trashed due to the inability to sync the filesystem after becoming
::locked up. Also, there is some preparation work to do before running this
::..
:
: I've reproduced at least one lockup with your script. I'm tracking it
: down now.
:
: -Matt

I think the problem is that the quota.user file is not being pre-populated,
and the filesystem is recursing trying to allocate blocks for the file.
i.e. it is trying to check quotas on the quota file itself.

If you run quotacheck on the filesystem before turning on quotas the
quota.user file will be properly created.

That said, we certainly do not want it to crash. I'll adjust the code
to generate a kernel warning.

-Matt
Matthew Dillon
<>

Also available in: Atom PDF