Bug #602: cache_lock: blocked on... extremely urgent - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #602

closed

cache_lock: blocked on... extremely urgent

Added by elekktretterr almost 19 years ago. Updated almost 19 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd77d7f38
"1176219194.V20d05I16c3a9M163630.daria.webgate.net.au"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd5d416e8 "cur"
[diagnostic] cache_lock: blocked on 0xd70d3118 "tmp"

This is off a mailserver, these came up when i tried to cd to cur/tmp
directories inside a maildir. There are process in ps all over the place
and performance is terrible. The machine is running 2 fujitsu scsi
drives (new but factory repaired and running in raid 1) on an older
scsi/raid controller. This is the 2nd time this happend in the last
week. Should i rather suspect a dying controller or the drives?

Thanks,
Petr

Actions

Copy link

Updated by elekktretterr almost 19 years ago

Its running this raid card:;

mly0@pci1:6:1: class=0x010400 card=0x00541069 chip=0x00501069 rev=0x02
hdr=0x00
vendor = 'Mylex Corp'
device = 'AcceleRAID Disk Array'
class = mass storage
subclass = RAID

It seems that because of this problem, mails get stuck in postfix queues.

Actions

Copy link

Updated by dillon almost 19 years ago

:This is off a mailserver, these came up when i tried to cd to cur/tmp
:directories inside a maildir. There are process in ps all over the place
:and performance is terrible. The machine is running 2 fujitsu scsi
:drives (new but factory repaired and running in raid 1) on an older
:scsi/raid controller. This is the 2nd time this happend in the last
:week. Should i rather suspect a dying controller or the drives?
:
:Thanks,
:Petr

Is access completely locked up or just really really slow ?  Can
    you kill the postfix processes?  Do they go away or all they all
    hung on disk I/O ?

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by c.turner almost 19 years ago

Not sure if related, but last night I had a flurry of ::

[diagnostic] cache_resolve: EAGAIN ncp 0xc1d28ab8 home

messages, but didn't report as the system is running,
for lack of a better term 1.8.0.1-optimized

(e.g release branch after the initial flurry of various
post release bugfixes for e.g. chroot '/', but before
the bind/linker fixes of for lack of better term "1.8.1 Pre")

and I seemed to recall some kind of namecache related changes
that I hadn't had a chance to verify in the lists..

Things happening @time of incident:

- various nfs / rpc / amd server related startups ..
  - otherwise mostly idle

Actions

Copy link

Updated by c.turner almost 19 years ago

Chris Turner wrote:

...

Things happening @time of incident:

- various nfs / rpc / amd server related startups ..
- otherwise mostly idle

forgot to mention /home is the only filesystem exported / configured for
amd .

Actions

Copy link

Updated by elekktretterr almost 19 years ago

Completely locked, and they dont go away. They hang on disk I/O

Petr

Actions

Copy link

Updated by elekktretterr almost 19 years ago

By the way, this is an SMP box.

What could be causing this (maybe)? Anything i should do next time it
happens?

Actions

Copy link

Updated by dillon almost 19 years ago

:
:By the way, this is an SMP box.
:
:What could be causing this (maybe)? Anything i should do next time it
:happens?
:
:Petr
:>

It sounds like a bug in the driver for the raid controller, or the
    firmware for the raid controller.

I've had something similar happen to me on my backup box, which
    uses one of the older 3ware IDE RAID controllers.

twe0: &lt;3ware Storage Controller driver ver. 1.40.01.002&gt; port 0xb000-0xb00f mem 0xce000000-0xce7fffff,0xced00000-0xced0000f irq 2 at device 11.0 on pci0
    twe0: 4 ports, Firmware FE7X 1.05.00.068, BIOS BE7X 1.08.00.048
    twed0: &lt;Unit 0, RAID0, Normal&gt; on twe0
    twed0: 457880MB (937739136 sectors)

There is apparently a known bug in the 3ware firmware for this
    controller where the controller locks up if a drive goes bad.  When
    this occurs I usually also get a bunch of controller failure and
    drive failure messages on the console and in the dmesg output.
    It's happened twice to me so far over the last year or so, and
    replacing the drive fixed the problem.

I have two other 3ware controllers with SATA ports instead of PATA
    ports (in apollo.backplane.com and in pkgbox.dragonflybsd.org), and
    I don't recall either of them locking up.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by elekktretterr almost 19 years ago

Well do you think I should get a different SCSI/RAID controller in that
case? Or is there anyway I can help you find a bug in the RAID
controller code. Those cache_lock messages and ssh shell freezing upon
entering a directory are the only information i have for you at the
moment. The biggest problem is that for some reason these problems make
many emails stuck in either active or incoming queue. The only way to
deliver them is to stop/start postfix and then observe some of them get
delivered, then repeat this a few times for the rest to get delivered.

Now, should this be a drive problem, is there some software that will
find out if theres something wrong with the discs?

Thanks a lot for your time,
Petr

Actions

Copy link

Updated by qhwt+dfly almost 19 years ago

It looks to me like another critical section mismatch:

Index: mly.c ===================================================================
RCS file: /home/source/dragonfly/cvs/src/sys/dev/raid/mly/mly.c,v
retrieving revision 1.17
diff u -p -r1.17 mly.c
-- mly.c 22 Dec 2006 23:26:24 -0000 1.17
+++ mly.c 12 Apr 2007 05:00:42 -0000
@ -820,8 +820,10 @ mly_immediate_command(struct mly_command

/* spinning at splcam is ugly, but we're only used during controller init */
     crit_enter();
-    if ((error = mly_start(mc)))
+    if ((error = mly_start(mc))) {
+    crit_exit();
     return(error);
+    }

if (sc->mly_state & MLY_STATE_INTERRUPTS_ON) {
     /* sleep on the command */

Actions

Copy link

#10

Updated by elekktretterr almost 19 years ago

Thanks Yonetani,

I applied your patch and i am rebooting now. I will keep an eye on the
server for the next few days/weeks and see if this fixed it.

Petr

Actions

Copy link

#11

Updated by elekktretterr almost 19 years ago

I applied your patch and i am rebooting now. I will keep an eye on the
server for the next few days/weeks and see if this fixed it.

Petr

+ please commit the code.

Actions

Copy link

#12

Updated by qhwt+dfly almost 19 years ago

I just committed it to HEAD, but after reading the commit log in FreeBSD,
it appears that the support for mly(4) driver for RELENG_4 has been
abandonned far before we forked off, so you may need some more newer fixes
from FreeBSD, if it doesn't work.

Cheers.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #602

cache_lock: blocked on... extremely urgent

Updated by elekktretterr almost 19 years ago

Updated by dillon almost 19 years ago

Updated by c.turner almost 19 years ago

Updated by c.turner almost 19 years ago

Updated by elekktretterr almost 19 years ago

Updated by elekktretterr almost 19 years ago

Updated by dillon almost 19 years ago

Updated by elekktretterr almost 19 years ago

Updated by qhwt+dfly almost 19 years ago

Updated by elekktretterr almost 19 years ago

Updated by elekktretterr almost 19 years ago

Updated by qhwt+dfly almost 19 years ago