Bug #1556

many processes stuck in "hmrrcm", system unusable

Added by corecode about 5 years ago. Updated 9 months ago.

Status:NewStart date:
Priority:NormalDue date:
Assignee:tuxillo% Done:

0%

Category:VFS subsystem
Target version:3.8.0

Description

On chlamydia I have many processes stuck in "hmrrcm", making the system
unusable - can't open new shells, etc. After a minute or two the
situation seems to improve again.

History

#1 Updated by dillon about 5 years ago

:
:On chlamydia I have many processes stuck in "hmrrcm", making the system
:unusable - can't open new shells, etc. After a minute or two the
:situation seems to improve again.

This is an inode moderation mechanic which will typically occur if
large numbers (thousands) of inodes have been dirtied and need to
be updated on-media. It prevents hammer's inode structure allocations
from blowing away kernel memory and crashing the system.

A rm -rf or a tar extract is the most typical culprit but atime
updates can create issues as well if large portions of the filesystem
are being randomly read. You can try mounting noatime to address
atime issues.

-Matt
Matthew Dillon
<>

#2 Updated by corecode about 5 years ago

Matthew Dillon wrote:
> :
> :On chlamydia I have many processes stuck in "hmrrcm", making the system
> :unusable - can't open new shells, etc. After a minute or two the
> :situation seems to improve again.
>
> This is an inode moderation mechanic which will typically occur if
> large numbers (thousands) of inodes have been dirtied and need to
> be updated on-media. It prevents hammer's inode structure allocations
> from blowing away kernel memory and crashing the system.
>
> A rm -rf or a tar extract is the most typical culprit but atime
> updates can create issues as well if large portions of the filesystem
> are being randomly read. You can try mounting noatime to address
> atime issues.

Shouldn't we rather try to fix the issue, i.e. make hammer work just a
little bit performant and capable of concurrent use? I think now that
the code is stable we should start investigating performance (latency)
issues and address them.

cheers
simon

#3 Updated by dillon about 5 years ago

:Shouldn't we rather try to fix the issue, i.e. make hammer work just a
:little bit performant and capable of concurrent use? I think now that
:the code is stable we should start investigating performance (latency)
:issues and address them.
:
:cheers
: simon
:

I think the main culprit here is the background flusher. With UFS
any modifying operations can block the process context responsible
for them. With HAMMER *ALL* modifying operations are asynchronous
and do not block the process context responsible for them. Thus when
resources reach their limit, ANY process trying to make a modification
or even just load a new inode (hmrrcm) winds up taking a hit instead
of the one process that was responsible for eating up all the resources
in the first place.

These limits are quickly hit when rm -rf'ing or tar extracting tens of
thousands of files, but otherwise typically not hit.

In both cases the disk winds up being banged up, but with UFS it is
easier to prevent the resource starvation issue from bleeding over
into other processes. HAMMER can't really distinguish between
modifying operations belonging to a heavy handed process verses
modifying operations incidental to processes which otherwise have
a light touch.

I do believe it is possible to solve the problem, but it isn't a
quick fix. Essentially we have to move meta-data modification
out of the backend flusher and into the frontend. This will shift
the cpu and buffer cache burden back to the processes responsible.

But it isn't easy to do this because those meta-data buffers cannot
be flushed to the media without first synchronizing the UNDO space.
Synchronizing the UNDO space and still maintaining a pipeline requires
double-buffering dirty meta-data buffers (because new changes to
meta-data which is already dirtied from a previous operation now
undergoing a flush cannot be made in-place).

I would have to abandon using the buffer cache entirely for meta-data
buffers and go with a roll-my-own scheme. That might make porters
happier but it won't make me happier as it opens a whole new can of
worms on how to manage the buffer resources.

I would much rather work on the clustering, but if people are going
to constantly complain about HAMMER's performance I will have to take
2-3 months and deal with this issue first I guess.

-Matt
Matthew Dillon
<>

#4 Updated by steve about 5 years ago

On Sun, 4 Oct 2009 10:29:45 -0700 (PDT)
Matthew Dillon <> wrote:

> I would much rather work on the clustering, but if people are going
> to constantly complain about HAMMER's performance I will have to take
> 2-3 months and deal with this issue first I guess.

Voice from the sidelines - I would rather you worked on the
clustering, after all the hammer code won't really be stable until the
clustering is in.

#5 Updated by corecode about 5 years ago

Steve O'Hara-Smith (via DragonFly issue tracker) wrote:
> Voice from the sidelines - I would rather you worked on the
> clustering, after all the hammer code won't really be stable until the
> clustering is in.

Are you using hammer more than casually on any machine? It is PAINFULLY
SLOW. And adding clustering won't help, it might only make it even worse.

#6 Updated by steve about 5 years ago

On Sun, 04 Oct 2009 21:46:30 +0200
"Simon 'corecode' Schubert" <> wrote:

> Steve O'Hara-Smith (via DragonFly issue tracker) wrote:
> > Voice from the sidelines - I would rather you worked on the
> > clustering, after all the hammer code won't really be stable until the
> > clustering is in.
>
> Are you using hammer more than casually on any machine? It is PAINFULLY
> SLOW. And adding clustering won't help, it might only make it even worse.

Clearly I am not using it as hard as you are. For the way I am
using it at it is far from slow. I would not be at all surprised if adding
clustering made it worse, which is in fact the reason I'm suggesting it's
better to do that before attempting to optimise it.

I do note that Matt suggested two scenarios that would lead to the
problem you describe. One being large amounts of inode creation and
destruction caused by unpacking large archives or deleting large numbers of
files, I have done both of these without seeing problems but perhaps they
weren't large enough. The other being heavy atime churn caused by heavy
random file access, this is a load that I have not (as yet) had cause to
impose on a hammer filesystem.

If you are suffering the latter then I would certainly agree with
the suggestion to mount with noatime, I do this routinely for any file
system which gets heavy random file access because atime updates will
cripple performance on any filesystem I have ever used by making the discs
spend more time seeking than reading. It does sound like hammer may suffer
more than other filesystems from this.

#7 Updated by corecode about 5 years ago

Steve O'Hara-Smith wrote:
> If you are suffering the latter then I would certainly agree with
> the suggestion to mount with noatime, I do this routinely for any file
> system which gets heavy random file access because atime updates will
> cripple performance on any filesystem I have ever used by making the discs
> spend more time seeking than reading. It does sound like hammer may suffer
> more than other filesystems from this.

This is possible. However I don't want to mount all my file systems
noatime. I'm using atime regularly for various purposes, so I don't
really want to run without it. Maybe I can try it though.

I think even with atime turned on, a filesystem needs to perform
acceptable. I'm not talking about high throughput. I'm talking about
xterm+shell taking >10s until the prompt appears, or vim occasionally
hanging for several seconds.

cheers
simon

#8 Updated by justin about 5 years ago

On Sun, October 4, 2009 1:29 pm, Matthew Dillon wrote:

> I would much rather work on the clustering, but if people are going
> to constantly complain about HAMMER's performance I will have to take
> 2-3 months and deal with this issue first I guess.

Before we have Matt, who can work full-time on clustering - the _purpose
of this project_ - digress for months on filesystem performance, can we
quantify the actual problem?

Corecode's report is the only one we have, and we don't have a realistic
test scenario that measures when it happens or how much it happens. I've
been running pkgsrc bulk builds on Hammer disks for quite a while now, and
I have not seen any noticeable change in overall run time from using UFS
to using Hammer. Chasing a heisenbug can be very difficult.

#9 Updated by steve about 5 years ago

On Mon, 05 Oct 2009 00:53:46 +0200
"Simon 'corecode' Schubert" <> wrote:

> Steve O'Hara-Smith wrote:
> > If you are suffering the latter then I would certainly agree
> > with the suggestion to mount with noatime, I do this routinely for any
> > file system which gets heavy random file access because atime updates
> > will cripple performance on any filesystem I have ever used by making
> > the discs spend more time seeking than reading. It does sound like
> > hammer may suffer more than other filesystems from this.
>
> This is possible. However I don't want to mount all my file systems
> noatime. I'm using atime regularly for various purposes, so I don't
> really want to run without it. Maybe I can try it though.

I certainly wouldn't mount all filesystems noatime, just ones that
take a lot of random access traffic.

> I think even with atime turned on, a filesystem needs to perform
> acceptable. I'm not talking about high throughput. I'm talking about
> xterm+shell taking >10s until the prompt appears, or vim occasionally
> hanging for several seconds.

That's something I have never seen happening, even when I do
have a high load of filesystem activity, discs running at 100% for extended
periods while largeish trees get moved around. The system certainly gets
sluggish under that kind of load but nowhere near as bad as that.

#10 Updated by hasso about 5 years ago

Steve O'Hara-Smith wrote:
> That's something I have never seen happening, even when I do
> have a high load of filesystem activity, discs running at 100% for
> extended periods while largeish trees get moved around. The system
> certainly gets sluggish under that kind of load but nowhere near as bad
> as that.

My experience is comparable with corecode's experience. Even moderate IO
load makes my audio stutter, my prompts apps hang for some seconds time
to time etc. It's HAMMER + AHCI. It's not a conncurrent IO issue we
obviously have (due to lack of IO scheduler). Surprisingly my UFS + NATA
machine performs much better in this regard.

Not to mention http://bugs.dragonflybsd.org/issue1502 which became
showstopper for DragonFly + HAMMER deploying for me.

#11 Updated by alexh about 5 years ago

: Corecode's report is the only one we have, and we don't have a realistic
: test scenario that measures when it happens or how much it happens.
This is not really true. Hasso has also been complaining about very bad
performance and even showed some test cases and results running dd in various
scenarios. I also notice quite a performance degradation, especially working
with git. git diff and checkout particularly take ages on hammer.

: Before we have Matt, who can work full-time on clustering - the _purpose
: of this project_ - digress for months on filesystem performance, can we
: quantify the actual problem?
To quantify the problem should be straight forward. As I mentioned before,
hasso did some of it, but in any case there are test programs (thinking dbench
or similar here, don't know if anyone of them is in pkgsrc) who can quantify it
easily.
About clustering being the main goal... the original goal, as far as I know,
was to have a well performing SMP system. Right now our SMP performance isn't
exactly good. I know that the main page is now full of references to clustering
being *THE* goal, but what happened to the original goals?

Cheers,
Alex Hornung

#12 Updated by steve about 5 years ago

On Mon, 5 Oct 2009 08:49:59 +0300
Hasso Tepper <> wrote:

> Steve O'Hara-Smith wrote:
> > That's something I have never seen happening, even when I do
> > have a high load of filesystem activity, discs running at 100% for
> > extended periods while largeish trees get moved around. The system
> > certainly gets sluggish under that kind of load but nowhere near as bad
> > as that.
>
> My experience is comparable with corecode's experience. Even moderate IO
> load makes my audio stutter, my prompts apps hang for some seconds time
> to time etc. It's HAMMER + AHCI. It's not a conncurrent IO issue we

Interesting because I'm using HAMMER + NATA, heavy IO load will
make video playback stutter but I've yet to make audio stutter.

Corecode - are you using NATA or AHCI ?

> obviously have (due to lack of IO scheduler). Surprisingly my UFS + NATA
> machine performs much better in this regard.

Can you try HAMMER + NATA ?

> Not to mention http://bugs.dragonflybsd.org/issue1502 which became
> showstopper for DragonFly + HAMMER deploying for me.

I have no nohistory mounts but I delete 5GB files fairly often with
no ill effects on normal mounts.

#13 Updated by corecode about 5 years ago

Justin C. Sherrill (via DragonFly issue tracker) wrote:
> Justin C. Sherrill <> added the comment:
>
> On Sun, October 4, 2009 1:29 pm, Matthew Dillon wrote:
>
>> I would much rather work on the clustering, but if people are going
>> to constantly complain about HAMMER's performance I will have to take
>> 2-3 months and deal with this issue first I guess.
>
> Before we have Matt, who can work full-time on clustering - the _purpose
> of this project_ - digress for months on filesystem performance, can we
> quantify the actual problem?

What purpose? Clustering is set as a goal, not a purpose. The purpose
is to have a good operating system that will eventually be capable of
cluster operation. If people don't agree that having a usable operating
system, then I think I'm betting on the wrong horse.

> Corecode's report is the only one we have, and we don't have a realistic
> test scenario that measures when it happens or how much it happens. I've
> been running pkgsrc bulk builds on Hammer disks for quite a while now, and
> I have not seen any noticeable change in overall run time from using UFS
> to using Hammer. Chasing a heisenbug can be very difficult.

ARE YOU KIDDING ME? Don't you read what other people, especially Hasso
report?

I AM NOT TALKING ABOUT THROUGHPUT! This is all about latency. Opening
xterm and waiting 5 or 10 seconds for the prompt is not acceptable.

So to anybody who "can not reproduce" this issue: stop being such
obnoxious smartasses. There ARE serious problems, and anybody who is
just trying to reproduce this will notice issues at once.

I think it's as simple as that: If you don't see problems, you're not
qualified to talk.

The best featureful file system is worth nothing if nobody can use it
because it just performs badly. Now is the time to address this before
we complicate the system even more and make it even harder to diagnose
and fix.

#14 Updated by steve about 5 years ago

On Mon, 05 Oct 2009 11:33:47 +0200
"Simon 'corecode' Schubert" <> wrote:

> I AM NOT TALKING ABOUT THROUGHPUT! This is all about latency. Opening
> xterm and waiting 5 or 10 seconds for the prompt is not acceptable.

That is certainly true, although I can slug down just about any
system to the point where that happens if I pile on enough load.

> So to anybody who "can not reproduce" this issue: stop being such
> obnoxious smartasses.

I for one am not trying to be an obnoxious smartass. I am observing
that my experience and yours differ - perhaps due to your load pattern,
perhaps due to other factors. Let's find out which.

> There ARE serious problems, and anybody who is
> just trying to reproduce this will notice issues at once.

No they will not - I have not been able to reproduce these issues.
I can certainly make my system sluggish but nowhere near as bad as you are
reporting. I could keep on adding load until it happens but by then it'll
probably be because the system is swap thrashing.

> I think it's as simple as that: If you don't see problems, you're not
> qualified to talk.

It is not as simple as that - there is something in common between
your experience and Hasso's that is not present in my experience - even
when I try to stress the system with random file IO - which I am doing as I
type this. Negative evidence is useful.

I do know from experience that getting to the bottom of problems
like this on high load systems can be difficult and can produce surprising
results.

Hasso may have hit the nail on the head with the combination of
AHCI and HAMMER - an indication of this is that I do not see the problem
and I do not have AHCI. This possibility could do with confirmation (which
I cannot do) - are you using AHCI ? If so can you try reverting to NATA and
seeing if it makes the problem go away ? Might you and Hasso be using
similar hardware and exposing a hardware/driver problem/limitation ?

> The best featureful file system is worth nothing if nobody can use it
> because it just performs badly. Now is the time to address this before
> we complicate the system even more and make it even harder to diagnose
> and fix.

Significant point - it does not perform badly for everybody,
pinning down that difference is likely to go a long way towards identifying
the problem. For that those of us who do not suffer the problem can
usefully help - for example if you were to say "run this script and watch
it fall down around your ears" I would be happy to do so and report on
whether or not it did indeed fall down around my ears.

It would be a great shame for Matt to spend several months
reworking the background flusher only to find that it wasn't causing the
problems in the first place, even more so if it didn't cure them.

#15 Updated by wbh about 5 years ago

Alex Hornung (via DragonFly issue tracker) wrote:
> Alex Hornung <> added the comment:
>
> : Corecode's report is the only one we have, and we don't have a realistic
> : test scenario that measures when it happens or how much it happens.
> This is not really true. Hasso has also been complaining about very bad
> performance and even showed some test cases and results running dd in various
> scenarios. I also notice quite a performance degradation, especially working
> with git. git diff and checkout particularly take ages on hammer.
>
>
> : Before we have Matt, who can work full-time on clustering - the _purpose
> : of this project_ - digress for months on filesystem performance, can we
> : quantify the actual problem?
> To quantify the problem should be straight forward. As I mentioned before,
> hasso did some of it, but in any case there are test programs (thinking dbench
> or similar here, don't know if anyone of them is in pkgsrc) who can quantify it
> easily.
> About clustering being the main goal... the original goal, as far as I know,
> was to have a well performing SMP system. Right now our SMP performance isn't
> exactly good. I know that the main page is now full of references to clustering
> being *THE* goal, but what happened to the original goals?
>
> Cheers,
> Alex Hornung
>

Not to put too fine a point on it, but ISTR that HAMMER was originally a
serendipitous target of opportunity *en route* to SMP and clustering.

That said, if HAMMER *does* need work, well.... surely the goal is not to
cluster a problematic fs.

The woods are full of those already.

Bill Hacker

#16 Updated by dillon about 5 years ago

What we have here is a situation where corecode's xterm+shell startup
is accessing somewhere north of 900 files for various reasons. Big
programs with many shared libraries are getting run. If those
files get knocked out of the cache the startup is going to be slow.
This is what is happening.

HAMMER v2 is better at doing directory lookups but most of the time
seems to be spent on it searching the B-Tree for the first file data
block... it doesn't take a large percentage of misses out of the
900 files to balloon into a multi-second startup. UFS happens to have
a direct blockmap from the inode. HAMMER caches an offset to the
disk block containing the B-Tree entry most likely to contain the
file data reference. HAMMER depends a lot more on B-Tree meta-data
caches not getting blown out of the system.

Some 400,000 files get accessed when using rdist or cvs to update
something like the NetBSD CVS repo (corecode's test). I can prevent
the vnodes used to read files from getting blown out by vnodes used
to stat files, but vnodes are not thrown away unless the related VM
pages are thrown away so there is probably a VM page priority
adjustment that also needs to be made to retain the longer-cached
meta-data in the face of multi-gigabyte directory tree scans.
Something corecode is doing from cron is physically reading (not just
stat()ing) a large number of files.

I will make some adjustments to the VM page priority for meta-data
returned by the buffer cache to the VM system as well as some
adjustments to the vnode reclamation code to reduce instances of
long-lived file vnodes getting blown out by read-once data.

-Matt

#17 Updated by tuxillo 9 months ago

  • Description updated (diff)
  • Category set to VFS subsystem
  • Assignee changed from 0 to tuxillo
  • Target version set to 3.8.0

Grab.

Also available in: Atom PDF