Bug #1129

hammer-inodes: malloc limit exceeded

Added by ftigeot over 6 years ago. Updated almost 6 years ago.

Status:ClosedStart date:
Priority:HighDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hi,

I have replaced my backup system with a new machine running DragonFly and
Hammer. It is a SMP Opteron with 2GB memory. The backup partition is a
single disk on a 3Ware RAID controler.

The previous machine (FreeBSD 7/UFS2) ran rsnapshot every 4 hours and this
one continues with the same configuration. I have copied the content of the
old rsnapshot directory to the new backup disk before puting it in production.

For details on rsnapshot, see http://www.rsnapshot.org/

/backup is a 400GB Hammer disk:

Filesystem Size Used Avail Capacity iused ifree %iused
Backup 371G 239G 132G 64% 2652953 0 100%

I have just encountered this panic:

panic: hammer-inodes: malloc limit exceeded
mp_lock = 00000000; cpuid = 0
panic
panic
kmalloc
hammer_get_inode
hammer_vop_nresolve
vop_nresolve
cache_resolve
nlookup
kern_stat
sys_lstat
syscall2
Xint0x80_syscall
Debugger("panic")

The backtrace was quickly copied by hand. I may be able to post the full trace
tomorrow if needed.

History

#1 Updated by dillon over 6 years ago

:Hi,
:
:I have replaced my backup system with a new machine running DragonFly and
:Hammer. It is a SMP Opteron with 2GB memory. The backup partition is a
:single disk on a 3Ware RAID controler.
:
:The previous machine (FreeBSD 7/UFS2) ran rsnapshot every 4 hours and this
:one continues with the same configuration. I have copied the content of the
:old rsnapshot directory to the new backup disk before puting it in production.
:
:For details on rsnapshot, see http://www.rsnapshot.org/
:
:/backup is a 400GB Hammer disk:
:
:Filesystem Size Used Avail Capacity iused ifree %iused
:Backup 371G 239G 132G 64% 2652953 0 100%
:
:I have just encountered this panic:
:
:panic: hammer-inodes: malloc limit exceeded
:mp_lock = 00000000; cpuid = 0
:...
:The backtrace was quickly copied by hand. I may be able to post the full trace
:tomorrow if needed.
:
:--
:Francois Tigeot

Francois, is this on a 2.0 release system or a HEAD or latest
release (from CVS) system?

I was sure I fixed this issue for machines with large amounts of ram.
Please do this:

sysctl vfs.maxvnodes

And tell me what it says. If the value is greater then 70000, set it
to 70000.

-Matt
Matthew Dillon
<>

#2 Updated by ftigeot over 6 years ago

This is 2.0 + patches (current DragonFly_RELEASE_2_0_Slip)

There is no such sysctl. I used kern.maxvnodes instead; the original value
was 129055.

#3 Updated by dillon over 6 years ago

:This is 2.0 + patches (current DragonFly_RELEASE_2_0_Slip)
:
:...
:> And tell me what it says. If the value is greater then 70000, set it
:> to 70000.
:
:There is no such sysctl. I used kern.maxvnodes instead; the original value
:was 129055.
:
:--
:Francois Tigeot

Yah, I mistyped that. its kern.maxvnodes. The fix I had made was
MFC'd to 2.0_Slip so the calculation must still be off. Reducing
maxvnodes should solve the panic. The basic problem is that HAMMER's
struct hammer_inode is larger then struct vnode so the vnode limit
calculations wind up being off.

You don't need to use the hardlink trick if backing up to a HAMMER
filesystem. I still need to write utility support to streamline
the user interface but basically all you have to do is use rdist, rsync,
or cpdup (without the hardlink trick) to overwrite the same destination
directory on the HAMMER backup system, then generate a snapshot
softlink. Repeat each day.

This is how I backup DragonFly systems. I have all the systems
NFS-exported to the backup system and it uses cpdup and the hammer
snapshot feature to create a softlink for each day.

backup# df -g -i /backup
Filesystem 1G-blocks Used Avail Capacity iused ifree %iused Mounted on
TEST 696 281 414 40% 3605109 0 100% /backup

backup# cd /backup/mirrors
backup# ls -la
...
drwxr-xr-x 1 root wheel 0 Aug 31 03:20 pkgbox
lrwxr-xr-x 1 root wheel 26 Jul 14 22:22 pkgbox.20080714 -> pkgbox@@0x00000001061a92cd
lrwxr-xr-x 1 root wheel 26 Jul 16 01:58 pkgbox.20080716 -> pkgbox@@0x000000010c351e83
lrwxr-xr-x 1 root wheel 26 Jul 17 03:08 pkgbox.20080717 -> pkgbox@@0x000000010d9ee6ad
lrwxr-xr-x 1 root wheel 26 Jul 18 03:12 pkgbox.20080718 -> pkgbox@@0x000000010f78313d
lrwxr-xr-x 1 root wheel 26 Jul 19 03:25 pkgbox.20080719 -> pkgbox@@0x0000000112505014
...

Doing backups this way has some minor management issues, and we really
need an official user utility to address them. When the backup disk
gets over 90% full I will have to start deleting softlinks and running
hammer prune, and I run about 30 minutes worth of hammer reblocking
ops every night from cron.

HAMMER locks-down atime/mtime when accessed via a snapshot so tar | md5
can be used to create a sanity check for each snapshot.

By my estimation it is going to take at least another 200+ days of
daily backups before I get to that point on my /backup system. I
may speed it up by creating some filler files so I can write and test
a user utility to do the management.

--

Another way of doing backups is to use the mirroring feature. This only
works when both the source and target filesystems are HAMMER filesystems
though, and the snapshot softlink would have to be created manually
(so we need more utility support to make it easier for userland to do).

-Matt
Matthew Dillon
<>

#4 Updated by fjwcash over 6 years ago

On Sun, Aug 31, 2008 at 3:12 PM, Matthew Dillon
<> wrote:
> You don't need to use the hardlink trick if backing up to a HAMMER
> filesystem. I still need to write utility support to streamline
> the user interface but basically all you have to do is use rdist, rsync,
> or cpdup (without the hardlink trick) to overwrite the same destination
> directory on the HAMMER backup system, then generate a snapshot
> softlink. Repeat each day.

In-filesystem snapshot support is such a handy tool. It's something
that I really miss on our Linux systems (LVM snapshots are separate
volumes, and you have to guesstimate how much room each one will use,
and you have to leave empty space in your volume group to support
them).

We use a similar setup for our remote backups box at work. It's a 2x
dual-core Opteron system with 8 GB of RAM and 12x 400 GB SATA HDs on
one 3Ware controller and 12x 500 GB SATA HDs on a second 3Ware
controller (all configured as Single Disks), running FreeBSD 7-STABLE
off a pair of 2 GB CompactFlash cards (gmirror'd). / is on the CF,
everything else (/usr, /usr/ports, /usr/local, /usr/ports/distfiles,
/usr/src, /usr/obj, /home, /tmp, /var, /storage) are ZFS filesystems
(the 24 drives are configured as a single raidz2 pool). There's a
quad-port Intel Pro/1000 gigabit NIC configured via lagg(4) as a
single load-balancing interface.

Every night, a cronjob creates a snapshot of /storage, then the server
connects to the remote servers via SSH, runs rsync against the entire
harddrive and a directory under /storage. For 37 servers, it takes
just under 2 hours for the rsync runs (the initial rsync can takes
upwards of 12 hours per server, depending on the amount of data that
needs to be transferred). A normal snapshot uses <4 GB.

similar fashion, with Hammer filesystems.

<snip>

We've used just over 1 TB to completely archive 37 servers. Daily
snapshots use <5 GB each. This particular server (9 TB) should last
us for a couple of years. :) Even after we get the full 75 remote
servers being backed up, we should be good to keep at least 6 months
of daily backups online. :)

Or, you can use the mirror feature to mirror your backup server to an
offsite server. :) That's what we're planning on doing with ours,
using the "snapshot send" and "snapshot receive" features in ZFS.

There's lots of great work going on in filesystems right now. It's
nice to see the BSDs up near the front (FreeBSD with ZFS, DFlyBSD with
Hammer) again.

#5 Updated by dillon over 6 years ago

:We've used just over 1 TB to completely archive 37 servers. Daily
:snapshots use <5 GB each. This particular server (9 TB) should last
:us for a couple of years. :) Even after we get the full 75 remote
:servers being backed up, we should be good to keep at least 6 months
:of daily backups online. :)

It is a far cry from the tape backups we all had to use a decade ago.

These days if the backups aren't live, they are virtually worthless.

:Or, you can use the mirror feature to mirror your backup server to an
:offsite server. :) That's what we're planning on doing with ours,
:using the "snapshot send" and "snapshot receive" features in ZFS.

Ah, yes. I should have mentioned that. It is an excellent way to
bridge from a non-HAMMER filesystem to a HAMMER filesystem. At the
moment my off-site backup system is running linux (I'm stealing a 700G
disk from a friend of mine) so I can't run HAMMER, but hopefully some
point before the 2.2 release I'll be able to get my new DFly colo server
installed in the same colo facility and then I will be able to use
the mirroring stream to backup from the LAN backup machine to the
off-site backup machine, HAMMER-to-HAMMER.

:There's lots of great work going on in filesystems right now. It's
:nice to see the BSDs up near the front (FreeBSD with ZFS, DFlyBSD with
:Hammer) again.
:
:--
:Freddie Cash
:

I think the linux folks have wandered a little, but it only goes to
show that major filesystem design is the work of individuals, not OS
projects.

-Matt
Matthew Dillon
<>

#6 Updated by ftigeot over 6 years ago

The panic occurred again with kern.maxvnodes set to 70000.
I have reduced it to 35000; we will see if the system is still stable in a
few days...
>
> You don't need to use the hardlink trick if backing up to a HAMMER
> filesystem. I still need to write utility support to streamline
> the user interface but basically all you have to do is use rdist, rsync,
> or cpdup (without the hardlink trick) to overwrite the same destination
> directory on the HAMMER backup system, then generate a snapshot
> softlink. Repeat each day.

I agree this is a better way with Hammer. I just don't want to use something
too different from my other backup servers for the time being...

#7 Updated by dillon over 6 years ago

:
:> MFC'd to 2.0_Slip so the calculation must still be off. Reducing
:> maxvnodes should solve the panic. The basic problem is that HAMMER's
:> struct hammer_inode is larger then struct vnode so the vnode limit
:> calculations wind up being off.
:
:The panic occurred again with kern.maxvnodes set to 70000.
:I have reduced it to 35000; we will see if the system is still stable in a
:few days...

That doesn't sound right, it should have been fine at 70000. Do
you have a kernel & core crashdump I can look at? (email me privately
if you do).

-Matt
Matthew Dillon
<>

#8 Updated by ftigeot over 6 years ago

Unfortunately, this machine wasn't configured to save a crash dump.

I have now setup dumpdev and reset maxvnodes to the default value. We
should get a dump in a few days, the interval between crashes never
exceeded a week.

#9 Updated by ftigeot over 6 years ago

I've got a new panic.

How can I be sure to get a crash dump ? This machine actually panicked
once again before but for some reason didn't dump the core.

It is sitting at the kernel debugger screen for the moment.

#10 Updated by dillon over 6 years ago

:I've got a new panic.
:
:How can I be sure to get a crash dump ? This machine actually panicked
:once again before but for some reason didn't dump the core.
:
:It is sitting at the kernel debugger screen for the moment.
:
:--
:Francois Tigeot

It has to be setup before-hand. If it isn't there isn't much you can
do from the debugger. If it is setup before hand you can type 'panic'
from the debugger prompt & hit return twice (usually) and it will dump
before rebooting.

Generally speaking you set up to get a crash dump like this:

* Have enough swap space to cover main memory. i.e. if you 4g of
ram, you need 4g of swap.

* Set dumpdev to point to the swap device in /etc/rc.conf. Example:
'dumpdev=/dev/ad6s1b'. Takes effect when you reboot, you can
manually set the dumpdev on the running system by also running
'dumpon /dev/ad6s1b').

* Add 'kern.sync_on_panic=0' to your /etc/sysctl.conf to tell the
system not to try to flush the buffer crash when it crashes. This
improves its chances of being able to get to the dump code.

You can set the kernel up to automatically reboot on a panic (and
generate a crash dump if it has been setup to do one) by compiling
the kernel with:

options DDB
options DDB_TRACE
options DDB_UNATTENDED

-Matt
Matthew Dillon
<>

#11 Updated by ftigeot over 6 years ago

[...]

Thanks for the instructions, I was finally able to get a crash dump.

I have put the content of /var/crash at this location:
http://www.wolfpond.org/crash.dfly/

#12 Updated by dillon over 6 years ago

Ok, I'm looking at the core. There do not appear to be any
memory leaks but HAMMER got behind on reclaiming inodes whos ref
count has dropped to 0.

In looking at the code I see a case that I am not handling in
VOP_SETATTR. Was the code you were running doing a lot of chmod,
chown, or other operations on file paths that do not require open()ing
the file?

-Matt
Matthew Dillon
<>

#13 Updated by ftigeot over 6 years ago

Definitely.

Every time I got a crash, the machine was re-creating an hourly rsnapshot
arborescence from the previous one. It should have been a mix of mkdir /
chmod / chown ...

#14 Updated by dillon over 6 years ago

:Definitely.
:
:Every time I got a crash, the machine was re-creating an hourly rsnapshot
:arborescence from the previous one. It should have been a mix of mkdir /
:chmod / chown ...
:
:--
:Francois Tigeot

Ok, please try the patch below. This is kinda a kitchen sink approach
and will reduce performance somewhat when doing lots of
hardlinks/chmods/etc but I want to see if it deals with the problem.

Also reduce kern.maxvnodes to 100000.

-Matt
Matthew Dillon
<>

Index: hammer_vnops.c
===================================================================
RCS file: /cvs/src/sys/vfs/hammer/hammer_vnops.c,v
retrieving revision 1.96
diff -u -p -r1.96 hammer_vnops.c
--- hammer_vnops.c 9 Aug 2008 07:04:16 -0000 1.96
+++ hammer_vnops.c 16 Sep 2008 22:52:12 -0000
@@ -1038,6 +1038,7 @@ hammer_vop_nlink(struct vop_nlink_args *
cache_setvp(nch, ap->a_vp);
}
hammer_done_transaction(&trans);
+ hammer_inode_waitreclaims(dip->hmp);
return (error);
}

@@ -1108,6 +1109,7 @@ hammer_vop_nmkdir(struct vop_nmkdir_args
}
}
hammer_done_transaction(&trans);
+ hammer_inode_waitreclaims(dip->hmp);
return (error);
}

@@ -1873,6 +1875,8 @@ done:
if (error == 0)
hammer_modify_inode(ip, modflags);
hammer_done_transaction(&trans);
+ if (ap->a_vp->v_opencount == 0)
+ hammer_inode_waitreclaims(ip->hmp);
return (error);
}

#15 Updated by ftigeot over 6 years ago

Done.

If you don't hear about any new crash for a week, it means this patch is
good.

#16 Updated by ftigeot over 6 years ago

The machine has been stable so far.

I just noticed these unusual messages in the logs today:

Sep 21 09:04:29 akane kernel: HAMMER: Warning: UNDO area too small!
Sep 21 09:05:00 akane kernel: HAMMER: Warning: UNDO area too small!
Sep 21 09:06:11 akane kernel: HAMMER: Warning: UNDO area too small!

The time corresponds to a rsnasphot hourly run.

I had to reboot this machine for an unrelated problem. We should wait
a few more days to be sure if the patch really fixes think.

I was never able to get more than 5-6 days uptime before.

#17 Updated by dillon over 6 years ago

:The machine has been stable so far.
:
:I just noticed these unusual messages in the logs today:
:
:Sep 21 09:04:29 akane kernel: HAMMER: Warning: UNDO area too small!
:Sep 21 09:05:00 akane kernel: HAMMER: Warning: UNDO area too small!
:Sep 21 09:06:11 akane kernel: HAMMER: Warning: UNDO area too small!
:
:The time corresponds to a rsnasphot hourly run.
:...
:Francois Tigeot

How large is the filesystem your rsnapshot is writing to?

HAMMER tries to estimate how much space dependancies take up in the
UNDO FIFO and tries to split the work up into multiple flush cycles
such that each flush cycle does not exhaust the UNDO space.

The warning means that the HAMMER backend had to issue a flush cycle
before it really wanted to, potentially causing some directory
dependancies to get split between two flush cycles. If a crash were
to occur during those particular flush cycles the hard link count
between file and directory entry could wind up being wrong.

I think the problem may be caused by the hardlink trick you are using
to duplicate directory trees. HAMMER's estimator is probably not
taking into account the tens of thousands of hardlinks (directory-to-file
link count dependancies) and directory-to-directory dependancies from
creating the target directory hierarchy that can build up when files are
simply being linked.

For now, keep watch on it. The warning itself is not a big deal.
If HAMMER panics on insufficient UNDO space, though, that's a
different matter.

-Matt
Matthew Dillon
<>

#18 Updated by ftigeot over 6 years ago

It is a single volume on a 400GB disk:

$ df -ih .
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
Backup 371G 314G 57G 85% 2642085 0 100% /backup

#19 Updated by dillon over 6 years ago

:It is a single volume on a 400GB disk:
:
:$ df -ih .
:Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
:Backup 371G 314G 57G 85% 2642085 0 100% /backup
:
:--
:Francois Tigeot

Interesting. It should have a full-sized undo area, which means the
dependancies resulted in 600MB+ worth of undos. I'll have to start
testing with hardlinks.

Be sure to regularly prune and reblock that sucker. If you aren't
using the history feature I expect you'll want to mount it 'nohistory'
too. I found out the hard way that one still needs to spend about
5 minutes a day reblocking a HAMMER filesystem to keep fragmentation
in check.

-Matt
Matthew Dillon
<>

#20 Updated by ftigeot about 6 years ago

The machine is still stable, with a 4 days uptime.

I have found a new strange warning in the logs:
Warning: BTREE_REMOVE: Defering parent removal2 @ 80000058efe06000, skipping

It occurred during a rsnapshot hourly run.

#21 Updated by dillon about 6 years ago

:The machine is still stable, with a 4 days uptime.
:
:I have found a new strange warning in the logs:
:Warning: BTREE_REMOVE: Defering parent removal2 @ 80000058efe06000, skipping
:
:It occurred during a rsnapshot hourly run.
:
:--
:Francois Tigeot

You can ignore that one, it's harmless.

-Matt
Matthew Dillon
<>

#22 Updated by matthias about 6 years ago

Hi,

I have encountered the same panic on one of my machines running HAMMER
here:

panic: hammer-inodes: malloc limit exceededmp_lock = 00000000; cpuid = 0
Trace beginning at frame 0xdf0b1a20
panic(df0b1a44,c03f67c0,ff80048c,248,df0b1a68) at panic+0x142
panic(c039cecf,c03a501c,0,11,ff800000) at panic+0x142
kmalloc(248,c03f67c0,102,db54f000,c02e3a62) at kmalloc+0xa5
hammer_create_inode(df0b1ac0,df0b1b5c,de4efa88,e268b738,0) at
hammer_create_inode+0x26hammer_vop_ncreate(df0b1b00,c03eab70,c41090b8,0,0)
at hammer_vop_ncreate+0x72
vop_ncreate(c41090b8,df0b1c84,e157a9b8,df0b1c08,de4efa88) at
vop_ncreate+0x3d
vn_open(df0b1c84,d746e638,603,1a4,c410ecf8) at vn_open+0xf3
kern_open(df0b1c84,602,1b6,df0b1cf0,ec0636c8) at kern_open+0x84
sys_open(df0b1cf0,c192e8,0,da362b78,c03db85c) at sys_open+0x32
syscall2(df0b1d40) at syscall2+0x240
Xint0x80_syscall() at Xint0x80_syscall+0x36
Debugger("panic")

CPU0 stopping CPUs: 0x00000002
stopped
panic: from debugger
mp_lock = 00000000; cpuid = 0
boot() called on cpu#0
Uptime: 20d22h48m26s

dumping to dev #ad/0x20051, blockno 4269104
dump
Fatal double fault:
eip = 0xc03733f1
esp = 0xdf0aefb0
ebp = 0xdf0af024
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
panic: double fault
mp_lock = 00000000; cpuid = 0
boot() called on cpu#0
Uptime: 20d22h48m26s
Dump already in progress, bailing...
spin_lock: 0xc4107d6c, indefinite wait!
spin_lock: 0xc4107d64, indefinite wait!
Shutting down ACPI
Automatic reboot in 15 seconds - press a key on the console to abort
--> Press a key on the console to reboot,
--> or switch off the system now.
Rebooting...

Unfortunately no crash dump :( The machine is running HEAD from Tue Sep
30 11:47:27 CEST 2008. It is a Intel C2D 3GHz with 2GB RAM running a
SMP kernel. The fs layout is as follows:

ROOT 292G 60G 232G 21% /
/dev/ad10s1a 252M 138M 94M 59% /boot
/pfs/@@0xffffffffffffffff:00001 292G 60G 232G 21% /usr
/pfs/@@0xffffffffffffffff:00003 292G 60G 232G 21% /var
/pfs/@@0xffffffffffffffff:00006 292G 60G 232G 21% /tmp
/pfs/@@0xffffffffffffffff:00007 292G 60G 232G 21% /home
/pfs/@@0xffffffffffffffff:00005 292G 60G 232G 21% /var/tmp
/pfs/@@0xffffffffffffffff:00002 292G 60G 232G 21% /usr/obj
/pfs/@@0xffffffffffffffff:00004 292G 60G 232G 21%
/var/crash

The machine performed a pkgsrc "cvs update" before it crashed. If more
information is needed, I'll provide it. After the reboot kern.maxvnodes
is 129055 if that matters ...

Regards

Matthias

#23 Updated by dillon about 6 years ago

:The machine performed a pkgsrc "cvs update" before it crashed. If more
:information is needed, I'll provide it. After the reboot kern.maxvnodes
:is 129055 if that matters ...
:
:Regards
:
: Matthias

It's the same issue. Drop kern.maxvnodes to 100000.

I am going to add an API to set the kmalloc pool's limit so HAMMER
can size it according to the size of hammer_inode.

-Matt
Matthew Dillon
<>

#24 Updated by aoiko about 6 years ago

Should this be closed?

#25 Updated by aoiko about 6 years ago

Fix committed by dillon@

#26 Updated by qhwt+dfly almost 6 years ago

Do I still need to lower kern.maxvnodes to avoid the panic on machine
with >=2G bytes of RAM? I still see this panic while running blogbench
for a couple of hours, without increasing or decreasing kern.maxvnodes.

(kgdb) bt
:
#2 0xc0198e0c in panic (fmt=0xc02e71dd "%s: malloc limit exceeded")
at /home/source/dragonfly/current/src/sys/kern/kern_shutdown.c:800
#3 0xc0196a4f in kmalloc (size=584, type=0xc4170010, flags=258)
at /home/source/dragonfly/current/src/sys/kern/kern_slaballoc.c:490
#4 0xc0260056 in hammer_get_inode (trans=0xde16db20, dip=0xe47032d0,
obj_id=180316461440, asof=18446744073709551615, localization=131072,
flags=0, errorp=0xde16da68)
at /home/source/dragonfly/current/src/sys/vfs/hammer/hammer_inode.c:376
#5 0xc026fc95 in hammer_vop_nresolve (ap=0xde16db78)
at /home/source/dragonfly/current/src/sys/vfs/hammer/hammer_vnops.c:924
#6 0xc01ee2d4 in vop_nresolve_ap (ap=0xde16db78)
at /home/source/dragonfly/current/src/sys/kern/vfs_vopops.c:1613
#7 0xde35b032 in ?? ()
:

(kgdb) p *type
$10 = {ks_next = 0xc416ff50, ks_memuse = {55452800, 51921920,
0 <repeats 14 times>}, ks_loosememuse = 107374720, ks_limit = 107374182,
ks_size = 0, ks_inuse = {86645, 81128, 0 <repeats 14 times>},
ks_calls = 1694049, ks_maxused = 0, ks_magic = 877983977,
ks_shortdesc = 0xc02f039d "HAMMER-inodes", ks_limblocks = 0,
ks_mapblocks = 0, ks_reserved = {0, 0, 0, 0}}

#27 Updated by dillon almost 6 years ago

:Do I still need to lower kern.maxvnodes to avoid the panic on machine
:with >=2G bytes of RAM? I still see this panic while running blogbench
:for a couple of hours, without increasing or decreasing kern.maxvnodes.
:
: (kgdb) bt
: :
: #2 0xc0198e0c in panic (fmt=0xc02e71dd "%s: malloc limit exceeded")
: at /home/source/dragonfly/current/src/sys/kern/kern_shutdown.c:800

I'd like to get another crash dump if possible, before you lower the
limit. There is still clearly an issue which I would like to get
fixed before the January release.

Once you get a crash dump over to leaf then please lower the limit
and see if you can panic the machine.

-Matt

#28 Updated by qhwt+dfly almost 6 years ago

Ok, scp'ed as ~y0netan1/crash/{kernel,vmcore}.4 .

Sure. Oh, BTW, although this machine has two HAMMER partitions mounted
(/HAMMER and /var/vkernel), I was only using the former for blogbench,
the latter was mounted but totally idle.

#29 Updated by qhwt+dfly almost 6 years ago

I don't know exactly how desiredvnodes limits the amount of kmalloc's,
but I did notice that there are two places where it's used to compute
HAMMER-related values:
- hammer_vfs_init(): vfs.hammer.limit_iqueued is computed at the first
call and set to desiredvnodes / 5; after that, you need to set it
manually
- hammer_vfs_mount(): if I understand the code correctly, malloc limit
is only updated when the HAMMER volume is unmounted then mounted.

So I went into single-user mode, set these parameters, unmounted and
re-mounted the HAMMER filesystems. It seems that with kern.maxvnodes=100000
it can still panic the machine.

BTW, I have a few questions WRT kmalloc():

kern_slaballoc.c:478
while (type->ks_loosememuse >= type->ks_limit) {
int i;
long ttl;

for (i = ttl = 0; i < ncpus; ++i)
ttl += type->ks_memuse[i];
type->ks_loosememuse = ttl; /* not MP synchronized */
if (ttl >= type->ks_limit) {
if (flags & M_NULLOK) {
logmemory(malloc, NULL, type, size, flags);
return(NULL);
}
panic("%s: malloc limit exceeded", type->ks_shortdesc);
}
}

1. don't we need M_LOOPOK flag, which tells kmalloc() to wait until
the sum of ks_memuse[] becomes lower than ks_limit? of course
only when !M_NULLOK && M_WAITOK.
struct hammer_inode is fairly small in size, so there could be
a good chance that a couple of them gets reclaimed after a while.

2. I know ks_loosememuse is not MP synchronized, but ks_memuse[] is
summed up without any locks, either. couldn't there be a race?

3. shouldn't the conditionals be
while (type->ks_loosememuse + size >= type->ks_limit) {
...
if (ttl + size >= type->ks_limit) ...

to catch the situation earlier?

Thanks in advance.

#30 Updated by dillon almost 6 years ago

:1. don't we need M_LOOPOK flag, which tells kmalloc() to wait until
: the sum of ks_memuse[] becomes lower than ks_limit? of course
: only when !M_NULLOK && M_WAITOK.
: struct hammer_inode is fairly small in size, so there could be
: a good chance that a couple of them gets reclaimed after a while.

No, because there is no guarantee that the caller won't deadlock.
The bug is that the subsystem (HAMMER in this case) didn't control
the allocations it was making.

:2. I know ks_loosememuse is not MP synchronized, but ks_memuse[] is
: summed up without any locks, either. couldn't there be a race?

ks_loosememuse can be very wrong. Summing up ks_memuse[] will
give a correct result and while races can occur the difference
will only be the difference due to the race, not some potentially
wildly incorrect value.

:3. shouldn't the conditionals be
: while (type->ks_loosememuse + size >= type->ks_limit) {
: ...
: if (ttl + size >= type->ks_limit) ...
:
: to catch the situation earlier?
:
:Thanks in advance.

I don't think this will help. Subsystems have to control their
memory use. The kernel can't really save them. HAMMER has an
issue where it can allocate a virtually unlimited number of
hammer_inode structures. I have lots of code in there to try
to slow it down when it gets bloated but clearly some cases are
getting through and still causing the allocations to spiral out
of control.

-Matt
Matthew Dillon
<>

#31 Updated by dillon almost 6 years ago

What arguments to blogbench are you using and how long does it run
before it hits the malloc panic?

I found one possible path where inodes can build up but it still
doesn't feel quite right because even that path has upwards of a
2-second tsleep. With so many blogbench threads running in
parallel it could be the cause but it still ought to take a while
to build up that many extra inodes. I think I need to reproduce the
problem locally to determine if that path is the cause.

-Matt

#32 Updated by qhwt+dfly almost 6 years ago

I almost forgot, but fortunately vmcore.4 contains it:
blogbench -d0 -i1000 -o

`0' is the work directory, which has nohistory flag on it.

I thinks it survived for about three hours according to `last':
reboot ~ Sat Dec 27 02:37
qhwt ttyp0 eden Fri Dec 26 23:40 - crash (02:56)
reboot ~ Fri Dec 26 23:39

#33 Updated by dillon almost 6 years ago

:I almost forgot, but fortunately vmcore.4 contains it:
:blogbench -d0 -i1000 -o
:
:`0' is the work directory, which has nohistory flag on it.
:
:I thinks it survived for about three hours according to `last':
:reboot ~ Sat Dec 27 02:37
:qhwt ttyp0 eden Fri Dec 26 23:40 - crash (02:56)
:reboot ~ Fri Dec 26 23:39

Cool, I reproduced the issue. It is quite interesting. What is
happening is that the load on HAMMER is causing the HAMMER flusher
to get excessive deadlocks, stalling it out and preventing it from
making any progress. Because of that no matter how much I slow down new
inode creation the inode count just keeps building up.

What is really weird is that if I ^Z the blogbench and let the flusher
catch up, then resume it, the flusher is then able to stay caught up
for a few minutes before starting to get behind again.

I am experimenting with a number of possible solutions, including
having the flusher try a different inode if it hits a deadlock, to
see if I can prevent the stall-outs.

-Matt
Matthew Dillon
<>

#34 Updated by dillon almost 6 years ago

I've committed a HAMMER update to the master branch which should fix
the issue revealed by blogbench.

It is a bit of a hack but it seems to work in my tests so far.

-Matt

#35 Updated by qhwt+dfly almost 6 years ago

No more panics so far, thanks!

#36 Updated by dillon almost 6 years ago

Believed to be fixed no, closing.

Also available in: Atom PDF