Bug #1031

panic: hammer_mount: malloc limit exceeded

Added by ftigeot over 6 years ago. Updated over 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hi,

I have just upgraded my machine to today's 1.13 (HAMMER 53G) and
re-created a test Hammer filesystem. I then ran cpdup to copy /home to
the new fs.

The exact command was: cpdup -I -v -i0 /home /home.hammer

Everything seemed to be ok at first. After about 1 hour, the kernel panicked.

Details of configuration:

Core 2 Duo, 2GB RAM
ad4: 400GB Western Digital WD4000YS
ad8: 500GB Western Digital WD5001ABYS (faster than ad4)

Filesystems after panic:

Filesystem Size Used Avail Capacity iused ifree %iused
Mounted on
/dev/ad8 466G 241G 187G 56% 196815 373807 34% /home
Home 347G 139G 208G 40% 78725 0 100% /home.hammer

/home.hammer is on ad4

Text of panic (recopied by hand):

panic: hammer_mount: malloc limit exceeded
mp_lock: 00000000; cpuid = 0
Trace beginning at frame 0xdc4aaabc
panic(dc4aaae0,0,0c04017e0,ff80048c,dc4aab08) at panic+0x14d
panic(c03a50ae,c03ae7c5,c40f1948,c42b2474,ff800000) at panic+0x14d
kmalloc(88,c04017e0,102,0,0) at kmalloc + 0xbb
hammer_alloc_mem_record(e6fb7d18,0,94000,0,2c9e634e) at hammer_alloc_mem_record + 0x22
hammer_ip_add_bulk(e6fb7d18,90000,0,c9248000,4000) at hammer_ip_add_bulk + 0x11c
hammer_vop_strategy(dc4aacb8,c03f6268,d3a0f510,c264b720,93000) at hammer_vop_strategy + 0x676
vop_strategy(d3a0f510,e000d18e8,c4435a48) at vop_strategy + 0x2c
vn_strategy(e000d18e8,c4435a48,c4435b30,c4435b30,4000) at vn_strategy + 0x50
bwrite(c4435a18) at bwrite + 0xfa
vfs_bio_awrite(c4435a18,c41b8ac0,ff800000,dc4aad84,c01f9f6e) at vfs_bio_awrite + 0x301
flushbufqueues(c41b89d0,0,c01b1a3,0,0) at flushbufqueues + 0x148
buf_daemon_hw(0,0,0,0,0) at buf_daemon_hw + 0x4f
kthread_exit() at kthread_exit
Debugger("panic")

CPU0 stopping: CPUs: 0x00000002
stopped
Stopped at Debugger + 0x34: movb $0,in_Debugger.4464
db>

I am fairly confident I can reproduce this given sufficient time.

History

#1 Updated by dillon over 6 years ago

:Hi,
:
:I have just upgraded my machine to today's 1.13 (HAMMER 53G) and
:re-created a test Hammer filesystem. I then ran cpdup to copy /home to
:the new fs.
:
:The exact command was: cpdup -I -v -i0 /home /home.hammer
:
:Everything seemed to be ok at first. After about 1 hour, the kernel panicked.
:
:Details of configuration:

Ok, I'll mess with it. It's a minor issue... HAMMER blowing out its
kmalloc bucket is just me forgetting to put in checks in certain places.
'H' should fix some of that.

-Matt

#2 Updated by ftigeot over 6 years ago

After the 53H commit, the panic does'nt occur anymore.

There's a deadlock instead: after about 5 minutes of copying the machine
stopped responding. No specific message on the console, display frozen.

I tried to break into the debugger with CTRL-ALT-ESC but it didn't change
anything.

#3 Updated by dillon over 6 years ago

:After the 53H commit, the panic does'nt occur anymore.
:
:There's a deadlock instead: after about 5 minutes of copying the machine
:stopped responding. No specific message on the console, display frozen.
:
:I tried to break into the debugger with CTRL-ALT-ESC but it didn't change
:anything.
:
:--
:Francois Tigeot

I have been able to reproduce this. I can reproduce it sometimes when
I run blogbench.

I think what is happening is that the buffer cache is getting blown
out with too many B_LOCKED buffers and for some reason is not
recovering, and the buffer cache is live-locking inside a critical
section.

I should be able to track it down soon. The same problem also appears
to be the cause of blogbench's performance drop-off.

-Matt
Matthew Dillon
<>

#4 Updated by dillon over 6 years ago

::After the 53H commit, the panic does'nt occur anymore.
::
::There's a deadlock instead: after about 5 minutes of copying the machine
::stopped responding. No specific message on the console, display frozen.
::
::I tried to break into the debugger with CTRL-ALT-ESC but it didn't change
::anything.
::
::--
::Francois Tigeot

I believe I have fixed the lockup with commit 54. I have also, I
believe, completely fixed sequential read performance and have made
good progress on sequential write performance.

-Matt
Matthew Dillon
<>

#5 Updated by ftigeot over 6 years ago

I have been unable to reproduce the lockup.

Howewer, the original panic is back, this time after only about 5 minutes
of copying:

panic: hammer_mount: malloc limit exceeded
mp_lock = 00000001; cpuid = 1
Trace beginning at frame 0xdf1e28d4
panic(df1e28f8,0,c04026e0,ff80848c,df1e2920) at panic + 0x14d
panic(c03a5e6e,c03af5a9,c40f1948,c4772f5c,ff808000) at panic + 0x14d
kmalloc(88,c04026e0,102,0,0) at kmalloc + 0xbb
hammer_alloc_mem_record(e4a0d698,0,6e7c000,0,260a5c42) at hammer_alloc_mem_record + 0x22
hammer_ip_add_bulk(e4a0d698,6e78000,0,ccc30000,4000) at hammer_ip_add_bulk + 0x11c
hammer_vop_strategy(df1e2acc,c03f7148,d3a0db30,c1e72440,6e7b000) at hammer_vop_strategy + 0x64b
vop_strategy(d3a0db30,e2847168,c4608830) at vop_strategy + 0x2c
vn_strategy(e2847168,c4608830,c4608918,c4608918,0) at vn_strategy + 0x50
bwrite(c4608800) at bwrite + 0xfa
bawrite(c4608800) at bawrite + 0x13
hammer_vop_write(df1e2ba4,c03f70b8,d3a0db30,0,0) at hammer_vop_write + 0x3ef
vop_write(d3a0db30,e2847168,df1e2ca0,7f0001,def91948) at vop_write + 0x34
vn_write(de965fd0,df1e2ca0,def91948,0,de965fd0) at vn_write + 0x1e1
kern_pwritev(8,df1e2ca0,0,df1e2cf0) at kern_pwritev + 0xf6
sys_write(df1e2cf0,6,c6415ad,0,da413e18) at sys_write + 0x68
syscall2(df1e240) at syscall2 + 0x25f
Xint0x80_syscall() at Xint0x80_syscall + 0x36
Debugger("panic")

CPU1 stopping CPUs: 0x00000001
stopped
Stopped at Debugger + 0x34: movb $0,in_Debugger.4464
db>

#6 Updated by dillon over 6 years ago

:> I believe I have fixed the lockup with commit 54. I have also, I
:> believe, completely fixed sequential read performance and have made
:> good progress on sequential write performance.
:
:I have been unable to reproduce the lockup.
:
:Howewer, the original panic is back, this time after only about 5 minutes
:of copying:
:
:panic: hammer_mount: malloc limit exceeded
:mp_lock = 00000001; cpuid = 1
:Trace beginning at frame 0xdf1e28d4
:...
:--
:Francois Tigeot
:

Ugh. Do me a favor. Do a 'sysctl vfs.hammer' every few seconds while
you are copying and cut and paste the last few as of when it crashes.
That will tell me which structure is blowing out the memory.

Unfortunately it also looks like I have not completely fixed the
lockup, my own testing (buildworld loop + blogbench + fsxtest) locked
up overnight.

-Matt
Matthew Dillon
<>

#7 Updated by ftigeot over 6 years ago

Here it is. Each --- marks a one second pause.

vfs.hammer.contention_count: 1917
vfs.hammer.zone_limit: 0
vfs.hammer.count_io_running_write: 0
vfs.hammer.count_io_locked: 2
vfs.hammer.count_io_running_read: 1
vfs.hammer.count_reservations: 1172
vfs.hammer.count_refedbufs: 12
vfs.hammer.count_dirtybufs: 66
vfs.hammer.count_nodes: 22
vfs.hammer.count_buffers: 249
vfs.hammer.count_volumes: 1
vfs.hammer.count_record_datas: 4031
vfs.hammer.count_records: 601674
vfs.hammer.count_reclaiming: 1
vfs.hammer.count_iqueued: 4934
vfs.hammer.count_inodes: 33721
vfs.hammer.limit_iqueued: 26846
vfs.hammer.limit_recs: 320000
vfs.hammer.limit_irecs: 102400
vfs.hammer.limit_dirtybufs: 1610
---
vfs.hammer.contention_count: 1917
vfs.hammer.zone_limit: 0
vfs.hammer.count_io_running_write: 0
vfs.hammer.count_io_locked: 2
vfs.hammer.count_io_running_read: 1
vfs.hammer.count_reservations: 1179
vfs.hammer.count_refedbufs: 17
vfs.hammer.count_dirtybufs: 66
vfs.hammer.count_nodes: 28
vfs.hammer.count_buffers: 254
vfs.hammer.count_volumes: 1
vfs.hammer.count_record_datas: 4032
vfs.hammer.count_records: 605308
vfs.hammer.count_reclaiming: 1
vfs.hammer.count_iqueued: 4934
vfs.hammer.count_inodes: 33722
vfs.hammer.limit_iqueued: 26846
vfs.hammer.limit_recs: 320000
vfs.hammer.limit_irecs: 102400
vfs.hammer.limit_dirtybufs: 1610
---
vfs.hammer.contention_count: 1917
vfs.hammer.zone_limit: 0
vfs.hammer.count_io_running_write: 0
vfs.hammer.count_io_locked: 2
vfs.hammer.count_io_running_read: 1
vfs.hammer.count_reservations: 1186
vfs.hammer.count_refedbufs: 17
vfs.hammer.count_dirtybufs: 66
vfs.hammer.count_nodes: 28
vfs.hammer.count_buffers: 254
vfs.hammer.count_volumes: 1
vfs.hammer.count_record_datas: 4032
vfs.hammer.count_records: 608736
vfs.hammer.count_reclaiming: 1
vfs.hammer.count_iqueued: 4934
vfs.hammer.count_inodes: 33721
vfs.hammer.limit_iqueued: 26846
vfs.hammer.limit_recs: 320000
vfs.hammer.limit_irecs: 102400
vfs.hammer.limit_dirtybufs: 1610

panic: hammer_mount: malloc limit exceeded
...

#8 Updated by dillon over 6 years ago

:> Ugh. Do me a favor. Do a 'sysctl vfs.hammer' every few seconds while
:> you are copying and cut and paste the last few as of when it crashes.
:> That will tell me which structure is blowing out the memory.
:
:Here it is. Each --- marks a one second pause.
:
:vfs.hammer.count_records: 601674
:vfs.hammer.limit_recs: 320000
:vfs.hammer.limit_irecs: 102400

Yow. 600,000+ records cached in memory. Reducing the irecs limit
should fix the problem. Try changing it to 20000.

-Matt

#9 Updated by ftigeot over 6 years ago

Done. Panic still occurs, albeit after a longer time (about 30 minutes).
This time --- marks a 2 second pause.

vfs.hammer.contention_count: 2066
vfs.hammer.zone_limit: 0
vfs.hammer.count_io_running_write: 0
vfs.hammer.count_io_locked: 1
vfs.hammer.count_io_running_read: 1
vfs.hammer.count_reservations: 384
vfs.hammer.count_refedbufs: 9
vfs.hammer.count_dirtybufs: 90
vfs.hammer.count_nodes: 10
vfs.hammer.count_buffers: 195
vfs.hammer.count_volumes: 1
vfs.hammer.count_record_datas: 29
vfs.hammer.count_records: 195614
vfs.hammer.count_reclaiming: 0
vfs.hammer.count_iqueued: 18
vfs.hammer.count_inodes: 39332
vfs.hammer.limit_iqueued: 26846
vfs.hammer.limit_recs: 320000
vfs.hammer.limit_irecs: 20000
vfs.hammer.limit_dirtybufs: 1610
---
vfs.hammer.contention_count: 2066
vfs.hammer.zone_limit: 0
vfs.hammer.count_io_running_write: 0
vfs.hammer.count_io_locked: 1
vfs.hammer.count_io_running_read: 1
vfs.hammer.count_reservations: 397
vfs.hammer.count_refedbufs: 9
vfs.hammer.count_dirtybufs: 90
vfs.hammer.count_nodes: 10
vfs.hammer.count_buffers: 195
vfs.hammer.count_volumes: 1
vfs.hammer.count_record_datas: 29
vfs.hammer.count_records: 202090
vfs.hammer.count_reclaiming: 0
vfs.hammer.count_iqueued: 18
vfs.hammer.count_inodes: 39332
vfs.hammer.limit_iqueued: 26846
vfs.hammer.limit_recs: 320000
vfs.hammer.limit_irecs: 20000
vfs.hammer.limit_dirtybufs: 1610
---
vfs.hammer.contention_count: 2066
vfs.hammer.zone_limit: 0
vfs.hammer.count_io_running_write: 0
vfs.hammer.count_io_locked: 1
vfs.hammer.count_io_running_read: 1
vfs.hammer.count_reservations: 418
vfs.hammer.count_refedbufs: 10
vfs.hammer.count_dirtybufs: 90
vfs.hammer.count_nodes: 11
vfs.hammer.count_buffers: 196
vfs.hammer.count_volumes: 1
vfs.hammer.count_record_datas: 31
vfs.hammer.count_records: 213006
vfs.hammer.count_reclaiming: 0
vfs.hammer.count_iqueued: 18
vfs.hammer.count_inodes: 39333
vfs.hammer.limit_iqueued: 26846
vfs.hammer.limit_recs: 320000
vfs.hammer.limit_irecs: 20000
vfs.hammer.limit_dirtybufs: 1610
---
panic: hammer_mount: malloc limit exceeded
...

#10 Updated by dillon over 6 years ago

:Done. Panic still occurs, albeit after a longer time (about 30 minutes).
:This time --- marks a 2 second pause.

Ok, clearly I need to work on implementing limit_recs, which is a real
hack right now. That isn't my biggest concern at the moment so it may
be a week before I fix it.

My biggest concern right now is the machine lockup that was reported.
I thought I had fixed it but I can still reproduce it on my test box,
and since the box is locked up and I can't break into the debugger
finding the bug is going to be hard.

What I am doing to reproduce the lockup is (on a 2xCPU box), with
a HAMMER /usr/obj, is running three things in parallel:

window1: buildworld test

cd /usr/src
while (1)
make buildworld -j 8

window2: fsx

cd /usr/obj
/tmp/fsx -S 123424 -c 100000 -l 4194304 -o 262144 -P /tmp/ test.fsx.dat

window3: blogbench

mkdir /usr/obj/bench1
blogbench --iterations=100 -d /usr/obj/bench1

Needless to say, doing these three things in parallel stresses the hell
out of the machine. I don't even know if the bug is related to HAMMER
yet or not.

There is one more panic which occurs, related to direct-writes, but since
it panics and I get a core dump I expect to get it fixed fairly quickly.
The problem I really want to solve is the machine lock-up.

-Matt

#11 Updated by dillon over 5 years ago

Should now be generally fixed.

Also available in: Atom PDF