the evil interrupt stats monster is still around!
while doing the pkgsrc bulk build, I get weird numbers in top. I am pretty sure this has something to do with the multitude of shell scripts being used as wrappers in pkgsrc. command execution is also slower than I'd expect (for example wrapper execution).
load averages: 8.38, 7.96, 5.53 up 0+03:48:32 22:34:35
151 processes: 151 running
CPU0 states: 13.7% user, 0.0% nice, 27.5% system, 0.0% interrupt, 58.8% idle
CPU1 states: 21.4% user, 0.0% nice, 28.3% system, 0.0% interrupt, 50.4% idle
CPU2 states: 22.1% user, 0.0% nice, 22.9% system, 0.0% interrupt, 55.0% idle
CPU3 states: 34.7% user, 0.0% nice, 11.1% system, 42.0% interrupt, 12.2% idle
Mem: 71M Active, 509M Inact, 268M Wired, 199M Buf, 2665M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
36155 root 227 0 3060K 2736K wait 2 0:00 36.82% 8.15% bmake
37456 root 229 0 828K 704K RUN 1 0:00 16.14% 2.93% sh
37677 root 197 0 1576K 1216K wait 2 0:00 14.53% 2.64% bmake
40985 root 182 0 824K 700K RUN 3 0:00 40.00% 1.95% sh
41135 root 193 0 820K 696K wait 1 0:00 39.00% 1.90% sh
37807 root 172 0 828K 704K wait 0 0:00 9.69% 1.76% sh
37310 root 157 0 792K 668K wait 0 0:00 9.15% 1.66% sh
37961 root 210 0 1600K 1240K wait 0 0:00 10.85% 1.51% bmake
40898 root 157 0 784K 660K wait 2 0:00 30.00% 1.46% sh
37563 root 227 0 1464K 1104K wait 1 0:00 5.92% 1.07% bmake
37391 root 168 0 788K 664K wait 0 0:00 4.84% 0.88% sh
37316 root 162 0 1668K 1296K wait 0 0:00 4.57% 0.83% bmake
40213 root 182 0 1216K 1092K wait 1 0:00 16.00% 0.78% sh
39756 root 229 0 824K 700K wait 0 0:00 2.05% 0.20% sh
23163 root 163 0 1444K 1084K wait 2 0:00 0.26% 0.15% bmake
679 corecode 152 0 5948K 3116K select 0 0:05 0.00% 0.00% sshd
87928 corecode 152 0 2340K 1632K RUN 0 0:03 0.00% 0.00% top
5266 root 152 0 6548K 6120K kqread 0 0:02 0.00% 0.00% pbulk-build
2346 root 152 0 2384K 2052K select 2 0:01 0.00% 0.00% screen-4.0.3
672 corecode 152 0 5948K 3112K select 2 0:01 0.00% 0.00% sshd
6357 corecode 152 0 5948K 3124K select 0 0:00 0.00% 0.00% sshd
83074 root 157 0 3236K 2908K wait 0 0:00 0.00% 0.00% bmake
61160 root 152 0 2340K 1632K CPU0 0 0:00 0.00% 0.00% top
84410 root 192 0 3152K 2828K wait 0 0:00 0.00% 0.00% bmake
60999 root 157 0 3388K 3072K wait 1 0:00 0.00% 0.00% bmake
that's what systat -vm tells:
11 users Load 9.22 8.60 6.50 Jul 26 22:39
Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER
Tot Share Tot Share Free in out in out
Act 28560 5260 67456 9252 2717488 count
All 880132 8524 1793432 13480 pages
13651 zfod Interrupts
Proc:r p d s w Csw Trp Sys Int Sof Flt 16467 cow 556 total
10 45 727964259438586 475 6638986 284684 wire 429 clk
70624 act 13 uhci0
21.6%Sys 12.8%Intr 30.8%User 0.0%Nice 34.8%Idl 524220 inact 69 atapci1
| | | | | | | | | | cache 11 em0
===========++++++>>>>>>>>>>>>>>>> 2717552 free ata0
Namei Name-cache Dir-cache 21446 prcfr 34 irq21
Calls hits % hits % react
81740 79364 97 189 0 pdwake
Disks ad4 ad6 acd0 cd0 pass0 md0 intrn
KB/t 0.00 36.23 0.00 0.00 0.00 0.00 204096 buf
tps 0 34 0 0 0 0 1533 dirtybuf
MB/s 0.00 1.21 0.00 0.00 0.00 0.00 231015 desiredvnodes
% busy 0 3 0 0 0 0 81881 numvnodes
#1 Updated by corecode over 8 years ago
Okay, joerg and me could speed pkgsrc building up a bit by telling pkgsrc to use the builtin for test and [.
However, look at this:
avg fork parent: 0.000148
avg fork child: 0.004091
(rough rough numbers, will be off if there are multiple processes running concurrently)
5ms seems to be a good estimate for fork performance. but mind you, that's the child which is taking longer. This means the parent returns *first*, then starts wait4'ing (usually, or blocks on a read or so), then some time passes and then *later* the child returns from fork. Seems wrong to me.
so much statistics for now, cheers
#2 Updated by corecode over 8 years ago
Okay, less rough numbers:
122991 104/s fork
avg fork parent: 0.000157
avg fork child: 0.002563
so if we assume that the parent always waits for the child, it'll take 0.267 seconds per second (in average) to fork? That means 1/4 of the time is just spent until the child start running.
the above numbers include vfork, these are without:
avg fork parent: 0.000144
avg fork child: 0.002556
same, basically. Maybe teaching sh(1) to use vfork could give some performance improvement?
#3 Updated by dillon over 8 years ago
:Okay, less rough numbers:
:122991 104/s fork
:avg fork parent: 0.000157
:avg fork child: 0.002563
:so if we assume that the parent always waits for the child, it'll take 0.267 seconds per second (in average) to fork? That means 1/4 of the time is just spent until the child start running.
:the above numbers include vfork, these are without:
:avg fork parent: 0.000144
:avg fork child: 0.002556
:same, basically. Maybe teaching sh(1) to use vfork could give some performance improvement?
There is always something running somewhere when you fork. Either the
child is running or the parent is running. Either way both have to run
to their completion before the next program can be execed when running
a shell script since the script serializes everything.
It tends to be more efficient to allow the parent to continue to run
(that is, to return from the fork) because there are fewer context
wait - <<< 2 extra context switches
I would rather not try to optimize programs to use vfork(). There
just isn't a whole lot of difference between fork and vfork.
Most of the time spent by the child is not waiting for the parent
to switch it in, but instead is being spent in execl(). It may be
possible to speed up execution by making certain dynamic binaries
resident (with the 'resident' command). Everything in /bin is
statically linked alredy, though, and already execs as fast as it is
possible to exec a program. But perhaps pkgsrc is using certain
binaries internally that are dynamically linked. I suggest taking
a good look at what pkgsrc runs.
There is a fork test program in /usr/src/test/sysperf/fork1.c.
(make /tmp/fork1 from /usr/src/test/sysperf). I get around
154 uS for a complete fork/exit/wait sequence on my test box.
A time of 2 mS indicates that something else is going on.. either
other processes are running or whatever you are execing is taking
a long time to initialize (which would not be surprising if we are
talking about /bin/sh here).
The GNU configure script crap is just that... crap. In a typical
pkgsrc build that script will probably be run thousands of times,
doing the same tests with slow shell scripts over and over and over
and over and over and over again. It's really the stupidest gnu thing
I've ever seen in my life. It was stupid a decade ago, and its still
#4 Updated by corecode over 8 years ago
No, my point was that the child takes some time to even return from the fork, despite the parent already wait()ing. There is some gap involved where the system seemingly just schedules somebody else instead or so.
Ah, I see the point. This however does not explain why the child takes some time to be scheduled.
I think our vfork is not completely optimized. We could for instance make vfork not COW the vmspace, because the user is only allowed to call _exit and exec(). I don't think we are doing this right now.
That's actually not true:
Total runtime 1171.642231 sec
122991 104/s 0.002616 321.756458 fork_child
122991 104/s 0.000148 18.181783 fork_parent
112576 96/s 0.000235 26.489667 execve
239985 204/s 0.000080 19.174908 open
total exec time is not very high.
I'm not talking about fork/exec performance. This is from a build kdump, taking the CALL and RET times and calculating the duration of the syscall.
#6 Updated by corecode over 8 years ago
ktrace. These are real numbers, not a microbenchmark. There might be a little bit of load interfering, dunno. You can grab the dump from <http://chlamydia.fs.ei.tum.de/~corecode/unsorted/kdump.out.bz2>
#7 Updated by dillon over 8 years ago
:ktrace. These are real numbers, not a microbenchmark. There might be a little bit of load interfering, dunno. You can grab the dump from <http://chlamydia.fs.ei.tum.de/~corecode/unsorted/kdump.out.bz2>
Use kdump -R !!!
In anycase, it does look odd. But I can't reproduce the effect
when I ktrace a similar bmake from pkgsrc and there are cases
in your dump where you don't get large timeouts:
66255 sh 1185522939.738309 CALL fork
66255 sh 1185522939.738419 RET fork 66264/0x102d8
66255 sh 1185522939.738665 CALL getpgrp
66255 sh 1185522939.738676 RET getpgrp 5268/0x1494
66255 sh 1185522939.738730 CALL wait4(0xffffffff,0xbfbff318,0x2,0)
66264 sh 1185522939.738745 RET fork 0
<----------- 15 uS (326 uS from original
66264 sh 1185522939.739292 CALL open(0x280de1c8,0x601,0x1b6)
66264 sh 1185522939.739306 NAMI "/dev/null"
66264 sh 1185522939.739346 RET open 5
What it looks like is happening is that there are other processes
running that get the cpu after the fork/wait4, before the child
gets the cpu. Insofar as I can tell it looks like there is a cpu-bound
program running that is simply not making any system calls to be traced,
or is not being traced.
Are you running anything in parallel during this test or running
a parallel build of some sort?
#8 Updated by corecode over 8 years ago
heh. well, the absolute ones help me process the time automatically.
Yah, quite possible. However, why'd that only happen for for fork and not for other possibly blocking syscalls (or vfork)? I think we need some better, performance-counter based tracing tools, rather than plain kdump output.
I don't remember anymore, probably there was a build going on in parallel. But again, I'm not trying to benchmark, I'm trying to find a cause for the slow build speed.
#9 Updated by dillon over 8 years ago
:Yah, quite possible. However, why'd that only happen for for fork and not for other possibly blocking syscalls (or vfork)? I think we need some better, performance-counter based tracing tools, rather than plain kdump output.
There really isn't a whole lot of difference between fork and vfork,
but I noticed at the end of your dump that both fork and vfork had
much shorter delays then what I saw at the beginning of your dump.
:> Are you running anything in parallel during this test or running
:> a parallel build of some sort?
:I don't remember anymore, probably there was a build going on in parallel. But again, I'm not trying to benchmark, I'm trying to find a cause for the slow build speed.
Its called a script titled 'gnu configure'. Do a systat -vm 1 and look
at the name cache hit stats. Everything bmake, gcc, and especially the
gnu configure script do is extremely filesystem intensive. And in the
case of gnu configure, also mostly redundant.
#10 Updated by joerg over 8 years ago
When NetBSD implemented a real vfork back around NetBSD 1.4 or so, it
made difference in the percent range for a full buildworld. Basically,
you can avoid the duplication of the VM space *and* the COW faulting,
which sums up.
While a lot of bad things can be said about autoconf, it was the only
usable, semi-generic solution for over a decade. There is still only
*one* good alternative (and that is using C++...). The biggest complain
most maintainers on non-default (from the upstream POV) platform had in
all that time was that it is hard to use correctly. E.g. many tests can
be done a lot more intelligently than most configure scripts do. Given
that autoconf follows the "stand-alone" paradigm (as opposed to imake
for some parts of the problem domain), it actually implements it quite
#11 Updated by tuxillo over 8 years ago
- Description updated (diff)
- Status changed from New to In Progress
- Assignee deleted (
I've not seen that much intr% time during bulk builds, at least when I did some a couple months ago.
ftigeot/marino/dillon, can you please confirm this doesn't happen during poudriere builds now?
#13 Updated by tuxillo over 8 years ago
- Category changed from Other to Unverifiable
- Status changed from In Progress to Feedback
- Target version changed from 3.8.0 to Unverifiable
Moving to Unverifiable because:
- Many VM related work done to x64 at least.
- vfork/exec work was done by Matt in the end of 2013.
- We don't do pkgsrc bulkbuilds anymore but we do poudriere builds which are very fork-heavy and as far as I know this issue hasn't showed up.
Feel free to resurrect/provide feedback.