Bug #1309
closednew lwbuf api and mpsafe sf_bufs
0%
Description
The basic thought behind the new lwbuf api is that sf_buf's have been
abused more and more as time has passed for purposes that don't
necessarily match the original intent. Most uses of sf_bufs beyond
sendfile are exceedingly short-lived, short enough that they would be
better served by cpu-local caches of kvm space, even if that means a
bit of overlap now and again. lwbuf attempts to match the semantics of
xio, with one addition as suggested by Simon, a sanity check and tlb
invalidation in lwbuf_kva if the lwbuf strays to some other cpu.
Currently lwbuf allocates a page of kvm at a time on objcache object
construction and stores the kvm with the object in objcache when it is
returned to the cache.
sf_bufs, which are still most applicable to sendfile, have been
simplified on top of the lwbuf api, converted to objcache and the
global hash has been protected with a spinlock.
XIO and exec have been modified to use lwbuf's.
Please comment/hate/review/test
Thanks,
Sam
Updated by aoiko over 15 years ago
No time to comment atm, but a few questions to get a discussion started:
a) does this make any measurable difference in microbenchmarks?
b) have you tested it on mp?
c) is the extra complexity worth it? (just glanced at the patch, looks
simple enough to me)
d) do people think the extra kva usage is going to be an issue?
Hopefully I'll have more to ask tomorrow.
Aggelos
Updated by sjg over 15 years ago
On Mon, Mar 9, 2009 at 12:15 PM, Aggelos Economopoulos
<aoiko@cc.ece.ntua.gr> wrote:
This isn't the easiest thing to microbenchmark, but see the bottom of
this mail for some timings against the exec path.
I have now.
IMO this simplifies the code. sf_buf_alloc() in particular was
somewhat confusing internally, keeping used items on the freelist,
etc. I think ideally, assuming this goes in, all consumers of sf_*
would switch to being lwbuf consumers, and sf_* could move back into
uipc_syscalls.c (or the whole sendfile mess could be moved into
uipc_sendfile.c).
In terms of complexity, I had thought about using a per-cpu hashtable
for lwbuf, but upon consideration I didn't think it was worth the
added complexity or memory consumption. A middle ground might be a
simple open hashing scheme against small tables, but each lwbuf would
still have to track the cpu that owned it and ipi a deallocation if it
wandered to another cpu. This would likely be effective on
microbenchmarks but I am not so sure about real-world. You would be
trading the costs of pmap_kenter_quick() for a hash lookup, and while
the hash lookup is going to be faster that may be outweighed by the
extra overhead elsewhere (free) and in most cases (not
microbenchmarks) what you seek isn't going to be there anyway.
Currently, sf_bufs when used in contexts apart from sendfile(2) are
very short-lived. They are typically allocated, used and then free'd
immediately. If one wanted to do something more complicated I would
suggest never freeing the lwbufs except under memory pressure or in a
LRU fashion on allocation. Sharing the sf_bufs makes a lot of sense
for sendfile(2) itself, because webservers, ftp servers, etc. that
make most effective use of it typically have a fairly small common
working dataset.
I would suggest that in most cases this will use less kva. Not
changing the defaults you have 6656 pages dedicated to sf_bufs w/ 2GB
of memory. Typically I would expect you would see no more than a
couple of hundred pages in use unless you are using sendfile(2).
#!/bin/sh
if [ "$1" = "sub" ]
then
for i in `seq 3000`
do
sh &
done
wait
exit
fi
for i in `seq 10`
do
time $0 sub
done
SMP on T5450 (1.66GHz Core2Duo)
Unpatched
12.12 real 21.14 user 1.59 sys
12.14 real 21.42 user 1.48 sys
12.12 real 21.34 user 1.52 sys
12.13 real 21.06 user 1.71 sys
12.16 real 21.31 user 1.61 sys
12.14 real 21.14 user 1.57 sys
12.16 real 21.37 user 1.72 sys
12.17 real 21.29 user 1.68 sys
12.15 real 21.31 user 1.65 sys
12.19 real 21.07 user 1.61 sys
Patched
12.09 real 21.37 user 1.45 sys
12.07 real 21.27 user 1.51 sys
12.08 real 21.17 user 1.56 sys
12.11 real 21.31 user 1.51 sys
12.11 real 21.44 user 1.54 sys
12.10 real 21.33 user 1.49 sys
12.10 real 21.18 user 1.70 sys
12.12 real 21.45 user 1.53 sys
12.12 real 21.37 user 1.64 sys
12.14 real 21.32 user 1.64 sys
UP on T5450
Unpatched
22.33 real 20.79 user 0.81 sys
22.36 real 20.87 user 0.82 sys
22.35 real 20.57 user 0.87 sys
22.38 real 20.71 user 0.88 sys
22.39 real 20.82 user 0.84 sys
22.37 real 20.81 user 0.89 sys
22.38 real 20.84 user 0.80 sys
22.40 real 20.78 user 0.87 sys
22.40 real 20.71 user 0.90 sys
22.40 real 20.82 user 0.81 sys
Patched
22.34 real 20.75 user 0.82 sys
22.34 real 20.78 user 0.78 sys
22.38 real 20.75 user 0.89 sys
22.37 real 20.82 user 0.75 sys
22.38 real 20.91 user 0.83 sys
22.38 real 20.85 user 0.76 sys
22.40 real 20.82 user 0.95 sys
22.39 real 20.85 user 0.86 sys
22.40 real 20.85 user 0.82 sys
22.39 real 20.90 user 0.85 sys
Updated by dillon over 15 years ago
:http://gitweb.dragonflybsd.org/~sjg/dragonfly.git/commitdiff/caac5766643042650328f83531dc8fd767df5f60
:
:The basic thought behind the new lwbuf api is that sf_buf's have been
:abused more and more as time has passed for purposes that don't
:necessarily match the original intent. Most uses of sf_bufs beyond
:sendfile are exceedingly short-lived, short enough that they would be
:better served by cpu-local caches of kvm space, even if that means a
:bit of overlap now and again. lwbuf attempts to match the semantics of
:xio, with one addition as suggested by Simon, a sanity check and tlb
:invalidation in lwbuf_kva if the lwbuf strays to some other cpu.
:Currently lwbuf allocates a page of kvm at a time on objcache object
:construction and stores the kvm with the object in objcache when it is
:returned to the cache.
:
:sf_bufs, which are still most applicable to sendfile, have been
:simplified on top of the lwbuf api, converted to objcache and the
:global hash has been protected with a spinlock.
:
:XIO and exec have been modified to use lwbuf's.
:
:Please comment/hate/review/test
:
:Thanks,
:Sam
Hmm. If I understand this correctly you are removing SFB_CPUPRIVATE
from the SFBUF API and splitting the entire set of API procedures out
for private-cpu use into a new LWBUF API?
It looks like you moved the cpumask mmu syncing to lwbuf_kva(),
but I think I see a bug in lwbuf_alloc().
From my reading of the code lwbuf_alloc() is not able to reuse
the KVA mappings for previously cached LWBUF's because it is
unconditionally changing the vm_page (lwb->m = m; pmap_kenter_quick...).
Because of the pmap change, lwb->cpumask would have to be set to
gd->gd_cpumask, NOT or'd with the previous mask.
i.e. the bug is:
lwb->cpumask |= gd->gd_cpumask;
It needs to be:
lwb->cpumask = gd->gd_cpumask;
If you want to separate out private mappings like that I think the
LWBUFs have to cached just like they are with SFBUFs. I like the
use of the objcache but I think the LWBUF code needs to retain
a sysctl-programmable number of lwb elements in a RBTREE and try to
reuse the mappings. i.e. it needs to be a bit more sophisticated
then your current design.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by corecode over 15 years ago
Why do these bufs have to go into an rbtree? indexed by the page? is the
same page mapped so often?
cheers
simon
Updated by sjg over 15 years ago
On Tue, Mar 10, 2009 at 5:28 PM, Matthew Dillon
<dillon@apollo.backplane.com> wrote:
Hmm. If I understand this correctly you are removing SFB_CPUPRIVATE
from the SFBUF API and splitting the entire set of API procedures out
for private-cpu use into a new LWBUF API?It looks like you moved the cpumask mmu syncing to lwbuf_kva(),
but I think I see a bug in lwbuf_alloc().From my reading of the code lwbuf_alloc() is not able to reuse
the KVA mappings for previously cached LWBUF's because it is
unconditionally changing the vm_page (lwb->m = m; pmap_kenter_quick...).
Because of the pmap change, lwb->cpumask would have to be set to
gd->gd_cpumask, NOT or'd with the previous mask.i.e. the bug is:
lwb->cpumask |= gd->gd_cpumask;
It needs to be:
lwb->cpumask = gd->gd_cpumask;
Thanks.
If implementing caching at this level would it make sense to still
grab a page of kva at a time and cache forever, looking through our
hash/tree in an attempt to reuse, then looking at the end of an
overlapping LRU queue to see if we have one with a refcnt of 0, and
acquiring more resources if all else fails up to our maximum limit?
With some modest pre-allocation to avoid any initial sloshing.
With the duration that these allocations are held I am not convinced
that caching buys us anything unless we cache past the free.
Sam
Updated by dillon over 15 years ago
:If implementing caching at this level would it make sense to still
:grab a page of kva at a time and cache forever, looking through our
:hash/tree in an attempt to reuse, then looking at the end of an
:overlapping LRU queue to see if we have one with a refcnt of 0, and
:acquiring more resources if all else fails up to our maximum limit?
:With some modest pre-allocation to avoid any initial sloshing.
:
:With the duration that these allocations are held I am not convinced
:that caching buys us anything unless we cache past the free.
:
:Sam
Ok, I'll just reiterate some of what we talked about on IRC so
people reading the thread don't get confused by our skipping around.
Basically the original SFBUF system caches freed mappings which may
be reused later. The lwbuf API loses that ability. What I do like
about the lwbuf code is that it dynamically allocates the lwbufs.
The sfbufs are statically allocated and thus have scaling issues
(e.g. the example Samuel gave on IRC is when one running many parallel
sendfile()'s). I would like to see the SFBUF code use a more
dynamic model and a sysctl which sets the maximum number of excess
sfbufs instead of the maximum number of sfbufs.
The other part of this equation is how to optimize for MP operation.
Initially we should just use the same global index model that we
use now, though perhaps using the newly available atomic ops to
control the ref count and cpumask. As a second step we can figure
out a better model.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by sjg over 15 years ago
On Tue, Mar 10, 2009 at 9:35 PM, Matthew Dillon
<dillon@apollo.backplane.com> wrote:
:If implementing caching at this level would it make sense to still
:grab a page of kva at a time and cache forever, looking through our
:hash/tree in an attempt to reuse, then looking at the end of an
:overlapping LRU queue to see if we have one with a refcnt of 0, and
:acquiring more resources if all else fails up to our maximum limit?
:With some modest pre-allocation to avoid any initial sloshing.
:
:With the duration that these allocations are held I am not convinced
:that caching buys us anything unless we cache past the free.
:
:SamOk, I'll just reiterate some of what we talked about on IRC so
people reading the thread don't get confused by our skipping around.Basically the original SFBUF system caches freed mappings which may
be reused later. The lwbuf API loses that ability. What I do like
about the lwbuf code is that it dynamically allocates the lwbufs.The sfbufs are statically allocated and thus have scaling issues
(e.g. the example Samuel gave on IRC is when one running many parallel
sendfile()'s). I would like to see the SFBUF code use a more
dynamic model and a sysctl which sets the maximum number of excess
sfbufs instead of the maximum number of sfbufs.The other part of this equation is how to optimize for MP operation.
Initially we should just use the same global index model that we
use now, though perhaps using the newly available atomic ops to
control the ref count and cpumask. As a second step we can figure
out a better model.-Matt
Matthew Dillon
<dillon@backplane.com>
I had another thought on this (Simon-inspired), let me see if I can illiterate.
struct lwbuf_pcpu {
struct vm_page *m;
vm_offset_t kva;
int refcnt;
};
lwbuf_pcpu structures sit in a pcpu hash/rbtree
struct lwbuf {
struct vm_page *m;
cpumask_t cpumask;
};
lwbuf's kept in objcache, returned to alloc caller
lwbuf_alloc:
create/reuse a lwbuf_pcpu setting refcnt to 1, or bump existing
lwbuf_pcpu refcnt, set lwbuf cpumask to current cpu.
lwbuf_kva:
check lwbuf cpumask and if it is on our cpu we know there is a
lwbuf_pcpu there and held, look it up and return the kva. otherwise
re-use lwbuf_alloc code to create a held lwbuf on the current cpu and
return the kva.
lwbuf_free:
decref on current cpu if it is in our cpumask, if other cpu's
exist in the mask propagate the decref via passive ipi.
kva could be shared and mapped differently on each cpu, debugging
could get interesting in this case.
All other consumers beyond sendfile proper can use this api.
sf_buf continues to exist as now serving only sendfile proper, made
dynamic and spinlock protected as Dillon explained.
Thoughts?
Sam
Updated by sjg over 15 years ago
I thought I would ask this before laying down any code, you (Dillon)
mentioned putting this into an rbtree previously as opposed to the
existing hash. Even if we do a variable hash (maybe just one that
grows, not shrinks?) amortized costs should be lower than an rbtree.
Is there any rationale as to why an rbtree would be preferrable in
this case apart from simply being simpler in implementation compared
to a variable hash?
Sam
Updated by dillon over 15 years ago
:
:> =A0 =A0The sfbufs are statically allocated and thus have scaling issues
:> =A0 =A0(e.g. the example Samuel gave on IRC is when one running many para=
:llel
:> =A0 =A0sendfile()'s). =A0 I would like to see the SFBUF code use a more
:> =A0 =A0dynamic model and a sysctl which sets the maximum number of exces=
:s
:> =A0 =A0sfbufs instead of the maximum number of sfbufs.
:
:I thought I would ask this before laying down any code, you (Dillon)
:mentioned putting this into an rbtree previously as opposed to the
:existing hash. Even if we do a variable hash (maybe just one that
:grows, not shrinks?) amortized costs should be lower than an rbtree.
:Is there any rationale as to why an rbtree would be preferrable in
:this case apart from simply being simpler in implementation compared
:to a variable hash?
:
:Sam
No, not particularly. A hash will probably work just as well, it just
doesn't scale. I've found that even though binary trees have longer
access times they also require less code maintainance and deal with
scaling cases better.
Hashes do have one undesireable effect. If the hash is large enough
then accesses which do not translate to any sort of locality of reference
within the hash table can really mess up the cpu's L1 and L2 cache.
-Matt
Matthew Dillon
<dillon@backplane.com>