Bug #1513

Postfix kqueue(2) support broken

Added by hasso about 5 years ago. Updated almost 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

At some time during last months (I can't give exact dates, sry) something
broke postfix using kqueue(2) (it's default for DragonFly). I've used
postfix for ages on my machines, now it starts up, is able to receive
even some mails at best, but stops responding then. I'm able to telnet to
the port 25, but postfix doesn't respond.

$ ps axl | grep master
0 27129 1 0 152 0 4908 1756 kqread ILs ??
0:00.01 /usr/pkg/libexec/postfix/master
$

At all kqueue(2) seems to be extremely fragile in DragonFly. There are
many packages using it probably just via autodetecting it, but there is
packages known to be broken in DragonFly for ages while built with
kqueue(2) support. Sysutils/dbus is certainly most notable such one -
kqueue(2) support works on every BSD except DragonFly.

History

#1 Updated by dillon about 5 years ago

:At some time during last months (I can't give exact dates, sry) something
:broke postfix using kqueue(2) (it's default for DragonFly). I've used
:postfix for ages on my machines, now it starts up, is able to receive
:even some mails at best, but stops responding then. I'm able to telnet to
:the port 25, but postfix doesn't respond.
:
:$ ps axl | grep master
: 0 27129 1 0 152 0 4908 1756 kqread ILs ??
:0:00.01 /usr/pkg/libexec/postfix/master
:$
:
:At all kqueue(2) seems to be extremely fragile in DragonFly. There are
:many packages using it probably just via autodetecting it, but there is
:packages known to be broken in DragonFly for ages while built with
:kqueue(2) support. Sysutils/dbus is certainly most notable such one -
:kqueue(2) support works on every BSD except DragonFly.
:
:--
:Hasso Tepper

Hasso, could you generate a kernel panic and kernel core on
your machine while postfix is stuck in this state? And upload
it to leaf?

I have an idea what might be wrong but I need a kernel core (and
the related kernel binary of course) to track it down and verify
the issue. The thing is the race that I see is really tiny and
shouldn't regularly effect something like postfix, so I'm not
sure if I'm looking at the same problem that you are reporting.

-Matt
Matthew Dillon
<>

#2 Updated by dillon about 5 years ago

Also verify that the problem occurs w/ the latest master branch
(or 2.4.0 release). Some serious pipe(2) bugs were fixed around
September 8th in master which would also account for numerous
stalling issues.

-Matt
Matthew Dillon
<>

#3 Updated by polachok about 5 years ago

I think I have a related bug with rtorrent. If libtorrent is compiled with
kqueue, rtorrent fails in 10 seconds after start.
...
46247 rtorrent CALL kevent(0x3,0x28925000,0x400,0,0,0)
46247 rtorrent RET kevent -1 errno 9 Bad file descriptor
...
I can put ktrace log somewhere if it's useful.

#4 Updated by hasso about 5 years ago

Alexander Polakov (via DragonFly issue tracker) wrote:
> I think I have a related bug with rtorrent. If libtorrent is compiled
> with kqueue, rtorrent fails in 10 seconds after start.
> ...
> 46247 rtorrent CALL kevent(0x3,0x28925000,0x400,0,0,0)
> 46247 rtorrent RET kevent -1 errno 9 Bad file descriptor
> ...

I think it is, yes. It's a matter of ~10 seconds when postfix stops to
respond as well. I'm not able to generate a core before the next week
though.

#5 Updated by polachok about 5 years ago

Okay, here it is: http://leaf.dragonflybsd.org/~polachok/kdump. I can't reproduce
the problem with small number of torrents.

#6 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:Okay, here it is: http://leaf.dragonflybsd.org/~polachok/kdump. I can't rep=
:roduce=20
:the problem with small number of torrents.

I could be hitting a file descriptor limit. I noticed from the
ktrace output that it is working on very high-numbered descriptors.
e.g. in the 2000+ range!!! This seems excessive even for rtorrent.

If it is a file descriptor limit it will be a different issue than
the postfix kqueue failure.

-Matt
Matthew Dillon
<>

#7 Updated by polachok about 5 years ago

>If it is a file descriptor limit it will be a different issue than
>the postfix kqueue failure.
It works fine with kqueue disabled ("using select polling") with the same number
of torrents. So select() is not affected by the file descriptor limit?

#8 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:>If it is a file descriptor limit it will be a different issue than
:>the postfix kqueue failure.
:It works fine with kqueue disabled ("using select polling") with the same n=
:umber=20
:of torrents. So select() is not affected by the file descriptor limit?

No, but select() probably wouldn't return EBADF if a bad descriptor
were specified in the bitmap.

Did using kqueue w/rtorrent work in an earlier kernel? It still feels
like it's a different problem but the only way to really tell would be
to instrument the kernel with some kprintf()'s for kevent() to track
down which element in the event array being passed is causing the
error.

The kevent call is being called with a change list but no event
return list so any error in the change list will cause the call to
fail with that error (instead of recording it in a returned event
list). Presumably some descriptor within that event list has been
closed.

-Matt
Matthew Dillon
<>

#9 Updated by dillon about 5 years ago

Another thing to test w/ kqueue. If this is a timing race against
sockets then try setting sysctl net.inet.tcp.mpsafe_thread=0.

If that fixes the problem it narrows down the list of possibilites.

-Matt

#10 Updated by polachok about 5 years ago

>If this is a timing race against
>sockets then try setting sysctl net.inet.tcp.mpsafe_thread=0.
Doesn't help.

>If it is a file descriptor limit it will be a different issue than
>the postfix kqueue failure.
I changed kern.maxfilesperproc to 5000000, still fails.

#11 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:>If this is a timing race against
:>sockets then try setting sysctl net.inet.tcp.mpsafe_thread=3D0.
:Doesn't help.
:
:>If it is a file descriptor limit it will be a different issue than
:>the postfix kqueue failure.
:I changed kern.maxfilesperproc to 5000000, still fails.

Try this patch. All it does is add kprintf()'s in various failure
paths, hopefully it will tell us why the kevent call is failing.

fetch http://apollo.backplane.com/DFlyMisc/kqueue01.patch

-Matt
Matthew Dillon
<>

#12 Updated by aoiko almost 5 years ago

Hasso Tepper wrote:
[...]
> At all kqueue(2) seems to be extremely fragile in DragonFly. There are
> many packages using it probably just via autodetecting it, but there is
> packages known to be broken in DragonFly for ages while built with
> kqueue(2) support. Sysutils/dbus is certainly most notable such one -
> kqueue(2) support works on every BSD except DragonFly.

Do those packages use pipes? Does this commit help with the postfix
issue at least?

commit d9dd0db189df92875f7bde80747910ad551eabdd
Author: Matthew Dillon <>
Date: Mon Sep 21 23:17:14 2009 -0700

kernel - Fix kqueue and SIGIO operation on pipes

* pipe reads and writes were not notifying kqueue and SIGIO consumers
due to an incorrect conditional which only tested for select/poll
consumers.

#13 Updated by hasso almost 5 years ago

Yes, it's fixed now. Thanks, Matt. If there is still problem with rtorrent,
it's a different one.

#14 Updated by dillon almost 5 years ago

:Hasso Tepper <> added the comment:
:
:Yes, it's fixed now. Thanks, Matt. If there is still problem with rtorrent,=
:=20
:it's a different one.
:
:----------
:status: chatting -> resolved

Ok, excellent! I think we've fixed the biggest issues in 2.4.0.
USB is still a sticking point but generally speaking I think we
will be able to roll 2.4.1 this coming weekend.

-Matt
Matthew Dillon
<>

Also available in: Atom PDF