Bug #1416

dma(8): Fix race condition in multi-recipient delivery

Added by roe about 5 years ago. Updated about 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:

Jul 3 20:57:21 marvin dma[32640]: 15c.284010d0: remote delivery
failed:corrupted queue file

Messages may also silently corrupt without dma(8) even noticing.

To reproduce:

$ dd if=/dev/urandom bs=1k count=100 | openssl base64 > /tmp/data
$ mail -s test < /tmp/data

Then compare what you received with /tmp/data.

The attached patch resolves the race condition by locking the
queue file during delivery.

Note that this is just a quick fix against message corruption.
I'm suspecting that a better fix would be to rewrite dma(8) to
fork a process per message, not per recipient, OR copy the spool
file per recipient and fork per spool file.

Disclaimer: Tested and fixed on FreeBSD 7.2, not DragonFlyBSD.

dma-lock-delivery.diff Magnifier (1018 Bytes) roe, 07/03/2009 10:27 PM

0001-dma-prevent-races-from-sharing-fd-between-children.patch Magnifier (1.33 KB) corecode, 07/09/2009 09:31 PM

0001-dma-prevent-races-from-sharing-fd-between-children.patch Magnifier (1.33 KB) corecode, 07/09/2009 11:56 PM

0001-dma-prevent-races-from-sharing-fd-between-children.patch Magnifier (1.33 KB) corecode, 07/09/2009 11:58 PM

d.c Magnifier (433 Bytes) corecode, 07/10/2009 08:51 AM

0001-Fix-queue-file-seek-position-races-by-locking.patch Magnifier (2.07 KB) roe, 07/10/2009 10:24 AM

History

#1 Updated by corecode about 5 years ago

Daniel Roethlisberger wrote:
> When delivering mail to multiple recipients, dma(8) will fork for
> each recipient and attempt parallel delivery. Separate delivery
> processes for multiple recipients of the same message all use the
> same FILE* on the same file descriptor. If a process is loosing
> the CPU during the delivery process, another delivery process for
> the same message will wreak havoc on the file descriptor. The
> symptom is dma(8) logging a failed delivery and bouncing the
> message for some or all of the recipients:
>
> Jul 3 20:57:21 marvin dma[32640]: 15c.284010d0: remote delivery
> failed:corrupted queue file

Thanks for your submission! I've taken a different approach, could you please test whether it works for you? (Untested patch).

thanks,
simon

#2 Updated by roe about 5 years ago

Simon 'corecode' Schuber <> 2009-07-09:
> Daniel Roethlisberger wrote:
> >When delivering mail to multiple recipients, dma(8) will fork for
> >each recipient and attempt parallel delivery. Separate delivery
> >processes for multiple recipients of the same message all use the
> >same FILE* on the same file descriptor. If a process is loosing
> >the CPU during the delivery process, another delivery process for
> >the same message will wreak havoc on the file descriptor. The
> >symptom is dma(8) logging a failed delivery and bouncing the
> >message for some or all of the recipients:
> >
> >Jul 3 20:57:21 marvin dma[32640]: 15c.284010d0: remote delivery
> >failed:corrupted queue file
>
> Thanks for your submission! I've taken a different approach, could you
> please test whether it works for you? (Untested patch).
>
> thanks,
> simon

This patch doesn't work for me: dma delivers fine to the first
recipient, but fails delivery for all other recipients, logging
an illegal seek to syslog:

Jul 10 01:15:52 marvin dma[45137]: e7.28401130: remote delivery
deferred: cannot seek: Illegal seek

#3 Updated by corecode about 5 years ago

Daniel Roethlisberger wrote:
> Simon 'corecode' Schuber <> 2009-07-09:
>> --- a/libexec/dma/dma.c
>> +++ b/libexec/dma/dma.c
>> @@ -484,6 +484,7 @@ go_background(struct queue *queue)
>> {
>> struct sigaction sa;
>> struct qitem *it;
>> + FILE *newqf;
>> pid_t pid;
>>
>> if (daemonize && daemon(0, 0) != 0) {
>> @@ -515,6 +516,17 @@ go_background(struct queue *queue)
>> *
>> * return and deliver mail
>> */
>> + /*
>> + * We have to prevent sharing of fds between children, so
>> + * we have to dup the queue fd.
>> + */
>> + newqf = fdopen(fileno(it->queuef), "r");
>
> Actually, fdopen() will not dup() the file descriptors, just
> create a new FILE* handle operating on the same fd.

Uh. I ment to add a dup() there, but I moved a lot of code, so I
forgot... What about the attached patch?

cheers
simon

#4 Updated by corecode about 5 years ago

Simon 'corecode' Schubert wrote:
> Uh. I ment to add a dup() there, but I moved a lot of code, so I
> forgot... What about the attached patch?

And of course I attached the wrong one. Here the right one.

#5 Updated by roe about 5 years ago

[...]
> --- a/libexec/dma/dma.c
> +++ b/libexec/dma/dma.c
> @@ -484,6 +484,7 @@ go_background(struct queue *queue)
> {
> struct sigaction sa;
> struct qitem *it;
> + FILE *newqf;
> pid_t pid;
>
> if (daemonize && daemon(0, 0) != 0) {
> @@ -515,6 +516,17 @@ go_background(struct queue *queue)
> *
> * return and deliver mail
> */
> + /*
> + * We have to prevent sharing of fds between children, so
> + * we have to dup the queue fd.
> + */
> + newqf = fdopen(dup(fileno(it->queuef)), "r");
> + if (newqf == NULL) {
> + syslog(LOG_ERR, "can not dup queue fd: %m");
> + exit(1);
> + }
> + fclose(it->queuef);
> + it->queuef = newqf;
> return (it);
>
> default:

This is still not sufficient (logs below; of 3 recipients, 2 were
successful but with corrupted contents, 1 failed due to queue
file corruption).

To go down that route, I think you'd need to fully reopen the
file starting with open(). This opens another can of worms:
unlink() races. I guess it is easier to lock the queue file
instead of fully reopening and fixing the unlink() race. But
maybe I'm missing something.

Jul 10 10:41:30 marvin dma[52950]: d6.28401130: mail from=<> to=<>
Jul 10 10:41:30 marvin dma[52951]: d6.28401100: mail from=<> to=<>
Jul 10 10:41:30 marvin dma[52951]: d6.28401100: trying delivery
Jul 10 10:41:30 marvin dma[52951]: d6.28401100: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma[52949]: d6.284010d0: mail from=<> to=<>
Jul 10 10:41:30 marvin dma[52949]: d6.284010d0: trying delivery
Jul 10 10:41:30 marvin dma[52949]: d6.284010d0: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma[52950]: d6.28401130: trying delivery
Jul 10 10:41:30 marvin dma[52950]: d6.28401130: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:34 marvin dma[52951]: d6.28401100: SSL initialization successful
Jul 10 10:41:34 marvin dma[52951]: d6.28401100: Use SMTP authentication
Jul 10 10:41:37 marvin dma[52950]: d6.28401130: SSL initialization successful
Jul 10 10:41:37 marvin dma[52949]: d6.284010d0: SSL initialization successful
Jul 10 10:41:37 marvin dma[52950]: d6.28401130: Use SMTP authentication
Jul 10 10:41:37 marvin dma[52949]: d6.284010d0: Use SMTP authentication
Jul 10 10:41:45 marvin dma[52951]: d6.28401100: delivery successful
Jul 10 10:41:48 marvin dma[52949]: d6.284010d0: remote delivery failed:corrupted queue file
Jul 10 10:41:48 marvin dma[52949]: d6.284010d0: delivery failed, bouncing
Jul 10 10:41:48 marvin dma[52949]: d8.28401250: mail from=<> to=<roe>
Jul 10 10:41:48 marvin dma[52949]: d8.28401250: trying delivery
Jul 10 10:41:48 marvin dma[52949]: d8.28401250: delivery successful
Jul 10 10:41:57 marvin dma[52950]: d6.28401130: delivery successful

#6 Updated by corecode about 5 years ago

Matthew Dillon wrote:
> dup() doesn't solve the problem. A dup()'d descriptor still
> points to a common file pointer and thus a common seek position.
>
> Only open() will generate a new file pointer backing the file
> descriptor.

That's not true.

cheers
simon

#7 Updated by corecode about 5 years ago

Simon 'corecode' Schubert wrote:
> Matthew Dillon wrote:
>> dup() doesn't solve the problem. A dup()'d descriptor still
>> points to a common file pointer and thus a common seek position.
>>
>> Only open() will generate a new file pointer backing the file
>> descriptor.
>
> That's not true.

Nevermind, you're right. I'm too close to sleeping times, and stdio is
doing some heavy buffering.

#8 Updated by roe about 5 years ago

Daniel Roethlisberger <> 2009-07-10:
> I guess it is easier to lock the queue file instead of fully
> reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

#9 Updated by dillon about 5 years ago

:Uh. I ment to add a dup() there, but I moved a lot of code, so I
:forgot... What about the attached patch?
:
:cheers
: simon

dup() doesn't solve the problem. A dup()'d descriptor still
points to a common file pointer and thus a common seek position.

Only open() will generate a new file pointer backing the file
descriptor.

-Matt

#10 Updated by corecode about 5 years ago

Daniel Roethlisberger wrote:
> Daniel Roethlisberger <> 2009-07-10:
>> I guess it is easier to lock the queue file instead of fully
>> reopening and fixing the unlink() race.
>
> I cleaned up my original hack, maybe this version is a tad more
> convincing :)

This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.

cheers
simon

#11 Updated by roe about 5 years ago

Simon 'corecode' Schuber <> 2009-07-16:
>
> Simon 'corecode' Schubert <> added the comment:
>
> Daniel Roethlisberger wrote:
> > Daniel Roethlisberger <> 2009-07-10:
> >> I guess it is easier to lock the queue file instead of fully
> >> reopening and fixing the unlink() race.
> >
> > I cleaned up my original hack, maybe this version is a tad more
> > convincing :)
>
> This doesn't work for me. All processes are stuck in state lockf. I
> think we can't reliably combine fcntl locks and flock locks.

Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
(both according to manual page and practical testing). Using
flock to synchronize on the dup()d / fork()d queue file
descriptor doesnŽt work in this case due to the semantics of
flock.

In any case, IŽm sorry that I cannot test / respond again for 3
weeks since IŽm abroad, far away from keyboard.

#12 Updated by corecode about 5 years ago

Daniel Roethlisberger wrote:
> Simon 'corecode' Schuber <> 2009-07-16:
>> Simon 'corecode' Schubert <> added the comment:
>>
>> Daniel Roethlisberger wrote:
>>> Daniel Roethlisberger <> 2009-07-10:
>>>> I guess it is easier to lock the queue file instead of fully
>>>> reopening and fixing the unlink() race.
>>> I cleaned up my original hack, maybe this version is a tad more
>>> convincing :)
>> This doesn't work for me. All processes are stuck in state lockf. I
>> think we can't reliably combine fcntl locks and flock locks.
>
> Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
> (both according to manual page and practical testing). Using
> flock to synchronize on the dup()d / fork()d queue file
> descriptor doesnŽt work in this case due to the semantics of
> flock.
>
> In any case, IŽm sorry that I cannot test / respond again for 3
> weeks since IŽm abroad, far away from keyboard.

No problem, I fixed the issue by re-opening the queue file. Thanks for
your bug report!

cheers
simon

#13 Updated by dillon about 5 years ago

flock locks are effective lockf locks but cover the ENTIRE range.
It is not a good idea to try to combine them. Use one or the other.

lockf locks (via fcntl) follow POSIX semanics, I believe, which means
that when you close() a descriptor, ANY LOCK HELD BY THAT PROCESS ON
THAT FILE will also be released, even if it was obtained by another
descriptor. This is badly broken but its what POSIX says and what we
have to do.

If you are fork()ing you want to be sure that you only have one file
descriptor open for the file that you intend to maintain a lock on.
If you open/close the file within that same process the original lock
on the still-open descriptor will be lost. For fcntl-based lockf locks.

I believe flock() behaves more rationally.

-Matt

#14 Updated by corecode about 5 years ago

Matthew Dillon wrote:
> I believe flock() behaves more rationally.

Yah, I simply went and re-opened the file instead of using locks. I also
have a new queueing code in the works that will get rid of this problem
entirely.

cheers
simon

#15 Updated by dillon about 5 years ago

:Matthew Dillon wrote:
:> I believe flock() behaves more rationally.
:
:Yah, I simply went and re-opened the file instead of using locks. I also
:have a new queueing code in the works that will get rid of this problem
:entirely.
:
:cheers
: simon

The last time I looked at the code I also noticed that the file
descriptors were getting spammed by the fork()... that the children
would inherit the parent's entire fd set, even open fd's related
to queue files other then the one the child is supposed to
process.

-Matt

#16 Updated by corecode about 5 years ago

Matthew Dillon wrote:
> :Matthew Dillon wrote:
> :> I believe flock() behaves more rationally.
> :
> :Yah, I simply went and re-opened the file instead of using locks. I also
> :have a new queueing code in the works that will get rid of this problem
> :entirely.
> :
> :cheers
> : simon
>
> The last time I looked at the code I also noticed that the file
> descriptors were getting spammed by the fork()... that the children
> would inherit the parent's entire fd set, even open fd's related
> to queue files other then the one the child is supposed to
> process.

Yes, I'm working on all that.

cheers
simon

Also available in: Atom PDF