Project

General

Profile

Actions

Bug #1416

closed

dma(8): Fix race condition in multi-recipient delivery

Added by roe almost 15 years ago. Updated almost 15 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:

Jul 3 20:57:21 marvin dma32640: 15c.284010d0: remote delivery
failed:corrupted queue file

Messages may also silently corrupt without dma(8) even noticing.

To reproduce:

$ dd if=/dev/urandom bs=1k count=100 | openssl base64 > /tmp/data
$ mail -s test < /tmp/data

Then compare what you received with /tmp/data.

The attached patch resolves the race condition by locking the
queue file during delivery.

Note that this is just a quick fix against message corruption.
I'm suspecting that a better fix would be to rewrite dma(8) to
fork a process per message, not per recipient, OR copy the spool
file per recipient and fork per spool file.

Disclaimer: Tested and fixed on FreeBSD 7.2, not DragonFlyBSD.


Files

Actions #1

Updated by corecode almost 15 years ago

Daniel Roethlisberger wrote:

When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:

Jul 3 20:57:21 marvin dma32640: 15c.284010d0: remote delivery
failed:corrupted queue file

Thanks for your submission! I've taken a different approach, could you please test whether it works for you? (Untested patch).

thanks,
simon

Actions #2

Updated by roe almost 15 years ago

Simon 'corecode' Schuber <> 2009-07-09:

Daniel Roethlisberger wrote:

When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:

Jul 3 20:57:21 marvin dma32640: 15c.284010d0: remote delivery
failed:corrupted queue file

Thanks for your submission! I've taken a different approach, could you
please test whether it works for you? (Untested patch).

thanks,
simon

This patch doesn't work for me: dma delivers fine to the first
recipient, but fails delivery for all other recipients, logging
an illegal seek to syslog:

Jul 10 01:15:52 marvin dma45137: e7.28401130: remote delivery
deferred: cannot seek: Illegal seek

Actions #3

Updated by corecode almost 15 years ago

Daniel Roethlisberger wrote:

Simon 'corecode' Schuber <> 2009-07-09:

--- a/libexec/dma/dma.c
+++ b/libexec/dma/dma.c
@ -484,6 +484,7 @ go_background(struct queue *queue) {
struct sigaction sa;
struct qitem *it;
+ FILE *newqf;
pid_t pid;

if (daemonize && daemon(0, 0) != 0) {
@ -515,6 +516,17 @ go_background(struct queue *queue) *
  • return and deliver mail
    /
    + /

    + * We have to prevent sharing of fds between children, so
    + * we have to dup the queue fd.
    + */
    + newqf = fdopen(fileno(it->queuef), "r");

Actually, fdopen() will not dup() the file descriptors, just
create a new FILE* handle operating on the same fd.

Uh. I ment to add a dup() there, but I moved a lot of code, so I
forgot... What about the attached patch?

cheers
simon

Actions #4

Updated by corecode almost 15 years ago

Simon 'corecode' Schubert wrote:

Uh. I ment to add a dup() there, but I moved a lot of code, so I
forgot... What about the attached patch?

And of course I attached the wrong one. Here the right one.

Actions #5

Updated by roe almost 15 years ago

[...]

--- a/libexec/dma/dma.c
+++ b/libexec/dma/dma.c
@ -484,6 +484,7 @ go_background(struct queue *queue) {
struct sigaction sa;
struct qitem *it;
+ FILE *newqf;
pid_t pid;

if (daemonize && daemon(0, 0) != 0) {
@ -515,6 +516,17 @ go_background(struct queue *queue) *
  • return and deliver mail
    /
    + /

    + * We have to prevent sharing of fds between children, so
    + * we have to dup the queue fd.
    + */
    + newqf = fdopen(dup(fileno(it->queuef)), "r");
    + if (newqf == NULL) {
    + syslog(LOG_ERR, "can not dup queue fd: %m");
    + exit(1);
    + }
    + fclose(it->queuef);
    + it->queuef = newqf;
    return (it);

default:

This is still not sufficient (logs below; of 3 recipients, 2 were
successful but with corrupted contents, 1 failed due to queue
file corruption).

To go down that route, I think you'd need to fully reopen the
file starting with open(). This opens another can of worms:
unlink() races. I guess it is easier to lock the queue file
instead of fully reopening and fixing the unlink() race. But
maybe I'm missing something.

Jul 10 10:41:30 marvin dma52950: d6.28401130: mail from=<> to=<>
Jul 10 10:41:30 marvin dma52951: d6.28401100: mail from=<> to=<>
Jul 10 10:41:30 marvin dma52951: d6.28401100: trying delivery
Jul 10 10:41:30 marvin dma52951: d6.28401100: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma52949: d6.284010d0: mail from=<> to=<>
Jul 10 10:41:30 marvin dma52949: d6.284010d0: trying delivery
Jul 10 10:41:30 marvin dma52949: d6.284010d0: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma52950: d6.28401130: trying delivery
Jul 10 10:41:30 marvin dma52950: d6.28401130: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:34 marvin dma52951: d6.28401100: SSL initialization successful
Jul 10 10:41:34 marvin dma52951: d6.28401100: Use SMTP authentication
Jul 10 10:41:37 marvin dma52950: d6.28401130: SSL initialization successful
Jul 10 10:41:37 marvin dma52949: d6.284010d0: SSL initialization successful
Jul 10 10:41:37 marvin dma52950: d6.28401130: Use SMTP authentication
Jul 10 10:41:37 marvin dma52949: d6.284010d0: Use SMTP authentication
Jul 10 10:41:45 marvin dma52951: d6.28401100: delivery successful
Jul 10 10:41:48 marvin dma52949: d6.284010d0: remote delivery failed:corrupted queue file
Jul 10 10:41:48 marvin dma52949: d6.284010d0: delivery failed, bouncing
Jul 10 10:41:48 marvin dma52949: d8.28401250: mail from=<> to=<roe>
Jul 10 10:41:48 marvin dma52949: d8.28401250: trying delivery
Jul 10 10:41:48 marvin dma52949: d8.28401250: delivery successful
Jul 10 10:41:57 marvin dma52950: d6.28401130: delivery successful

Actions #6

Updated by corecode almost 15 years ago

Matthew Dillon wrote:

dup() doesn't solve the problem. A dup()'d descriptor still
points to a common file pointer and thus a common seek position.

Only open() will generate a new file pointer backing the file
descriptor.

That's not true.

cheers
simon

Actions #7

Updated by corecode almost 15 years ago

Simon 'corecode' Schubert wrote:

Matthew Dillon wrote:

dup() doesn't solve the problem. A dup()'d descriptor still
points to a common file pointer and thus a common seek position.

Only open() will generate a new file pointer backing the file
descriptor.

That's not true.

Nevermind, you're right. I'm too close to sleeping times, and stdio is
doing some heavy buffering.

Actions #8

Updated by roe almost 15 years ago

Daniel Roethlisberger <> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

Actions #9

Updated by dillon almost 15 years ago

:Uh. I ment to add a dup() there, but I moved a lot of code, so I
:forgot... What about the attached patch?
:
:cheers
: simon

dup() doesn't solve the problem.  A dup()'d descriptor still
points to a common file pointer and thus a common seek position.
Only open() will generate a new file pointer backing the file
descriptor.
-Matt
Actions #10

Updated by corecode almost 15 years ago

Daniel Roethlisberger wrote:

Daniel Roethlisberger <> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.

cheers
simon

Actions #11

Updated by roe almost 15 years ago

Simon 'corecode' Schuber <> 2009-07-16:

Simon 'corecode' Schubert <> added the comment:

Daniel Roethlisberger wrote:

Daniel Roethlisberger <> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.

Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
(both according to manual page and practical testing). Using
flock to synchronize on the dup()d / fork()d queue file
descriptor doesnŽt work in this case due to the semantics of
flock.

In any case, IŽm sorry that I cannot test / respond again for 3
weeks since IŽm abroad, far away from keyboard.

Actions #12

Updated by corecode almost 15 years ago

Daniel Roethlisberger wrote:

Simon 'corecode' Schuber <> 2009-07-16:

Simon 'corecode' Schubert <> added the comment:

Daniel Roethlisberger wrote:

Daniel Roethlisberger <> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.

Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
(both according to manual page and practical testing). Using
flock to synchronize on the dup()d / fork()d queue file
descriptor doesnŽt work in this case due to the semantics of
flock.

In any case, IŽm sorry that I cannot test / respond again for 3
weeks since IŽm abroad, far away from keyboard.

No problem, I fixed the issue by re-opening the queue file. Thanks for
your bug report!

cheers
simon

Actions #13

Updated by dillon almost 15 years ago

flock locks are effective lockf locks but cover the ENTIRE range.
It is not a good idea to try to combine them. Use one or the other.

lockf locks (via fcntl) follow POSIX semanics, I believe, which means
that when you close() a descriptor, ANY LOCK HELD BY THAT PROCESS ON
THAT FILE will also be released, even if it was obtained by another
descriptor. This is badly broken but its what POSIX says and what we
have to do.
If you are fork()ing you want to be sure that you only have one file
descriptor open for the file that you intend to maintain a lock on.
If you open/close the file within that same process the original lock
on the still-open descriptor will be lost. For fcntl-based lockf locks.
I believe flock() behaves more rationally.
-Matt
Actions #14

Updated by corecode almost 15 years ago

Matthew Dillon wrote:

I believe flock() behaves more rationally.

Yah, I simply went and re-opened the file instead of using locks. I also
have a new queueing code in the works that will get rid of this problem
entirely.

cheers
simon

Actions #15

Updated by dillon almost 15 years ago

:Matthew Dillon wrote:
:> I believe flock() behaves more rationally.
:
:Yah, I simply went and re-opened the file instead of using locks. I also
:have a new queueing code in the works that will get rid of this problem
:entirely.
:
:cheers
: simon

The last time I looked at the code I also noticed that the file
descriptors were getting spammed by the fork()... that the children
would inherit the parent's entire fd set, even open fd's related
to queue files other then the one the child is supposed to
process.
-Matt
Actions #16

Updated by corecode almost 15 years ago

Matthew Dillon wrote:

:Matthew Dillon wrote:
:> I believe flock() behaves more rationally.
:
:Yah, I simply went and re-opened the file instead of using locks. I also
:have a new queueing code in the works that will get rid of this problem
:entirely.
:
:cheers
: simon

The last time I looked at the code I also noticed that the file
descriptors were getting spammed by the fork()... that the children
would inherit the parent's entire fd set, even open fd's related
to queue files other then the one the child is supposed to
process.

Yes, I'm working on all that.

cheers
simon

Actions

Also available in: Atom PDF