Bug #1416: dma(8): Fix race condition in multi-recipient delivery - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #1416

closed

dma(8): Fix race condition in multi-recipient delivery

Added by roe over 16 years ago. Updated over 16 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:

Jul 3 20:57:21 marvin dma³²⁶⁴⁰: 15c.284010d0: remote delivery
failed:corrupted queue file

Messages may also silently corrupt without dma(8) even noticing.

To reproduce:

$ dd if=/dev/urandom bs=1k count=100 | openssl base64 > /tmp/data
$ mail -s test your@email.com your_second@email.com < /tmp/data

Then compare what you received with /tmp/data.

The attached patch resolves the race condition by locking the
queue file during delivery.

Note that this is just a quick fix against message corruption.
I'm suspecting that a better fix would be to rewrite dma(8) to
fork a process per message, not per recipient, OR copy the spool
file per recipient and fork per spool file.

Disclaimer: Tested and fixed on FreeBSD 7.2, not DragonFlyBSD.

Files

Download all files

dma-lock-delivery.diff (1018 Bytes) dma-lock-delivery.diff		roe, 07/03/2009 10:27 PM
0001-dma-prevent-races-from-sharing-fd-between-children.patch (1.33 KB) 0001-dma-prevent-races-from-sharing-fd-between-children.patch		corecode, 07/09/2009 09:31 PM
0001-dma-prevent-races-from-sharing-fd-between-children.patch (1.33 KB) 0001-dma-prevent-races-from-sharing-fd-between-children.patch		corecode, 07/09/2009 11:56 PM
0001-dma-prevent-races-from-sharing-fd-between-children.patch (1.33 KB) 0001-dma-prevent-races-from-sharing-fd-between-children.patch		corecode, 07/09/2009 11:58 PM
d.c (433 Bytes) d.c		corecode, 07/10/2009 08:51 AM
0001-Fix-queue-file-seek-position-races-by-locking.patch (2.07 KB) 0001-Fix-queue-file-seek-position-races-by-locking.patch		roe, 07/10/2009 10:24 AM

Actions

Copy link

Updated by corecode over 16 years ago

Daniel Roethlisberger wrote:

When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:

Jul 3 20:57:21 marvin dma³²⁶⁴⁰: 15c.284010d0: remote delivery
failed:corrupted queue file

Thanks for your submission! I've taken a different approach, could you please test whether it works for you? (Untested patch).

thanks,
simon

Actions

Copy link

Updated by roe over 16 years ago

Simon 'corecode' Schuber <corecode@fs.ei.tum.de> 2009-07-09:

Daniel Roethlisberger wrote:

When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:

Jul 3 20:57:21 marvin dma³²⁶⁴⁰: 15c.284010d0: remote delivery
failed:corrupted queue file

Thanks for your submission! I've taken a different approach, could you
please test whether it works for you? (Untested patch).

thanks,
simon

This patch doesn't work for me: dma delivers fine to the first
recipient, but fails delivery for all other recipients, logging
an illegal seek to syslog:

Jul 10 01:15:52 marvin dma⁴⁵¹³⁷: e7.28401130: remote delivery
deferred: cannot seek: Illegal seek

Actions

Copy link

Updated by corecode over 16 years ago

Daniel Roethlisberger wrote:

Simon 'corecode' Schuber <corecode@fs.ei.tum.de> 2009-07-09:

--- a/libexec/dma/dma.c
+++ b/libexec/dma/dma.c
@ -484,6 +484,7 @ go_background(struct queue *queue) {
struct sigaction sa;
struct qitem *it;
+ FILE *newqf;
pid_t pid;
if (daemonize && daemon(0, 0) != 0) {
@ -515,6 +516,17 @ go_background(struct queue *queue) *

return and deliver mail
/
+ /
+ * We have to prevent sharing of fds between children, so
+ * we have to dup the queue fd.
+ */
+ newqf = fdopen(fileno(it->queuef), "r");

Actually, fdopen() will not dup() the file descriptors, just
create a new FILE* handle operating on the same fd.

Uh. I ment to add a dup() there, but I moved a lot of code, so I
forgot... What about the attached patch?

cheers
simon

Actions

Copy link

Updated by corecode over 16 years ago

Simon 'corecode' Schubert wrote:

Uh. I ment to add a dup() there, but I moved a lot of code, so I
forgot... What about the attached patch?

And of course I attached the wrong one. Here the right one.

Actions

Copy link

Updated by roe over 16 years ago

[...]

--- a/libexec/dma/dma.c
+++ b/libexec/dma/dma.c
@ -484,6 +484,7 @ go_background(struct queue *queue) {
struct sigaction sa;
struct qitem *it;
+ FILE *newqf;
pid_t pid;
if (daemonize && daemon(0, 0) != 0) {
@ -515,6 +516,17 @ go_background(struct queue *queue) *

return and deliver mail
/
+ /
+ * We have to prevent sharing of fds between children, so
+ * we have to dup the queue fd.
+ */
+ newqf = fdopen(dup(fileno(it->queuef)), "r");
+ if (newqf == NULL) {
+ syslog(LOG_ERR, "can not dup queue fd: %m");
+ exit(1);
+ }
+ fclose(it->queuef);
+ it->queuef = newqf;
return (it);

default:

This is still not sufficient (logs below; of 3 recipients, 2 were
successful but with corrupted contents, 1 failed due to queue
file corruption).

To go down that route, I think you'd need to fully reopen the
file starting with open(). This opens another can of worms:
unlink() races. I guess it is easier to lock the queue file
instead of fully reopening and fixing the unlink() race. But
maybe I'm missing something.

Jul 10 10:41:30 marvin dma⁵²⁹⁵⁰: d6.28401130: mail from=<roe@marvin.ustcor.roe.ch> to=<daniel+3@roe.ch>
Jul 10 10:41:30 marvin dma⁵²⁹⁵¹: d6.28401100: mail from=<roe@marvin.ustcor.roe.ch> to=<daniel+2@roe.ch>
Jul 10 10:41:30 marvin dma⁵²⁹⁵¹: d6.28401100: trying delivery
Jul 10 10:41:30 marvin dma⁵²⁹⁵¹: d6.28401100: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma⁵²⁹⁴⁹: d6.284010d0: mail from=<roe@marvin.ustcor.roe.ch> to=<daniel+1@roe.ch>
Jul 10 10:41:30 marvin dma⁵²⁹⁴⁹: d6.284010d0: trying delivery
Jul 10 10:41:30 marvin dma⁵²⁹⁴⁹: d6.284010d0: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma⁵²⁹⁵⁰: d6.28401130: trying delivery
Jul 10 10:41:30 marvin dma⁵²⁹⁵⁰: d6.28401130: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:34 marvin dma⁵²⁹⁵¹: d6.28401100: SSL initialization successful
Jul 10 10:41:34 marvin dma⁵²⁹⁵¹: d6.28401100: Use SMTP authentication
Jul 10 10:41:37 marvin dma⁵²⁹⁵⁰: d6.28401130: SSL initialization successful
Jul 10 10:41:37 marvin dma⁵²⁹⁴⁹: d6.284010d0: SSL initialization successful
Jul 10 10:41:37 marvin dma⁵²⁹⁵⁰: d6.28401130: Use SMTP authentication
Jul 10 10:41:37 marvin dma⁵²⁹⁴⁹: d6.284010d0: Use SMTP authentication
Jul 10 10:41:45 marvin dma⁵²⁹⁵¹: d6.28401100: delivery successful
Jul 10 10:41:48 marvin dma⁵²⁹⁴⁹: d6.284010d0: remote delivery failed:corrupted queue file
Jul 10 10:41:48 marvin dma⁵²⁹⁴⁹: d6.284010d0: delivery failed, bouncing
Jul 10 10:41:48 marvin dma⁵²⁹⁴⁹: d8.28401250: mail from=<> to=<roe>
Jul 10 10:41:48 marvin dma⁵²⁹⁴⁹: d8.28401250: trying delivery
Jul 10 10:41:48 marvin dma⁵²⁹⁴⁹: d8.28401250: delivery successful
Jul 10 10:41:57 marvin dma⁵²⁹⁵⁰: d6.28401130: delivery successful

Actions

Copy link

Updated by corecode over 16 years ago

Matthew Dillon wrote:

dup() doesn't solve the problem. A dup()'d descriptor still
points to a common file pointer and thus a common seek position.

Only open() will generate a new file pointer backing the file
descriptor.

That's not true.

cheers
simon

Actions

Copy link

Updated by corecode over 16 years ago

Simon 'corecode' Schubert wrote:

Matthew Dillon wrote:

dup() doesn't solve the problem. A dup()'d descriptor still
points to a common file pointer and thus a common seek position.

Only open() will generate a new file pointer backing the file
descriptor.

That's not true.

Nevermind, you're right. I'm too close to sleeping times, and stdio is
doing some heavy buffering.

Actions

Copy link

Updated by roe over 16 years ago

Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

Actions

Copy link

Updated by dillon over 16 years ago

:Uh. I ment to add a dup() there, but I moved a lot of code, so I
:forgot... What about the attached patch?
:
:cheers
: simon

dup() doesn't solve the problem.  A dup()'d descriptor still
    points to a common file pointer and thus a common seek position.

Only open() will generate a new file pointer backing the file
    descriptor.

-Matt

Actions

Copy link

#10

Updated by corecode over 16 years ago

Daniel Roethlisberger wrote:

Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.

cheers
simon

Actions

Copy link

#11

Updated by roe over 16 years ago

Simon 'corecode' Schuber <submit@crater.dragonflybsd.org> 2009-07-16:

Simon 'corecode' Schubert <corecode@fs.ei.tum.de> added the comment:

Daniel Roethlisberger wrote:

Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.

Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
(both according to manual page and practical testing). Using
flock to synchronize on the dup()d / fork()d queue file
descriptor doesnŽt work in this case due to the semantics of
flock.

In any case, IŽm sorry that I cannot test / respond again for 3
weeks since IŽm abroad, far away from keyboard.

Actions

Copy link

#12

Updated by corecode over 16 years ago

Daniel Roethlisberger wrote:

Simon 'corecode' Schuber <submit@crater.dragonflybsd.org> 2009-07-16:

Simon 'corecode' Schubert <corecode@fs.ei.tum.de> added the comment:

Daniel Roethlisberger wrote:

Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:

I guess it is easier to lock the queue file instead of fully
reopening and fixing the unlink() race.

I cleaned up my original hack, maybe this version is a tad more
convincing :)

This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.

Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
(both according to manual page and practical testing). Using
flock to synchronize on the dup()d / fork()d queue file
descriptor doesnŽt work in this case due to the semantics of
flock.

In any case, IŽm sorry that I cannot test / respond again for 3
weeks since IŽm abroad, far away from keyboard.

No problem, I fixed the issue by re-opening the queue file. Thanks for
your bug report!

cheers
simon

Actions

Copy link

#13

Updated by dillon over 16 years ago

flock locks are effective lockf locks but cover the ENTIRE range.
It is not a good idea to try to combine them. Use one or the other.

lockf locks (via fcntl) follow POSIX semanics, I believe, which means
    that when you close() a descriptor, ANY LOCK HELD BY THAT PROCESS ON
    THAT FILE will also be released, even if it was obtained by another 
    descriptor.  This is badly broken but its what POSIX says and what we
    have to do.

If you are fork()ing you want to be sure that you only have one file
    descriptor open for the file that you intend to maintain a lock on.
    If you open/close the file within that same process the original lock
    on the still-open descriptor will be lost.  For fcntl-based lockf locks.

I believe flock() behaves more rationally.

-Matt

Actions

Copy link

#14

Updated by corecode over 16 years ago

Matthew Dillon wrote:

I believe flock() behaves more rationally.

Yah, I simply went and re-opened the file instead of using locks. I also
have a new queueing code in the works that will get rid of this problem
entirely.

cheers
simon

Actions

Copy link

#15

Updated by dillon over 16 years ago

:Matthew Dillon wrote:
:> I believe flock() behaves more rationally.
:
:Yah, I simply went and re-opened the file instead of using locks. I also
:have a new queueing code in the works that will get rid of this problem
:entirely.
:
:cheers
: simon

The last time I looked at the code I also noticed that the file
    descriptors were getting spammed by the fork()... that the children
    would inherit the parent's entire fd set, even open fd's related
    to queue files other then the one the child is supposed to
    process.

-Matt

Actions

Copy link

#16

Updated by corecode over 16 years ago

Matthew Dillon wrote:

:Matthew Dillon wrote:
:> I believe flock() behaves more rationally.
:
:Yah, I simply went and re-opened the file instead of using locks. I also
:have a new queueing code in the works that will get rid of this problem
:entirely.
:
:cheers
: simon

The last time I looked at the code I also noticed that the file
descriptors were getting spammed by the fork()... that the children
would inherit the parent's entire fd set, even open fd's related
to queue files other then the one the child is supposed to
process.

Yes, I'm working on all that.

cheers
simon

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #1416

dma(8): Fix race condition in multi-recipient delivery

Updated by corecode over 16 years ago

Updated by roe over 16 years ago

Updated by corecode over 16 years ago

Updated by corecode over 16 years ago

Updated by roe over 16 years ago

Updated by corecode over 16 years ago

Updated by corecode over 16 years ago

Updated by roe over 16 years ago

Updated by dillon over 16 years ago

Updated by corecode over 16 years ago

Updated by roe over 16 years ago

Updated by corecode over 16 years ago

Updated by dillon over 16 years ago

Updated by corecode over 16 years ago

Updated by dillon over 16 years ago

Updated by corecode over 16 years ago