Bug #1416
dma(8): Fix race condition in multi-recipient delivery
| Status: | Closed | Start date: | ||
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | - | % Done: | 0% |
|
| Category: | - | |||
| Target version: | - |
Description
When delivering mail to multiple recipients, dma(8) will fork for
each recipient and attempt parallel delivery. Separate delivery
processes for multiple recipients of the same message all use the
same FILE* on the same file descriptor. If a process is loosing
the CPU during the delivery process, another delivery process for
the same message will wreak havoc on the file descriptor. The
symptom is dma(8) logging a failed delivery and bouncing the
message for some or all of the recipients:
Jul 3 20:57:21 marvin dma[32640]: 15c.284010d0: remote delivery
failed:corrupted queue file
Messages may also silently corrupt without dma(8) even noticing.
To reproduce:
$ dd if=/dev/urandom bs=1k count=100 | openssl base64 > /tmp/data
$ mail -s test your@email.com your_second@email.com < /tmp/data
Then compare what you received with /tmp/data.
The attached patch resolves the race condition by locking the
queue file during delivery.
Note that this is just a quick fix against message corruption.
I'm suspecting that a better fix would be to rewrite dma(8) to
fork a process per message, not per recipient, OR copy the spool
file per recipient and fork per spool file.
Disclaimer: Tested and fixed on FreeBSD 7.2, not DragonFlyBSD.
Related todos
History
Updated by corecode almost 4 years ago
Daniel Roethlisberger wrote:
> When delivering mail to multiple recipients, dma(8) will fork for
> each recipient and attempt parallel delivery. Separate delivery
> processes for multiple recipients of the same message all use the
> same FILE* on the same file descriptor. If a process is loosing
> the CPU during the delivery process, another delivery process for
> the same message will wreak havoc on the file descriptor. The
> symptom is dma(8) logging a failed delivery and bouncing the
> message for some or all of the recipients:
>
> Jul 3 20:57:21 marvin dma[32640]: 15c.284010d0: remote delivery
> failed:corrupted queue file
Thanks for your submission! I've taken a different approach, could you please test whether it works for you? (Untested patch).
thanks,
simon
Updated by roe almost 4 years ago
Simon 'corecode' Schuber <corecode@fs.ei.tum.de> 2009-07-09:
> Daniel Roethlisberger wrote:
> >When delivering mail to multiple recipients, dma(8) will fork for
> >each recipient and attempt parallel delivery. Separate delivery
> >processes for multiple recipients of the same message all use the
> >same FILE* on the same file descriptor. If a process is loosing
> >the CPU during the delivery process, another delivery process for
> >the same message will wreak havoc on the file descriptor. The
> >symptom is dma(8) logging a failed delivery and bouncing the
> >message for some or all of the recipients:
> >
> >Jul 3 20:57:21 marvin dma[32640]: 15c.284010d0: remote delivery
> >failed:corrupted queue file
>
> Thanks for your submission! I've taken a different approach, could you
> please test whether it works for you? (Untested patch).
>
> thanks,
> simon
This patch doesn't work for me: dma delivers fine to the first
recipient, but fails delivery for all other recipients, logging
an illegal seek to syslog:
Jul 10 01:15:52 marvin dma[45137]: e7.28401130: remote delivery
deferred: cannot seek: Illegal seek
Updated by corecode almost 4 years ago
Daniel Roethlisberger wrote:
> Simon 'corecode' Schuber <corecode@fs.ei.tum.de> 2009-07-09:
>> --- a/libexec/dma/dma.c
>> +++ b/libexec/dma/dma.c
>> @@ -484,6 +484,7 @@ go_background(struct queue *queue)
>> {
>> struct sigaction sa;
>> struct qitem *it;
>> + FILE *newqf;
>> pid_t pid;
>>
>> if (daemonize && daemon(0, 0) != 0) {
>> @@ -515,6 +516,17 @@ go_background(struct queue *queue)
>> *
>> * return and deliver mail
>> */
>> + /*
>> + * We have to prevent sharing of fds between children, so
>> + * we have to dup the queue fd.
>> + */
>> + newqf = fdopen(fileno(it->queuef), "r");
>
> Actually, fdopen() will not dup() the file descriptors, just
> create a new FILE* handle operating on the same fd.
Uh. I ment to add a dup() there, but I moved a lot of code, so I
forgot... What about the attached patch?
cheers
simon
Updated by corecode almost 4 years ago
Simon 'corecode' Schubert wrote:
> Uh. I ment to add a dup() there, but I moved a lot of code, so I
> forgot... What about the attached patch?
And of course I attached the wrong one. Here the right one.
Updated by roe almost 4 years ago
[...]
> --- a/libexec/dma/dma.c
> +++ b/libexec/dma/dma.c
> @@ -484,6 +484,7 @@ go_background(struct queue *queue)
> {
> struct sigaction sa;
> struct qitem *it;
> + FILE *newqf;
> pid_t pid;
>
> if (daemonize && daemon(0, 0) != 0) {
> @@ -515,6 +516,17 @@ go_background(struct queue *queue)
> *
> * return and deliver mail
> */
> + /*
> + * We have to prevent sharing of fds between children, so
> + * we have to dup the queue fd.
> + */
> + newqf = fdopen(dup(fileno(it->queuef)), "r");
> + if (newqf == NULL) {
> + syslog(LOG_ERR, "can not dup queue fd: %m");
> + exit(1);
> + }
> + fclose(it->queuef);
> + it->queuef = newqf;
> return (it);
>
> default:
This is still not sufficient (logs below; of 3 recipients, 2 were
successful but with corrupted contents, 1 failed due to queue
file corruption).
To go down that route, I think you'd need to fully reopen the
file starting with open(). This opens another can of worms:
unlink() races. I guess it is easier to lock the queue file
instead of fully reopening and fixing the unlink() race. But
maybe I'm missing something.
Jul 10 10:41:30 marvin dma[52950]: d6.28401130: mail from=<roe@marvin.ustcor.roe.ch> to=<daniel+3@roe.ch>
Jul 10 10:41:30 marvin dma[52951]: d6.28401100: mail from=<roe@marvin.ustcor.roe.ch> to=<daniel+2@roe.ch>
Jul 10 10:41:30 marvin dma[52951]: d6.28401100: trying delivery
Jul 10 10:41:30 marvin dma[52951]: d6.28401100: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma[52949]: d6.284010d0: mail from=<roe@marvin.ustcor.roe.ch> to=<daniel+1@roe.ch>
Jul 10 10:41:30 marvin dma[52949]: d6.284010d0: trying delivery
Jul 10 10:41:30 marvin dma[52949]: d6.284010d0: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:30 marvin dma[52950]: d6.28401130: trying delivery
Jul 10 10:41:30 marvin dma[52950]: d6.28401130: using smarthost (calvin.ustdmz.roe.ch:587)
Jul 10 10:41:34 marvin dma[52951]: d6.28401100: SSL initialization successful
Jul 10 10:41:34 marvin dma[52951]: d6.28401100: Use SMTP authentication
Jul 10 10:41:37 marvin dma[52950]: d6.28401130: SSL initialization successful
Jul 10 10:41:37 marvin dma[52949]: d6.284010d0: SSL initialization successful
Jul 10 10:41:37 marvin dma[52950]: d6.28401130: Use SMTP authentication
Jul 10 10:41:37 marvin dma[52949]: d6.284010d0: Use SMTP authentication
Jul 10 10:41:45 marvin dma[52951]: d6.28401100: delivery successful
Jul 10 10:41:48 marvin dma[52949]: d6.284010d0: remote delivery failed:corrupted queue file
Jul 10 10:41:48 marvin dma[52949]: d6.284010d0: delivery failed, bouncing
Jul 10 10:41:48 marvin dma[52949]: d8.28401250: mail from=<> to=<roe>
Jul 10 10:41:48 marvin dma[52949]: d8.28401250: trying delivery
Jul 10 10:41:48 marvin dma[52949]: d8.28401250: delivery successful
Jul 10 10:41:57 marvin dma[52950]: d6.28401130: delivery successful
Updated by corecode almost 4 years ago
Matthew Dillon wrote:
> dup() doesn't solve the problem. A dup()'d descriptor still
> points to a common file pointer and thus a common seek position.
>
> Only open() will generate a new file pointer backing the file
> descriptor.
That's not true.
cheers
simon
Updated by corecode almost 4 years ago
Simon 'corecode' Schubert wrote:
> Matthew Dillon wrote:
>> dup() doesn't solve the problem. A dup()'d descriptor still
>> points to a common file pointer and thus a common seek position.
>>
>> Only open() will generate a new file pointer backing the file
>> descriptor.
>
> That's not true.
Nevermind, you're right. I'm too close to sleeping times, and stdio is
doing some heavy buffering.
Updated by roe almost 4 years ago
Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:
> I guess it is easier to lock the queue file instead of fully
> reopening and fixing the unlink() race.
I cleaned up my original hack, maybe this version is a tad more
convincing :)
Updated by dillon almost 4 years ago
:Uh. I ment to add a dup() there, but I moved a lot of code, so I
:forgot... What about the attached patch?
:
:cheers
: simon
dup() doesn't solve the problem. A dup()'d descriptor still
points to a common file pointer and thus a common seek position.
Only open() will generate a new file pointer backing the file
descriptor.
-Matt
Updated by corecode almost 4 years ago
Daniel Roethlisberger wrote:
> Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:
>> I guess it is easier to lock the queue file instead of fully
>> reopening and fixing the unlink() race.
>
> I cleaned up my original hack, maybe this version is a tad more
> convincing :)
This doesn't work for me. All processes are stuck in state lockf. I
think we can't reliably combine fcntl locks and flock locks.
cheers
simon
Updated by roe almost 4 years ago
Simon 'corecode' Schuber <submit@crater.dragonflybsd.org> 2009-07-16:
>
> Simon 'corecode' Schubert <corecode@fs.ei.tum.de> added the comment:
>
> Daniel Roethlisberger wrote:
> > Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:
> >> I guess it is easier to lock the queue file instead of fully
> >> reopening and fixing the unlink() race.
> >
> > I cleaned up my original hack, maybe this version is a tad more
> > convincing :)
>
> This doesn't work for me. All processes are stuck in state lockf. I
> think we can't reliably combine fcntl locks and flock locks.
Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
(both according to manual page and practical testing). Using
flock to synchronize on the dup()d / fork()d queue file
descriptor doesnŽt work in this case due to the semantics of
flock.
In any case, IŽm sorry that I cannot test / respond again for 3
weeks since IŽm abroad, far away from keyboard.
Updated by corecode almost 4 years ago
Daniel Roethlisberger wrote:
> Simon 'corecode' Schuber <submit@crater.dragonflybsd.org> 2009-07-16:
>> Simon 'corecode' Schubert <corecode@fs.ei.tum.de> added the comment:
>>
>> Daniel Roethlisberger wrote:
>>> Daniel Roethlisberger <daniel@roe.ch> 2009-07-10:
>>>> I guess it is easier to lock the queue file instead of fully
>>>> reopening and fixing the unlink() race.
>>> I cleaned up my original hack, maybe this version is a tad more
>>> convincing :)
>> This doesn't work for me. All processes are stuck in state lockf. I
>> think we can't reliably combine fcntl locks and flock locks.
>
> Is that a DragonFlyBSD issue? On FreeBSD seem to be compatible
> (both according to manual page and practical testing). Using
> flock to synchronize on the dup()d / fork()d queue file
> descriptor doesnŽt work in this case due to the semantics of
> flock.
>
> In any case, IŽm sorry that I cannot test / respond again for 3
> weeks since IŽm abroad, far away from keyboard.
No problem, I fixed the issue by re-opening the queue file. Thanks for
your bug report!
cheers
simon
Updated by dillon almost 4 years ago
flock locks are effective lockf locks but cover the ENTIRE range.
It is not a good idea to try to combine them. Use one or the other.
lockf locks (via fcntl) follow POSIX semanics, I believe, which means
that when you close() a descriptor, ANY LOCK HELD BY THAT PROCESS ON
THAT FILE will also be released, even if it was obtained by another
descriptor. This is badly broken but its what POSIX says and what we
have to do.
If you are fork()ing you want to be sure that you only have one file
descriptor open for the file that you intend to maintain a lock on.
If you open/close the file within that same process the original lock
on the still-open descriptor will be lost. For fcntl-based lockf locks.
I believe flock() behaves more rationally.
-Matt
Updated by corecode almost 4 years ago
Matthew Dillon wrote:
> I believe flock() behaves more rationally.
Yah, I simply went and re-opened the file instead of using locks. I also
have a new queueing code in the works that will get rid of this problem
entirely.
cheers
simon
Updated by dillon almost 4 years ago
:Matthew Dillon wrote:
:> I believe flock() behaves more rationally.
:
:Yah, I simply went and re-opened the file instead of using locks. I also
:have a new queueing code in the works that will get rid of this problem
:entirely.
:
:cheers
: simon
The last time I looked at the code I also noticed that the file
descriptors were getting spammed by the fork()... that the children
would inherit the parent's entire fd set, even open fd's related
to queue files other then the one the child is supposed to
process.
-Matt
Updated by corecode almost 4 years ago
Matthew Dillon wrote:
> :Matthew Dillon wrote:
> :> I believe flock() behaves more rationally.
> :
> :Yah, I simply went and re-opened the file instead of using locks. I also
> :have a new queueing code in the works that will get rid of this problem
> :entirely.
> :
> :cheers
> : simon
>
> The last time I looked at the code I also noticed that the file
> descriptors were getting spammed by the fork()... that the children
> would inherit the parent's entire fd set, even open fd's related
> to queue files other then the one the child is supposed to
> process.
Yes, I'm working on all that.
cheers
simon