Bug #1302

Checkpoint regression?

Added by sjg over 5 years ago. Updated about 1 year ago.

Status:In ProgressStart date:
Priority:NormalDue date:
Assignee:sjg% Done:

0%

Category:-
Target version:-

Description

DragonFly 2.3.0-DEVELOPMENT #10: Fri Feb 27 20:15:21 PST 2009
:/usr/obj/usr/src/sys/GENERIC

Either CKPT or CKPTEXIT will create a .ckpt, but CKPTEXIT does not cause the
process to terminate (unless explicitly handled).

Attempting to thaw any checkpoint seems to result in:
thaw failed error -1 Unknown error: -1

History

#1 Updated by dillon over 5 years ago

:New submission from Samuel J. Greear <>:
:
:DragonFly 2.3.0-DEVELOPMENT #10: Fri Feb 27 20:15:21 PST 2009 =20
::/usr/obj/usr/src/sys/GENERIC
:
:Either CKPT or CKPTEXIT will create a .ckpt, but CKPTEXIT does not cause th=
:e
:process to terminate (unless explicitly handled).

I'm not sure it ever worked this way. They were designed so they
could be handled differently by the program if it wanted to handle
them.

:Attempting to thaw any checkpoint seems to result in:
:thaw failed error -1 Unknown error: -1

This one now fixed! The reordering of the ELF coredump code broke
the checkpoint restore code.

I'll note here that the new ELF coredump code includes thread info.
I'm not sure if it has enough (i.e. also the thread id for each thread),
but if it does we would be able to have the checkpoint restore code
also restore multiple threads. Now THAT would be very, very cool
if someone wanted to mess with it.

-Matt

#2 Updated by sjg over 5 years ago

For posterity, the language from the sys_checkpoint(2) manpage is as follows and
I believe we agreed that it was the desired behavior.

SIGNALS
Two signals are associated with checkpointing. SIGCKPT is delivered via
the tty ckpt character, usually control-E. Its default action is to
checkpoint a program and continue running it. The SIGCKPTEXIT signal can
only be delivered by kill(2). Its default action is to checkpoint a pro-
gram and then exit. SIGCKPTEXIT might not be implemented by the system.

I've tested your changes and checkpoints will now thaw, but it looks like more
work is going to be needed for them to be useful. Not just in terms of threads
but it also appears TLS may be wonky in the single thread case. Traceback from
my rwhod hacked to natively handle SIGCKPTEXIT and properly reinit on thaw.

Core was generated by `rwhod'.
Program terminated with signal 11, Segmentation fault.
#0 0x28053a2d in ___tls_get_addr () from /usr/libexec/ld-elf.so.2
(gdb) bt
#0 0x28053a2d in ___tls_get_addr () from /usr/libexec/ld-elf.so.2
#1 0x28115ef9 in nsdispatch () from /usr/lib/libc.so.6
#2 0x281093d0 in GETSERVBYNAME_R () from /usr/lib/libc.so.6
#3 0x28109410 in ?? () from /usr/lib/libc.so.6
#4 0x28108e08 in ?? () from /usr/lib/libc.so.6
#5 0x28108f30 in getservbyname () from /usr/lib/libc.so.6
#6 0x08049ff4 in main (argc=0, argv=<value optimized out>) at rwhod.c:251

I haven't investigated this further, but I will unless someone beats me to it.

#3 Updated by sjg over 5 years ago

I should note because I was not explicit, the process dumps core while
attempting to reinitialize socket state after thaw.

#4 Updated by sjg about 1 year ago

  • Status changed from New to In Progress
  • Assignee changed from 0 to sjg

This is fixed in my students GSoC branch, to be merged around mid-term?

Also available in: Atom PDF