Project

General

Profile

Actions

Bug #871

closed

gtk2 related: X mouse pointer jumps and sticks to top left corner

Added by floid over 16 years ago. Updated about 15 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

I mentioned this at the end of [issue818]. After largely fruitless research,
the subject line here is formulated to contain as many keywords as possible.

I believe I've found the culprit, but I haven't quite found the fix.

The bug:
Running certain clients under Xorg will, after a brief period of mouse activity,
'stick' the cursor to the top left of the screen; click events still work,
movement does not. "Certain clients" are Metacity, Firefox, and gtk-demo,
suggesting gtk2 is the common cause.

Other clients, such as twm, xmms, and xterm, are fine. gdm is also fine, go figure.

Assumed steps to reproduce:
Build pkgsrc-2007Q3's x11/gtk2 and its dependencies from scratch. Build
pkgsrc/wm/metacity (or just use gtk-demo). startx, exit your wm if necessary,
attempt to use any of the culprit programs.

All packages in question are from pkgsrc-2007Q3, run on my 1.11.0-PREVIEW built
23-Nov-2007. The gtk2 in that tree is version 2.12.0.

Questions:

How should I force pkgsrc to build with -O at most?

Looks like Metacity, at least, eventually winds up ignoring the definition of
BSD_INSTALL_PROGRAM in mk.conf even after detecting it during configure. Is
there really no nice way to preserve debugging symbols other than 'overloading'
strip to do nothing? Is this a job for a varsym?

Actions #1

Updated by sepherosa over 16 years ago

On Dec 2, 2007 11:27 AM, Joe Floid Kanowitz
<> wrote:

New submission from Joe "Floid" Kanowitz <>:

I mentioned this at the end of [issue818]. After largely fruitless research,
the subject line here is formulated to contain as many keywords as possible.

I believe I've found the culprit, but I haven't quite found the fix.

If you are using PS/2 mouse, I have following workaround:
use /dev/psm0 instead of /dev/sysmouse and stop using moused.

Best Regards,
sephe

Actions #2

Updated by floid over 16 years ago

sephe's workaround -- switch to a PS/2 mouse and have Xorg use the psm0 device
directly -- works for me. It's too bad ums0 can't play as nicely, since once
moused has it nothing else can touch it.

Freaky. I'll try to investigate; does anyone have answers re: the best, most
"system-friendly" way to preserve debugging symbols in pkgsrc ports?

Actions #3

Updated by c.turner over 16 years ago

There was a thread on this a couple of weeks back - the summary
is that the FPU state isn't preserved in signal handlers . I said I'd
summarize a quickie investigation of what needs to be changed where but
never did.. perhaps it's time to do so.. digging in the notes & will
respond.

Actions #4

Updated by c.turner over 16 years ago

summary of thread on user@ circa 11/08/2007:

With respect to Xorg & the actual mouse issue reported here:

- using psm0 instead of sysmouse is a workaround
- using xf86-input-mouse-1.2.1 will reportedly prevent the problem

According to Joerg, the issue is actually that the signal handlers
(presumably in xf86-input-mouse) use floating point state, which is
currently not saved/restored in the dragonfly signal handling code.

Not sure why as discussed in this bug, GTK appears to trigger
the problem, but my guess is that is a red herring, although hacking
the specific pkg makefile / patches might result in a good debug binary.

Specific signal handling related error created as #875

Actions #5

Updated by joerg over 16 years ago

No, the signal handler is elsewhere in the Xorg server. Reports include
that Mesa could be the source for that. It is very strange what they do
in the signal handlers :-( Xrender might be another source, no idea.

Joerg

Actions #6

Updated by wa1ter over 16 years ago

On Wed, 5 Dec 2007, Joe "Floid" Kanowitz wrote:

This seems to work for me:
  1. CFLAGS+='-g' INSTALL_UNSTRIPPED=yes bmake

I came up with that just by grepping through /usr/pkgsrc/mk/* which
is worth a look (or several) just to get a feel for what kinds of
bullets you can put through your own foot :o)

Actions #7

Updated by joerg over 16 years ago

Use CFLAGS='-O -g' INSTALL_UNSTRIPPED=yes bmake.

The shell can't really use += and I am surprised it works at all.
Don't disable optimisation completely, you can run into issues with
that.

Joerg

Actions #8

Updated by c.turner over 16 years ago

Joerg Sonnenberger wrote:

sorry to mis quote.. when you say X server, does this mean it's driver
specific? (was using radeon in my case)

I don't think I was using mesa on my machine unless there is some kind
of nested dependancy I'm unaware of (via GTK, xscreensaver initializes
it on startup, etc)

anyhow.. my bug report (#875) seems to have inspired Matt so perhaps
we'll have a fix soon, as much as I which I could have been the hero :)

Actions #9

Updated by wa1ter over 16 years ago

Sorry for spamming the list with this. I tried to answer your private
email but spamhaus.org is blacklisting my IP address. I complained to
my ISP, so the problem will be resolved soon. (Not!)

Actions #10

Updated by dillon over 16 years ago

Ok, I would like people with this problem to test this patch.

(1) BEFORE applying the patch get your X server setup so the mouse pointer
bugs out.
(2) Apply patch
(3) Re-test
This code isn't very efficient but, hey, its a signal handler so at
the moment I don't care.
p.s. for some reason the regression test failed to detect the FP
trashing in my pre-patch tests. I don't know why.
-Matt

Index: platform/pc32/i386/machdep.c ===================================================================
RCS file: /cvs/src/sys/platform/pc32/i386/machdep.c,v
retrieving revision 1.128
diff u -p -r1.128 machdep.c
--
platform/pc32/i386/machdep.c 7 Nov 2007 17:42:50 0000 1.128
+++ platform/pc32/i386/machdep.c 7 Dec 2007 02:56:39 -0000
@ -498,6 +498,11 @ tf
>tf_eflags &= ~(PSL_VM | PSL_NT | P
} ===================================================================
RCS file: /cvs/src/sys/platform/pc32/include/md_var.h,v
retrieving revision 1.25
diff u -p -r1.25 md_var.h
--
platform/pc32/include/md_var.h 9 Jan 2007 23:34:03 -0000 1.25
+++ platform/pc32/include/md_var.h 7 Dec 2007 01:15:37 -0000
@ -69,6 +69,7 @
struct dbreg;
struct mdglobaldata;
struct thread;
+struct __mcontext; ===================================================================
RCS file: /cvs/src/sys/platform/pc32/isa/npx.c,v
retrieving revision 1.42
diff u -p -r1.42 npx.c
--
platform/pc32/isa/npx.c 22 Feb 2007 15:50:49 -0000 1.42
+++ platform/pc32/isa/npx.c 7 Dec 2007 03:12:46 -0000
@ -513,7 +513,7 @ */
void
npxinit(u_short control) {
- static union savefpu dummy;
+ static union savefpu dummy __aligned(16);

if (!npx_exists)
return;
@ -861,6 +861,17 @ kprintf("npxdna: npxthread = p, curth
mdcpu->gd_npxthread, curthread);
panic("npxdna");
}

/*
+ * Setup the initial saved state if the thread has never before
+ * used the FP unit. This also occurs when a thread pushes a
+ * signal handler and uses FP in the handler.
+ /
+ if ((curthread->td_flags x%x
TDF_USINGFP) == 0) {
+ curthread->td_flags |= TDF_USINGFP;
+ npxinit(INITIAL_NPXCW);
+ }
+
/
* The setting of gd_npxthread and the call to fpurstor() must not * be preempted by an interrupt thread or we will take an npxdna
@ -975,6 +986,78 @ #endif
fnsave(addr);
}

/*
* Save the FP state to the mcontext structure.
+
+ * WARNING: If you want to try to npxsave() directly to mctx->mc_fpregs,
+ * then it MUST be 16-byte aligned. Currently this is not guarenteed.
+ */
void
+npxpush(mcontext_t *mctx)
{
+ thread_t td = curthread;

if (td->td_flags & TDF_USINGFP) {
+ if (mdcpu->gd_npxthread == td) {
+ /

+ * XXX Note: This is a bit inefficient if the signal
+ * handler uses floating point, extra faults will
+ * occur.
+ /
+ mctx->mc_ownedfp = _MC_FPOWNED_FPU;
+ npxsave(td->td_savefpu);
+ } else {
+ mctx->mc_ownedfp = _MC_FPOWNED_PCB;
+ }
+ bcopy(td->td_savefpu, mctx->mc_fpregs, sizeof(mctx->mc_fpregs));
+ td->td_flags &= ~TDF_USINGFP;
+ } else {
+ mctx->mc_ownedfp = _MC_FPOWNED_NONE;
+ }
}

/

* Restore the FP state from the mcontext structure.
+ /
void
+npxpop(mcontext_t *mctx)
{
+ thread_t td = curthread;

switch(mctx->mc_ownedfp) {
+ case _MC_FPOWNED_NONE:
+ /

+ * If the signal handler used the FP unit but the interrupted
+ * code did not, release the FP unit. Clear TDF_USINGFP will
+ * force the FP unit to reinit so the interrupted code sees
+ * a clean slate.
+ /
+ if (td->td_flags & TDF_USINGFP) {
+ if (td == mdcpu->gd_npxthread)
+ npxsave(td->td_savefpu);
+ td->td_flags &= ~TDF_USINGFP;
+ }
+ break;
+ case _MC_FPOWNED_FPU:
+ case _MC_FPOWNED_PCB:
+ /

+ * Clear ownership of the FP unit and restore our saved state.
+
+ * NOTE: The signal handler may have set-up some FP state and
+ * enabled the FP unit, so we have to restore no matter what.
+ *
+ * XXX: This is bit inefficient, if the code being returned
+ * to is actively using the FP this results in multiple
+ * kernel faults.
+ */
+ if (td == mdcpu->gd_npxthread)
+ npxsave(td->td_savefpu);
+ bcopy(mctx->mc_fpregs, td->td_savefpu, sizeof(*td->td_savefpu));
+ td->td_flags |= TDF_USINGFP;
+ break;
+ }
}

#ifndef CPU_DISABLE_SSE
/
* On AuthenticAMD processors, the fxrstor instruction does not restore
Index: sys/thread.h ===================================================================
RCS file: /cvs/src/sys/sys/thread.h,v
retrieving revision 1.89
diff u -p -r1.89 thread.h
--
sys/thread.h 18 Nov 2007 09:53:19 -0000 1.89
++ sys/thread.h 7 Dec 2007 01:12:37 -0000
@ -284,6 +284,7 @ #define TDF_PANICWARN 0x00080000 /* pan
#define TDF_BLOCKQ 0x00100000 /* on block queue /
#define TDF_MPSAFE 0x00200000 /
(thread creation) /
#define TDF_EXITING 0x00400000 /
thread exiting /
#define TDF_USINGFP 0x00800000 / thread using fp coproc */

/*
 * Thread priorities.  Typically only one thread from any given
Actions #11

Updated by dillon over 16 years ago

: p.s. for some reason the regression test failed to detect the FP
: trashing in my pre-patch tests. I don't know why.
:
: -Matt

Clarification:  That is, some of the time it failed to detect the
trashing. Other times it did detect it. Very odd. Still don't know
why.
-Matt
Actions #12

Updated by floid over 16 years ago

Mixed results! gdm and twm still work, Metacity seems happy with the patch
(though I should let it go longer), but Firefox or gnome-session now freeze
everything -- no response to ping -- instantly, instead of killing the mouse but
otherwise working.

Still using an Athlon 64 3800+ x2 with a SMP kernel if that matters.

It does feel like it cured a subtle (thought the switch might have gotten
corroded from disuse) glitch where clicks were occasionally not registering
properly.

Actions #13

Updated by dillon over 16 years ago

:Joe "Floid" Kanowitz <> added the comment:
:
:Mixed results! gdm and twm still work, Metacity seems happy with the patch
:(though I should let it go longer), but Firefox or gnome-session now freeze
:everything -- no response to ping -- instantly, instead of killing the mous=
:e but
:otherwise working.
:
:Still using an Athlon 64 3800+ x2 with a SMP kernel if that matters.
:
:It does feel like it cured a subtle (thought the switch might have gotten
:corroded from disuse) glitch where clicks were occasionally not registering
:properly.

If no response to a ping then it sounds like the machine crashed.
Is it possible to connect that box to another one via a serial
port and boot with the console over the serial port? Then you'd
see the panic.
It may also be possible to build it with auto-dumping turned on
so it automatically tries to dump the kernel core instead of sitting
at DDB prompt. Make sure a dump device is configured and try building
a kernel with DDB_UNATTENDED:

options DDB
options DDB_UNATTENDED
options INVARIANTS

I think we are making progress and I'll bet the freeze is just a panic
occuring due to something I must have missed w/ the patch.
I will also try to reproduce it here.
-Matt
Actions #14

Updated by dillon over 16 years ago

Ok, it looks like firefox is messing up the signal stack and this
caused a GP fault in the kernel due to reserved bits in the MXCSR
field in the floating point saved state being set by userland.

I have added code to report and clear the bits.  I do not know why
firefox is blowing up the state, though.
I also fixed a bug in krateprintf() in kern/subr_prf.c (not included)...
I committed that fix directly.
PATCH #2 enclosed.
-Matt
Matthew Dillon
&lt;&gt;

Index: platform/pc32/i386/machdep.c ===================================================================
RCS file: /cvs/src/sys/platform/pc32/i386/machdep.c,v
retrieving revision 1.128
diff u -p -r1.128 machdep.c
--
platform/pc32/i386/machdep.c 7 Nov 2007 17:42:50 0000 1.128
+++ platform/pc32/i386/machdep.c 7 Dec 2007 02:56:39 -0000
@ -498,6 +498,11 @ tf
>tf_eflags &= ~(PSL_VM | PSL_NT | P
} ===================================================================
RCS file: /cvs/src/sys/platform/pc32/include/md_var.h,v
retrieving revision 1.25
diff u -p -r1.25 md_var.h
--
platform/pc32/include/md_var.h 9 Jan 2007 23:34:03 -0000 1.25
+++ platform/pc32/include/md_var.h 7 Dec 2007 01:15:37 -0000
@ -69,6 +69,7 @ struct fpreg;
struct dbreg;
struct mdglobaldata;
struct thread;
+struct __mcontext; ===================================================================
RCS file: /cvs/src/sys/platform/pc32/isa/npx.c,v
retrieving revision 1.42
diff u -p -r1.42 npx.c
--
platform/pc32/isa/npx.c 22 Feb 2007 15:50:49 -0000 1.42
+++ platform/pc32/isa/npx.c 8 Dec 2007 19:56:47 -0000
@ -210,6 +210,8 @ iret \n\
");
#endif /* SMP */

void    busdma_swi (void);
void cpu_gdinit (struct mdglobaldata *gd, int cpu);
@ -113,5 +114,7 @ int selec);
void userconfig (void);
int user_dbreg_trap (void);
int npxdna(void);
+void npxpush(struct __mcontext *mctx);
+void npxpop(struct __mcontext *mctx);
#endif /* !_MACHINE_MD_VAR_H_ */
Index: platform/pc32/isa/npx.c

static struct krate badfprate = { 1 };

/* * Probe routine. Initialize cr0 to give correct behaviour for [f]wait * whether the device exists or not (XXX should be elsewhere). Set flags
@ -513,7 +515,7 @ */
void
npxinit(u_short control) {
- static union savefpu dummy;
+ static union savefpu dummy __aligned(16);

if (!npx_exists)
return;
@ -852,15 +854,29 @ */
int
npxdna(void) {
+ thread_t td = curthread;
u_long *exstat;
+ int didinit = 0;
if (!npx_exists)
return (0);
if (mdcpu->gd_npxthread != NULL) {
kprintf("npxdna: npxthread = p, curthread = %p\n",
- mdcpu->gd_npxthread, curthread);
+ mdcpu->gd_npxthread, td);
panic("npxdna");
}

/*
+ * Setup the initial saved state if the thread has never before
+ * used the FP unit. This also occurs when a thread pushes a
+ * signal handler and uses FP in the handler.
+ /
+ if ((td->td_flags x%x
TDF_USINGFP) == 0) {
+ td->td_flags |= TDF_USINGFP;
+ npxinit(INITIAL_NPXCW);
+ didinit = 1;
+ }
+
/
* The setting of gd_npxthread and the call to fpurstor() must not * be preempted by an interrupt thread or we will take an npxdna
@ -873,8 +889,8 @ stop_emulating();
/* * Record new context early in case frstor causes an IRQ13.
/
- mdcpu->gd_npxthread = curthread;
- exstat = GET_FPU_EXSW_PTR(curthread);
+ mdcpu->gd_npxthread = td;
+ exstat = GET_FPU_EXSW_PTR(td);
*exstat = 0;
/
* The following frstor may cause an IRQ13 when the state being
@ -888,7 +904,13 @ * 386/Cyrix 387 system, fnclex works c * fnsave are broken, so our treatment breaks fnclex if it is the * first FPU instruction after a context switch.
*/
- fpurstor(curthread->td_savefpu);
+ if (td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~0xFFBF) {
+ krateprintf(&badfprate,
+ "FXRSTR: illegal FP MXCSR 08x didinit = %d\n",
+ td->td_savefpu->sv_xmm.sv_env.en_mxcsr, didinit);
+ td->td_savefpu->sv_xmm.sv_env.en_mxcsr x%x
= 0xFFBF;
+ }
+ fpurstor(td->td_savefpu);
crit_exit();
return (1);
@ -975,6 +997,90 @ #endif
fnsave(addr);
}

/*
* Save the FP state to the mcontext structure.
+
+ * WARNING: If you want to try to npxsave() directly to mctx->mc_fpregs,
+ * then it MUST be 16-byte aligned. Currently this is not guarenteed.
+ */
void
+npxpush(mcontext_t *mctx)
{
+ thread_t td = curthread;

if (td->td_flags & TDF_USINGFP) {
+ if (mdcpu->gd_npxthread == td) {
+ /

+ * XXX Note: This is a bit inefficient if the signal
+ * handler uses floating point, extra faults will
+ * occur.
+ /
+ mctx->mc_ownedfp = _MC_FPOWNED_FPU;
+ npxsave(td->td_savefpu);
+ } else {
+ mctx->mc_ownedfp = _MC_FPOWNED_PCB;
+ }
+ bcopy(td->td_savefpu, mctx->mc_fpregs, sizeof(mctx->mc_fpregs));
+ td->td_flags &= ~TDF_USINGFP;
+ } else {
+ mctx->mc_ownedfp = _MC_FPOWNED_NONE;
+ }
}

/

* Restore the FP state from the mcontext structure.
+ /
void
+npxpop(mcontext_t *mctx)
{
+ thread_t td = curthread;

switch(mctx->mc_ownedfp) {
+ case _MC_FPOWNED_NONE:
+ /

+ * If the signal handler used the FP unit but the interrupted
+ * code did not, release the FP unit. Clear TDF_USINGFP will
+ * force the FP unit to reinit so the interrupted code sees
+ * a clean slate.
+ /
+ if (td->td_flags & TDF_USINGFP) {
+ if (td == mdcpu->gd_npxthread)
+ npxsave(td->td_savefpu);
+ td->td_flags &= ~TDF_USINGFP;
+ }
+ break;
+ case _MC_FPOWNED_FPU:
+ case _MC_FPOWNED_PCB:
+ /

+ * Clear ownership of the FP unit and restore our saved state.
+
+ * NOTE: The signal handler may have set-up some FP state and
+ * enabled the FP unit, so we have to restore no matter what.
+ *
+ * XXX: This is bit inefficient, if the code being returned
+ * to is actively using the FP this results in multiple
+ * kernel faults.
+ *
+ * WARNING: The saved state was exposed to userland and may
+ * have to be sanitized to avoid a GP fault in the kernel.
+ */
+ if (td == mdcpu->gd_npxthread)
+ npxsave(td->td_savefpu);
+ bcopy(mctx->mc_fpregs, td->td_savefpu, sizeof(*td->td_savefpu));
+ if (td->td_savefpu->sv_xmm.sv_env.en_mxcsr & ~0xFFBF) {
+ krateprintf(&badfprate,
+ "pid d (%s) signal return from user: "
+ "illegal FP MXCSR %08x\n",
+ td->td_proc->p_pid,
+ td->td_proc->p_comm,
+ td->td_savefpu->sv_xmm.sv_env.en_mxcsr);
+ td->td_savefpu->sv_xmm.sv_env.en_mxcsr x%x
= 0xFFBF;
+ }
+ td->td_flags |= TDF_USINGFP;
+ break;
+ }
}

#ifndef CPU_DISABLE_SSE
/
* On AuthenticAMD processors, the fxrstor instruction does not restore
Index: sys/thread.h ===================================================================
RCS file: /cvs/src/sys/sys/thread.h,v
retrieving revision 1.89
diff u -p -r1.89 thread.h
--
sys/thread.h 18 Nov 2007 09:53:19 -0000 1.89
++ sys/thread.h 7 Dec 2007 01:12:37 -0000
@ -284,6 +284,7 @ #define TDF_PANICWARN 0x00080000 /* pan
#define TDF_BLOCKQ 0x00100000 /* on block queue /
#define TDF_MPSAFE 0x00200000 /
(thread creation) /
#define TDF_EXITING 0x00400000 /
thread exiting /
#define TDF_USINGFP 0x00800000 / thread using fp coproc */

/*
 * Thread priorities.  Typically only one thread from any given
Actions #15

Updated by rumcic over 16 years ago

I tested this patch (I'm also another one, who got his mouse stuck in X) and I
get many messages from firefox "kernel: pid 1158 (firefox-bin) signal return
from user: illegal FP MXCSR ffff0010" (the message repeats and does not
change) and the firefox process itself seems to be stuck, eating up as much
CPU as it can, but doesn't do anything.

Matthew Dillon wrote:

..

/*
  • Thread priorities. Typically only one thread from any given

--
Regards,
Rumko

Actions #16

Updated by dillon over 16 years ago

:I tested this patch (I'm also another one, who got his mouse stuck in X) and I
:get many messages from firefox "kernel: pid 1158 (firefox-bin) signal return
:from user: illegal FP MXCSR ffff0010" (the message repeats and does not
:change) and the firefox process itself seems to be stuck, eating up as much
:CPU as it can, but doesn't do anything.

Update /usr/src/sys/kern/subr_prf.c to the latest, which fixes krateprintf().
You will see only one message / sec then.
When I test this, the non-GTK firefox seg-faults but the GTK firefox does in
fact seem to work. The non-GTK firefox seg-faults even without the FP changes
so I don't think they made things worse.
firefox-gtk1-2.0.0.8 seems to work.
-Matt
Matthew Dillon
&lt;&gt;
Actions #17

Updated by rumcic over 16 years ago

My subr_prf.c is the latest (1.19) (and yes, I only got 1 message per second,
but that still seems a lot). As far as firefox is concerned, I'm running
2.0.0.9, but I also noticed something else. I left the machine on overnight
(but killed firefox before I did that) and the exact same message appears for
knode, kmail and one of the kdeinit processes (but these don't seem to get
stuck, but seem to continue working without problems).

Actions #18

Updated by rumcic over 16 years ago

Also, just now I managed to get my mouse stuck to the far top-left corner. USB
mouse and X was using sysmouse.

Actions #19

Updated by dillon over 16 years ago

:Also, just now I managed to get my mouse stuck to the far top-left corner. USB
:mouse and X was using sysmouse.

I don't know what else can be done on the OS side.  When I look at the
floating point save area on the stack from firefox dumps, it contains
total garbage. I think firefox and/or gtk is munging the stack somewhere.
I think the changes are worth comitting since this is what other OSes are
doing, we'll just have to work from this base to resolve the remaining
issues.
-Matt
Actions

Also available in: Atom PDF