Oracle bug 7708133
FP corruption running a JRockit test program
Chuck Anderson
chuck.anderson@oracle.com
06/10/2009

This bug was discovered by running a JRockit test program that did roughly the
following:

	- alloc a page; mark it as not writeable.
	- create a bunch of threads that have a SIGSEGV signal handler
	- the threads do the following in a loop:
		- store a marker value in xmm0
		- attempt to write to the non-writeable page to trigger a
		  #GP.  The kernel's #GP handler calls the thread's SIGSEGV
		  signal handler.
		- the SIGSEGV signal handler:
			- writes a different marker value in xmm0
			- marks the page as writeable
		- on return from the signal handler:
			- if xmm0 does not have the original marker value then
			  issue an error message and cause the program to exit
			- else mark page as not writeable

The test program running under EL5U3 fails within 10 seconds in both a PV
guest and on bare metal.

With the fixes described below, three concurrent instances of the test program
ran without error for several hours.

TBD: There are other issues that need to be fixed
	- cleaning up FP state after an error saving/restoring it.
	- sequences that I believe need to be made non-preemptable

TBD: Need to patch 32-bit code.

******************
i387.c save_i387()
******************

A test program causes the following scenario:

	- user thread writes to xmm0.  This will cause PF_USED_MATH to be set
	  in struct task_struct.flags indicating that this task has used the
	  FPU (there is state in FP regs such as xmm0).  used_math() will
	  return a non-zero value if the current task's PF_USED_MATH is set.
	- user thread attempts to write to a non-writeable page.  The user
	  thread has a SIGSEGV handler.
	- kernel general protection exception handler is invoked.
	- kernel #GP path sets up to call user thread SIGSEGV handler.
		- used_math(), called from setup_rt_frame(),  returns non-zero
		  (current task's PF_USED_MATH is set) so save_i387() is
		  called to save the user thread's FP state so that it can be
		  restored after returning from the signal handler.

This code sequence in save_i387() is not correct:

	call clear_used_math()
	save FP state in a user buffer
	return 1

clear_used_math() clears PF_USED_MATH indicating that the task does not have
FP state that needs to be saved before switching to another context.  If we
get preempted before we save our FP state, our FP state would be cleared
because PF_USED_MATH is not set.  For example, we could get a page fault
attempting to save our FP state to the user's buffer.  When the page fault
handling completes and returns to the interrupted context (the #GP handler),
the user thread FP state is not restored (because it wasn't saved) and the
xmm registers are cleared.  The original #GP path continues, saving the now
cleared FP state.  That cleared state will be "restored" when we return to
the user thread from the signal handler.  This causes the test program to
complain that xmm0's value is NULL rather than the non-zero value it
had prior to the general protection fault.

This problem affects 32 and 64-bit flavors of EL3, EL4 and EL5 (at least
through 2.6.18-128.1.1.0.1), both bare metal and OVM guests.

The fix is to move the call to clear_used_math() after the save of the FP
state.

*****************************
signal.c restore_sigcontext()
*****************************

If we needed to save FP state prior to calling a signal handler (used_math()
was true) then we saved it to the address pointed to by
struct sigcontext.fpstate.  If !used_math() then we set fpstate = NULL.
restore_sigcontext() is called to restore that context after returning from
the signal handler.  If fpstate != NULL then restore_i387() is called to
restore FP state.

The problem is that TS_USEDFPU may not be set.  When set, TS_USEDFPU indicates
that the FP state for the task is in the FP registers.  We just restored
state to the FP registers so TS_USEDFPU must also be set.  Otherwise, if
we are preempted after restoring FP state then context switching code will
assume that there is no context in the FP registers that needs to be saved.
It will assume that the FP context for this task has been saved to
struct task_struct.thread.i387.fxsave, which has a stale context.  It could
be from the signal handler causing us to inherit the signal handler's FP
values.  The correct state is in the FP registers.

The fix is to check if TS_USEDFPU is set.  If not, set it and make sure
CR0.ts is not set (clts()) to indicate that the FPU has this task's
context and is ready for use.  We must also clear TS_USEDFPU when we save our
FP context in save_i387() and set CR0.ts (stts()) to indicate that the
FPU must be initialized on first use by the signal handler.

diff -up linux-2.6.18.x86_64/arch/x86_64/kernel/i387.c.orig linux-2.6.18.x86_64/arch/x86_64/kernel/i387.c
--- linux-2.6.18.x86_64/arch/x86_64/kernel/i387.c.orig	2009-05-29 13:15:35.000000000 -0700
+++ linux-2.6.18.x86_64/arch/x86_64/kernel/i387.c	2009-06-10 10:09:41.000000000 -0700
@@ -93,17 +93,18 @@ int save_i387(struct _fpstate __user *bu
 
 	if (!used_math())
 		return 0;
-	clear_used_math(); /* trigger finit */
 	if (task_thread_info(tsk)->status & TS_USEDFPU) {
 		err = save_i387_checking((struct i387_fxsave_struct __user *)buf);
 		if (err) return err;
+		task_thread_info(tsk)->status &= ~TS_USEDFPU;
 		stts();
-		} else {
+	} else {
 		if (__copy_to_user(buf, &tsk->thread.i387.fxsave, 
 				   sizeof(struct i387_fxsave_struct)))
 			return -1;
 	} 
-		return 1;
+	clear_used_math(); /* trigger finit */
+	return 1;
 }
 
 /*
diff -up linux-2.6.18.x86_64/arch/x86_64/kernel/signal.c.orig linux-2.6.18.x86_64/arch/x86_64/kernel/signal.c
--- linux-2.6.18.x86_64/arch/x86_64/kernel/signal.c.orig	2009-05-27 13:15:01.000000000 -0700
+++ linux-2.6.18.x86_64/arch/x86_64/kernel/signal.c	2009-06-10 10:07:47.000000000 -0700
@@ -95,14 +95,22 @@ restore_sigcontext(struct pt_regs *regs,
 
 	{
 		struct _fpstate __user * buf;
+		struct task_struct *me = current;
+
 		err |= __get_user(buf, &sc->fpstate);
 
 		if (buf) {
 			if (!access_ok(VERIFY_READ, buf, sizeof(*buf)))
 				goto badframe;
+			if (!used_math()) {
+				init_fpu(me);
+			}
+			if (!(task_thread_info(current)->status & TS_USEDFPU)) {
+				clts();
+				task_thread_info(current)->status |= TS_USEDFPU;
+			}
 			err |= restore_i387(buf);
 		} else {
-			struct task_struct *me = current;
 			if (used_math()) {
 				clear_fpu(me);
 				clear_used_math();
