Initial commit; kernel source import

2025-04-06 23:50:55 -05:00
commit 25c6d769f4
45093 changed files with 18199410 additions and 0 deletions
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -0,0 +1,32 @@
+00-INDEX
+	- This file
+arrayRCU.txt
+	- Using RCU to Protect Read-Mostly Arrays
+checklist.txt
+	- Review Checklist for RCU Patches
+listRCU.txt
+	- Using RCU to Protect Read-Mostly Linked Lists
+lockdep.txt
+	- RCU and lockdep checking
+NMI-RCU.txt
+	- Using RCU to Protect Dynamic NMI Handlers
+rcubarrier.txt
+	- RCU and Unloadable Modules
+rculist_nulls.txt
+	- RCU list primitives for use with SLAB_DESTROY_BY_RCU
+rcuref.txt
+	- Reference-count design for elements of lists/arrays protected by RCU
+rcu.txt
+	- RCU Concepts
+RTFP.txt
+	- List of RCU papers (bibliography) going back to 1980.
+stallwarn.txt
+	- RCU CPU stall warnings (module parameter rcu_cpu_stall_suppress)
+torture.txt
+	- RCU Torture Test Operation (CONFIG_RCU_TORTURE_TEST)
+trace.txt
+	- CONFIG_RCU_TRACE debugfs files and formats
+UP.txt
+	- RCU on Uniprocessor Systems
+whatisRCU.txt
+	- What is RCU?
--- a/Documentation/RCU/NMI-RCU.txt
+++ b/Documentation/RCU/NMI-RCU.txt
@@ -0,0 +1,120 @@
+Using RCU to Protect Dynamic NMI Handlers
+
+
+Although RCU is usually used to protect read-mostly data structures,
+it is possible to use RCU to provide dynamic non-maskable interrupt
+handlers, as well as dynamic irq handlers.  This document describes
+how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
+work in "arch/x86/oprofile/nmi_timer_int.c" and in
+"arch/x86/kernel/traps.c".
+
+The relevant pieces of code are listed below, each followed by a
+brief explanation.
+
+	static int dummy_nmi_callback(struct pt_regs *regs, int cpu)
+	{
+		return 0;
+	}
+
+The dummy_nmi_callback() function is a "dummy" NMI handler that does
+nothing, but returns zero, thus saying that it did nothing, allowing
+the NMI handler to take the default machine-specific action.
+
+	static nmi_callback_t nmi_callback = dummy_nmi_callback;
+
+This nmi_callback variable is a global function pointer to the current
+NMI handler.
+
+	void do_nmi(struct pt_regs * regs, long error_code)
+	{
+		int cpu;
+
+		nmi_enter();
+
+		cpu = smp_processor_id();
+		++nmi_count(cpu);
+
+		if (!rcu_dereference_sched(nmi_callback)(regs, cpu))
+			default_do_nmi(regs);
+
+		nmi_exit();
+	}
+
+The do_nmi() function processes each NMI.  It first disables preemption
+in the same way that a hardware irq would, then increments the per-CPU
+count of NMIs.  It then invokes the NMI handler stored in the nmi_callback
+function pointer.  If this handler returns zero, do_nmi() invokes the
+default_do_nmi() function to handle a machine-specific NMI.  Finally,
+preemption is restored.
+
+In theory, rcu_dereference_sched() is not needed, since this code runs
+only on i386, which in theory does not need rcu_dereference_sched()
+anyway.  However, in practice it is a good documentation aid, particularly
+for anyone attempting to do something similar on Alpha or on systems
+with aggressive optimizing compilers.
+
+Quick Quiz:  Why might the rcu_dereference_sched() be necessary on Alpha,
+	     given that the code referenced by the pointer is read-only?
+
+
+Back to the discussion of NMI and RCU...
+
+	void set_nmi_callback(nmi_callback_t callback)
+	{
+		rcu_assign_pointer(nmi_callback, callback);
+	}
+
+The set_nmi_callback() function registers an NMI handler.  Note that any
+data that is to be used by the callback must be initialized up -before-
+the call to set_nmi_callback().  On architectures that do not order
+writes, the rcu_assign_pointer() ensures that the NMI handler sees the
+initialized values.
+
+	void unset_nmi_callback(void)
+	{
+		rcu_assign_pointer(nmi_callback, dummy_nmi_callback);
+	}
+
+This function unregisters an NMI handler, restoring the original
+dummy_nmi_handler().  However, there may well be an NMI handler
+currently executing on some other CPU.  We therefore cannot free
+up any data structures used by the old NMI handler until execution
+of it completes on all other CPUs.
+
+One way to accomplish this is via synchronize_sched(), perhaps as
+follows:
+
+	unset_nmi_callback();
+	synchronize_sched();
+	kfree(my_nmi_data);
+
+This works because synchronize_sched() blocks until all CPUs complete
+any preemption-disabled segments of code that they were executing.
+Since NMI handlers disable preemption, synchronize_sched() is guaranteed
+not to return until all ongoing NMI handlers exit.  It is therefore safe
+to free up the handler's data as soon as synchronize_sched() returns.
+
+Important note: for this to work, the architecture in question must
+invoke nmi_enter() and nmi_exit() on NMI entry and exit, respectively.
+
+
+Answer to Quick Quiz
+
+	Why might the rcu_dereference_sched() be necessary on Alpha, given
+	that the code referenced by the pointer is read-only?
+
+	Answer: The caller to set_nmi_callback() might well have
+		initialized some data that is to be used by the new NMI
+		handler.  In this case, the rcu_dereference_sched() would
+		be needed, because otherwise a CPU that received an NMI
+		just after the new handler was set might see the pointer
+		to the new NMI handler, but the old pre-initialized
+		version of the handler's data.
+
+		This same sad story can happen on other CPUs when using
+		a compiler with aggressive pointer-value speculation
+		optimizations.
+
+		More important, the rcu_dereference_sched() makes it
+		clear to someone reading the code that the pointer is
+		being protected by RCU-sched.
--- a/Documentation/RCU/RTFP.txt
+++ b/Documentation/RCU/RTFP.txt
--- a/Documentation/RCU/UP.txt
+++ b/Documentation/RCU/UP.txt
@@ -0,0 +1,135 @@
+RCU on Uniprocessor Systems
+
+
+A common misconception is that, on UP systems, the call_rcu() primitive
+may immediately invoke its function.  The basis of this misconception
+is that since there is only one CPU, it should not be necessary to
+wait for anything else to get done, since there are no other CPUs for
+anything else to be happening on.  Although this approach will -sort- -of-
+work a surprising amount of the time, it is a very bad idea in general.
+This document presents three examples that demonstrate exactly how bad
+an idea this is.
+
+
+Example 1: softirq Suicide
+
+Suppose that an RCU-based algorithm scans a linked list containing
+elements A, B, and C in process context, and can delete elements from
+this same list in softirq context.  Suppose that the process-context scan
+is referencing element B when it is interrupted by softirq processing,
+which deletes element B, and then invokes call_rcu() to free element B
+after a grace period.
+
+Now, if call_rcu() were to directly invoke its arguments, then upon return
+from softirq, the list scan would find itself referencing a newly freed
+element B.  This situation can greatly decrease the life expectancy of
+your kernel.
+
+This same problem can occur if call_rcu() is invoked from a hardware
+interrupt handler.
+
+
+Example 2: Function-Call Fatality
+
+Of course, one could avert the suicide described in the preceding example
+by having call_rcu() directly invoke its arguments only if it was called
+from process context.  However, this can fail in a similar manner.
+
+Suppose that an RCU-based algorithm again scans a linked list containing
+elements A, B, and C in process contexts, but that it invokes a function
+on each element as it is scanned.  Suppose further that this function
+deletes element B from the list, then passes it to call_rcu() for deferred
+freeing.  This may be a bit unconventional, but it is perfectly legal
+RCU usage, since call_rcu() must wait for a grace period to elapse.
+Therefore, in this case, allowing call_rcu() to immediately invoke
+its arguments would cause it to fail to make the fundamental guarantee
+underlying RCU, namely that call_rcu() defers invoking its arguments until
+all RCU read-side critical sections currently executing have completed.
+
+Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in
+	this case?
+
+
+Example 3: Death by Deadlock
+
+Suppose that call_rcu() is invoked while holding a lock, and that the
+callback function must acquire this same lock.  In this case, if
+call_rcu() were to directly invoke the callback, the result would
+be self-deadlock.
+
+In some cases, it would possible to restructure to code so that
+the call_rcu() is delayed until after the lock is released.  However,
+there are cases where this can be quite ugly:
+
+1.	If a number of items need to be passed to call_rcu() within
+	the same critical section, then the code would need to create
+	a list of them, then traverse the list once the lock was
+	released.
+
+2.	In some cases, the lock will be held across some kernel API,
+	so that delaying the call_rcu() until the lock is released
+	requires that the data item be passed up via a common API.
+	It is far better to guarantee that callbacks are invoked
+	with no locks held than to have to modify such APIs to allow
+	arbitrary data items to be passed back up through them.
+
+If call_rcu() directly invokes the callback, painful locking restrictions
+or API changes would be required.
+
+Quick Quiz #2: What locking restriction must RCU callbacks respect?
+
+
+Summary
+
+Permitting call_rcu() to immediately invoke its arguments breaks RCU,
+even on a UP system.  So do not do it!  Even on a UP system, the RCU
+infrastructure -must- respect grace periods, and -must- invoke callbacks
+from a known environment in which no locks are held.
+
+It -is- safe for synchronize_sched() and synchronize_rcu_bh() to return
+immediately on an UP system.  It is also safe for synchronize_rcu()
+to return immediately on UP systems, except when running preemptable
+RCU.
+
+Quick Quiz #3: Why can't synchronize_rcu() return immediately on
+	UP systems running preemptable RCU?
+
+
+Answer to Quick Quiz #1:
+	Why is it -not- legal to invoke synchronize_rcu() in this case?
+
+	Because the calling function is scanning an RCU-protected linked
+	list, and is therefore within an RCU read-side critical section.
+	Therefore, the called function has been invoked within an RCU
+	read-side critical section, and is not permitted to block.
+
+Answer to Quick Quiz #2:
+	What locking restriction must RCU callbacks respect?
+
+	Any lock that is acquired within an RCU callback must be
+	acquired elsewhere using an _irq variant of the spinlock
+	primitive.  For example, if "mylock" is acquired by an
+	RCU callback, then a process-context acquisition of this
+	lock must use something like spin_lock_irqsave() to
+	acquire the lock.
+
+	If the process-context code were to simply use spin_lock(),
+	then, since RCU callbacks can be invoked from softirq context,
+	the callback might be called from a softirq that interrupted
+	the process-context critical section.  This would result in
+	self-deadlock.
+
+	This restriction might seem gratuitous, since very few RCU
+	callbacks acquire locks directly.  However, a great many RCU
+	callbacks do acquire locks -indirectly-, for example, via
+	the kfree() primitive.
+
+Answer to Quick Quiz #3:
+	Why can't synchronize_rcu() return immediately on UP systems
+	running preemptable RCU?
+
+	Because some other task might have been preempted in the middle
+	of an RCU read-side critical section.  If synchronize_rcu()
+	simply immediately returned, it would prematurely signal the
+	end of the grace period, which would come as a nasty shock to
+	that other thread when it started running again.
--- a/Documentation/RCU/arrayRCU.txt
+++ b/Documentation/RCU/arrayRCU.txt
@@ -0,0 +1,141 @@
+Using RCU to Protect Read-Mostly Arrays
+
+
+Although RCU is more commonly used to protect linked lists, it can
+also be used to protect arrays.  Three situations are as follows:
+
+1.  Hash Tables
+
+2.  Static Arrays
+
+3.  Resizeable Arrays
+
+Each of these situations are discussed below.
+
+
+Situation 1: Hash Tables
+
+Hash tables are often implemented as an array, where each array entry
+has a linked-list hash chain.  Each hash chain can be protected by RCU
+as described in the listRCU.txt document.  This approach also applies
+to other array-of-list situations, such as radix trees.
+
+
+Situation 2: Static Arrays
+
+Static arrays, where the data (rather than a pointer to the data) is
+located in each array element, and where the array is never resized,
+have not been used with RCU.  Rik van Riel recommends using seqlock in
+this situation, which would also have minimal read-side overhead as long
+as updates are rare.
+
+Quick Quiz:  Why is it so important that updates be rare when
+	     using seqlock?
+
+
+Situation 3: Resizeable Arrays
+
+Use of RCU for resizeable arrays is demonstrated by the grow_ary()
+function used by the System V IPC code.  The array is used to map from
+semaphore, message-queue, and shared-memory IDs to the data structure
+that represents the corresponding IPC construct.  The grow_ary()
+function does not acquire any locks; instead its caller must hold the
+ids->sem semaphore.
+
+The grow_ary() function, shown below, does some limit checks, allocates a
+new ipc_id_ary, copies the old to the new portion of the new, initializes
+the remainder of the new, updates the ids->entries pointer to point to
+the new array, and invokes ipc_rcu_putref() to free up the old array.
+Note that rcu_assign_pointer() is used to update the ids->entries pointer,
+which includes any memory barriers required on whatever architecture
+you are running on.
+
+	static int grow_ary(struct ipc_ids* ids, int newsize)
+	{
+		struct ipc_id_ary* new;
+		struct ipc_id_ary* old;
+		int i;
+		int size = ids->entries->size;
+
+		if(newsize > IPCMNI)
+			newsize = IPCMNI;
+		if(newsize <= size)
+			return newsize;
+
+		new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize +
+				    sizeof(struct ipc_id_ary));
+		if(new == NULL)
+			return size;
+		new->size = newsize;
+		memcpy(new->p, ids->entries->p,
+		       sizeof(struct kern_ipc_perm *)*size +
+		       sizeof(struct ipc_id_ary));
+		for(i=size;i<newsize;i++) {
+			new->p[i] = NULL;
+		}
+		old = ids->entries;
+
+		/*
+		 * Use rcu_assign_pointer() to make sure the memcpyed
+		 * contents of the new array are visible before the new
+		 * array becomes visible.
+		 */
+		rcu_assign_pointer(ids->entries, new);
+
+		ipc_rcu_putref(old);
+		return newsize;
+	}
+
+The ipc_rcu_putref() function decrements the array's reference count
+and then, if the reference count has dropped to zero, uses call_rcu()
+to free the array after a grace period has elapsed.
+
+The array is traversed by the ipc_lock() function.  This function
+indexes into the array under the protection of rcu_read_lock(),
+using rcu_dereference() to pick up the pointer to the array so
+that it may later safely be dereferenced -- memory barriers are
+required on the Alpha CPU.  Since the size of the array is stored
+with the array itself, there can be no array-size mismatches, so
+a simple check suffices.  The pointer to the structure corresponding
+to the desired IPC object is placed in "out", with NULL indicating
+a non-existent entry.  After acquiring "out->lock", the "out->deleted"
+flag indicates whether the IPC object is in the process of being
+deleted, and, if not, the pointer is returned.
+
+	struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id)
+	{
+		struct kern_ipc_perm* out;
+		int lid = id % SEQ_MULTIPLIER;
+		struct ipc_id_ary* entries;
+
+		rcu_read_lock();
+		entries = rcu_dereference(ids->entries);
+		if(lid >= entries->size) {
+			rcu_read_unlock();
+			return NULL;
+		}
+		out = entries->p[lid];
+		if(out == NULL) {
+			rcu_read_unlock();
+			return NULL;
+		}
+		spin_lock(&out->lock);
+
+		/* ipc_rmid() may have already freed the ID while ipc_lock
+		 * was spinning: here verify that the structure is still valid
+		 */
+		if (out->deleted) {
+			spin_unlock(&out->lock);
+			rcu_read_unlock();
+			return NULL;
+		}
+		return out;
+	}
+
+
+Answer to Quick Quiz:
+
+	The reason that it is important that updates be rare when
+	using seqlock is that frequent updates can livelock readers.
+	One way to avoid this problem is to assign a seqlock for
+	each array entry rather than to the entire array.
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -0,0 +1,431 @@
+Review Checklist for RCU Patches
+
+
+This document contains a checklist for producing and reviewing patches
+that make use of RCU.  Violating any of the rules listed below will
+result in the same sorts of problems that leaving out a locking primitive
+would cause.  This list is based on experiences reviewing such patches
+over a rather long period of time, but improvements are always welcome!
+
+0.	Is RCU being applied to a read-mostly situation?  If the data
+	structure is updated more than about 10% of the time, then you
+	should strongly consider some other approach, unless detailed
+	performance measurements show that RCU is nonetheless the right
+	tool for the job.  Yes, RCU does reduce read-side overhead by
+	increasing write-side overhead, which is exactly why normal uses
+	of RCU will do much more reading than updating.
+
+	Another exception is where performance is not an issue, and RCU
+	provides a simpler implementation.  An example of this situation
+	is the dynamic NMI code in the Linux 2.6 kernel, at least on
+	architectures where NMIs are rare.
+
+	Yet another exception is where the low real-time latency of RCU's
+	read-side primitives is critically important.
+
+1.	Does the update code have proper mutual exclusion?
+
+	RCU does allow -readers- to run (almost) naked, but -writers- must
+	still use some sort of mutual exclusion, such as:
+
+	a.	locking,
+	b.	atomic operations, or
+	c.	restricting updates to a single task.
+
+	If you choose #b, be prepared to describe how you have handled
+	memory barriers on weakly ordered machines (pretty much all of
+	them -- even x86 allows later loads to be reordered to precede
+	earlier stores), and be prepared to explain why this added
+	complexity is worthwhile.  If you choose #c, be prepared to
+	explain how this single task does not become a major bottleneck on
+	big multiprocessor machines (for example, if the task is updating
+	information relating to itself that other tasks can read, there
+	by definition can be no bottleneck).
+
+2.	Do the RCU read-side critical sections make proper use of
+	rcu_read_lock() and friends?  These primitives are needed
+	to prevent grace periods from ending prematurely, which
+	could result in data being unceremoniously freed out from
+	under your read-side code, which can greatly increase the
+	actuarial risk of your kernel.
+
+	As a rough rule of thumb, any dereference of an RCU-protected
+	pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(),
+	rcu_read_lock_sched(), or by the appropriate update-side lock.
+	Disabling of preemption can serve as rcu_read_lock_sched(), but
+	is less readable.
+
+3.	Does the update code tolerate concurrent accesses?
+
+	The whole point of RCU is to permit readers to run without
+	any locks or atomic operations.  This means that readers will
+	be running while updates are in progress.  There are a number
+	of ways to handle this concurrency, depending on the situation:
+
+	a.	Use the RCU variants of the list and hlist update
+		primitives to add, remove, and replace elements on
+		an RCU-protected list.	Alternatively, use the other
+		RCU-protected data structures that have been added to
+		the Linux kernel.
+
+		This is almost always the best approach.
+
+	b.	Proceed as in (a) above, but also maintain per-element
+		locks (that are acquired by both readers and writers)
+		that guard per-element state.  Of course, fields that
+		the readers refrain from accessing can be guarded by
+		some other lock acquired only by updaters, if desired.
+
+		This works quite well, also.
+
+	c.	Make updates appear atomic to readers.  For example,
+		pointer updates to properly aligned fields will
+		appear atomic, as will individual atomic primitives.
+		Sequences of perations performed under a lock will -not-
+		appear to be atomic to RCU readers, nor will sequences
+		of multiple atomic primitives.
+
+		This can work, but is starting to get a bit tricky.
+
+	d.	Carefully order the updates and the reads so that
+		readers see valid data at all phases of the update.
+		This is often more difficult than it sounds, especially
+		given modern CPUs' tendency to reorder memory references.
+		One must usually liberally sprinkle memory barriers
+		(smp_wmb(), smp_rmb(), smp_mb()) through the code,
+		making it difficult to understand and to test.
+
+		It is usually better to group the changing data into
+		a separate structure, so that the change may be made
+		to appear atomic by updating a pointer to reference
+		a new structure containing updated values.
+
+4.	Weakly ordered CPUs pose special challenges.  Almost all CPUs
+	are weakly ordered -- even x86 CPUs allow later loads to be
+	reordered to precede earlier stores.  RCU code must take all of
+	the following measures to prevent memory-corruption problems:
+
+	a.	Readers must maintain proper ordering of their memory
+		accesses.  The rcu_dereference() primitive ensures that
+		the CPU picks up the pointer before it picks up the data
+		that the pointer points to.  This really is necessary
+		on Alpha CPUs.	If you don't believe me, see:
+
+			http://www.openvms.compaq.com/wizard/wiz_2637.html
+
+		The rcu_dereference() primitive is also an excellent
+		documentation aid, letting the person reading the code
+		know exactly which pointers are protected by RCU.
+		Please note that compilers can also reorder code, and
+		they are becoming increasingly aggressive about doing
+		just that.  The rcu_dereference() primitive therefore
+		also prevents destructive compiler optimizations.
+
+		The rcu_dereference() primitive is used by the
+		various "_rcu()" list-traversal primitives, such
+		as the list_for_each_entry_rcu().  Note that it is
+		perfectly legal (if redundant) for update-side code to
+		use rcu_dereference() and the "_rcu()" list-traversal
+		primitives.  This is particularly useful in code that
+		is common to readers and updaters.  However, lockdep
+		will complain if you access rcu_dereference() outside
+		of an RCU read-side critical section.  See lockdep.txt
+		to learn what to do about this.
+
+		Of course, neither rcu_dereference() nor the "_rcu()"
+		list-traversal primitives can substitute for a good
+		concurrency design coordinating among multiple updaters.
+
+	b.	If the list macros are being used, the list_add_tail_rcu()
+		and list_add_rcu() primitives must be used in order
+		to prevent weakly ordered machines from misordering
+		structure initialization and pointer planting.
+		Similarly, if the hlist macros are being used, the
+		hlist_add_head_rcu() primitive is required.
+
+	c.	If the list macros are being used, the list_del_rcu()
+		primitive must be used to keep list_del()'s pointer
+		poisoning from inflicting toxic effects on concurrent
+		readers.  Similarly, if the hlist macros are being used,
+		the hlist_del_rcu() primitive is required.
+
+		The list_replace_rcu() and hlist_replace_rcu() primitives
+		may be used to replace an old structure with a new one
+		in their respective types of RCU-protected lists.
+
+	d.	Rules similar to (4b) and (4c) apply to the "hlist_nulls"
+		type of RCU-protected linked lists.
+
+	e.	Updates must ensure that initialization of a given
+		structure happens before pointers to that structure are
+		publicized.  Use the rcu_assign_pointer() primitive
+		when publicizing a pointer to a structure that can
+		be traversed by an RCU read-side critical section.
+
+5.	If call_rcu(), or a related primitive such as call_rcu_bh(),
+	call_rcu_sched(), or call_srcu() is used, the callback function
+	must be written to be called from softirq context.  In particular,
+	it cannot block.
+
+6.	Since synchronize_rcu() can block, it cannot be called from
+	any sort of irq context.  The same rule applies for
+	synchronize_rcu_bh(), synchronize_sched(), synchronize_srcu(),
+	synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(),
+	synchronize_sched_expedite(), and synchronize_srcu_expedited().
+
+	The expedited forms of these primitives have the same semantics
+	as the non-expedited forms, but expediting is both expensive
+	and unfriendly to real-time workloads.	Use of the expedited
+	primitives should be restricted to rare configuration-change
+	operations that would not normally be undertaken while a real-time
+	workload is running.
+
+	In particular, if you find yourself invoking one of the expedited
+	primitives repeatedly in a loop, please do everyone a favor:
+	Restructure your code so that it batches the updates, allowing
+	a single non-expedited primitive to cover the entire batch.
+	This will very likely be faster than the loop containing the
+	expedited primitive, and will be much much easier on the rest
+	of the system, especially to real-time workloads running on
+	the rest of the system.
+
+	In addition, it is illegal to call the expedited forms from
+	a CPU-hotplug notifier, or while holding a lock that is acquired
+	by a CPU-hotplug notifier.  Failing to observe this restriction
+	will result in deadlock.
+
+7.	If the updater uses call_rcu() or synchronize_rcu(), then the
+	corresponding readers must use rcu_read_lock() and
+	rcu_read_unlock().  If the updater uses call_rcu_bh() or
+	synchronize_rcu_bh(), then the corresponding readers must
+	use rcu_read_lock_bh() and rcu_read_unlock_bh().  If the
+	updater uses call_rcu_sched() or synchronize_sched(), then
+	the corresponding readers must disable preemption, possibly
+	by calling rcu_read_lock_sched() and rcu_read_unlock_sched().
+	If the updater uses synchronize_srcu() or call_srcu(),
+	the the corresponding readers must use srcu_read_lock() and
+	srcu_read_unlock(), and with the same srcu_struct.  The rules for
+	the expedited primitives are the same as for their non-expedited
+	counterparts.  Mixing things up will result in confusion and
+	broken kernels.
+
+	One exception to this rule: rcu_read_lock() and rcu_read_unlock()
+	may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
+	in cases where local bottom halves are already known to be
+	disabled, for example, in irq or softirq context.  Commenting
+	such cases is a must, of course!  And the jury is still out on
+	whether the increased speed is worth it.
+
+8.	Although synchronize_rcu() is slower than is call_rcu(), it
+	usually results in simpler code.  So, unless update performance is
+	critically important, the updaters cannot block, or the latency of
+	synchronize_rcu() is visible from userspace, synchronize_rcu()
+	should be used in preference to call_rcu().  Furthermore,
+	kfree_rcu() usually results in even simpler code than does
+	synchronize_rcu() without synchronize_rcu()'s multi-millisecond
+	latency.  So please take advantage of kfree_rcu()'s "fire and
+	forget" memory-freeing capabilities where it applies.
+
+	An especially important property of the synchronize_rcu()
+	primitive is that it automatically self-limits: if grace periods
+	are delayed for whatever reason, then the synchronize_rcu()
+	primitive will correspondingly delay updates.  In contrast,
+	code using call_rcu() should explicitly limit update rate in
+	cases where grace periods are delayed, as failing to do so can
+	result in excessive realtime latencies or even OOM conditions.
+
+	Ways of gaining this self-limiting property when using call_rcu()
+	include:
+
+	a.	Keeping a count of the number of data-structure elements
+		used by the RCU-protected data structure, including
+		those waiting for a grace period to elapse.  Enforce a
+		limit on this number, stalling updates as needed to allow
+		previously deferred frees to complete.	Alternatively,
+		limit only the number awaiting deferred free rather than
+		the total number of elements.
+
+		One way to stall the updates is to acquire the update-side
+		mutex.	(Don't try this with a spinlock -- other CPUs
+		spinning on the lock could prevent the grace period
+		from ever ending.)  Another way to stall the updates
+		is for the updates to use a wrapper function around
+		the memory allocator, so that this wrapper function
+		simulates OOM when there is too much memory awaiting an
+		RCU grace period.  There are of course many other
+		variations on this theme.
+
+	b.	Limiting update rate.  For example, if updates occur only
+		once per hour, then no explicit rate limiting is required,
+		unless your system is already badly broken.  The dcache
+		subsystem takes this approach -- updates are guarded
+		by a global lock, limiting their rate.
+
+	c.	Trusted update -- if updates can only be done manually by
+		superuser or some other trusted user, then it might not
+		be necessary to automatically limit them.  The theory
+		here is that superuser already has lots of ways to crash
+		the machine.
+
+	d.	Use call_rcu_bh() rather than call_rcu(), in order to take
+		advantage of call_rcu_bh()'s faster grace periods.
+
+	e.	Periodically invoke synchronize_rcu(), permitting a limited
+		number of updates per grace period.
+
+	The same cautions apply to call_rcu_bh(), call_rcu_sched(),
+	call_srcu(), and kfree_rcu().
+
+9.	All RCU list-traversal primitives, which include
+	rcu_dereference(), list_for_each_entry_rcu(), and
+	list_for_each_safe_rcu(), must be either within an RCU read-side
+	critical section or must be protected by appropriate update-side
+	locks.	RCU read-side critical sections are delimited by
+	rcu_read_lock() and rcu_read_unlock(), or by similar primitives
+	such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which
+	case the matching rcu_dereference() primitive must be used in
+	order to keep lockdep happy, in this case, rcu_dereference_bh().
+
+	The reason that it is permissible to use RCU list-traversal
+	primitives when the update-side lock is held is that doing so
+	can be quite helpful in reducing code bloat when common code is
+	shared between readers and updaters.  Additional primitives
+	are provided for this case, as discussed in lockdep.txt.
+
+10.	Conversely, if you are in an RCU read-side critical section,
+	and you don't hold the appropriate update-side lock, you -must-
+	use the "_rcu()" variants of the list macros.  Failing to do so
+	will break Alpha, cause aggressive compilers to generate bad code,
+	and confuse people trying to read your code.
+
+11.	Note that synchronize_rcu() -only- guarantees to wait until
+	all currently executing rcu_read_lock()-protected RCU read-side
+	critical sections complete.  It does -not- necessarily guarantee
+	that all currently running interrupts, NMIs, preempt_disable()
+	code, or idle loops will complete.  Therefore, if your
+	read-side critical sections are protected by something other
+	than rcu_read_lock(), do -not- use synchronize_rcu().
+
+	Similarly, disabling preemption is not an acceptable substitute
+	for rcu_read_lock().  Code that attempts to use preemption
+	disabling where it should be using rcu_read_lock() will break
+	in real-time kernel builds.
+
+	If you want to wait for interrupt handlers, NMI handlers, and
+	code under the influence of preempt_disable(), you instead
+	need to use synchronize_irq() or synchronize_sched().
+
+	This same limitation also applies to synchronize_rcu_bh()
+	and synchronize_srcu(), as well as to the asynchronous and
+	expedited forms of the three primitives, namely call_rcu(),
+	call_rcu_bh(), call_srcu(), synchronize_rcu_expedited(),
+	synchronize_rcu_bh_expedited(), and synchronize_srcu_expedited().
+
+12.	Any lock acquired by an RCU callback must be acquired elsewhere
+	with softirq disabled, e.g., via spin_lock_irqsave(),
+	spin_lock_bh(), etc.  Failing to disable irq on a given
+	acquisition of that lock will result in deadlock as soon as
+	the RCU softirq handler happens to run your RCU callback while
+	interrupting that acquisition's critical section.
+
+13.	RCU callbacks can be and are executed in parallel.  In many cases,
+	the callback code simply wrappers around kfree(), so that this
+	is not an issue (or, more accurately, to the extent that it is
+	an issue, the memory-allocator locking handles it).  However,
+	if the callbacks do manipulate a shared data structure, they
+	must use whatever locking or other synchronization is required
+	to safely access and/or modify that data structure.
+
+	RCU callbacks are -usually- executed on the same CPU that executed
+	the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(),
+	but are by -no- means guaranteed to be.  For example, if a given
+	CPU goes offline while having an RCU callback pending, then that
+	RCU callback will execute on some surviving CPU.  (If this was
+	not the case, a self-spawning RCU callback would prevent the
+	victim CPU from ever going offline.)
+
+14.	SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(),
+	synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu())
+	may only be invoked from process context.  Unlike other forms of
+	RCU, it -is- permissible to block in an SRCU read-side critical
+	section (demarked by srcu_read_lock() and srcu_read_unlock()),
+	hence the "SRCU": "sleepable RCU".  Please note that if you
+	don't need to sleep in read-side critical sections, you should be
+	using RCU rather than SRCU, because RCU is almost always faster
+	and easier to use than is SRCU.
+
+	If you need to enter your read-side critical section in a
+	hardirq or exception handler, and then exit that same read-side
+	critical section in the task that was interrupted, then you need
+	to srcu_read_lock_raw() and srcu_read_unlock_raw(), which avoid
+	the lockdep checking that would otherwise this practice illegal.
+
+	Also unlike other forms of RCU, explicit initialization
+	and cleanup is required via init_srcu_struct() and
+	cleanup_srcu_struct().	These are passed a "struct srcu_struct"
+	that defines the scope of a given SRCU domain.	Once initialized,
+	the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock()
+	synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu().
+	A given synchronize_srcu() waits only for SRCU read-side critical
+	sections governed by srcu_read_lock() and srcu_read_unlock()
+	calls that have been passed the same srcu_struct.  This property
+	is what makes sleeping read-side critical sections tolerable --
+	a given subsystem delays only its own updates, not those of other
+	subsystems using SRCU.	Therefore, SRCU is less prone to OOM the
+	system than RCU would be if RCU's read-side critical sections
+	were permitted to sleep.
+
+	The ability to sleep in read-side critical sections does not
+	come for free.	First, corresponding srcu_read_lock() and
+	srcu_read_unlock() calls must be passed the same srcu_struct.
+	Second, grace-period-detection overhead is amortized only
+	over those updates sharing a given srcu_struct, rather than
+	being globally amortized as they are for other forms of RCU.
+	Therefore, SRCU should be used in preference to rw_semaphore
+	only in extremely read-intensive situations, or in situations
+	requiring SRCU's read-side deadlock immunity or low read-side
+	realtime latency.
+
+	Note that, rcu_assign_pointer() relates to SRCU just as it does
+	to other forms of RCU.
+
+15.	The whole point of call_rcu(), synchronize_rcu(), and friends
+	is to wait until all pre-existing readers have finished before
+	carrying out some otherwise-destructive operation.  It is
+	therefore critically important to -first- remove any path
+	that readers can follow that could be affected by the
+	destructive operation, and -only- -then- invoke call_rcu(),
+	synchronize_rcu(), or friends.
+
+	Because these primitives only wait for pre-existing readers, it
+	is the caller's responsibility to guarantee that any subsequent
+	readers will execute safely.
+
+16.	The various RCU read-side primitives do -not- necessarily contain
+	memory barriers.  You should therefore plan for the CPU
+	and the compiler to freely reorder code into and out of RCU
+	read-side critical sections.  It is the responsibility of the
+	RCU update-side primitives to deal with this.
+
+17.	Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the
+	__rcu sparse checks (enabled by CONFIG_SPARSE_RCU_POINTER) to
+	validate your RCU code.  These can help find problems as follows:
+
+	CONFIG_PROVE_RCU: check that accesses to RCU-protected data
+		structures are carried out under the proper RCU
+		read-side critical section, while holding the right
+		combination of locks, or whatever other conditions
+		are appropriate.
+
+	CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the
+		same object to call_rcu() (or friends) before an RCU
+		grace period has elapsed since the last time that you
+		passed that same object to call_rcu() (or friends).
+
+	__rcu sparse checks: tag the pointer to the RCU-protected data
+		structure with __rcu, and sparse will warn you if you
+		access that pointer without the services of one of the
+		variants of rcu_dereference().
+
+	These debugging aids can help you find problems that are
+	otherwise extremely difficult to spot.
--- a/Documentation/RCU/listRCU.txt
+++ b/Documentation/RCU/listRCU.txt
@@ -0,0 +1,315 @@
+Using RCU to Protect Read-Mostly Linked Lists
+
+
+One of the best applications of RCU is to protect read-mostly linked lists
+("struct list_head" in list.h).  One big advantage of this approach
+is that all of the required memory barriers are included for you in
+the list macros.  This document describes several applications of RCU,
+with the best fits first.
+
+
+Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
+
+The best applications are cases where, if reader-writer locking were
+used, the read-side lock would be dropped before taking any action
+based on the results of the search.  The most celebrated example is
+the routing table.  Because the routing table is tracking the state of
+equipment outside of the computer, it will at times contain stale data.
+Therefore, once the route has been computed, there is no need to hold
+the routing table static during transmission of the packet.  After all,
+you can hold the routing table static all you want, but that won't keep
+the external Internet from changing, and it is the state of the external
+Internet that really matters.  In addition, routing entries are typically
+added or deleted, rather than being modified in place.
+
+A straightforward example of this use of RCU may be found in the
+system-call auditing support.  For example, a reader-writer locked
+implementation of audit_filter_task() might be as follows:
+
+	static enum audit_state audit_filter_task(struct task_struct *tsk)
+	{
+		struct audit_entry *e;
+		enum audit_state   state;
+
+		read_lock(&auditsc_lock);
+		/* Note: audit_netlink_sem held by caller. */
+		list_for_each_entry(e, &audit_tsklist, list) {
+			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
+				read_unlock(&auditsc_lock);
+				return state;
+			}
+		}
+		read_unlock(&auditsc_lock);
+		return AUDIT_BUILD_CONTEXT;
+	}
+
+Here the list is searched under the lock, but the lock is dropped before
+the corresponding value is returned.  By the time that this value is acted
+on, the list may well have been modified.  This makes sense, since if
+you are turning auditing off, it is OK to audit a few extra system calls.
+
+This means that RCU can be easily applied to the read side, as follows:
+
+	static enum audit_state audit_filter_task(struct task_struct *tsk)
+	{
+		struct audit_entry *e;
+		enum audit_state   state;
+
+		rcu_read_lock();
+		/* Note: audit_netlink_sem held by caller. */
+		list_for_each_entry_rcu(e, &audit_tsklist, list) {
+			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
+				rcu_read_unlock();
+				return state;
+			}
+		}
+		rcu_read_unlock();
+		return AUDIT_BUILD_CONTEXT;
+	}
+
+The read_lock() and read_unlock() calls have become rcu_read_lock()
+and rcu_read_unlock(), respectively, and the list_for_each_entry() has
+become list_for_each_entry_rcu().  The _rcu() list-traversal primitives
+insert the read-side memory barriers that are required on DEC Alpha CPUs.
+
+The changes to the update side are also straightforward.  A reader-writer
+lock might be used as follows for deletion and insertion:
+
+	static inline int audit_del_rule(struct audit_rule *rule,
+					 struct list_head *list)
+	{
+		struct audit_entry  *e;
+
+		write_lock(&auditsc_lock);
+		list_for_each_entry(e, list, list) {
+			if (!audit_compare_rule(rule, &e->rule)) {
+				list_del(&e->list);
+				write_unlock(&auditsc_lock);
+				return 0;
+			}
+		}
+		write_unlock(&auditsc_lock);
+		return -EFAULT;		/* No matching rule */
+	}
+
+	static inline int audit_add_rule(struct audit_entry *entry,
+					 struct list_head *list)
+	{
+		write_lock(&auditsc_lock);
+		if (entry->rule.flags & AUDIT_PREPEND) {
+			entry->rule.flags &= ~AUDIT_PREPEND;
+			list_add(&entry->list, list);
+		} else {
+			list_add_tail(&entry->list, list);
+		}
+		write_unlock(&auditsc_lock);
+		return 0;
+	}
+
+Following are the RCU equivalents for these two functions:
+
+	static inline int audit_del_rule(struct audit_rule *rule,
+					 struct list_head *list)
+	{
+		struct audit_entry  *e;
+
+		/* Do not use the _rcu iterator here, since this is the only
+		 * deletion routine. */
+		list_for_each_entry(e, list, list) {
+			if (!audit_compare_rule(rule, &e->rule)) {
+				list_del_rcu(&e->list);
+				call_rcu(&e->rcu, audit_free_rule);
+				return 0;
+			}
+		}
+		return -EFAULT;		/* No matching rule */
+	}
+
+	static inline int audit_add_rule(struct audit_entry *entry,
+					 struct list_head *list)
+	{
+		if (entry->rule.flags & AUDIT_PREPEND) {
+			entry->rule.flags &= ~AUDIT_PREPEND;
+			list_add_rcu(&entry->list, list);
+		} else {
+			list_add_tail_rcu(&entry->list, list);
+		}
+		return 0;
+	}
+
+Normally, the write_lock() and write_unlock() would be replaced by
+a spin_lock() and a spin_unlock(), but in this case, all callers hold
+audit_netlink_sem, so no additional locking is required.  The auditsc_lock
+can therefore be eliminated, since use of RCU eliminates the need for
+writers to exclude readers.  Normally, the write_lock() calls would
+be converted into spin_lock() calls.
+
+The list_del(), list_add(), and list_add_tail() primitives have been
+replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
+The _rcu() list-manipulation primitives add memory barriers that are
+needed on weakly ordered CPUs (most of them!).  The list_del_rcu()
+primitive omits the pointer poisoning debug-assist code that would
+otherwise cause concurrent readers to fail spectacularly.
+
+So, when readers can tolerate stale data and when entries are either added
+or deleted, without in-place modification, it is very easy to use RCU!
+
+
+Example 2: Handling In-Place Updates
+
+The system-call auditing code does not update auditing rules in place.
+However, if it did, reader-writer-locked code to do so might look as
+follows (presumably, the field_count is only permitted to decrease,
+otherwise, the added fields would need to be filled in):
+
+	static inline int audit_upd_rule(struct audit_rule *rule,
+					 struct list_head *list,
+					 __u32 newaction,
+					 __u32 newfield_count)
+	{
+		struct audit_entry  *e;
+		struct audit_newentry *ne;
+
+		write_lock(&auditsc_lock);
+		/* Note: audit_netlink_sem held by caller. */
+		list_for_each_entry(e, list, list) {
+			if (!audit_compare_rule(rule, &e->rule)) {
+				e->rule.action = newaction;
+				e->rule.file_count = newfield_count;
+				write_unlock(&auditsc_lock);
+				return 0;
+			}
+		}
+		write_unlock(&auditsc_lock);
+		return -EFAULT;		/* No matching rule */
+	}
+
+The RCU version creates a copy, updates the copy, then replaces the old
+entry with the newly updated entry.  This sequence of actions, allowing
+concurrent reads while doing a copy to perform an update, is what gives
+RCU ("read-copy update") its name.  The RCU code is as follows:
+
+	static inline int audit_upd_rule(struct audit_rule *rule,
+					 struct list_head *list,
+					 __u32 newaction,
+					 __u32 newfield_count)
+	{
+		struct audit_entry  *e;
+		struct audit_newentry *ne;
+
+		list_for_each_entry(e, list, list) {
+			if (!audit_compare_rule(rule, &e->rule)) {
+				ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
+				if (ne == NULL)
+					return -ENOMEM;
+				audit_copy_rule(&ne->rule, &e->rule);
+				ne->rule.action = newaction;
+				ne->rule.file_count = newfield_count;
+				list_replace_rcu(&e->list, &ne->list);
+				call_rcu(&e->rcu, audit_free_rule);
+				return 0;
+			}
+		}
+		return -EFAULT;		/* No matching rule */
+	}
+
+Again, this assumes that the caller holds audit_netlink_sem.  Normally,
+the reader-writer lock would become a spinlock in this sort of code.
+
+
+Example 3: Eliminating Stale Data
+
+The auditing examples above tolerate stale data, as do most algorithms
+that are tracking external state.  Because there is a delay from the
+time the external state changes before Linux becomes aware of the change,
+additional RCU-induced staleness is normally not a problem.
+
+However, there are many examples where stale data cannot be tolerated.
+One example in the Linux kernel is the System V IPC (see the ipc_lock()
+function in ipc/util.c).  This code checks a "deleted" flag under a
+per-entry spinlock, and, if the "deleted" flag is set, pretends that the
+entry does not exist.  For this to be helpful, the search function must
+return holding the per-entry spinlock, as ipc_lock() does in fact do.
+
+Quick Quiz:  Why does the search function need to return holding the
+	per-entry lock for this deleted-flag technique to be helpful?
+
+If the system-call audit module were to ever need to reject stale data,
+one way to accomplish this would be to add a "deleted" flag and a "lock"
+spinlock to the audit_entry structure, and modify audit_filter_task()
+as follows:
+
+	static enum audit_state audit_filter_task(struct task_struct *tsk)
+	{
+		struct audit_entry *e;
+		enum audit_state   state;
+
+		rcu_read_lock();
+		list_for_each_entry_rcu(e, &audit_tsklist, list) {
+			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
+				spin_lock(&e->lock);
+				if (e->deleted) {
+					spin_unlock(&e->lock);
+					rcu_read_unlock();
+					return AUDIT_BUILD_CONTEXT;
+				}
+				rcu_read_unlock();
+				return state;
+			}
+		}
+		rcu_read_unlock();
+		return AUDIT_BUILD_CONTEXT;
+	}
+
+Note that this example assumes that entries are only added and deleted.
+Additional mechanism is required to deal correctly with the
+update-in-place performed by audit_upd_rule().  For one thing,
+audit_upd_rule() would need additional memory barriers to ensure
+that the list_add_rcu() was really executed before the list_del_rcu().
+
+The audit_del_rule() function would need to set the "deleted"
+flag under the spinlock as follows:
+
+	static inline int audit_del_rule(struct audit_rule *rule,
+					 struct list_head *list)
+	{
+		struct audit_entry  *e;
+
+		/* Do not need to use the _rcu iterator here, since this
+		 * is the only deletion routine. */
+		list_for_each_entry(e, list, list) {
+			if (!audit_compare_rule(rule, &e->rule)) {
+				spin_lock(&e->lock);
+				list_del_rcu(&e->list);
+				e->deleted = 1;
+				spin_unlock(&e->lock);
+				call_rcu(&e->rcu, audit_free_rule);
+				return 0;
+			}
+		}
+		return -EFAULT;		/* No matching rule */
+	}
+
+
+Summary
+
+Read-mostly list-based data structures that can tolerate stale data are
+the most amenable to use of RCU.  The simplest case is where entries are
+either added or deleted from the data structure (or atomically modified
+in place), but non-atomic in-place modifications can be handled by making
+a copy, updating the copy, then replacing the original with the copy.
+If stale data cannot be tolerated, then a "deleted" flag may be used
+in conjunction with a per-entry spinlock in order to allow the search
+function to reject newly deleted data.
+
+
+Answer to Quick Quiz
+	Why does the search function need to return holding the per-entry
+	lock for this deleted-flag technique to be helpful?
+
+	If the search function drops the per-entry lock before returning,
+	then the caller will be processing stale data in any case.  If it
+	is really OK to be processing stale data, then you don't need a
+	"deleted" flag.  If processing stale data really is a problem,
+	then you need to hold the per-entry lock across all of the code
+	that uses the value that was returned.
--- a/Documentation/RCU/lockdep-splat.txt
+++ b/Documentation/RCU/lockdep-splat.txt
@@ -0,0 +1,110 @@
+Lockdep-RCU was added to the Linux kernel in early 2010
+(http://lwn.net/Articles/371986/).  This facility checks for some common
+misuses of the RCU API, most notably using one of the rcu_dereference()
+family to access an RCU-protected pointer without the proper protection.
+When such misuse is detected, an lockdep-RCU splat is emitted.
+
+The usual cause of a lockdep-RCU slat is someone accessing an
+RCU-protected data structure without either (1) being in the right kind of
+RCU read-side critical section or (2) holding the right update-side lock.
+This problem can therefore be serious: it might result in random memory
+overwriting or worse.  There can of course be false positives, this
+being the real world and all that.
+
+So let's look at an example RCU lockdep splat from 3.0-rc5, one that
+has long since been fixed:
+
+===============================
+[ INFO: suspicious RCU usage. ]
+-------------------------------
+block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage!
+
+other info that might help us debug this:
+
+
+rcu_scheduler_active = 1, debug_locks = 0
+3 locks held by scsi_scan_6/1552:
+ #0:  (&shost->scan_mutex){+.+.+.}, at: [<ffffffff8145efca>]
+scsi_scan_host_selected+0x5a/0x150
+ #1:  (&eq->sysfs_lock){+.+...}, at: [<ffffffff812a5032>]
+elevator_exit+0x22/0x60
+ #2:  (&(&q->__queue_lock)->rlock){-.-...}, at: [<ffffffff812b6233>]
+cfq_exit_queue+0x43/0x190
+
+stack backtrace:
+Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17
+Call Trace:
+ [<ffffffff810abb9b>] lockdep_rcu_dereference+0xbb/0xc0
+ [<ffffffff812b6139>] __cfq_exit_single_io_context+0xe9/0x120
+ [<ffffffff812b626c>] cfq_exit_queue+0x7c/0x190
+ [<ffffffff812a5046>] elevator_exit+0x36/0x60
+ [<ffffffff812a802a>] blk_cleanup_queue+0x4a/0x60
+ [<ffffffff8145cc09>] scsi_free_queue+0x9/0x10
+ [<ffffffff81460944>] __scsi_remove_device+0x84/0xd0
+ [<ffffffff8145dca3>] scsi_probe_and_add_lun+0x353/0xb10
+ [<ffffffff817da069>] ? error_exit+0x29/0xb0
+ [<ffffffff817d98ed>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
+ [<ffffffff8145e722>] __scsi_scan_target+0x112/0x680
+ [<ffffffff812c690d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
+ [<ffffffff817da069>] ? error_exit+0x29/0xb0
+ [<ffffffff812bcc60>] ? kobject_del+0x40/0x40
+ [<ffffffff8145ed16>] scsi_scan_channel+0x86/0xb0
+ [<ffffffff8145f0b0>] scsi_scan_host_selected+0x140/0x150
+ [<ffffffff8145f149>] do_scsi_scan_host+0x89/0x90
+ [<ffffffff8145f170>] do_scan_async+0x20/0x160
+ [<ffffffff8145f150>] ? do_scsi_scan_host+0x90/0x90
+ [<ffffffff810975b6>] kthread+0xa6/0xb0
+ [<ffffffff817db154>] kernel_thread_helper+0x4/0x10
+ [<ffffffff81066430>] ? finish_task_switch+0x80/0x110
+ [<ffffffff817d9c04>] ? retint_restore_args+0xe/0xe
+ [<ffffffff81097510>] ? __init_kthread_worker+0x70/0x70
+ [<ffffffff817db150>] ? gs_change+0xb/0xb
+
+Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows:
+
+	if (rcu_dereference(ioc->ioc_data) == cic) {
+
+This form says that it must be in a plain vanilla RCU read-side critical
+section, but the "other info" list above shows that this is not the
+case.  Instead, we hold three locks, one of which might be RCU related.
+And maybe that lock really does protect this reference.  If so, the fix
+is to inform RCU, perhaps by changing __cfq_exit_single_io_context() to
+take the struct request_queue "q" from cfq_exit_queue() as an argument,
+which would permit us to invoke rcu_dereference_protected as follows:
+
+	if (rcu_dereference_protected(ioc->ioc_data,
+				      lockdep_is_held(&q->queue_lock)) == cic) {
+
+With this change, there would be no lockdep-RCU splat emitted if this
+code was invoked either from within an RCU read-side critical section
+or with the ->queue_lock held.  In particular, this would have suppressed
+the above lockdep-RCU splat because ->queue_lock is held (see #2 in the
+list above).
+
+On the other hand, perhaps we really do need an RCU read-side critical
+section.  In this case, the critical section must span the use of the
+return value from rcu_dereference(), or at least until there is some
+reference count incremented or some such.  One way to handle this is to
+add rcu_read_lock() and rcu_read_unlock() as follows:
+
+	rcu_read_lock();
+	if (rcu_dereference(ioc->ioc_data) == cic) {
+		spin_lock(&ioc->lock);
+		rcu_assign_pointer(ioc->ioc_data, NULL);
+		spin_unlock(&ioc->lock);
+	}
+	rcu_read_unlock();
+
+With this change, the rcu_dereference() is always within an RCU
+read-side critical section, which again would have suppressed the
+above lockdep-RCU splat.
+
+But in this particular case, we don't actually deference the pointer
+returned from rcu_dereference().  Instead, that pointer is just compared
+to the cic pointer, which means that the rcu_dereference() can be replaced
+by rcu_access_pointer() as follows:
+
+	if (rcu_access_pointer(ioc->ioc_data) == cic) {
+
+Because it is legal to invoke rcu_access_pointer() without protection,
+this change would also suppress the above lockdep-RCU splat.
--- a/Documentation/RCU/lockdep.txt
+++ b/Documentation/RCU/lockdep.txt
@@ -0,0 +1,112 @@
+RCU and lockdep checking
+
+All flavors of RCU have lockdep checking available, so that lockdep is
+aware of when each task enters and leaves any flavor of RCU read-side
+critical section.  Each flavor of RCU is tracked separately (but note
+that this is not the case in 2.6.32 and earlier).  This allows lockdep's
+tracking to include RCU state, which can sometimes help when debugging
+deadlocks and the like.
+
+In addition, RCU provides the following primitives that check lockdep's
+state:
+
+	rcu_read_lock_held() for normal RCU.
+	rcu_read_lock_bh_held() for RCU-bh.
+	rcu_read_lock_sched_held() for RCU-sched.
+	srcu_read_lock_held() for SRCU.
+
+These functions are conservative, and will therefore return 1 if they
+aren't certain (for example, if CONFIG_DEBUG_LOCK_ALLOC is not set).
+This prevents things like WARN_ON(!rcu_read_lock_held()) from giving false
+positives when lockdep is disabled.
+
+In addition, a separate kernel config parameter CONFIG_PROVE_RCU enables
+checking of rcu_dereference() primitives:
+
+	rcu_dereference(p):
+		Check for RCU read-side critical section.
+	rcu_dereference_bh(p):
+		Check for RCU-bh read-side critical section.
+	rcu_dereference_sched(p):
+		Check for RCU-sched read-side critical section.
+	srcu_dereference(p, sp):
+		Check for SRCU read-side critical section.
+	rcu_dereference_check(p, c):
+		Use explicit check expression "c" along with
+		rcu_read_lock_held().  This is useful in code that is
+		invoked by both RCU readers and updaters.
+	rcu_dereference_bh_check(p, c):
+		Use explicit check expression "c" along with
+		rcu_read_lock_bh_held().  This is useful in code that
+		is invoked by both RCU-bh readers and updaters.
+	rcu_dereference_sched_check(p, c):
+		Use explicit check expression "c" along with
+		rcu_read_lock_sched_held().  This is useful in code that
+		is invoked by both RCU-sched readers and updaters.
+	srcu_dereference_check(p, c):
+		Use explicit check expression "c" along with
+		srcu_read_lock_held()().  This is useful in code that
+		is invoked by both SRCU readers and updaters.
+	rcu_dereference_index_check(p, c):
+		Use explicit check expression "c", but the caller
+		must supply one of the rcu_read_lock_held() functions.
+		This is useful in code that uses RCU-protected arrays
+		that is invoked by both RCU readers and updaters.
+	rcu_dereference_raw(p):
+		Don't check.  (Use sparingly, if at all.)
+	rcu_dereference_protected(p, c):
+		Use explicit check expression "c", and omit all barriers
+		and compiler constraints.  This is useful when the data
+		structure cannot change, for example, in code that is
+		invoked only by updaters.
+	rcu_access_pointer(p):
+		Return the value of the pointer and omit all barriers,
+		but retain the compiler constraints that prevent duplicating
+		or coalescsing.  This is useful when when testing the
+		value of the pointer itself, for example, against NULL.
+	rcu_access_index(idx):
+		Return the value of the index and omit all barriers, but
+		retain the compiler constraints that prevent duplicating
+		or coalescsing.  This is useful when when testing the
+		value of the index itself, for example, against -1.
+
+The rcu_dereference_check() check expression can be any boolean
+expression, but would normally include a lockdep expression.  However,
+any boolean expression can be used.  For a moderately ornate example,
+consider the following:
+
+	file = rcu_dereference_check(fdt->fd[fd],
+				     lockdep_is_held(&files->file_lock) ||
+				     atomic_read(&files->count) == 1);
+
+This expression picks up the pointer "fdt->fd[fd]" in an RCU-safe manner,
+and, if CONFIG_PROVE_RCU is configured, verifies that this expression
+is used in:
+
+1.	An RCU read-side critical section (implicit), or
+2.	with files->file_lock held, or
+3.	on an unshared files_struct.
+
+In case (1), the pointer is picked up in an RCU-safe manner for vanilla
+RCU read-side critical sections, in case (2) the ->file_lock prevents
+any change from taking place, and finally, in case (3) the current task
+is the only task accessing the file_struct, again preventing any change
+from taking place.  If the above statement was invoked only from updater
+code, it could instead be written as follows:
+
+	file = rcu_dereference_protected(fdt->fd[fd],
+					 lockdep_is_held(&files->file_lock) ||
+					 atomic_read(&files->count) == 1);
+
+This would verify cases #2 and #3 above, and furthermore lockdep would
+complain if this was used in an RCU read-side critical section unless one
+of these two cases held.  Because rcu_dereference_protected() omits all
+barriers and compiler constraints, it generates better code than do the
+other flavors of rcu_dereference().  On the other hand, it is illegal
+to use rcu_dereference_protected() if either the RCU-protected pointer
+or the RCU-protected data that it points to can change concurrently.
+
+There are currently only "universal" versions of the rcu_assign_pointer()
+and RCU list-/tree-traversal primitives, which do not (yet) check for
+being in an RCU read-side critical section.  In the future, separate
+versions of these primitives might be created.
--- a/Documentation/RCU/rcu.txt
+++ b/Documentation/RCU/rcu.txt
@@ -0,0 +1,96 @@
+RCU Concepts
+
+
+The basic idea behind RCU (read-copy update) is to split destructive
+operations into two parts, one that prevents anyone from seeing the data
+item being destroyed, and one that actually carries out the destruction.
+A "grace period" must elapse between the two parts, and this grace period
+must be long enough that any readers accessing the item being deleted have
+since dropped their references.  For example, an RCU-protected deletion
+from a linked list would first remove the item from the list, wait for
+a grace period to elapse, then free the element.  See the listRCU.txt
+file for more information on using RCU with linked lists.
+
+
+Frequently Asked Questions
+
+o	Why would anyone want to use RCU?
+
+	The advantage of RCU's two-part approach is that RCU readers need
+	not acquire any locks, perform any atomic instructions, write to
+	shared memory, or (on CPUs other than Alpha) execute any memory
+	barriers.  The fact that these operations are quite expensive
+	on modern CPUs is what gives RCU its performance advantages
+	in read-mostly situations.  The fact that RCU readers need not
+	acquire locks can also greatly simplify deadlock-avoidance code.
+
+o	How can the updater tell when a grace period has completed
+	if the RCU readers give no indication when they are done?
+
+	Just as with spinlocks, RCU readers are not permitted to
+	block, switch to user-mode execution, or enter the idle loop.
+	Therefore, as soon as a CPU is seen passing through any of these
+	three states, we know that that CPU has exited any previous RCU
+	read-side critical sections.  So, if we remove an item from a
+	linked list, and then wait until all CPUs have switched context,
+	executed in user mode, or executed in the idle loop, we can
+	safely free up that item.
+
+	Preemptible variants of RCU (CONFIG_TREE_PREEMPT_RCU) get the
+	same effect, but require that the readers manipulate CPU-local
+	counters.  These counters allow limited types of blocking within
+	RCU read-side critical sections.  SRCU also uses CPU-local
+	counters, and permits general blocking within RCU read-side
+	critical sections.  These variants of RCU detect grace periods
+	by sampling these counters.
+
+o	If I am running on a uniprocessor kernel, which can only do one
+	thing at a time, why should I wait for a grace period?
+
+	See the UP.txt file in this directory.
+
+o	How can I see where RCU is currently used in the Linux kernel?
+
+	Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
+	"rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
+	"srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
+	"synchronize_net", "synchronize_srcu", and the other RCU
+	primitives.  Or grab one of the cscope databases from:
+
+	http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html
+
+o	What guidelines should I follow when writing code that uses RCU?
+
+	See the checklist.txt file in this directory.
+
+o	Why the name "RCU"?
+
+	"RCU" stands for "read-copy update".  The file listRCU.txt has
+	more information on where this name came from, search for
+	"read-copy update" to find it.
+
+o	I hear that RCU is patented?  What is with that?
+
+	Yes, it is.  There are several known patents related to RCU,
+	search for the string "Patent" in RTFP.txt to find them.
+	Of these, one was allowed to lapse by the assignee, and the
+	others have been contributed to the Linux kernel under GPL.
+	There are now also LGPL implementations of user-level RCU
+	available (http://lttng.org/?q=node/18).
+
+o	I hear that RCU needs work in order to support realtime kernels?
+
+	This work is largely completed.  Realtime-friendly RCU can be
+	enabled via the CONFIG_TREE_PREEMPT_RCU kernel configuration
+	parameter.  However, work is in progress for enabling priority
+	boosting of preempted RCU read-side critical sections.	This is
+	needed if you have CPU-bound realtime threads.
+
+o	Where can I find more information on RCU?
+
+	See the RTFP.txt file in this directory.
+	Or point your browser at http://www.rdrop.com/users/paulmck/RCU/.
+
+o	What are all these files in this directory?
+
+	See 00-INDEX for the list.
--- a/Documentation/RCU/rcubarrier.txt
+++ b/Documentation/RCU/rcubarrier.txt
@@ -0,0 +1,317 @@
+RCU and Unloadable Modules
+
+[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]
+
+RCU (read-copy update) is a synchronization mechanism that can be thought
+of as a replacement for read-writer locking (among other things), but with
+very low-overhead readers that are immune to deadlock, priority inversion,
+and unbounded latency. RCU read-side critical sections are delimited
+by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT
+kernels, generate no code whatsoever.
+
+This means that RCU writers are unaware of the presence of concurrent
+readers, so that RCU updates to shared data must be undertaken quite
+carefully, leaving an old version of the data structure in place until all
+pre-existing readers have finished. These old versions are needed because
+such readers might hold a reference to them. RCU updates can therefore be
+rather expensive, and RCU is thus best suited for read-mostly situations.
+
+How can an RCU writer possibly determine when all readers are finished,
+given that readers might well leave absolutely no trace of their
+presence? There is a synchronize_rcu() primitive that blocks until all
+pre-existing readers have completed. An updater wishing to delete an
+element p from a linked list might do the following, while holding an
+appropriate lock, of course:
+
+	list_del_rcu(p);
+	synchronize_rcu();
+	kfree(p);
+
+But the above code cannot be used in IRQ context -- the call_rcu()
+primitive must be used instead. This primitive takes a pointer to an
+rcu_head struct placed within the RCU-protected data structure and
+another pointer to a function that may be invoked later to free that
+structure. Code to delete an element p from the linked list from IRQ
+context might then be as follows:
+
+	list_del_rcu(p);
+	call_rcu(&p->rcu, p_callback);
+
+Since call_rcu() never blocks, this code can safely be used from within
+IRQ context. The function p_callback() might be defined as follows:
+
+	static void p_callback(struct rcu_head *rp)
+	{
+		struct pstruct *p = container_of(rp, struct pstruct, rcu);
+
+		kfree(p);
+	}
+
+
+Unloading Modules That Use call_rcu()
+
+But what if p_callback is defined in an unloadable module?
+
+If we unload the module while some RCU callbacks are pending,
+the CPUs executing these callbacks are going to be severely
+disappointed when they are later invoked, as fancifully depicted at
+http://lwn.net/images/ns/kernel/rcu-drop.jpg.
+
+We could try placing a synchronize_rcu() in the module-exit code path,
+but this is not sufficient. Although synchronize_rcu() does wait for a
+grace period to elapse, it does not wait for the callbacks to complete.
+
+One might be tempted to try several back-to-back synchronize_rcu()
+calls, but this is still not guaranteed to work. If there is a very
+heavy RCU-callback load, then some of the callbacks might be deferred
+in order to allow other processing to proceed. Such deferral is required
+in realtime kernels in order to avoid excessive scheduling latencies.
+
+
+rcu_barrier()
+
+We instead need the rcu_barrier() primitive. This primitive is similar
+to synchronize_rcu(), but instead of waiting solely for a grace
+period to elapse, it also waits for all outstanding RCU callbacks to
+complete. Pseudo-code using rcu_barrier() is as follows:
+
+   1. Prevent any new RCU callbacks from being posted.
+   2. Execute rcu_barrier().
+   3. Allow the module to be unloaded.
+
+There are also rcu_barrier_bh(), rcu_barrier_sched(), and srcu_barrier()
+functions for the other flavors of RCU, and you of course must match
+the flavor of rcu_barrier() with that of call_rcu().  If your module
+uses multiple flavors of call_rcu(), then it must also use multiple
+flavors of rcu_barrier() when unloading that module.  For example, if
+it uses call_rcu_bh(), call_srcu() on srcu_struct_1, and call_srcu() on
+srcu_struct_2(), then the following three lines of code will be required
+when unloading:
+
+ 1 rcu_barrier_bh();
+ 2 srcu_barrier(&srcu_struct_1);
+ 3 srcu_barrier(&srcu_struct_2);
+
+The rcutorture module makes use of rcu_barrier() in its exit function
+as follows:
+
+ 1 static void
+ 2 rcu_torture_cleanup(void)
+ 3 {
+ 4   int i;
+ 5
+ 6   fullstop = 1;
+ 7   if (shuffler_task != NULL) {
+ 8     VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
+ 9     kthread_stop(shuffler_task);
+10   }
+11   shuffler_task = NULL;
+12
+13   if (writer_task != NULL) {
+14     VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
+15     kthread_stop(writer_task);
+16   }
+17   writer_task = NULL;
+18
+19   if (reader_tasks != NULL) {
+20     for (i = 0; i < nrealreaders; i++) {
+21       if (reader_tasks[i] != NULL) {
+22         VERBOSE_PRINTK_STRING(
+23           "Stopping rcu_torture_reader task");
+24         kthread_stop(reader_tasks[i]);
+25       }
+26       reader_tasks[i] = NULL;
+27     }
+28     kfree(reader_tasks);
+29     reader_tasks = NULL;
+30   }
+31   rcu_torture_current = NULL;
+32
+33   if (fakewriter_tasks != NULL) {
+34     for (i = 0; i < nfakewriters; i++) {
+35       if (fakewriter_tasks[i] != NULL) {
+36         VERBOSE_PRINTK_STRING(
+37           "Stopping rcu_torture_fakewriter task");
+38         kthread_stop(fakewriter_tasks[i]);
+39       }
+40       fakewriter_tasks[i] = NULL;
+41     }
+42     kfree(fakewriter_tasks);
+43     fakewriter_tasks = NULL;
+44   }
+45
+46   if (stats_task != NULL) {
+47     VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
+48     kthread_stop(stats_task);
+49   }
+50   stats_task = NULL;
+51
+52   /* Wait for all RCU callbacks to fire. */
+53   rcu_barrier();
+54
+55   rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
+56
+57   if (cur_ops->cleanup != NULL)
+58     cur_ops->cleanup();
+59   if (atomic_read(&n_rcu_torture_error))
+60     rcu_torture_print_module_parms("End of test: FAILURE");
+61   else
+62     rcu_torture_print_module_parms("End of test: SUCCESS");
+63 }
+
+Line 6 sets a global variable that prevents any RCU callbacks from
+re-posting themselves. This will not be necessary in most cases, since
+RCU callbacks rarely include calls to call_rcu(). However, the rcutorture
+module is an exception to this rule, and therefore needs to set this
+global variable.
+
+Lines 7-50 stop all the kernel tasks associated with the rcutorture
+module. Therefore, once execution reaches line 53, no more rcutorture
+RCU callbacks will be posted. The rcu_barrier() call on line 53 waits
+for any pre-existing callbacks to complete.
+
+Then lines 55-62 print status and do operation-specific cleanup, and
+then return, permitting the module-unload operation to be completed.
+
+Quick Quiz #1: Is there any other situation where rcu_barrier() might
+	be required?
+
+Your module might have additional complications. For example, if your
+module invokes call_rcu() from timers, you will need to first cancel all
+the timers, and only then invoke rcu_barrier() to wait for any remaining
+RCU callbacks to complete.
+
+Of course, if you module uses call_rcu_bh(), you will need to invoke
+rcu_barrier_bh() before unloading.  Similarly, if your module uses
+call_rcu_sched(), you will need to invoke rcu_barrier_sched() before
+unloading.  If your module uses call_rcu(), call_rcu_bh(), -and-
+call_rcu_sched(), then you will need to invoke each of rcu_barrier(),
+rcu_barrier_bh(), and rcu_barrier_sched().
+
+
+Implementing rcu_barrier()
+
+Dipankar Sarma's implementation of rcu_barrier() makes use of the fact
+that RCU callbacks are never reordered once queued on one of the per-CPU
+queues. His implementation queues an RCU callback on each of the per-CPU
+callback queues, and then waits until they have all started executing, at
+which point, all earlier RCU callbacks are guaranteed to have completed.
+
+The original code for rcu_barrier() was as follows:
+
+ 1 void rcu_barrier(void)
+ 2 {
+ 3   BUG_ON(in_interrupt());
+ 4   /* Take cpucontrol mutex to protect against CPU hotplug */
+ 5   mutex_lock(&rcu_barrier_mutex);
+ 6   init_completion(&rcu_barrier_completion);
+ 7   atomic_set(&rcu_barrier_cpu_count, 0);
+ 8   on_each_cpu(rcu_barrier_func, NULL, 0, 1);
+ 9   wait_for_completion(&rcu_barrier_completion);
+10   mutex_unlock(&rcu_barrier_mutex);
+11 }
+
+Line 3 verifies that the caller is in process context, and lines 5 and 10
+use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the
+global completion and counters at a time, which are initialized on lines
+6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is
+shown below. Note that the final "1" in on_each_cpu()'s argument list
+ensures that all the calls to rcu_barrier_func() will have completed
+before on_each_cpu() returns. Line 9 then waits for the completion.
+
+This code was rewritten in 2008 to support rcu_barrier_bh() and
+rcu_barrier_sched() in addition to the original rcu_barrier().
+
+The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()
+to post an RCU callback, as follows:
+
+ 1 static void rcu_barrier_func(void *notused)
+ 2 {
+ 3 int cpu = smp_processor_id();
+ 4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+ 5 struct rcu_head *head;
+ 6
+ 7 head = &rdp->barrier;
+ 8 atomic_inc(&rcu_barrier_cpu_count);
+ 9 call_rcu(head, rcu_barrier_callback);
+10 }
+
+Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,
+which contains the struct rcu_head that needed for the later call to
+call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line
+8 increments a global counter. This counter will later be decremented
+by the callback. Line 9 then registers the rcu_barrier_callback() on
+the current CPU's queue.
+
+The rcu_barrier_callback() function simply atomically decrements the
+rcu_barrier_cpu_count variable and finalizes the completion when it
+reaches zero, as follows:
+
+ 1 static void rcu_barrier_callback(struct rcu_head *notused)
+ 2 {
+ 3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
+ 4 complete(&rcu_barrier_completion);
+ 5 }
+
+Quick Quiz #2: What happens if CPU 0's rcu_barrier_func() executes
+	immediately (thus incrementing rcu_barrier_cpu_count to the
+	value one), but the other CPU's rcu_barrier_func() invocations
+	are delayed for a full grace period? Couldn't this result in
+	rcu_barrier() returning prematurely?
+
+
+rcu_barrier() Summary
+
+The rcu_barrier() primitive has seen relatively little use, since most
+code using RCU is in the core kernel rather than in modules. However, if
+you are using RCU from an unloadable module, you need to use rcu_barrier()
+so that your module may be safely unloaded.
+
+
+Answers to Quick Quizzes
+
+Quick Quiz #1: Is there any other situation where rcu_barrier() might
+	be required?
+
+Answer: Interestingly enough, rcu_barrier() was not originally
+	implemented for module unloading. Nikita Danilov was using
+	RCU in a filesystem, which resulted in a similar situation at
+	filesystem-unmount time. Dipankar Sarma coded up rcu_barrier()
+	in response, so that Nikita could invoke it during the
+	filesystem-unmount process.
+
+	Much later, yours truly hit the RCU module-unload problem when
+	implementing rcutorture, and found that rcu_barrier() solves
+	this problem as well.
+
+Quick Quiz #2: What happens if CPU 0's rcu_barrier_func() executes
+	immediately (thus incrementing rcu_barrier_cpu_count to the
+	value one), but the other CPU's rcu_barrier_func() invocations
+	are delayed for a full grace period? Couldn't this result in
+	rcu_barrier() returning prematurely?
+
+Answer: This cannot happen. The reason is that on_each_cpu() has its last
+	argument, the wait flag, set to "1". This flag is passed through
+	to smp_call_function() and further to smp_call_function_on_cpu(),
+	causing this latter to spin until the cross-CPU invocation of
+	rcu_barrier_func() has completed. This by itself would prevent
+	a grace period from completing on non-CONFIG_PREEMPT kernels,
+	since each CPU must undergo a context switch (or other quiescent
+	state) before the grace period can complete. However, this is
+	of no use in CONFIG_PREEMPT kernels.
+
+	Therefore, on_each_cpu() disables preemption across its call
+	to smp_call_function() and also across the local call to
+	rcu_barrier_func(). This prevents the local CPU from context
+	switching, again preventing grace periods from completing. This
+	means that all CPUs have executed rcu_barrier_func() before
+	the first rcu_barrier_callback() can possibly execute, in turn
+	preventing rcu_barrier_cpu_count from prematurely reaching zero.
+
+	Currently, -rt implementations of RCU keep but a single global
+	queue for RCU callbacks, and thus do not suffer from this
+	problem. However, when the -rt RCU eventually does have per-CPU
+	callback queues, things will have to change. One simple change
+	is to add an rcu_read_lock() before line 8 of rcu_barrier()
+	and an rcu_read_unlock() after line 8 of this same function. If
+	you can think of a better change, please let me know!
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -0,0 +1,172 @@
+Using hlist_nulls to protect read-mostly linked lists and
+objects using SLAB_DESTROY_BY_RCU allocations.
+
+Please read the basics in Documentation/RCU/listRCU.txt
+
+Using special makers (called 'nulls') is a convenient way
+to solve following problem :
+
+A typical RCU linked list managing objects which are
+allocated with SLAB_DESTROY_BY_RCU kmem_cache can
+use following algos :
+
+1) Lookup algo
+--------------
+rcu_read_lock()
+begin:
+obj = lockless_lookup(key);
+if (obj) {
+  if (!try_get_ref(obj)) // might fail for free objects
+    goto begin;
+  /*
+   * Because a writer could delete object, and a writer could
+   * reuse these object before the RCU grace period, we
+   * must check key after getting the reference on object
+   */
+  if (obj->key != key) { // not the object we expected
+     put_ref(obj);
+     goto begin;
+   }
+}
+rcu_read_unlock();
+
+Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu()
+but a version with an additional memory barrier (smp_rmb())
+
+lockless_lookup(key)
+{
+   struct hlist_node *node, *next;
+   for (pos = rcu_dereference((head)->first);
+          pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
+          ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
+          pos = rcu_dereference(next))
+      if (obj->key == key)
+         return obj;
+   return NULL;
+
+And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() :
+
+   struct hlist_node *node;
+   for (pos = rcu_dereference((head)->first);
+		pos && ({ prefetch(pos->next); 1; }) &&
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
+		pos = rcu_dereference(pos->next))
+      if (obj->key == key)
+         return obj;
+   return NULL;
+}
+
+Quoting Corey Minyard :
+
+"If the object is moved from one list to another list in-between the
+ time the hash is calculated and the next field is accessed, and the
+ object has moved to the end of a new list, the traversal will not
+ complete properly on the list it should have, since the object will
+ be on the end of the new list and there's not a way to tell it's on a
+ new list and restart the list traversal.  I think that this can be
+ solved by pre-fetching the "next" field (with proper barriers) before
+ checking the key."
+
+2) Insert algo :
+----------------
+
+We need to make sure a reader cannot read the new 'obj->obj_next' value
+and previous value of 'obj->key'. Or else, an item could be deleted
+from a chain, and inserted into another chain. If new chain was empty
+before the move, 'next' pointer is NULL, and lockless reader can
+not detect it missed following items in original chain.
+
+/*
+ * Please note that new inserts are done at the head of list,
+ * not in the middle or end.
+ */
+obj = kmem_cache_alloc(...);
+lock_chain(); // typically a spin_lock()
+obj->key = key;
+/*
+ * we need to make sure obj->key is updated before obj->next
+ * or obj->refcnt
+ */
+smp_wmb();
+atomic_set(&obj->refcnt, 1);
+hlist_add_head_rcu(&obj->obj_node, list);
+unlock_chain(); // typically a spin_unlock()
+
+
+3) Remove algo
+--------------
+Nothing special here, we can use a standard RCU hlist deletion.
+But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
+very very fast (before the end of RCU grace period)
+
+if (put_last_reference_on(obj) {
+   lock_chain(); // typically a spin_lock()
+   hlist_del_init_rcu(&obj->obj_node);
+   unlock_chain(); // typically a spin_unlock()
+   kmem_cache_free(cachep, obj);
+}
+
+
+
+--------------------------------------------------------------------------
+With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
+and extra smp_wmb() in insert function.
+
+For example, if we choose to store the slot number as the 'nulls'
+end-of-list marker for each slot of the hash table, we can detect
+a race (some writer did a delete and/or a move of an object
+to another chain) checking the final 'nulls' value if
+the lookup met the end of chain. If final 'nulls' value
+is not the slot number, then we must restart the lookup at
+the beginning. If the object was moved to the same chain,
+then the reader doesn't care : It might eventually
+scan the list again without harm.
+
+
+1) lookup algo
+
+ head = &table[slot];
+ rcu_read_lock();
+begin:
+ hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
+   if (obj->key == key) {
+      if (!try_get_ref(obj)) // might fail for free objects
+         goto begin;
+      if (obj->key != key) { // not the object we expected
+         put_ref(obj);
+         goto begin;
+      }
+  goto out;
+ }
+/*
+ * if the nulls value we got at the end of this lookup is
+ * not the expected one, we must restart lookup.
+ * We probably met an item that was moved to another chain.
+ */
+ if (get_nulls_value(node) != slot)
+   goto begin;
+ obj = NULL;
+
+out:
+ rcu_read_unlock();
+
+2) Insert function :
+--------------------
+
+/*
+ * Please note that new inserts are done at the head of list,
+ * not in the middle or end.
+ */
+obj = kmem_cache_alloc(cachep);
+lock_chain(); // typically a spin_lock()
+obj->key = key;
+/*
+ * changes to obj->key must be visible before refcnt one
+ */
+smp_wmb();
+atomic_set(&obj->refcnt, 1);
+/*
+ * insert obj in RCU way (readers might be traversing chain)
+ */
+hlist_nulls_add_head_rcu(&obj->obj_node, list);
+unlock_chain(); // typically a spin_unlock()
--- a/Documentation/RCU/rcuref.txt
+++ b/Documentation/RCU/rcuref.txt
@@ -0,0 +1,123 @@
+Reference-count design for elements of lists/arrays protected by RCU.
+
+Reference counting on elements of lists which are protected by traditional
+reader/writer spinlocks or semaphores are straightforward:
+
+1.				2.
+add()				search_and_reference()
+{				{
+    alloc_object		    read_lock(&list_lock);
+    ...				    search_for_element
+    atomic_set(&el->rc, 1);	    atomic_inc(&el->rc);
+    write_lock(&list_lock);	     ...
+    add_element			    read_unlock(&list_lock);
+    ...				    ...
+    write_unlock(&list_lock);	}
+}
+
+3.					4.
+release_referenced()			delete()
+{					{
+    ...					    write_lock(&list_lock);
+    atomic_dec(&el->rc, relfunc)	    ...
+    ...					    remove_element
+}					    write_unlock(&list_lock);
+ 					    ...
+					    if (atomic_dec_and_test(&el->rc))
+					        kfree(el);
+					    ...
+					}
+
+If this list/array is made lock free using RCU as in changing the
+write_lock() in add() and delete() to spin_lock() and changing read_lock()
+in search_and_reference() to rcu_read_lock(), the atomic_inc() in
+search_and_reference() could potentially hold reference to an element which
+has already been deleted from the list/array.  Use atomic_inc_not_zero()
+in this scenario as follows:
+
+1.					2.
+add()					search_and_reference()
+{					{
+    alloc_object			    rcu_read_lock();
+    ...					    search_for_element
+    atomic_set(&el->rc, 1);		    if (!atomic_inc_not_zero(&el->rc)) {
+    spin_lock(&list_lock);		        rcu_read_unlock();
+					        return FAIL;
+    add_element				    }
+    ...					    ...
+    spin_unlock(&list_lock);		    rcu_read_unlock();
+}					}
+3.					4.
+release_referenced()			delete()
+{					{
+    ...					    spin_lock(&list_lock);
+    if (atomic_dec_and_test(&el->rc))       ...
+        call_rcu(&el->head, el_free);       remove_element
+    ...                                     spin_unlock(&list_lock);
+} 					    ...
+					    if (atomic_dec_and_test(&el->rc))
+					        call_rcu(&el->head, el_free);
+					    ...
+					}
+
+Sometimes, a reference to the element needs to be obtained in the
+update (write) stream.  In such cases, atomic_inc_not_zero() might be
+overkill, since we hold the update-side spinlock.  One might instead
+use atomic_inc() in such cases.
+
+It is not always convenient to deal with "FAIL" in the
+search_and_reference() code path.  In such cases, the
+atomic_dec_and_test() may be moved from delete() to el_free()
+as follows:
+
+1.					2.
+add()					search_and_reference()
+{					{
+    alloc_object			    rcu_read_lock();
+    ...					    search_for_element
+    atomic_set(&el->rc, 1);		    atomic_inc(&el->rc);
+    spin_lock(&list_lock);		    ...
+
+    add_element				    rcu_read_unlock();
+    ...					}
+    spin_unlock(&list_lock);		4.
+}					delete()
+3.					{
+release_referenced()			    spin_lock(&list_lock);
+{					    ...
+    ...					    remove_element
+    if (atomic_dec_and_test(&el->rc))       spin_unlock(&list_lock);
+        kfree(el);			    ...
+    ...                                     call_rcu(&el->head, el_free);
+} 					    ...
+5.					}
+void el_free(struct rcu_head *rhp)
+{
+    release_referenced();
+}
+
+The key point is that the initial reference added by add() is not removed
+until after a grace period has elapsed following removal.  This means that
+search_and_reference() cannot find this element, which means that the value
+of el->rc cannot increase.  Thus, once it reaches zero, there are no
+readers that can or ever will be able to reference the element.  The
+element can therefore safely be freed.  This in turn guarantees that if
+any reader finds the element, that reader may safely acquire a reference
+without checking the value of the reference counter.
+
+In cases where delete() can sleep, synchronize_rcu() can be called from
+delete(), so that el_free() can be subsumed into delete as follows:
+
+4.
+delete()
+{
+    spin_lock(&list_lock);
+    ...
+    remove_element
+    spin_unlock(&list_lock);
+    ...
+    synchronize_rcu();
+    if (atomic_dec_and_test(&el->rc))
+    	kfree(el);
+    ...
+}
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -0,0 +1,219 @@
+Using RCU's CPU Stall Detector
+
+The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall
+detector, which detects conditions that unduly delay RCU grace periods.
+This module parameter enables CPU stall detection by default, but
+may be overridden via boot-time parameter or at runtime via sysfs.
+The stall detector's idea of what constitutes "unduly delayed" is
+controlled by a set of kernel configuration variables and cpp macros:
+
+CONFIG_RCU_CPU_STALL_TIMEOUT
+
+	This kernel configuration parameter defines the period of time
+	that RCU will wait from the beginning of a grace period until it
+	issues an RCU CPU stall warning.  This time period is normally
+	sixty seconds.
+
+	This configuration parameter may be changed at runtime via the
+	/sys/module/rcutree/parameters/rcu_cpu_stall_timeout, however
+	this parameter is checked only at the beginning of a cycle.
+	So if you are 30 seconds into a 70-second stall, setting this
+	sysfs parameter to (say) five will shorten the timeout for the
+	-next- stall, or the following warning for the current stall
+	(assuming the stall lasts long enough).  It will not affect the
+	timing of the next warning for the current stall.
+
+	Stall-warning messages may be enabled and disabled completely via
+	/sys/module/rcutree/parameters/rcu_cpu_stall_suppress.
+
+CONFIG_RCU_CPU_STALL_VERBOSE
+
+	This kernel configuration parameter causes the stall warning to
+	also dump the stacks of any tasks that are blocking the current
+	RCU-preempt grace period.
+
+RCU_CPU_STALL_INFO
+
+	This kernel configuration parameter causes the stall warning to
+	print out additional per-CPU diagnostic information, including
+	information on scheduling-clock ticks and RCU's idle-CPU tracking.
+
+RCU_STALL_DELAY_DELTA
+
+	Although the lockdep facility is extremely useful, it does add
+	some overhead.  Therefore, under CONFIG_PROVE_RCU, the
+	RCU_STALL_DELAY_DELTA macro allows five extra seconds before
+	giving an RCU CPU stall warning message.
+
+RCU_STALL_RAT_DELAY
+
+	The CPU stall detector tries to make the offending CPU print its
+	own warnings, as this often gives better-quality stack traces.
+	However, if the offending CPU does not detect its own stall in
+	the number of jiffies specified by RCU_STALL_RAT_DELAY, then
+	some other CPU will complain.  This delay is normally set to
+	two jiffies.
+
+When a CPU detects that it is stalling, it will print a message similar
+to the following:
+
+INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies)
+
+This message indicates that CPU 5 detected that it was causing a stall,
+and that the stall was affecting RCU-sched.  This message will normally be
+followed by a stack dump of the offending CPU.  On TREE_RCU kernel builds,
+RCU and RCU-sched are implemented by the same underlying mechanism,
+while on TREE_PREEMPT_RCU kernel builds, RCU is instead implemented
+by rcu_preempt_state.
+
+On the other hand, if the offending CPU fails to print out a stall-warning
+message quickly enough, some other CPU will print a message similar to
+the following:
+
+INFO: rcu_bh_state detected stalls on CPUs/tasks: { 3 5 } (detected by 2, 2502 jiffies)
+
+This message indicates that CPU 2 detected that CPUs 3 and 5 were both
+causing stalls, and that the stall was affecting RCU-bh.  This message
+will normally be followed by stack dumps for each CPU.  Please note that
+TREE_PREEMPT_RCU builds can be stalled by tasks as well as by CPUs,
+and that the tasks will be indicated by PID, for example, "P3421".
+It is even possible for a rcu_preempt_state stall to be caused by both
+CPUs -and- tasks, in which case the offending CPUs and tasks will all
+be called out in the list.
+
+Finally, if the grace period ends just as the stall warning starts
+printing, there will be a spurious stall-warning message:
+
+INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies)
+
+This is rare, but does happen from time to time in real life.
+
+If the CONFIG_RCU_CPU_STALL_INFO kernel configuration parameter is set,
+more information is printed with the stall-warning message, for example:
+
+	INFO: rcu_preempt detected stall on CPU
+	0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 softirq=82/543
+	   (t=65000 jiffies)
+
+In kernels with CONFIG_RCU_FAST_NO_HZ, even more information is
+printed:
+
+	INFO: rcu_preempt detected stall on CPU
+	0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D
+	   (t=65000 jiffies)
+
+The "(64628 ticks this GP)" indicates that this CPU has taken more
+than 64,000 scheduling-clock interrupts during the current stalled
+grace period.  If the CPU was not yet aware of the current grace
+period (for example, if it was offline), then this part of the message
+indicates how many grace periods behind the CPU is.
+
+The "idle=" portion of the message prints the dyntick-idle state.
+The hex number before the first "/" is the low-order 12 bits of the
+dynticks counter, which will have an even-numbered value if the CPU is
+in dyntick-idle mode and an odd-numbered value otherwise.  The hex
+number between the two "/"s is the value of the nesting, which will
+be a small positive number if in the idle loop and a very large positive
+number (as shown above) otherwise.
+
+The "softirq=" portion of the message tracks the number of RCU softirq
+handlers that the stalled CPU has executed.  The number before the "/"
+is the number that had executed since boot at the time that this CPU
+last noted the beginning of a grace period, which might be the current
+(stalled) grace period, or it might be some earlier grace period (for
+example, if the CPU might have been in dyntick-idle mode for an extended
+time period.  The number after the "/" is the number that have executed
+since boot until the current time.  If this latter number stays constant
+across repeated stall-warning messages, it is possible that RCU's softirq
+handlers are no longer able to execute on this CPU.  This can happen if
+the stalled CPU is spinning with interrupts are disabled, or, in -rt
+kernels, if a high-priority process is starving RCU's softirq handler.
+
+For CONFIG_RCU_FAST_NO_HZ kernels, the "last_accelerate:" prints the
+low-order 16 bits (in hex) of the jiffies counter when this CPU last
+invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked
+rcu_accelerate_cbs() from rcu_prepare_for_idle().  The "nonlazy_posted:"
+prints the number of non-lazy callbacks posted since the last call to
+rcu_needs_cpu().  Finally, an "L" indicates that there are currently
+no non-lazy callbacks ("." is printed otherwise, as shown above) and
+"D" indicates that dyntick-idle processing is enabled ("." is printed
+otherwise, for example, if disabled via the "nohz=" kernel boot parameter).
+
+
+Multiple Warnings From One Stall
+
+If a stall lasts long enough, multiple stall-warning messages will be
+printed for it.  The second and subsequent messages are printed at
+longer intervals, so that the time between (say) the first and second
+message will be about three times the interval between the beginning
+of the stall and the first message.
+
+
+What Causes RCU CPU Stall Warnings?
+
+So your kernel printed an RCU CPU stall warning.  The next question is
+"What caused it?"  The following problems can result in RCU CPU stall
+warnings:
+
+o	A CPU looping in an RCU read-side critical section.
+	
+o	A CPU looping with interrupts disabled.  This condition can
+	result in RCU-sched and RCU-bh stalls.
+
+o	A CPU looping with preemption disabled.  This condition can
+	result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
+	stalls.
+
+o	A CPU looping with bottom halves disabled.  This condition can
+	result in RCU-sched and RCU-bh stalls.
+
+o	For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
+	without invoking schedule().
+
+o	A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
+	happen to preempt a low-priority task in the middle of an RCU
+	read-side critical section.   This is especially damaging if
+	that low-priority task is not permitted to run on any other CPU,
+	in which case the next RCU grace period can never complete, which
+	will eventually cause the system to run out of memory and hang.
+	While the system is in the process of running itself out of
+	memory, you might see stall-warning messages.
+
+o	A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
+	is running at a higher priority than the RCU softirq threads.
+	This will prevent RCU callbacks from ever being invoked,
+	and in a CONFIG_TREE_PREEMPT_RCU kernel will further prevent
+	RCU grace periods from ever completing.  Either way, the
+	system will eventually run out of memory and hang.  In the
+	CONFIG_TREE_PREEMPT_RCU case, you might see stall-warning
+	messages.
+
+o	A hardware or software issue shuts off the scheduler-clock
+	interrupt on a CPU that is not in dyntick-idle mode.  This
+	problem really has happened, and seems to be most likely to
+	result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
+
+o	A bug in the RCU implementation.
+
+o	A hardware failure.  This is quite unlikely, but has occurred
+	at least once in real life.  A CPU failed in a running system,
+	becoming unresponsive, but not causing an immediate crash.
+	This resulted in a series of RCU CPU stall warnings, eventually
+	leading the realization that the CPU had failed.
+
+The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning.
+SRCU does not have its own CPU stall warnings, but its calls to
+synchronize_sched() will result in RCU-sched detecting RCU-sched-related
+CPU stalls.  Please note that RCU only detects CPU stalls when there is
+a grace period in progress.  No grace period, no CPU stall warnings.
+
+To diagnose the cause of the stall, inspect the stack traces.
+The offending function will usually be near the top of the stack.
+If you have a series of stall warnings from a single extended stall,
+comparing the stack traces can often help determine where the stall
+is occurring, which will usually be in the function nearest the top of
+that portion of the stack which remains the same from trace to trace.
+If you can reliably trigger the stall, ftrace can be quite helpful.
+
+RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
+and with RCU's event tracing.
--- a/Documentation/RCU/torture.txt
+++ b/Documentation/RCU/torture.txt
@@ -0,0 +1,330 @@
+RCU Torture Test Operation
+
+
+CONFIG_RCU_TORTURE_TEST
+
+The CONFIG_RCU_TORTURE_TEST config option is available for all RCU
+implementations.  It creates an rcutorture kernel module that can
+be loaded to run a torture test.  The test periodically outputs
+status messages via printk(), which can be examined via the dmesg
+command (perhaps grepping for "torture").  The test is started
+when the module is loaded, and stops when the module is unloaded.
+
+CONFIG_RCU_TORTURE_TEST_RUNNABLE
+
+It is also possible to specify CONFIG_RCU_TORTURE_TEST=y, which will
+result in the tests being loaded into the base kernel.  In this case,
+the CONFIG_RCU_TORTURE_TEST_RUNNABLE config option is used to specify
+whether the RCU torture tests are to be started immediately during
+boot or whether the /proc/sys/kernel/rcutorture_runnable file is used
+to enable them.  This /proc file can be used to repeatedly pause and
+restart the tests, regardless of the initial state specified by the
+CONFIG_RCU_TORTURE_TEST_RUNNABLE config option.
+
+You will normally -not- want to start the RCU torture tests during boot
+(and thus the default is CONFIG_RCU_TORTURE_TEST_RUNNABLE=n), but doing
+this can sometimes be useful in finding boot-time bugs.
+
+
+MODULE PARAMETERS
+
+This module has the following parameters:
+
+fqs_duration	Duration (in microseconds) of artificially induced bursts
+		of force_quiescent_state() invocations.  In RCU
+		implementations having force_quiescent_state(), these
+		bursts help force races between forcing a given grace
+		period and that grace period ending on its own.
+
+fqs_holdoff	Holdoff time (in microseconds) between consecutive calls
+		to force_quiescent_state() within a burst.
+
+fqs_stutter	Wait time (in seconds) between consecutive bursts
+		of calls to force_quiescent_state().
+
+irqreader	Says to invoke RCU readers from irq level.  This is currently
+		done via timers.  Defaults to "1" for variants of RCU that
+		permit this.  (Or, more accurately, variants of RCU that do
+		-not- permit this know to ignore this variable.)
+
+n_barrier_cbs	If this is nonzero, RCU barrier testing will be conducted,
+		in which case n_barrier_cbs specifies the number of
+		RCU callbacks (and corresponding kthreads) to use for
+		this testing.  The value cannot be negative.  If you
+		specify this to be non-zero when torture_type indicates a
+		synchronous RCU implementation (one for which a member of
+		the synchronize_rcu() rather than the call_rcu() family is
+		used -- see the documentation for torture_type below), an
+		error will be reported and no testing will be carried out.
+
+nfakewriters	This is the number of RCU fake writer threads to run.  Fake
+		writer threads repeatedly use the synchronous "wait for
+		current readers" function of the interface selected by
+		torture_type, with a delay between calls to allow for various
+		different numbers of writers running in parallel.
+		nfakewriters defaults to 4, which provides enough parallelism
+		to trigger special cases caused by multiple writers, such as
+		the synchronize_srcu() early return optimization.
+
+nreaders	This is the number of RCU reading threads supported.
+		The default is twice the number of CPUs.  Why twice?
+		To properly exercise RCU implementations with preemptible
+		read-side critical sections.
+
+onoff_interval
+		The number of seconds between each attempt to execute a
+		randomly selected CPU-hotplug operation.  Defaults to
+		zero, which disables CPU hotplugging.  In HOTPLUG_CPU=n
+		kernels, rcutorture will silently refuse to do any
+		CPU-hotplug operations regardless of what value is
+		specified for onoff_interval.
+
+onoff_holdoff	The number of seconds to wait until starting CPU-hotplug
+		operations.  This would normally only be used when
+		rcutorture was built into the kernel and started
+		automatically at boot time, in which case it is useful
+		in order to avoid confusing boot-time code with CPUs
+		coming and going.
+
+shuffle_interval
+		The number of seconds to keep the test threads affinitied
+		to a particular subset of the CPUs, defaults to 3 seconds.
+		Used in conjunction with test_no_idle_hz.
+
+shutdown_secs	The number of seconds to run the test before terminating
+		the test and powering off the system.  The default is
+		zero, which disables test termination and system shutdown.
+		This capability is useful for automated testing.
+
+stall_cpu	The number of seconds that a CPU should be stalled while
+		within both an rcu_read_lock() and a preempt_disable().
+		This stall happens only once per rcutorture run.
+		If you need multiple stalls, use modprobe and rmmod to
+		repeatedly run rcutorture.  The default for stall_cpu
+		is zero, which prevents rcutorture from stalling a CPU.
+
+		Note that attempts to rmmod rcutorture while the stall
+		is ongoing will hang, so be careful what value you
+		choose for this module parameter!  In addition, too-large
+		values for stall_cpu might well induce failures and
+		warnings in other parts of the kernel.  You have been
+		warned!
+
+stall_cpu_holdoff
+		The number of seconds to wait after rcutorture starts
+		before stalling a CPU.  Defaults to 10 seconds.
+
+stat_interval	The number of seconds between output of torture
+		statistics (via printk()).  Regardless of the interval,
+		statistics are printed when the module is unloaded.
+		Setting the interval to zero causes the statistics to
+		be printed -only- when the module is unloaded, and this
+		is the default.
+
+stutter		The length of time to run the test before pausing for this
+		same period of time.  Defaults to "stutter=5", so as
+		to run and pause for (roughly) five-second intervals.
+		Specifying "stutter=0" causes the test to run continuously
+		without pausing, which is the old default behavior.
+
+test_boost	Whether or not to test the ability of RCU to do priority
+		boosting.  Defaults to "test_boost=1", which performs
+		RCU priority-inversion testing only if the selected
+		RCU implementation supports priority boosting.  Specifying
+		"test_boost=0" never performs RCU priority-inversion
+		testing.  Specifying "test_boost=2" performs RCU
+		priority-inversion testing even if the selected RCU
+		implementation does not support RCU priority boosting,
+		which can be used to test rcutorture's ability to
+		carry out RCU priority-inversion testing.
+
+test_boost_interval
+		The number of seconds in an RCU priority-inversion test
+		cycle.	Defaults to "test_boost_interval=7".  It is
+		usually wise for this value to be relatively prime to
+		the value selected for "stutter".
+
+test_boost_duration
+		The number of seconds to do RCU priority-inversion testing
+		within any given "test_boost_interval".  Defaults to
+		"test_boost_duration=4".
+
+test_no_idle_hz	Whether or not to test the ability of RCU to operate in
+		a kernel that disables the scheduling-clock interrupt to
+		idle CPUs.  Boolean parameter, "1" to test, "0" otherwise.
+		Defaults to omitting this test.
+
+torture_type	The type of RCU to test, with string values as follows:
+
+		"rcu":  rcu_read_lock(), rcu_read_unlock() and call_rcu().
+
+		"rcu_sync":  rcu_read_lock(), rcu_read_unlock(), and
+			synchronize_rcu().
+
+		"rcu_expedited": rcu_read_lock(), rcu_read_unlock(), and
+			synchronize_rcu_expedited().
+
+		"rcu_bh": rcu_read_lock_bh(), rcu_read_unlock_bh(), and
+			call_rcu_bh().
+
+		"rcu_bh_sync": rcu_read_lock_bh(), rcu_read_unlock_bh(),
+			and synchronize_rcu_bh().
+
+		"rcu_bh_expedited": rcu_read_lock_bh(), rcu_read_unlock_bh(),
+			and synchronize_rcu_bh_expedited().
+
+		"srcu": srcu_read_lock(), srcu_read_unlock() and
+			call_srcu().
+
+		"srcu_sync": srcu_read_lock(), srcu_read_unlock() and
+			synchronize_srcu().
+
+		"srcu_expedited": srcu_read_lock(), srcu_read_unlock() and
+			synchronize_srcu_expedited().
+
+		"srcu_raw": srcu_read_lock_raw(), srcu_read_unlock_raw(),
+			and call_srcu().
+
+		"srcu_raw_sync": srcu_read_lock_raw(), srcu_read_unlock_raw(),
+			and synchronize_srcu().
+
+		"sched": preempt_disable(), preempt_enable(), and
+			call_rcu_sched().
+
+		"sched_sync": preempt_disable(), preempt_enable(), and
+			synchronize_sched().
+
+		"sched_expedited": preempt_disable(), preempt_enable(), and
+			synchronize_sched_expedited().
+
+		Defaults to "rcu".
+
+verbose		Enable debug printk()s.  Default is disabled.
+
+
+OUTPUT
+
+The statistics output is as follows:
+
+	rcu-torture:--- Start of test: nreaders=16 nfakewriters=4 stat_interval=30 verbose=0 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4
+	rcu-torture: rtc:           (null) ver: 155441 tfle: 0 rta: 155441 rtaf: 8884 rtf: 155440 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 3055767
+	rcu-torture: Reader Pipe:  727860534 34213 0 0 0 0 0 0 0 0 0
+	rcu-torture: Reader Batch:  727877838 17003 0 0 0 0 0 0 0 0 0
+	rcu-torture: Free-Block Circulation:  155440 155440 155440 155440 155440 155440 155440 155440 155440 155440 0
+	rcu-torture:--- End of test: SUCCESS: nreaders=16 nfakewriters=4 stat_interval=30 verbose=0 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4
+
+The command "dmesg | grep torture:" will extract this information on
+most systems.  On more esoteric configurations, it may be necessary to
+use other commands to access the output of the printk()s used by
+the RCU torture test.  The printk()s use KERN_ALERT, so they should
+be evident.  ;-)
+
+The first and last lines show the rcutorture module parameters, and the
+last line shows either "SUCCESS" or "FAILURE", based on rcutorture's
+automatic determination as to whether RCU operated correctly.
+
+The entries are as follows:
+
+o	"rtc": The hexadecimal address of the structure currently visible
+	to readers.
+
+o	"ver": The number of times since boot that the RCU writer task
+	has changed the structure visible to readers.
+
+o	"tfle": If non-zero, indicates that the "torture freelist"
+	containing structures to be placed into the "rtc" area is empty.
+	This condition is important, since it can fool you into thinking
+	that RCU is working when it is not.  :-/
+
+o	"rta": Number of structures allocated from the torture freelist.
+
+o	"rtaf": Number of allocations from the torture freelist that have
+	failed due to the list being empty.  It is not unusual for this
+	to be non-zero, but it is bad for it to be a large fraction of
+	the value indicated by "rta".
+
+o	"rtf": Number of frees into the torture freelist.
+
+o	"rtmbe": A non-zero value indicates that rcutorture believes that
+	rcu_assign_pointer() and rcu_dereference() are not working
+	correctly.  This value should be zero.
+
+o	"rtbe": A non-zero value indicates that one of the rcu_barrier()
+	family of functions is not working correctly.
+
+o	"rtbke": rcutorture was unable to create the real-time kthreads
+	used to force RCU priority inversion.  This value should be zero.
+
+o	"rtbre": Although rcutorture successfully created the kthreads
+	used to force RCU priority inversion, it was unable to set them
+	to the real-time priority level of 1.  This value should be zero.
+
+o	"rtbf": The number of times that RCU priority boosting failed
+	to resolve RCU priority inversion.
+
+o	"rtb": The number of times that rcutorture attempted to force
+	an RCU priority inversion condition.  If you are testing RCU
+	priority boosting via the "test_boost" module parameter, this
+	value should be non-zero.
+
+o	"nt": The number of times rcutorture ran RCU read-side code from
+	within a timer handler.  This value should be non-zero only
+	if you specified the "irqreader" module parameter.
+
+o	"Reader Pipe": Histogram of "ages" of structures seen by readers.
+	If any entries past the first two are non-zero, RCU is broken.
+	And rcutorture prints the error flag string "!!!" to make sure
+	you notice.  The age of a newly allocated structure is zero,
+	it becomes one when removed from reader visibility, and is
+	incremented once per grace period subsequently -- and is freed
+	after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods.
+
+	The output displayed above was taken from a correctly working
+	RCU.  If you want to see what it looks like when broken, break
+	it yourself.  ;-)
+
+o	"Reader Batch": Another histogram of "ages" of structures seen
+	by readers, but in terms of counter flips (or batches) rather
+	than in terms of grace periods.  The legal number of non-zero
+	entries is again two.  The reason for this separate view is that
+	it is sometimes easier to get the third entry to show up in the
+	"Reader Batch" list than in the "Reader Pipe" list.
+
+o	"Free-Block Circulation": Shows the number of torture structures
+	that have reached a given point in the pipeline.  The first element
+	should closely correspond to the number of structures allocated,
+	the second to the number that have been removed from reader view,
+	and all but the last remaining to the corresponding number of
+	passes through a grace period.  The last entry should be zero,
+	as it is only incremented if a torture structure's counter
+	somehow gets incremented farther than it should.
+
+Different implementations of RCU can provide implementation-specific
+additional information.  For example, SRCU provides the following
+additional line:
+
+	srcu-torture: per-CPU(idx=1): 0(0,1) 1(0,1) 2(0,0) 3(0,1)
+
+This line shows the per-CPU counter state.  The numbers in parentheses are
+the values of the "old" and "current" counters for the corresponding CPU.
+The "idx" value maps the "old" and "current" values to the underlying
+array, and is useful for debugging.
+
+
+USAGE
+
+The following script may be used to torture RCU:
+
+	#!/bin/sh
+
+	modprobe rcutorture
+	sleep 3600
+	rmmod rcutorture
+	dmesg | grep torture:
+
+The output can be manually inspected for the error flag of "!!!".
+One could of course create a more elaborate script that automatically
+checked for such errors.  The "rmmod" command forces a "SUCCESS",
+"FAILURE", or "RCU_HOTPLUG" indication to be printk()ed.  The first
+two are self-explanatory, while the last indicates that while there
+were no RCU failures, CPU-hotplug problems were detected.
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -0,0 +1,642 @@
+CONFIG_RCU_TRACE debugfs Files and Formats
+
+
+The rcutree and rcutiny implementations of RCU provide debugfs trace
+output that summarizes counters and state.  This information is useful for
+debugging RCU itself, and can sometimes also help to debug abuses of RCU.
+The following sections describe the debugfs files and formats, first
+for rcutree and next for rcutiny.
+
+
+CONFIG_TREE_RCU and CONFIG_TREE_PREEMPT_RCU debugfs Files and Formats
+
+These implementations of RCU provide several debugfs directories under the
+top-level directory "rcu":
+
+rcu/rcu_bh
+rcu/rcu_preempt
+rcu/rcu_sched
+
+Each directory contains files for the corresponding flavor of RCU.
+Note that rcu/rcu_preempt is only present for CONFIG_TREE_PREEMPT_RCU.
+For CONFIG_TREE_RCU, the RCU flavor maps onto the RCU-sched flavor,
+so that activity for both appears in rcu/rcu_sched.
+
+In addition, the following file appears in the top-level directory:
+rcu/rcutorture.  This file displays rcutorture test progress.  The output
+of "cat rcu/rcutorture" looks as follows:
+
+rcutorture test sequence: 0 (test in progress)
+rcutorture update version number: 615
+
+The first line shows the number of rcutorture tests that have completed
+since boot.  If a test is currently running, the "(test in progress)"
+string will appear as shown above.  The second line shows the number of
+update cycles that the current test has started, or zero if there is
+no test in progress.
+
+
+Within each flavor directory (rcu/rcu_bh, rcu/rcu_sched, and possibly
+also rcu/rcu_preempt) the following files will be present:
+
+rcudata:
+	Displays fields in struct rcu_data.
+rcuexp:
+	Displays statistics for expedited grace periods.
+rcugp:
+	Displays grace-period counters.
+rcuhier:
+	Displays the struct rcu_node hierarchy.
+rcu_pending:
+	Displays counts of the reasons rcu_pending() decided that RCU had
+	work to do.
+rcuboost:
+	Displays RCU boosting statistics.  Only present if
+	CONFIG_RCU_BOOST=y.
+
+The output of "cat rcu/rcu_preempt/rcudata" looks as follows:
+
+  0!c=30455 g=30456 pq=1 qp=1 dt=126535/140000000000000/0 df=2002 of=4 ql=0/0 qs=N... b=10 ci=74572 nci=0 co=1131 ca=716
+  1!c=30719 g=30720 pq=1 qp=0 dt=132007/140000000000000/0 df=1874 of=10 ql=0/0 qs=N... b=10 ci=123209 nci=0 co=685 ca=982
+  2!c=30150 g=30151 pq=1 qp=1 dt=138537/140000000000000/0 df=1707 of=8 ql=0/0 qs=N... b=10 ci=80132 nci=0 co=1328 ca=1458
+  3 c=31249 g=31250 pq=1 qp=0 dt=107255/140000000000000/0 df=1749 of=6 ql=0/450 qs=NRW. b=10 ci=151700 nci=0 co=509 ca=622
+  4!c=29502 g=29503 pq=1 qp=1 dt=83647/140000000000000/0 df=965 of=5 ql=0/0 qs=N... b=10 ci=65643 nci=0 co=1373 ca=1521
+  5 c=31201 g=31202 pq=1 qp=1 dt=70422/0/0 df=535 of=7 ql=0/0 qs=.... b=10 ci=58500 nci=0 co=764 ca=698
+  6!c=30253 g=30254 pq=1 qp=1 dt=95363/140000000000000/0 df=780 of=5 ql=0/0 qs=N... b=10 ci=100607 nci=0 co=1414 ca=1353
+  7 c=31178 g=31178 pq=1 qp=0 dt=91536/0/0 df=547 of=4 ql=0/0 qs=.... b=10 ci=109819 nci=0 co=1115 ca=969
+
+This file has one line per CPU, or eight for this 8-CPU system.
+The fields are as follows:
+
+o	The number at the beginning of each line is the CPU number.
+	CPUs numbers followed by an exclamation mark are offline,
+	but have been online at least once since boot.	There will be
+	no output for CPUs that have never been online, which can be
+	a good thing in the surprisingly common case where NR_CPUS is
+	substantially larger than the number of actual CPUs.
+
+o	"c" is the count of grace periods that this CPU believes have
+	completed.  Offlined CPUs and CPUs in dynticks idle mode may lag
+	quite a ways behind, for example, CPU 4 under "rcu_sched" above,
+	which has been offline through 16 RCU grace periods.  It is not
+	unusual to see offline CPUs lagging by thousands of grace periods.
+	Note that although the grace-period number is an unsigned long,
+	it is printed out as a signed long to allow more human-friendly
+	representation near boot time.
+
+o	"g" is the count of grace periods that this CPU believes have
+	started.  Again, offlined CPUs and CPUs in dynticks idle mode
+	may lag behind.  If the "c" and "g" values are equal, this CPU
+	has already reported a quiescent state for the last RCU grace
+	period that it is aware of, otherwise, the CPU believes that it
+	owes RCU a quiescent state.
+
+o	"pq" indicates that this CPU has passed through a quiescent state
+	for the current grace period.  It is possible for "pq" to be
+	"1" and "c" different than "g", which indicates that although
+	the CPU has passed through a quiescent state, either (1) this
+	CPU has not yet reported that fact, (2) some other CPU has not
+	yet reported for this grace period, or (3) both.
+
+o	"qp" indicates that RCU still expects a quiescent state from
+	this CPU.  Offlined CPUs and CPUs in dyntick idle mode might
+	well have qp=1, which is OK: RCU is still ignoring them.
+
+o	"dt" is the current value of the dyntick counter that is incremented
+	when entering or leaving idle, either due to a context switch or
+	due to an interrupt.  This number is even if the CPU is in idle
+	from RCU's viewpoint and odd otherwise.  The number after the
+	first "/" is the interrupt nesting depth when in idle state,
+	or a large number added to the interrupt-nesting depth when
+	running a non-idle task.  Some architectures do not accurately
+	count interrupt nesting when running in non-idle kernel context,
+	which can result in interesting anomalies such as negative
+	interrupt-nesting levels.  The number after the second "/"
+	is the NMI nesting depth.
+
+o	"df" is the number of times that some other CPU has forced a
+	quiescent state on behalf of this CPU due to this CPU being in
+	idle state.
+
+o	"of" is the number of times that some other CPU has forced a
+	quiescent state on behalf of this CPU due to this CPU being
+	offline.  In a perfect world, this might never happen, but it
+	turns out that offlining and onlining a CPU can take several grace
+	periods, and so there is likely to be an extended period of time
+	when RCU believes that the CPU is online when it really is not.
+	Please note that erring in the other direction (RCU believing a
+	CPU is offline when it is really alive and kicking) is a fatal
+	error, so it makes sense to err conservatively.
+
+o	"ql" is the number of RCU callbacks currently residing on
+	this CPU.  The first number is the number of "lazy" callbacks
+	that are known to RCU to only be freeing memory, and the number
+	after the "/" is the total number of callbacks, lazy or not.
+	These counters count callbacks regardless of what phase of
+	grace-period processing that they are in (new, waiting for
+	grace period to start, waiting for grace period to end, ready
+	to invoke).
+
+o	"qs" gives an indication of the state of the callback queue
+	with four characters:
+
+	"N"	Indicates that there are callbacks queued that are not
+		ready to be handled by the next grace period, and thus
+		will be handled by the grace period following the next
+		one.
+
+	"R"	Indicates that there are callbacks queued that are
+		ready to be handled by the next grace period.
+
+	"W"	Indicates that there are callbacks queued that are
+		waiting on the current grace period.
+
+	"D"	Indicates that there are callbacks queued that have
+		already been handled by a prior grace period, and are
+		thus waiting to be invoked.  Note that callbacks in
+		the process of being invoked are not counted here.
+		Callbacks in the process of being invoked are those
+		that have been removed from the rcu_data structures
+		queues by rcu_do_batch(), but which have not yet been
+		invoked.
+
+	If there are no callbacks in a given one of the above states,
+	the corresponding character is replaced by ".".
+
+o	"b" is the batch limit for this CPU.  If more than this number
+	of RCU callbacks is ready to invoke, then the remainder will
+	be deferred.
+
+o	"ci" is the number of RCU callbacks that have been invoked for
+	this CPU.  Note that ci+nci+ql is the number of callbacks that have
+	been registered in absence of CPU-hotplug activity.
+
+o	"nci" is the number of RCU callbacks that have been offloaded from
+	this CPU.  This will always be zero unless the kernel was built
+	with CONFIG_RCU_NOCB_CPU=y and the "rcu_nocbs=" kernel boot
+	parameter was specified.
+
+o	"co" is the number of RCU callbacks that have been orphaned due to
+	this CPU going offline.  These orphaned callbacks have been moved
+	to an arbitrarily chosen online CPU.
+
+o	"ca" is the number of RCU callbacks that have been adopted by this
+	CPU due to other CPUs going offline.  Note that ci+co-ca+ql is
+	the number of RCU callbacks registered on this CPU.
+
+
+Kernels compiled with CONFIG_RCU_BOOST=y display the following from
+/debug/rcu/rcu_preempt/rcudata:
+
+  0!c=12865 g=12866 pq=1 qp=1 dt=83113/140000000000000/0 df=288 of=11 ql=0/0 qs=N... kt=0/O ktl=944 b=10 ci=60709 nci=0 co=748 ca=871
+  1 c=14407 g=14408 pq=1 qp=0 dt=100679/140000000000000/0 df=378 of=7 ql=0/119 qs=NRW. kt=0/W ktl=9b6 b=10 ci=109740 nci=0 co=589 ca=485
+  2 c=14407 g=14408 pq=1 qp=0 dt=105486/0/0 df=90 of=9 ql=0/89 qs=NRW. kt=0/W ktl=c0c b=10 ci=83113 nci=0 co=533 ca=490
+  3 c=14407 g=14408 pq=1 qp=0 dt=107138/0/0 df=142 of=8 ql=0/188 qs=NRW. kt=0/W ktl=b96 b=10 ci=121114 nci=0 co=426 ca=290
+  4 c=14405 g=14406 pq=1 qp=1 dt=50238/0/0 df=706 of=7 ql=0/0 qs=.... kt=0/W ktl=812 b=10 ci=34929 nci=0 co=643 ca=114
+  5!c=14168 g=14169 pq=1 qp=0 dt=45465/140000000000000/0 df=161 of=11 ql=0/0 qs=N... kt=0/O ktl=b4d b=10 ci=47712 nci=0 co=677 ca=722
+  6 c=14404 g=14405 pq=1 qp=0 dt=59454/0/0 df=94 of=6 ql=0/0 qs=.... kt=0/W ktl=e57 b=10 ci=55597 nci=0 co=701 ca=811
+  7 c=14407 g=14408 pq=1 qp=1 dt=68850/0/0 df=31 of=8 ql=0/0 qs=.... kt=0/W ktl=14bd b=10 ci=77475 nci=0 co=508 ca=1042
+
+This is similar to the output discussed above, but contains the following
+additional fields:
+
+o	"kt" is the per-CPU kernel-thread state.  The digit preceding
+	the first slash is zero if there is no work pending and 1
+	otherwise.  The character between the first pair of slashes is
+	as follows:
+
+	"S"	The kernel thread is stopped, in other words, all
+		CPUs corresponding to this rcu_node structure are
+		offline.
+
+	"R"	The kernel thread is running.
+
+	"W"	The kernel thread is waiting because there is no work
+		for it to do.
+
+	"O"	The kernel thread is waiting because it has been
+		forced off of its designated CPU or because its
+		->cpus_allowed mask permits it to run on other than
+		its designated CPU.
+
+	"Y"	The kernel thread is yielding to avoid hogging CPU.
+
+	"?"	Unknown value, indicates a bug.
+
+	The number after the final slash is the CPU that the kthread
+	is actually running on.
+
+	This field is displayed only for CONFIG_RCU_BOOST kernels.
+
+o	"ktl" is the low-order 16 bits (in hexadecimal) of the count of
+	the number of times that this CPU's per-CPU kthread has gone
+	through its loop servicing invoke_rcu_cpu_kthread() requests.
+
+	This field is displayed only for CONFIG_RCU_BOOST kernels.
+
+
+The output of "cat rcu/rcu_preempt/rcuexp" looks as follows:
+
+s=21872 d=21872 w=0 tf=0 wd1=0 wd2=0 n=0 sc=21872 dt=21872 dl=0 dx=21872
+
+These fields are as follows:
+
+o	"s" is the starting sequence number.
+
+o	"d" is the ending sequence number.  When the starting and ending
+	numbers differ, there is an expedited grace period in progress.
+
+o	"w" is the number of times that the sequence numbers have been
+	in danger of wrapping.
+
+o	"tf" is the number of times that contention has resulted in a
+	failure to begin an expedited grace period.
+
+o	"wd1" and "wd2" are the number of times that an attempt to
+	start an expedited grace period found that someone else had
+	completed an expedited grace period that satisfies the
+	attempted request.  "Our work is done."
+
+o	"n" is number of times that contention was so great that
+	the request was demoted from an expedited grace period to
+	a normal grace period.
+
+o	"sc" is the number of times that the attempt to start a
+	new expedited grace period succeeded.
+
+o	"dt" is the number of times that we attempted to update
+	the "d" counter.
+
+o	"dl" is the number of times that we failed to update the "d"
+	counter.
+
+o	"dx" is the number of times that we succeeded in updating
+	the "d" counter.
+
+
+The output of "cat rcu/rcu_preempt/rcugp" looks as follows:
+
+completed=31249  gpnum=31250  age=1  max=18
+
+These fields are taken from the rcu_state structure, and are as follows:
+
+o	"completed" is the number of grace periods that have completed.
+	It is comparable to the "c" field from rcu/rcudata in that a
+	CPU whose "c" field matches the value of "completed" is aware
+	that the corresponding RCU grace period has completed.
+
+o	"gpnum" is the number of grace periods that have started.  It is
+	similarly comparable to the "g" field from rcu/rcudata in that
+	a CPU whose "g" field matches the value of "gpnum" is aware that
+	the corresponding RCU grace period has started.
+
+	If these two fields are equal, then there is no grace period
+	in progress, in other words, RCU is idle.  On the other hand,
+	if the two fields differ (as they are above), then an RCU grace
+	period is in progress.
+
+o	"age" is the number of jiffies that the current grace period
+	has extended for, or zero if there is no grace period currently
+	in effect.
+
+o	"max" is the age in jiffies of the longest-duration grace period
+	thus far.
+
+The output of "cat rcu/rcu_preempt/rcuhier" looks as follows:
+
+c=14407 g=14408 s=0 jfq=2 j=c863 nfqs=12040/nfqsng=0(12040) fqlh=1051 oqlen=0/0
+3/3 ..>. 0:7 ^0
+e/e ..>. 0:3 ^0    d/d ..>. 4:7 ^1
+
+The fields are as follows:
+
+o	"c" is exactly the same as "completed" under rcu/rcu_preempt/rcugp.
+
+o	"g" is exactly the same as "gpnum" under rcu/rcu_preempt/rcugp.
+
+o	"s" is the current state of the force_quiescent_state()
+	state machine.
+
+o	"jfq" is the number of jiffies remaining for this grace period
+	before force_quiescent_state() is invoked to help push things
+	along.	Note that CPUs in idle mode throughout the grace period
+	will not report on their own, but rather must be check by some
+	other CPU via force_quiescent_state().
+
+o	"j" is the low-order four hex digits of the jiffies counter.
+	Yes, Paul did run into a number of problems that turned out to
+	be due to the jiffies counter no longer counting.  Why do you ask?
+
+o	"nfqs" is the number of calls to force_quiescent_state() since
+	boot.
+
+o	"nfqsng" is the number of useless calls to force_quiescent_state(),
+	where there wasn't actually a grace period active.  This can
+	no longer happen due to grace-period processing being pushed
+	into a kthread.  The number in parentheses is the difference
+	between "nfqs" and "nfqsng", or the number of times that
+	force_quiescent_state() actually did some real work.
+
+o	"fqlh" is the number of calls to force_quiescent_state() that
+	exited immediately (without even being counted in nfqs above)
+	due to contention on ->fqslock.
+
+o	Each element of the form "3/3 ..>. 0:7 ^0" represents one rcu_node
+	structure.  Each line represents one level of the hierarchy,
+	from root to leaves.  It is best to think of the rcu_data
+	structures as forming yet another level after the leaves.
+	Note that there might be either one, two, three, or even four
+	levels of rcu_node structures, depending on the relationship
+	between CONFIG_RCU_FANOUT, CONFIG_RCU_FANOUT_LEAF (possibly
+	adjusted using the rcu_fanout_leaf kernel boot parameter), and
+	CONFIG_NR_CPUS (possibly adjusted using the nr_cpu_ids count of
+	possible CPUs for the booting hardware).
+
+	o	The numbers separated by the "/" are the qsmask followed
+		by the qsmaskinit.  The qsmask will have one bit
+		set for each entity in the next lower level that has
+		not yet checked in for the current grace period ("e"
+		indicating CPUs 5, 6, and 7 in the example above).
+		The qsmaskinit will have one bit for each entity that is
+		currently expected to check in during each grace period.
+		The value of qsmaskinit is assigned to that of qsmask
+		at the beginning of each grace period.
+
+	o	The characters separated by the ">" indicate the state
+		of the blocked-tasks lists.  A "G" preceding the ">"
+		indicates that at least one task blocked in an RCU
+		read-side critical section blocks the current grace
+		period, while a "E" preceding the ">" indicates that
+		at least one task blocked in an RCU read-side critical
+		section blocks the current expedited grace period.
+		A "T" character following the ">" indicates that at
+		least one task is blocked within an RCU read-side
+		critical section, regardless of whether any current
+		grace period (expedited or normal) is inconvenienced.
+		A "." character appears if the corresponding condition
+		does not hold, so that "..>." indicates that no tasks
+		are blocked.  In contrast, "GE>T" indicates maximal
+		inconvenience from blocked tasks.  CONFIG_TREE_RCU
+		builds of the kernel will always show "..>.".
+
+	o	The numbers separated by the ":" are the range of CPUs
+		served by this struct rcu_node.  This can be helpful
+		in working out how the hierarchy is wired together.
+
+		For example, the example rcu_node structure shown above
+		has "0:7", indicating that it covers CPUs 0 through 7.
+
+	o	The number after the "^" indicates the bit in the
+		next higher level rcu_node structure that this rcu_node
+		structure corresponds to.  For example, the "d/d ..>. 4:7
+		^1" has a "1" in this position, indicating that it
+		corresponds to the "1" bit in the "3" shown in the
+		"3/3 ..>. 0:7 ^0" entry on the next level up.
+
+
+The output of "cat rcu/rcu_sched/rcu_pending" looks as follows:
+
+  0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903
+  1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113
+  2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889
+  3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469
+  4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042
+  5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422
+  6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699
+  7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147
+
+The fields are as follows:
+
+o	The leading number is the CPU number, with "!" indicating
+	an offline CPU.
+
+o	"np" is the number of times that __rcu_pending() has been invoked
+	for the corresponding flavor of RCU.
+
+o	"qsp" is the number of times that the RCU was waiting for a
+	quiescent state from this CPU.
+
+o	"rpq" is the number of times that the CPU had passed through
+	a quiescent state, but not yet reported it to RCU.
+
+o	"cbr" is the number of times that this CPU had RCU callbacks
+	that had passed through a grace period, and were thus ready
+	to be invoked.
+
+o	"cng" is the number of times that this CPU needed another
+	grace period while RCU was idle.
+
+o	"gpc" is the number of times that an old grace period had
+	completed, but this CPU was not yet aware of it.
+
+o	"gps" is the number of times that a new grace period had started,
+	but this CPU was not yet aware of it.
+
+o	"nn" is the number of times that this CPU needed nothing.
+
+
+The output of "cat rcu/rcuboost" looks as follows:
+
+0:3 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894
+    balk: nt=0 egt=4695 bt=0 nb=0 ny=56 nos=0
+4:7 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894
+    balk: nt=0 egt=6541 bt=0 nb=0 ny=126 nos=0
+
+This information is output only for rcu_preempt.  Each two-line entry
+corresponds to a leaf rcu_node strcuture.  The fields are as follows:
+
+o	"n:m" is the CPU-number range for the corresponding two-line
+	entry.  In the sample output above, the first entry covers
+	CPUs zero through three and the second entry covers CPUs four
+	through seven.
+
+o	"tasks=TNEB" gives the state of the various segments of the
+	rnp->blocked_tasks list:
+
+	"T"	This indicates that there are some tasks that blocked
+		while running on one of the corresponding CPUs while
+		in an RCU read-side critical section.
+
+	"N"	This indicates that some of the blocked tasks are preventing
+		the current normal (non-expedited) grace period from
+		completing.
+
+	"E"	This indicates that some of the blocked tasks are preventing
+		the current expedited grace period from completing.
+
+	"B"	This indicates that some of the blocked tasks are in
+		need of RCU priority boosting.
+
+	Each character is replaced with "." if the corresponding
+	condition does not hold.
+
+o	"kt" is the state of the RCU priority-boosting kernel
+	thread associated with the corresponding rcu_node structure.
+	The state can be one of the following:
+
+	"S"	The kernel thread is stopped, in other words, all
+		CPUs corresponding to this rcu_node structure are
+		offline.
+
+	"R"	The kernel thread is running.
+
+	"W"	The kernel thread is waiting because there is no work
+		for it to do.
+
+	"Y"	The kernel thread is yielding to avoid hogging CPU.
+
+	"?"	Unknown value, indicates a bug.
+
+o	"ntb" is the number of tasks boosted.
+
+o	"neb" is the number of tasks boosted in order to complete an
+	expedited grace period.
+
+o	"nnb" is the number of tasks boosted in order to complete a
+	normal (non-expedited) grace period.  When boosting a task
+	that was blocking both an expedited and a normal grace period,
+	it is counted against the expedited total above.
+
+o	"j" is the low-order 16 bits of the jiffies counter in
+	hexadecimal.
+
+o	"bt" is the low-order 16 bits of the value that the jiffies
+	counter will have when we next start boosting, assuming that
+	the current grace period does not end beforehand.  This is
+	also in hexadecimal.
+
+o	"balk: nt" counts the number of times we didn't boost (in
+	other words, we balked) even though it was time to boost because
+	there were no blocked tasks to boost.  This situation occurs
+	when there is one blocked task on one rcu_node structure and
+	none on some other rcu_node structure.
+
+o	"egt" counts the number of times we balked because although
+	there were blocked tasks, none of them were blocking the
+	current grace period, whether expedited or otherwise.
+
+o	"bt" counts the number of times we balked because boosting
+	had already been initiated for the current grace period.
+
+o	"nb" counts the number of times we balked because there
+	was at least one task blocking the current non-expedited grace
+	period that never had blocked.  If it is already running, it
+	just won't help to boost its priority!
+
+o	"ny" counts the number of times we balked because it was
+	not yet time to start boosting.
+
+o	"nos" counts the number of times we balked for other
+	reasons, e.g., the grace period ended first.
+
+
+CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU debugfs Files and Formats
+
+These implementations of RCU provides a single debugfs file under the
+top-level directory RCU, namely rcu/rcudata, which displays fields in
+rcu_bh_ctrlblk, rcu_sched_ctrlblk and, for CONFIG_TINY_PREEMPT_RCU,
+rcu_preempt_ctrlblk.
+
+The output of "cat rcu/rcudata" is as follows:
+
+rcu_preempt: qlen=24 gp=1097669 g197/p197/c197 tasks=...
+             ttb=. btg=no ntb=184 neb=0 nnb=183 j=01f7 bt=0274
+             normal balk: nt=1097669 gt=0 bt=371 b=0 ny=25073378 nos=0
+             exp balk: bt=0 nos=0
+rcu_sched: qlen: 0
+rcu_bh: qlen: 0
+
+This is split into rcu_preempt, rcu_sched, and rcu_bh sections, with the
+rcu_preempt section appearing only in CONFIG_TINY_PREEMPT_RCU builds.
+The last three lines of the rcu_preempt section appear only in
+CONFIG_RCU_BOOST kernel builds.  The fields are as follows:
+
+o	"qlen" is the number of RCU callbacks currently waiting either
+	for an RCU grace period or waiting to be invoked.  This is the
+	only field present for rcu_sched and rcu_bh, due to the
+	short-circuiting of grace period in those two cases.
+
+o	"gp" is the number of grace periods that have completed.
+
+o	"g197/p197/c197" displays the grace-period state, with the
+	"g" number being the number of grace periods that have started
+	(mod 256), the "p" number being the number of grace periods
+	that the CPU has responded to (also mod 256), and the "c"
+	number being the number of grace periods that have completed
+	(once again mode 256).
+
+	Why have both "gp" and "g"?  Because the data flowing into
+	"gp" is only present in a CONFIG_RCU_TRACE kernel.
+
+o	"tasks" is a set of bits.  The first bit is "T" if there are
+	currently tasks that have recently blocked within an RCU
+	read-side critical section, the second bit is "N" if any of the
+	aforementioned tasks are blocking the current RCU grace period,
+	and the third bit is "E" if any of the aforementioned tasks are
+	blocking the current expedited grace period.  Each bit is "."
+	if the corresponding condition does not hold.
+
+o	"ttb" is a single bit.  It is "B" if any of the blocked tasks
+	need to be priority boosted and "." otherwise.
+
+o	"btg" indicates whether boosting has been carried out during
+	the current grace period, with "exp" indicating that boosting
+	is in progress for an expedited grace period, "no" indicating
+	that boosting has not yet started for a normal grace period,
+	"begun" indicating that boosting has bebug for a normal grace
+	period, and "done" indicating that boosting has completed for
+	a normal grace period.
+
+o	"ntb" is the total number of tasks subjected to RCU priority boosting
+	periods since boot.
+
+o	"neb" is the number of expedited grace periods that have had
+	to resort to RCU priority boosting since boot.
+
+o	"nnb" is the number of normal grace periods that have had
+	to resort to RCU priority boosting since boot.
+
+o	"j" is the low-order 16 bits of the jiffies counter in hexadecimal.
+
+o	"bt" is the low-order 16 bits of the value that the jiffies counter
+	will have at the next time that boosting is scheduled to begin.
+
+o	In the line beginning with "normal balk", the fields are as follows:
+
+	o	"nt" is the number of times that the system balked from
+		boosting because there were no blocked tasks to boost.
+		Note that the system will balk from boosting even if the
+		grace period is overdue when the currently running task
+		is looping within an RCU read-side critical section.
+		There is no point in boosting in this case, because
+		boosting a running task won't make it run any faster.
+
+	o	"gt" is the number of times that the system balked
+		from boosting because, although there were blocked tasks,
+		none of them were preventing the current grace period
+		from completing.
+
+	o	"bt" is the number of times that the system balked
+		from boosting because boosting was already in progress.
+
+	o	"b" is the number of times that the system balked from
+		boosting because boosting had already completed for
+		the grace period in question.
+
+	o	"ny" is the number of times that the system balked from
+		boosting because it was not yet time to start boosting
+		the grace period in question.
+
+	o	"nos" is the number of times that the system balked from
+		boosting for inexplicable ("not otherwise specified")
+		reasons.  This can actually happen due to races involving
+		increments of the jiffies counter.
+
+o	In the line beginning with "exp balk", the fields are as follows:
+
+	o	"bt" is the number of times that the system balked from
+		boosting because there were no blocked tasks to boost.
+
+	o	"nos" is the number of times that the system balked from
+		 boosting for inexplicable ("not otherwise specified")
+		 reasons.
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt