aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/core-api
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--Documentation/core-api/asm-annotations.rst (renamed from Documentation/asm-annotations.rst)18
-rw-r--r--Documentation/core-api/bus-virt-phys-mapping.rst220
-rw-r--r--Documentation/core-api/cpu_hotplug.rst2
-rw-r--r--Documentation/core-api/dma-api-howto.rst14
-rw-r--r--Documentation/core-api/dma-api.rst14
-rw-r--r--Documentation/core-api/entry.rst279
-rw-r--r--Documentation/core-api/idr.rst3
-rw-r--r--Documentation/core-api/index.rst18
-rw-r--r--Documentation/core-api/irq/irq-domain.rst2
-rw-r--r--Documentation/core-api/kernel-api.rst12
-rw-r--r--Documentation/core-api/kobject.rst16
-rw-r--r--Documentation/core-api/maple_tree.rst217
-rw-r--r--Documentation/core-api/mm-api.rst30
-rw-r--r--Documentation/core-api/pin_user_pages.rst18
-rw-r--r--Documentation/core-api/printk-formats.rst10
-rw-r--r--Documentation/core-api/printk-index.rst137
-rw-r--r--Documentation/core-api/protection-keys.rst44
-rw-r--r--Documentation/core-api/symbol-namespaces.rst4
-rw-r--r--Documentation/core-api/timekeeping.rst1
-rw-r--r--Documentation/core-api/watch_queue.rst (renamed from Documentation/watch_queue.rst)0
-rw-r--r--Documentation/core-api/wrappers/atomic_bitops.rst18
-rw-r--r--Documentation/core-api/wrappers/atomic_t.rst19
-rw-r--r--Documentation/core-api/wrappers/memory-barriers.rst18
-rw-r--r--Documentation/core-api/xarray.rst14
24 files changed, 820 insertions, 308 deletions
diff --git a/Documentation/asm-annotations.rst b/Documentation/core-api/asm-annotations.rst
index f4bf0f6395fb..bc514ed59887 100644
--- a/Documentation/asm-annotations.rst
+++ b/Documentation/core-api/asm-annotations.rst
@@ -43,10 +43,11 @@ annotated objects like this, tools can be run on them to generate more useful
information. In particular, on properly annotated objects, ``objtool`` can be
run to check and fix the object if needed. Currently, ``objtool`` can report
missing frame pointer setup/destruction in functions. It can also
-automatically generate annotations for :doc:`ORC unwinder <x86/orc-unwinder>`
+automatically generate annotations for the ORC unwinder
+(Documentation/x86/orc-unwinder.rst)
for most code. Both of these are especially important to support reliable
-stack traces which are in turn necessary for :doc:`Kernel live patching
-<livepatch/livepatch>`.
+stack traces which are in turn necessary for kernel live patching
+(Documentation/livepatch/livepatch.rst).
Caveat and Discussion
---------------------
@@ -130,14 +131,13 @@ denoting a range of code via ``SYM_*_START/END`` annotations.
In fact, this kind of annotation corresponds to the now deprecated ``ENTRY``
and ``ENDPROC`` macros.
-* ``SYM_FUNC_START_ALIAS`` and ``SYM_FUNC_START_LOCAL_ALIAS`` serve for those
- who decided to have two or more names for one function. The typical use is::
+* ``SYM_FUNC_ALIAS``, ``SYM_FUNC_ALIAS_LOCAL``, and ``SYM_FUNC_ALIAS_WEAK`` can
+ be used to define multiple names for a function. The typical use is::
- SYM_FUNC_START_ALIAS(__memset)
- SYM_FUNC_START(memset)
+ SYM_FUNC_START(__memset)
... asm insns ...
- SYM_FUNC_END(memset)
- SYM_FUNC_END_ALIAS(__memset)
+ SYN_FUNC_END(__memset)
+ SYM_FUNC_ALIAS(memset, __memset)
In this example, one can call ``__memset`` or ``memset`` with the same
result, except the debug information for the instructions is generated to
diff --git a/Documentation/core-api/bus-virt-phys-mapping.rst b/Documentation/core-api/bus-virt-phys-mapping.rst
deleted file mode 100644
index c72b24a7d52c..000000000000
--- a/Documentation/core-api/bus-virt-phys-mapping.rst
+++ /dev/null
@@ -1,220 +0,0 @@
-==========================================================
-How to access I/O mapped memory from within device drivers
-==========================================================
-
-:Author: Linus
-
-.. warning::
-
- The virt_to_bus() and bus_to_virt() functions have been
- superseded by the functionality provided by the PCI DMA interface
- (see Documentation/core-api/dma-api-howto.rst). They continue
- to be documented below for historical purposes, but new code
- must not use them. --davidm 00/12/12
-
-::
-
- [ This is a mail message in response to a query on IO mapping, thus the
- strange format for a "document" ]
-
-The AHA-1542 is a bus-master device, and your patch makes the driver give the
-controller the physical address of the buffers, which is correct on x86
-(because all bus master devices see the physical memory mappings directly).
-
-However, on many setups, there are actually **three** different ways of looking
-at memory addresses, and in this case we actually want the third, the
-so-called "bus address".
-
-Essentially, the three ways of addressing memory are (this is "real memory",
-that is, normal RAM--see later about other details):
-
- - CPU untranslated. This is the "physical" address. Physical address
- 0 is what the CPU sees when it drives zeroes on the memory bus.
-
- - CPU translated address. This is the "virtual" address, and is
- completely internal to the CPU itself with the CPU doing the appropriate
- translations into "CPU untranslated".
-
- - bus address. This is the address of memory as seen by OTHER devices,
- not the CPU. Now, in theory there could be many different bus
- addresses, with each device seeing memory in some device-specific way, but
- happily most hardware designers aren't actually actively trying to make
- things any more complex than necessary, so you can assume that all
- external hardware sees the memory the same way.
-
-Now, on normal PCs the bus address is exactly the same as the physical
-address, and things are very simple indeed. However, they are that simple
-because the memory and the devices share the same address space, and that is
-not generally necessarily true on other PCI/ISA setups.
-
-Now, just as an example, on the PReP (PowerPC Reference Platform), the
-CPU sees a memory map something like this (this is from memory)::
-
- 0-2 GB "real memory"
- 2 GB-3 GB "system IO" (inb/out and similar accesses on x86)
- 3 GB-4 GB "IO memory" (shared memory over the IO bus)
-
-Now, that looks simple enough. However, when you look at the same thing from
-the viewpoint of the devices, you have the reverse, and the physical memory
-address 0 actually shows up as address 2 GB for any IO master.
-
-So when the CPU wants any bus master to write to physical memory 0, it
-has to give the master address 0x80000000 as the memory address.
-
-So, for example, depending on how the kernel is actually mapped on the
-PPC, you can end up with a setup like this::
-
- physical address: 0
- virtual address: 0xC0000000
- bus address: 0x80000000
-
-where all the addresses actually point to the same thing. It's just seen
-through different translations..
-
-Similarly, on the Alpha, the normal translation is::
-
- physical address: 0
- virtual address: 0xfffffc0000000000
- bus address: 0x40000000
-
-(but there are also Alphas where the physical address and the bus address
-are the same).
-
-Anyway, the way to look up all these translations, you do::
-
- #include <asm/io.h>
-
- phys_addr = virt_to_phys(virt_addr);
- virt_addr = phys_to_virt(phys_addr);
- bus_addr = virt_to_bus(virt_addr);
- virt_addr = bus_to_virt(bus_addr);
-
-Now, when do you need these?
-
-You want the **virtual** address when you are actually going to access that
-pointer from the kernel. So you can have something like this::
-
- /*
- * this is the hardware "mailbox" we use to communicate with
- * the controller. The controller sees this directly.
- */
- struct mailbox {
- __u32 status;
- __u32 bufstart;
- __u32 buflen;
- ..
- } mbox;
-
- unsigned char * retbuffer;
-
- /* get the address from the controller */
- retbuffer = bus_to_virt(mbox.bufstart);
- switch (retbuffer[0]) {
- case STATUS_OK:
- ...
-
-on the other hand, you want the bus address when you have a buffer that
-you want to give to the controller::
-
- /* ask the controller to read the sense status into "sense_buffer" */
- mbox.bufstart = virt_to_bus(&sense_buffer);
- mbox.buflen = sizeof(sense_buffer);
- mbox.status = 0;
- notify_controller(&mbox);
-
-And you generally **never** want to use the physical address, because you can't
-use that from the CPU (the CPU only uses translated virtual addresses), and
-you can't use it from the bus master.
-
-So why do we care about the physical address at all? We do need the physical
-address in some cases, it's just not very often in normal code. The physical
-address is needed if you use memory mappings, for example, because the
-"remap_pfn_range()" mm function wants the physical address of the memory to
-be remapped as measured in units of pages, a.k.a. the pfn (the memory
-management layer doesn't know about devices outside the CPU, so it
-shouldn't need to know about "bus addresses" etc).
-
-.. note::
-
- The above is only one part of the whole equation. The above
- only talks about "real memory", that is, CPU memory (RAM).
-
-There is a completely different type of memory too, and that's the "shared
-memory" on the PCI or ISA bus. That's generally not RAM (although in the case
-of a video graphics card it can be normal DRAM that is just used for a frame
-buffer), but can be things like a packet buffer in a network card etc.
-
-This memory is called "PCI memory" or "shared memory" or "IO memory" or
-whatever, and there is only one way to access it: the readb/writeb and
-related functions. You should never take the address of such memory, because
-there is really nothing you can do with such an address: it's not
-conceptually in the same memory space as "real memory" at all, so you cannot
-just dereference a pointer. (Sadly, on x86 it **is** in the same memory space,
-so on x86 it actually works to just deference a pointer, but it's not
-portable).
-
-For such memory, you can do things like:
-
- - reading::
-
- /*
- * read first 32 bits from ISA memory at 0xC0000, aka
- * C000:0000 in DOS terms
- */
- unsigned int signature = isa_readl(0xC0000);
-
- - remapping and writing::
-
- /*
- * remap framebuffer PCI memory area at 0xFC000000,
- * size 1MB, so that we can access it: We can directly
- * access only the 640k-1MB area, so anything else
- * has to be remapped.
- */
- void __iomem *baseptr = ioremap(0xFC000000, 1024*1024);
-
- /* write a 'A' to the offset 10 of the area */
- writeb('A',baseptr+10);
-
- /* unmap when we unload the driver */
- iounmap(baseptr);
-
- - copying and clearing::
-
- /* get the 6-byte Ethernet address at ISA address E000:0040 */
- memcpy_fromio(kernel_buffer, 0xE0040, 6);
- /* write a packet to the driver */
- memcpy_toio(0xE1000, skb->data, skb->len);
- /* clear the frame buffer */
- memset_io(0xA0000, 0, 0x10000);
-
-OK, that just about covers the basics of accessing IO portably. Questions?
-Comments? You may think that all the above is overly complex, but one day you
-might find yourself with a 500 MHz Alpha in front of you, and then you'll be
-happy that your driver works ;)
-
-Note that kernel versions 2.0.x (and earlier) mistakenly called the
-ioremap() function "vremap()". ioremap() is the proper name, but I
-didn't think straight when I wrote it originally. People who have to
-support both can do something like::
-
- /* support old naming silliness */
- #if LINUX_VERSION_CODE < 0x020100
- #define ioremap vremap
- #define iounmap vfree
- #endif
-
-at the top of their source files, and then they can use the right names
-even on 2.0.x systems.
-
-And the above sounds worse than it really is. Most real drivers really
-don't do all that complex things (or rather: the complexity is not so
-much in the actual IO accesses as in error handling and timeouts etc).
-It's generally not hard to fix drivers, and in many cases the code
-actually looks better afterwards::
-
- unsigned long signature = *(unsigned int *) 0xC0000;
- vs
- unsigned long signature = readl(0xC0000);
-
-I think the second version actually is more readable, no?
diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst
index c6f4ba2fb32d..f75778d37488 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -560,7 +560,7 @@ available:
* cpuhp_state_remove_instance(state, node)
* cpuhp_state_remove_instance_nocalls(state, node)
-The arguments are the same as for the the cpuhp_state_add_instance*()
+The arguments are the same as for the cpuhp_state_add_instance*()
variants above.
The functions differ in the way how the installed callbacks are treated:
diff --git a/Documentation/core-api/dma-api-howto.rst b/Documentation/core-api/dma-api-howto.rst
index 358d495456d1..828846804e25 100644
--- a/Documentation/core-api/dma-api-howto.rst
+++ b/Documentation/core-api/dma-api-howto.rst
@@ -707,20 +707,6 @@ to use the dma_sync_*() interfaces::
}
}
-Drivers converted fully to this interface should not use virt_to_bus() any
-longer, nor should they use bus_to_virt(). Some drivers have to be changed a
-little bit, because there is no longer an equivalent to bus_to_virt() in the
-dynamic DMA mapping scheme - you have to always store the DMA addresses
-returned by the dma_alloc_coherent(), dma_pool_alloc(), and dma_map_single()
-calls (dma_map_sg() stores them in the scatterlist itself if the platform
-supports dynamic DMA mapping in hardware) in your driver structures and/or
-in the card registers.
-
-All drivers should be using these interfaces with no exceptions. It
-is planned to completely remove virt_to_bus() and bus_to_virt() as
-they are entirely deprecated. Some ports already do not provide these
-as it is impossible to correctly support them.
-
Handling Errors
===============
diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
index 6d6d0edd2d27..829f20a193ca 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -206,6 +206,20 @@ others should not be larger than the returned value.
::
+ size_t
+ dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device.
+
+Mapping larger buffers may take much longer in certain scenarios. In
+addition, for high-rate short-lived streaming mappings, the upfront time
+spent on the mapping may account for an appreciable part of the total
+request lifetime. As such, if splitting larger requests incurs no
+significant performance penalty, then device drivers are advised to
+limit total DMA streaming mappings length to the returned value.
+
+::
+
bool
dma_need_sync(struct device *dev, dma_addr_t dma_addr);
diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst
new file mode 100644
index 000000000000..e12f22ab33c7
--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,279 @@
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+All transitions between execution domains require state updates which are
+subject to strict ordering constraints. State updates are required for the
+following:
+
+ * Lockdep
+ * RCU / Context tracking
+ * Preemption counter
+ * Tracing
+ * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
+exceptions`_, `NMI and NMI-like exceptions`_.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Most instrumentation facilities depend on RCU, so intrumentation is prohibited
+for entry code before RCU starts watching and exit code after RCU stops
+watching. In addition, many architectures must save and restore register state,
+which means that (for example) a breakpoint in the breakpoint entry code would
+overwrite the debug registers of the initial breakpoint.
+
+Such code must be marked with the 'noinstr' attribute, placing that code into a
+special section inaccessible to instrumentation and debug facilities. Some
+functions are partially instrumentable, which is handled by marking them
+noinstr and using instrumentation_begin() and instrumentation_end() to flag the
+instrumentable ranges of code:
+
+.. code-block:: c
+
+ noinstr void entry(void)
+ {
+ handle_entry(); // <-- must be 'noinstr' or '__always_inline'
+ ...
+
+ instrumentation_begin();
+ handle_context(); // <-- instrumentable code
+ instrumentation_end();
+
+ ...
+ handle_exit(); // <-- must be 'noinstr' or '__always_inline'
+ }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall-entry code starts in assembly code and calls out into low-level C code
+after establishing low-level architecture-specific state and stack frames. This
+low-level C code must not be instrumented. A typical syscall handling function
+invoked from low-level assembly code looks like this:
+
+.. code-block:: c
+
+ noinstr void syscall(struct pt_regs *regs, int nr)
+ {
+ arch_syscall_enter(regs);
+ nr = syscall_enter_from_user_mode(regs, nr);
+
+ instrumentation_begin();
+ if (!invoke_syscall(regs, nr) && nr != -1)
+ result_reg(regs) = __sys_ni_syscall(regs);
+ instrumentation_end();
+
+ syscall_exit_to_user_mode(regs);
+ }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+ * Lockdep
+ * RCU / Context tracking
+ * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp, audit,
+syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
+function can be invoked. The instrumentable code section then ends, after which
+syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+ * Tracing
+ * RCU / Context tracking
+ * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
+to print a warning.
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+Do not nest KVM entry/exit transitions because doing so is nonsensical.
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry and
+exit handling is slightly different. RCU state is only updated when the
+interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
+already be watching. Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture-specific part looks similar to syscall handling:
+
+.. code-block:: c
+
+ noinstr void interrupt(struct pt_regs *regs, int nr)
+ {
+ arch_interrupt_enter(regs);
+ state = irqentry_enter(regs);
+
+ instrumentation_begin();
+
+ irq_enter_rcu();
+ invoke_irq_handler(regs, nr);
+ irq_exit_rcu();
+
+ instrumentation_end();
+
+ irqentry_exit(regs, state);
+ }
+
+Note that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+In theory, the preemption count could be updated in irqentry_enter(). In
+practice, deferring this update to irq_enter_rcu() allows the preemption-count
+code to be traced, while also maintaining symmetry with irq_exit_rcu() and
+irqentry_exit(), which are described in the next paragraph. The only downside
+is that the early entry code up to irq_enter_rcu() must be aware that the
+preemption count has not yet been updated with the HARDIRQ_OFFSET state.
+
+Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
+before it handles soft interrupts, whose handlers must run in BH context rather
+than irq-disabled context. In addition, irqentry_exit() might schedule, which
+also requires that HARDIRQ_OFFSET has been removed from the preemption count.
+
+Even though interrupt handlers are expected to run with local interrupts
+disabled, interrupt nesting is common from an entry/exit perspective. For
+example, softirq handling happens within an irqentry_{enter,exit}() block with
+local interrupts enabled. Also, although uncommon, nothing prevents an
+interrupt handler from re-enabling interrupts.
+
+Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
+runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
+the entry code is shared between the two.
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI-like exceptions (machine checks, double faults, debug
+interrupts, etc.) can hit any context and must be extra careful with
+the state.
+
+State changes for debug exceptions and machine-check exceptions depend on
+whether these exceptions happened in user-space (breakpoints or watchpoints) or
+in kernel mode (code patching). From user-space, they are treated like
+interrupts, while from kernel mode they are treated like NMIs.
+
+NMIs and other NMI-like exceptions handle state transitions without
+distinguishing between user-mode and kernel-mode origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+ * Preemption counter
+ * Lockdep
+ * RCU / Context tracking
+ * Tracing
+
+The exit counterpart irqentry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case must not be
+traced.
+
+Architecture-specific code looks like this:
+
+.. code-block:: c
+
+ noinstr void nmi(struct pt_regs *regs)
+ {
+ arch_nmi_enter(regs);
+ state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+ nmi_handler(regs);
+ instrumentation_end();
+
+ irqentry_nmi_exit(regs);
+ }
+
+and for e.g. a debug exception it can look like this:
+
+.. code-block:: c
+
+ noinstr void debug(struct pt_regs *regs)
+ {
+ arch_nmi_enter(regs);
+
+ debug_regs = save_debug_regs();
+
+ if (user_mode(regs)) {
+ state = irqentry_enter(regs);
+
+ instrumentation_begin();
+ user_mode_debug_handler(regs, debug_regs);
+ instrumentation_end();
+
+ irqentry_exit(regs, state);
+ } else {
+ state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+ kernel_mode_debug_handler(regs, debug_regs);
+ instrumentation_end();
+
+ irqentry_nmi_exit(regs, state);
+ }
+ }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception-agnostic way.
+
+NMIs can happen in any context. For example, an NMI-like exception triggered
+while handling an NMI. So NMI entry code has to be reentrant and state updates
+need to handle nesting.
diff --git a/Documentation/core-api/idr.rst b/Documentation/core-api/idr.rst
index 2eb5afdb9931..18d724867064 100644
--- a/Documentation/core-api/idr.rst
+++ b/Documentation/core-api/idr.rst
@@ -17,6 +17,9 @@ solution to the problem to avoid everybody inventing their own. The IDR
provides the ability to map an ID to a pointer, while the IDA provides
only ID allocation, and as a result is much more memory-efficient.
+The IDR interface is deprecated; please use the :doc:`XArray <xarray>`
+instead.
+
IDR usage
=========
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 5de2c7a4b1b3..77eb775b8b42 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -18,9 +18,12 @@ it.
kernel-api
workqueue
+ watch_queue
printk-basics
printk-formats
+ printk-index
symbol-namespaces
+ asm-annotations
Data structures and low-level utilities
=======================================
@@ -34,15 +37,25 @@ Library functionality that is used throughout the kernel.
kref
assoc_array
xarray
+ maple_tree
idr
circular-buffers
rbtree
generic-radix-tree
packing
- bus-virt-phys-mapping
this_cpu_ops
timekeeping
errseq
+ wrappers/atomic_t
+ wrappers/atomic_bitops
+
+Low level entry and exit
+========================
+
+.. toctree::
+ :maxdepth: 1
+
+ entry
Concurrency primitives
======================
@@ -58,6 +71,7 @@ Documentation/locking/index.rst for more related documentation.
local_ops
padata
../RCU/index
+ wrappers/memory-barriers.rst
Low-level hardware management
=============================
@@ -77,7 +91,7 @@ Memory management
=================
How to allocate and use memory in the kernel. Note that there is a lot
-more memory-management documentation in Documentation/vm/index.rst.
+more memory-management documentation in Documentation/mm/index.rst.
.. toctree::
:maxdepth: 1
diff --git a/Documentation/core-api/irq/irq-domain.rst b/Documentation/core-api/irq/irq-domain.rst
index d30b4d0a9769..f88a6ee67a35 100644
--- a/Documentation/core-api/irq/irq-domain.rst
+++ b/Documentation/core-api/irq/irq-domain.rst
@@ -71,7 +71,7 @@ variety of methods:
Note that irq domain lookups must happen in contexts that are
compatible with a RCU read-side critical section.
-The irq_create_mapping() function must be called *atleast once*
+The irq_create_mapping() function must be called *at least once*
before any call to irq_find_mapping(), lest the descriptor will not
be allocated.
diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 2e7186805148..06f4ab122697 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -118,6 +118,12 @@ Text Searching
CRC and Math Functions in Linux
===============================
+Arithmetic Overflow Checking
+----------------------------
+
+.. kernel-doc:: include/linux/overflow.h
+ :internal:
+
CRC Functions
-------------
@@ -223,7 +229,7 @@ Module Loading
Inter Module support
--------------------
-Refer to the file kernel/module.c for more information.
+Refer to the files in kernel/module/ for more information.
Hardware Interfaces
===================
@@ -279,6 +285,7 @@ Accounting Framework
Block Devices
=============
+.. kernel-doc:: include/linux/bio.h
.. kernel-doc:: block/blk-core.c
:export:
@@ -294,9 +301,6 @@ Block Devices
.. kernel-doc:: block/blk-settings.c
:export:
-.. kernel-doc:: block/blk-exec.c
- :export:
-
.. kernel-doc:: block/blk-flush.c
:export:
diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst
index 2739f8b72575..7310247310a0 100644
--- a/Documentation/core-api/kobject.rst
+++ b/Documentation/core-api/kobject.rst
@@ -118,7 +118,7 @@ Initialization of kobjects
Code which creates a kobject must, of course, initialize that object. Some
of the internal fields are setup with a (mandatory) call to kobject_init()::
- void kobject_init(struct kobject *kobj, struct kobj_type *ktype);
+ void kobject_init(struct kobject *kobj, const struct kobj_type *ktype);
The ktype is required for a kobject to be created properly, as every kobject
must have an associated kobj_type. After calling kobject_init(), to
@@ -156,7 +156,7 @@ kobject_name()::
There is a helper function to both initialize and add the kobject to the
kernel at the same time, called surprisingly enough kobject_init_and_add()::
- int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype,
+ int kobject_init_and_add(struct kobject *kobj, const struct kobj_type *ktype,
struct kobject *parent, const char *fmt, ...);
The arguments are the same as the individual kobject_init() and
@@ -299,7 +299,6 @@ kobj_type::
struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;
- struct attribute **default_attrs;
const struct attribute_group **default_groups;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
@@ -313,10 +312,10 @@ call kobject_init() or kobject_init_and_add().
The release field in struct kobj_type is, of course, a pointer to the
release() method for this type of kobject. The other two fields (sysfs_ops
-and default_attrs) control how objects of this type are represented in
+and default_groups) control how objects of this type are represented in
sysfs; they are beyond the scope of this document.
-The default_attrs pointer is a list of default attributes that will be
+The default_groups pointer is a list of default attributes that will be
automatically created for any kobject that is registered with this ktype.
@@ -373,10 +372,9 @@ If a kset wishes to control the uevent operations of the kobjects
associated with it, it can use the struct kset_uevent_ops to handle it::
struct kset_uevent_ops {
- int (* const filter)(struct kset *kset, struct kobject *kobj);
- const char *(* const name)(struct kset *kset, struct kobject *kobj);
- int (* const uevent)(struct kset *kset, struct kobject *kobj,
- struct kobj_uevent_env *env);
+ int (* const filter)(struct kobject *kobj);
+ const char *(* const name)(struct kobject *kobj);
+ int (* const uevent)(struct kobject *kobj, struct kobj_uevent_env *env);
};
diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
new file mode 100644
index 000000000000..45defcf15da7
--- /dev/null
+++ b/Documentation/core-api/maple_tree.rst
@@ -0,0 +1,217 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+
+==========
+Maple Tree
+==========
+
+:Author: Liam R. Howlett
+
+Overview
+========
+
+The Maple Tree is a B-Tree data type which is optimized for storing
+non-overlapping ranges, including ranges of size 1. The tree was designed to
+be simple to use and does not require a user written search method. It
+supports iterating over a range of entries and going to the previous or next
+entry in a cache-efficient manner. The tree can also be put into an RCU-safe
+mode of operation which allows reading and writing concurrently. Writers must
+synchronize on a lock, which can be the default spinlock, or the user can set
+the lock to an external lock of a different type.
+
+The Maple Tree maintains a small memory footprint and was designed to use
+modern processor cache efficiently. The majority of the users will be able to
+use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex
+scenarios. The most important usage of the Maple Tree is the tracking of the
+virtual memory areas.
+
+The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple
+Tree reserves values with the bottom two bits set to '10' which are below 4096
+(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved
+entries then the users can convert the entries using xa_mk_value() and convert
+them back by calling xa_to_value(). If the user needs to use a reserved
+value, then the user can convert the value when using the
+:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
+
+The Maple Tree can also be configured to support searching for a gap of a given
+size (or larger).
+
+Pre-allocating of nodes is also supported using the
+:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a
+successful store operation within a given
+code segment when allocating cannot be done. Allocations of nodes are
+relatively small at around 256 bytes.
+
+.. _maple-tree-normal-api:
+
+Normal API
+==========
+
+Start by initialising a maple tree, either with DEFINE_MTREE() for statically
+allocated maple trees or mt_init() for dynamically allocated ones. A
+freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0``
+- ``ULONG_MAX``. There are currently two types of maple trees supported: the
+allocation tree and the regular tree. The regular tree has a higher branching
+factor for internal nodes. The allocation tree has a lower branching factor
+but allows the user to search for a gap of a given size or larger from either
+``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by
+passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree.
+
+You can then set entries using mtree_store() or mtree_store_range().
+mtree_store() will overwrite any entry with the new entry and return 0 on
+success or an error code otherwise. mtree_store_range() works in the same way
+but takes a range. mtree_load() is used to retrieve the entry stored at a
+given index. You can use mtree_erase() to erase an entire range by only
+knowing one value within that range, or mtree_store() call with an entry of
+NULL may be used to partially erase a range or many ranges at once.
+
+If you want to only store a new entry to a range (or index) if that range is
+currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which
+return -EEXIST if the range is not empty.
+
+You can search for an entry from an index upwards by using mt_find().
+
+You can walk each entry within a range by calling mt_for_each(). You must
+provide a temporary variable to store a cursor. If you want to walk each
+element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
+the caller is going to hold the lock for the duration of the walk then it is
+worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api`
+section.
+
+Sometimes it is necessary to ensure the next call to store to a maple tree does
+not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
+
+Finally, you can remove all entries from a maple tree by calling
+mtree_destroy(). If the maple tree entries are pointers, you may wish to free
+the entries first.
+
+Allocating Nodes
+----------------
+
+The allocations are handled by the internal tree code. See
+:ref:`maple-tree-advanced-alloc` for other options.
+
+Locking
+-------
+
+You do not have to worry about locking. See :ref:`maple-tree-advanced-locks`
+for other options.
+
+The Maple Tree uses RCU and an internal spinlock to synchronise access:
+
+Takes RCU read lock:
+ * mtree_load()
+ * mt_find()
+ * mt_for_each()
+ * mt_next()
+ * mt_prev()
+
+Takes ma_lock internally:
+ * mtree_store()
+ * mtree_store_range()
+ * mtree_insert()
+ * mtree_insert_range()
+ * mtree_erase()
+ * mtree_destroy()
+ * mt_set_in_rcu()
+ * mt_clear_in_rcu()
+
+If you want to take advantage of the internal lock to protect the data
+structures that you are storing in the Maple Tree, you can call mtree_lock()
+before calling mtree_load(), then take a reference count on the object you
+have found before calling mtree_unlock(). This will prevent stores from
+removing the object from the tree between looking up the object and
+incrementing the refcount. You can also use RCU to avoid dereferencing
+freed memory, but an explanation of that is beyond the scope of this
+document.
+
+.. _maple-tree-advanced-api:
+
+Advanced API
+============
+
+The advanced API offers more flexibility and better performance at the
+cost of an interface which can be harder to use and has fewer safeguards.
+You must take care of your own locking while using the advanced API.
+You can use the ma_lock, RCU or an external lock for protection.
+You can mix advanced and normal operations on the same array, as long
+as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented
+in terms of the advanced API.
+
+The advanced API is based around the ma_state, this is where the 'mas'
+prefix originates. The ma_state struct keeps track of tree operations to make
+life easier for both internal and external tree users.
+
+Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`.
+Please see above.
+
+The maple state keeps track of the range start and end in mas->index and
+mas->last, respectively.
+
+mas_walk() will walk the tree to the location of mas->index and set the
+mas->index and mas->last according to the range for the entry.
+
+You can set entries using mas_store(). mas_store() will overwrite any entry
+with the new entry and return the first existing entry that is overwritten.
+The range is passed in as members of the maple state: index and last.
+
+You can use mas_erase() to erase an entire range by setting index and
+last of the maple state to the desired range to erase. This will erase
+the first range that is found in that range, set the maple state index
+and last as the range that was erased and return the entry that existed
+at that location.
+
+You can walk each entry within a range by using mas_for_each(). If you want
+to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as
+the range. If the lock needs to be periodically dropped, see the locking
+section mas_pause().
+
+Using a maple state allows mas_next() and mas_prev() to function as if the
+tree was a linked list. With such a high branching factor the amortized
+performance penalty is outweighed by cache optimization. mas_next() will
+return the next entry which occurs after the entry at index. mas_prev()
+will return the previous entry which occurs before the entry at index.
+
+mas_find() will find the first entry which exists at or above index on
+the first call, and the next entry from every subsequent calls.
+
+mas_find_rev() will find the fist entry which exists at or below the last on
+the first call, and the previous entry from every subsequent calls.
+
+If the user needs to yield the lock during an operation, then the maple state
+must be paused using mas_pause().
+
+There are a few extra interfaces provided when using an allocation tree.
+If you wish to search for a gap within a range, then mas_empty_area()
+or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap
+starting at the lowest index given up to the maximum of the range.
+mas_empty_area_rev() searches for a gap starting at the highest index given
+and continues downward to the lower bound of the range.
+
+.. _maple-tree-advanced-alloc:
+
+Advanced Allocating Nodes
+-------------------------
+
+Allocations are usually handled internally to the tree, however if allocations
+need to occur before a write occurs then calling mas_expected_entries() will
+allocate the worst-case number of needed nodes to insert the provided number of
+ranges. This also causes the tree to enter mass insertion mode. Once
+insertions are complete calling mas_destroy() on the maple state will free the
+unused allocations.
+
+.. _maple-tree-advanced-locks:
+
+Advanced Locking
+----------------
+
+The maple tree uses a spinlock by default, but external locks can be used for
+tree updates as well. To use an external lock, the tree must be initialized
+with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the
+MTREE_INIT_EXT() #define, which takes an external lock as an argument.
+
+Functions and structures
+========================
+
+.. kernel-doc:: include/linux/maple_tree.h
+.. kernel-doc:: lib/maple_tree.c
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 395835f9289f..f5dde5bceaea 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -19,19 +19,16 @@ User Space Memory Access
Memory Allocation Controls
==========================
-.. kernel-doc:: include/linux/gfp.h
- :internal:
-
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Page mobility and placement hints
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Watermark modifiers
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Reclaim modifiers
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Useful GFP flag combinations
The Slab Cache
@@ -58,15 +55,30 @@ Virtually Contiguous Mappings
File Mapping and Page Cache
===========================
-.. kernel-doc:: mm/readahead.c
- :export:
+Filemap
+-------
.. kernel-doc:: mm/filemap.c
:export:
+Readahead
+---------
+
+.. kernel-doc:: mm/readahead.c
+ :doc: Readahead Overview
+
+.. kernel-doc:: mm/readahead.c
+ :export:
+
+Writeback
+---------
+
.. kernel-doc:: mm/page-writeback.c
:export:
+Truncate
+--------
+
.. kernel-doc:: mm/truncate.c
:export:
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index fcf605be43d0..b18416f4500f 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -55,18 +55,18 @@ flags the caller provides. The caller is required to pass in a non-null struct
pages* array, and the function then pins pages by incrementing each by a special
value: GUP_PIN_COUNTING_BIAS.
-For huge pages (and in fact, any compound page of more than 2 pages), the
-GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
-is achieved, by using the 3rd struct page in the compound page. A new struct
-page field, hpage_pinned_refcount, has been added in order to support this.
+For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
+an exact form of pin counting is achieved, by using the 2nd struct page
+in the compound page. A new struct page field, compound_pincount, has
+been added in order to support this.
This approach for compound pages avoids the counting upper limit problems that
are discussed below. Those limitations would have been aggravated severely by
huge pages, because each tail page adds a refcount to the head page. And in
-fact, testing revealed that, without a separate hpage_pinned_refcount field,
+fact, testing revealed that, without a separate compound_pincount field,
page overflows were seen in some huge page stress tests.
-This also means that huge pages and compound pages (of order > 1) do not suffer
+This also means that huge pages and compound pages do not suffer
from the false positives problem that is mentioned below.::
Function
@@ -264,9 +264,9 @@ place.)
Other diagnostics
=================
-dump_page() has been enhanced slightly, to handle these new counting fields, and
-to better report on compound pages in general. Specifically, for compound pages
-with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
+dump_page() has been enhanced slightly, to handle these new counting
+fields, and to better report on compound pages in general. Specifically,
+for compound pages, the exact (compound_pincount) pincount is reported.
References
==========
diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst
index 5e89497ba314..dbe1aacc79d0 100644
--- a/Documentation/core-api/printk-formats.rst
+++ b/Documentation/core-api/printk-formats.rst
@@ -625,6 +625,16 @@ Examples::
%p4cc Y10 little-endian (0x20303159)
%p4cc NV12 big-endian (0xb231564e)
+Rust
+----
+
+::
+
+ %pA
+
+Only intended to be used from Rust code to format ``core::fmt::Arguments``.
+Do *not* use it from C.
+
Thanks
======
diff --git a/Documentation/core-api/printk-index.rst b/Documentation/core-api/printk-index.rst
new file mode 100644
index 000000000000..3062f37d119b
--- /dev/null
+++ b/Documentation/core-api/printk-index.rst
@@ -0,0 +1,137 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Printk Index
+============
+
+There are many ways how to monitor the state of the system. One important
+source of information is the system log. It provides a lot of information,
+including more or less important warnings and error messages.
+
+There are monitoring tools that filter and take action based on messages
+logged.
+
+The kernel messages are evolving together with the code. As a result,
+particular kernel messages are not KABI and never will be!
+
+It is a huge challenge for maintaining the system log monitors. It requires
+knowing what messages were updated in a particular kernel version and why.
+Finding these changes in the sources would require non-trivial parsers.
+Also it would require matching the sources with the binary kernel which
+is not always trivial. Various changes might be backported. Various kernel
+versions might be used on different monitored systems.
+
+This is where the printk index feature might become useful. It provides
+a dump of printk formats used all over the source code used for the kernel
+and modules on the running system. It is accessible at runtime via debugfs.
+
+The printk index helps to find changes in the message formats. Also it helps
+to track the strings back to the kernel sources and the related commit.
+
+
+User Interface
+==============
+
+The index of printk formats are split in into separate files. The files are
+named according to the binaries where the printk formats are built-in. There
+is always "vmlinux" and optionally also modules, for example::
+
+ /sys/kernel/debug/printk/index/vmlinux
+ /sys/kernel/debug/printk/index/ext4
+ /sys/kernel/debug/printk/index/scsi_mod
+
+Note that only loaded modules are shown. Also printk formats from a module
+might appear in "vmlinux" when the module is built-in.
+
+The content is inspired by the dynamic debug interface and looks like::
+
+ $> head -1 /sys/kernel/debug/printk/index/vmlinux; shuf -n 5 vmlinux
+ # <level[,flags]> filename:line function "format"
+ <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
+ <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
+ <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
+ <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
+ <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
+
+, where the meaning is:
+
+ - :level: log level value: 0-7 for particular severity, -1 as default,
+ 'c' as continuous line without an explicit log level
+ - :flags: optional flags: currently only 'c' for KERN_CONT
+ - :filename\:line: source filename and line number of the related
+ printk() call. Note that there are many wrappers, for example,
+ pr_warn(), pr_warn_once(), dev_warn().
+ - :function: function name where the printk() call is used.
+ - :format: format string
+
+The extra information makes it a bit harder to find differences
+between various kernels. Especially the line number might change
+very often. On the other hand, it helps a lot to confirm that
+it is the same string or find the commit that is responsible
+for eventual changes.
+
+
+printk() Is Not a Stable KABI
+=============================
+
+Several developers are afraid that exporting all these implementation
+details into the user space will transform particular printk() calls
+into KABI.
+
+But it is exactly the opposite. printk() calls must _not_ be KABI.
+And the printk index helps user space tools to deal with this.
+
+
+Subsystem specific printk wrappers
+==================================
+
+The printk index is generated using extra metadata that are stored in
+a dedicated .elf section ".printk_index". It is achieved using macro
+wrappers doing __printk_index_emit() together with the real printk()
+call. The same technique is used also for the metadata used by
+the dynamic debug feature.
+
+The metadata are stored for a particular message only when it is printed
+using these special wrappers. It is implemented for the commonly
+used printk() calls, including, for example, pr_warn(), or pr_once().
+
+Additional changes are necessary for various subsystem specific wrappers
+that call the original printk() via a common helper function. These needs
+their own wrappers adding __printk_index_emit().
+
+Only few subsystem specific wrappers have been updated so far,
+for example, dev_printk(). As a result, the printk formats from
+some subsystes can be missing in the printk index.
+
+
+Subsystem specific prefix
+=========================
+
+The macro pr_fmt() macro allows to define a prefix that is printed
+before the string generated by the related printk() calls.
+
+Subsystem specific wrappers usually add even more complicated
+prefixes.
+
+These prefixes can be stored into the printk index metadata
+by an optional parameter of __printk_index_emit(). The debugfs
+interface might then show the printk formats including these prefixes.
+For example, drivers/acpi/osl.c contains::
+
+ #define pr_fmt(fmt) "ACPI: OSL: " fmt
+
+ static int __init acpi_no_auto_serialize_setup(char *str)
+ {
+ acpi_gbl_auto_serialize_methods = FALSE;
+ pr_info("Auto-serialization disabled\n");
+
+ return 1;
+ }
+
+This results in the following printk index entry::
+
+ <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
+
+It helps matching messages from the real log with printk index.
+Then the source file name, line number, and function name can
+be used to match the string with the source code.
diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..bf28ac0401f3 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,31 +4,29 @@
Memory Protection Keys
======================
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains. It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key. Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+Pkeys Userspace (PKU) is a feature which can be found on:
+ * Intel server CPUs, Skylake and later
+ * Intel client CPUs, Tiger Lake (11th Gen Core) and later
+ * Future AMD CPUs
+
+Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.
+
+Protections for each key are defined with a per-CPU user-accessible register
+(PKRU). Each of these is a 32-bit register storing two bits (Access Disable
+and Write Disable) for each of 16 keys.
+
+Being a CPU register, PKRU is inherently thread-local, potentially giving each
thread a different set of protections from every other thread.
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register. The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs. These
-permissions are enforced on data access only and have no effect on
-instruction fetches.
+There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
+register. The feature is only available in 64-bit mode, even though there is
+theoretically space in the PAE PTEs. These permissions are enforced on data
+access only and have no effect on instruction fetches.
Syscalls
========
diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst
index 5ad9e0abe42c..12e4aecdae94 100644
--- a/Documentation/core-api/symbol-namespaces.rst
+++ b/Documentation/core-api/symbol-namespaces.rst
@@ -51,8 +51,8 @@ namespace ``USB_STORAGE``, use::
The corresponding ksymtab entry struct ``kernel_symbol`` will have the member
``namespace`` set accordingly. A symbol that is exported without a namespace will
refer to ``NULL``. There is no default namespace if none is defined. ``modpost``
-and kernel/module.c make use the namespace at build time or module load time,
-respectively.
+and kernel/module/main.c make use the namespace at build time or module load
+time, respectively.
2.2 Using the DEFAULT_SYMBOL_NAMESPACE define
=============================================
diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst
index 729e24864fe7..22ec68f24421 100644
--- a/Documentation/core-api/timekeeping.rst
+++ b/Documentation/core-api/timekeeping.rst
@@ -132,6 +132,7 @@ Some additional variants exist for more specialized cases:
.. c:function:: u64 ktime_get_mono_fast_ns( void )
u64 ktime_get_raw_fast_ns( void )
u64 ktime_get_boot_fast_ns( void )
+ u64 ktime_get_tai_fast_ns( void )
u64 ktime_get_real_fast_ns( void )
These variants are safe to call from any context, including from
diff --git a/Documentation/watch_queue.rst b/Documentation/core-api/watch_queue.rst
index 54f13ad5fc17..54f13ad5fc17 100644
--- a/Documentation/watch_queue.rst
+++ b/Documentation/core-api/watch_queue.rst
diff --git a/Documentation/core-api/wrappers/atomic_bitops.rst b/Documentation/core-api/wrappers/atomic_bitops.rst
new file mode 100644
index 000000000000..bf24e4081a8f
--- /dev/null
+++ b/Documentation/core-api/wrappers/atomic_bitops.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+ This is a simple wrapper to bring atomic_bitops.txt into the RST world
+ until such a time as that file can be converted directly.
+
+=============
+Atomic bitops
+=============
+
+.. raw:: latex
+
+ \footnotesize
+
+.. include:: ../../atomic_bitops.txt
+ :literal:
+
+.. raw:: latex
+
+ \normalsize
diff --git a/Documentation/core-api/wrappers/atomic_t.rst b/Documentation/core-api/wrappers/atomic_t.rst
new file mode 100644
index 000000000000..ed109a964c77
--- /dev/null
+++ b/Documentation/core-api/wrappers/atomic_t.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+ This is a simple wrapper to bring atomic_t.txt into the RST world
+ until such a time as that file can be converted directly.
+
+============
+Atomic types
+============
+
+.. raw:: latex
+
+ \footnotesize
+
+.. include:: ../../atomic_t.txt
+ :literal:
+
+.. raw:: latex
+
+ \normalsize
+
diff --git a/Documentation/core-api/wrappers/memory-barriers.rst b/Documentation/core-api/wrappers/memory-barriers.rst
new file mode 100644
index 000000000000..532460b5e3eb
--- /dev/null
+++ b/Documentation/core-api/wrappers/memory-barriers.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+ This is a simple wrapper to bring memory-barriers.txt into the RST world
+ until such a time as that file can be converted directly.
+
+============================
+Linux kernel memory barriers
+============================
+
+.. raw:: latex
+
+ \footnotesize
+
+.. include:: ../../memory-barriers.txt
+ :literal:
+
+.. raw:: latex
+
+ \normalsize
diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index a137a0e6d068..77e0ece2b1d6 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -315,11 +315,15 @@ indeed the normal API is implemented in terms of the advanced API. The
advanced API is only available to modules with a GPL-compatible license.
The advanced API is based around the xa_state. This is an opaque data
-structure which you declare on the stack using the XA_STATE()
-macro. This macro initialises the xa_state ready to start walking
-around the XArray. It is used as a cursor to maintain the position
-in the XArray and let you compose various operations together without
-having to restart from the top every time.
+structure which you declare on the stack using the XA_STATE() macro.
+This macro initialises the xa_state ready to start walking around the
+XArray. It is used as a cursor to maintain the position in the XArray
+and let you compose various operations together without having to restart
+from the top every time. The contents of the xa_state are protected by
+the rcu_read_lock() or the xas_lock(). If you need to drop whichever of
+those locks is protecting your state and tree, you must call xas_pause()
+so that future calls do not rely on the parts of the state which were
+left unprotected.
The xa_state is also used to store errors. You can call
xas_error() to retrieve the error. All operations check whether