diff options
Diffstat (limited to 'Documentation/core-api')
59 files changed, 5558 insertions, 1358 deletions
diff --git a/Documentation/core-api/asm-annotations.rst b/Documentation/core-api/asm-annotations.rst new file mode 100644 index 000000000000..11c96d3f9ad6 --- /dev/null +++ b/Documentation/core-api/asm-annotations.rst @@ -0,0 +1,222 @@ +Assembler Annotations +===================== + +Copyright (c) 2017-2019 Jiri Slaby + +This document describes the new macros for annotation of data and code in +assembly. In particular, it contains information about ``SYM_FUNC_START``, +``SYM_FUNC_END``, ``SYM_CODE_START``, and similar. + +Rationale +--------- +Some code like entries, trampolines, or boot code needs to be written in +assembly. The same as in C, such code is grouped into functions and +accompanied with data. Standard assemblers do not force users into precisely +marking these pieces as code, data, or even specifying their length. +Nevertheless, assemblers provide developers with such annotations to aid +debuggers throughout assembly. On top of that, developers also want to mark +some functions as *global* in order to be visible outside of their translation +units. + +Over time, the Linux kernel has adopted macros from various projects (like +``binutils``) to facilitate such annotations. So for historic reasons, +developers have been using ``ENTRY``, ``END``, ``ENDPROC``, and other +annotations in assembly. Due to the lack of their documentation, the macros +are used in rather wrong contexts at some locations. Clearly, ``ENTRY`` was +intended to denote the beginning of global symbols (be it data or code). +``END`` used to mark the end of data or end of special functions with +*non-standard* calling convention. In contrast, ``ENDPROC`` should annotate +only ends of *standard* functions. + +When these macros are used correctly, they help assemblers generate a nice +object with both sizes and types set correctly. For example, the result of +``arch/x86/lib/putuser.S``:: + + Num: Value Size Type Bind Vis Ndx Name + 25: 0000000000000000 33 FUNC GLOBAL DEFAULT 1 __put_user_1 + 29: 0000000000000030 37 FUNC GLOBAL DEFAULT 1 __put_user_2 + 32: 0000000000000060 36 FUNC GLOBAL DEFAULT 1 __put_user_4 + 35: 0000000000000090 37 FUNC GLOBAL DEFAULT 1 __put_user_8 + +This is not only important for debugging purposes. When there are properly +annotated objects like this, tools can be run on them to generate more useful +information. In particular, on properly annotated objects, ``objtool`` can be +run to check and fix the object if needed. Currently, ``objtool`` can report +missing frame pointer setup/destruction in functions. It can also +automatically generate annotations for the ORC unwinder +(Documentation/arch/x86/orc-unwinder.rst) +for most code. Both of these are especially important to support reliable +stack traces which are in turn necessary for kernel live patching +(Documentation/livepatch/livepatch.rst). + +Caveat and Discussion +--------------------- +As one might realize, there were only three macros previously. That is indeed +insufficient to cover all the combinations of cases: + +* standard/non-standard function +* code/data +* global/local symbol + +There was a discussion_ and instead of extending the current ``ENTRY/END*`` +macros, it was decided that brand new macros should be introduced instead:: + + So how about using macro names that actually show the purpose, instead + of importing all the crappy, historic, essentially randomly chosen + debug symbol macro names from the binutils and older kernels? + +.. _discussion: https://lore.kernel.org/r/20170217104757.28588-1-jslaby@suse.cz + +Macros Description +------------------ + +The new macros are prefixed with the ``SYM_`` prefix and can be divided into +three main groups: + +1. ``SYM_FUNC_*`` -- to annotate C-like functions. This means functions with + standard C calling conventions. For example, on x86, this means that the + stack contains a return address at the predefined place and a return from + the function can happen in a standard way. When frame pointers are enabled, + save/restore of frame pointer shall happen at the start/end of a function, + respectively, too. + + Checking tools like ``objtool`` should ensure such marked functions conform + to these rules. The tools can also easily annotate these functions with + debugging information (like *ORC data*) automatically. + +2. ``SYM_CODE_*`` -- special functions called with special stack. Be it + interrupt handlers with special stack content, trampolines, or startup + functions. + + Checking tools mostly ignore checking of these functions. But some debug + information still can be generated automatically. For correct debug data, + this code needs hints like ``UNWIND_HINT_REGS`` provided by developers. + +3. ``SYM_DATA*`` -- obviously data belonging to ``.data`` sections and not to + ``.text``. Data do not contain instructions, so they have to be treated + specially by the tools: they should not treat the bytes as instructions, + nor assign any debug information to them. + +Instruction Macros +~~~~~~~~~~~~~~~~~~ +This section covers ``SYM_FUNC_*`` and ``SYM_CODE_*`` enumerated above. + +``objtool`` requires that all code must be contained in an ELF symbol. Symbol +names that have a ``.L`` prefix do not emit symbol table entries. ``.L`` +prefixed symbols can be used within a code region, but should be avoided for +denoting a range of code via ``SYM_*_START/END`` annotations. + +* ``SYM_FUNC_START`` and ``SYM_FUNC_START_LOCAL`` are supposed to be **the + most frequent markings**. They are used for functions with standard calling + conventions -- global and local. Like in C, they both align the functions to + architecture specific ``__ALIGN`` bytes. There are also ``_NOALIGN`` variants + for special cases where developers do not want this implicit alignment. + + ``SYM_FUNC_START_WEAK`` and ``SYM_FUNC_START_WEAK_NOALIGN`` markings are + also offered as an assembler counterpart to the *weak* attribute known from + C. + + All of these **shall** be coupled with ``SYM_FUNC_END``. First, it marks + the sequence of instructions as a function and computes its size to the + generated object file. Second, it also eases checking and processing such + object files as the tools can trivially find exact function boundaries. + + So in most cases, developers should write something like in the following + example, having some asm instructions in between the macros, of course:: + + SYM_FUNC_START(memset) + ... asm insns ... + SYM_FUNC_END(memset) + + In fact, this kind of annotation corresponds to the now deprecated ``ENTRY`` + and ``ENDPROC`` macros. + +* ``SYM_FUNC_ALIAS``, ``SYM_FUNC_ALIAS_LOCAL``, and ``SYM_FUNC_ALIAS_WEAK`` can + be used to define multiple names for a function. The typical use is:: + + SYM_FUNC_START(__memset) + ... asm insns ... + SYN_FUNC_END(__memset) + SYM_FUNC_ALIAS(memset, __memset) + + In this example, one can call ``__memset`` or ``memset`` with the same + result, except the debug information for the instructions is generated to + the object file only once -- for the non-``ALIAS`` case. + +* ``SYM_CODE_START`` and ``SYM_CODE_START_LOCAL`` should be used only in + special cases -- if you know what you are doing. This is used exclusively + for interrupt handlers and similar where the calling convention is not the C + one. ``_NOALIGN`` variants exist too. The use is the same as for the ``FUNC`` + category above:: + + SYM_CODE_START_LOCAL(bad_put_user) + ... asm insns ... + SYM_CODE_END(bad_put_user) + + Again, every ``SYM_CODE_START*`` **shall** be coupled by ``SYM_CODE_END``. + + To some extent, this category corresponds to deprecated ``ENTRY`` and + ``END``. Except ``END`` had several other meanings too. + +* ``SYM_INNER_LABEL*`` is used to denote a label inside some + ``SYM_{CODE,FUNC}_START`` and ``SYM_{CODE,FUNC}_END``. They are very similar + to C labels, except they can be made global. An example of use:: + + SYM_CODE_START(ftrace_caller) + /* save_mcount_regs fills in first two parameters */ + ... + + SYM_INNER_LABEL(ftrace_caller_op_ptr, SYM_L_GLOBAL) + /* Load the ftrace_ops into the 3rd parameter */ + ... + + SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) + call ftrace_stub + ... + retq + SYM_CODE_END(ftrace_caller) + +Data Macros +~~~~~~~~~~~ +Similar to instructions, there is a couple of macros to describe data in the +assembly. + +* ``SYM_DATA_START`` and ``SYM_DATA_START_LOCAL`` mark the start of some data + and shall be used in conjunction with either ``SYM_DATA_END``, or + ``SYM_DATA_END_LABEL``. The latter adds also a label to the end, so that + people can use ``lstack`` and (local) ``lstack_end`` in the following + example:: + + SYM_DATA_START_LOCAL(lstack) + .skip 4096 + SYM_DATA_END_LABEL(lstack, SYM_L_LOCAL, lstack_end) + +* ``SYM_DATA`` and ``SYM_DATA_LOCAL`` are variants for simple, mostly one-line + data:: + + SYM_DATA(HEAP, .long rm_heap) + SYM_DATA(heap_end, .long rm_stack) + + In the end, they expand to ``SYM_DATA_START`` with ``SYM_DATA_END`` + internally. + +Support Macros +~~~~~~~~~~~~~~ +All the above reduce themselves to some invocation of ``SYM_START``, +``SYM_END``, or ``SYM_ENTRY`` at last. Normally, developers should avoid using +these. + +Further, in the above examples, one could see ``SYM_L_LOCAL``. There are also +``SYM_L_GLOBAL`` and ``SYM_L_WEAK``. All are intended to denote linkage of a +symbol marked by them. They are used either in ``_LABEL`` variants of the +earlier macros, or in ``SYM_START``. + + +Overriding Macros +~~~~~~~~~~~~~~~~~ +Architecture can also override any of the macros in their own +``asm/linkage.h``, including macros specifying the type of a symbol +(``SYM_T_FUNC``, ``SYM_T_OBJECT``, and ``SYM_T_NONE``). As every macro +described in this file is surrounded by ``#ifdef`` + ``#endif``, it is enough +to define the macros differently in the aforementioned architecture-dependent +header. diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst deleted file mode 100644 index 724583453e1f..000000000000 --- a/Documentation/core-api/atomic_ops.rst +++ /dev/null @@ -1,664 +0,0 @@ -======================================================= -Semantics and Behavior of Atomic and Bitmask Operations -======================================================= - -:Author: David S. Miller - -This document is intended to serve as a guide to Linux port -maintainers on how to implement atomic counter, bitops, and spinlock -interfaces properly. - -Atomic Type And Operations -========================== - -The atomic_t type should be defined as a signed integer and -the atomic_long_t type as a signed long integer. Also, they should -be made opaque such that any kind of cast to a normal C integer type -will fail. Something like the following should suffice:: - - typedef struct { int counter; } atomic_t; - typedef struct { long counter; } atomic_long_t; - -Historically, counter has been declared volatile. This is now discouraged. -See :ref:`Documentation/process/volatile-considered-harmful.rst -<volatile_considered_harmful>` for the complete rationale. - -local_t is very similar to atomic_t. If the counter is per CPU and only -updated by one CPU, local_t is probably more appropriate. Please see -:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of -local_t. - -The first operations to implement for atomic_t's are the initializers and -plain writes. :: - - #define ATOMIC_INIT(i) { (i) } - #define atomic_set(v, i) ((v)->counter = (i)) - -The first macro is used in definitions, such as:: - - static atomic_t my_counter = ATOMIC_INIT(1); - -The initializer is atomic in that the return values of the atomic operations -are guaranteed to be correct reflecting the initialized value if the -initializer is used before runtime. If the initializer is used at runtime, a -proper implicit or explicit read memory barrier is needed before reading the -value with atomic_read from another thread. - -As with all of the ``atomic_`` interfaces, replace the leading ``atomic_`` -with ``atomic_long_`` to operate on atomic_long_t. - -The second interface can be used at runtime, as in:: - - struct foo { atomic_t counter; }; - ... - - struct foo *k; - - k = kmalloc(sizeof(*k), GFP_KERNEL); - if (!k) - return -ENOMEM; - atomic_set(&k->counter, 0); - -The setting is atomic in that the return values of the atomic operations by -all threads are guaranteed to be correct reflecting either the value that has -been set with this operation or set with another operation. A proper implicit -or explicit memory barrier is needed before the value set with the operation -is guaranteed to be readable with atomic_read from another thread. - -Next, we have:: - - #define atomic_read(v) ((v)->counter) - -which simply reads the counter value currently visible to the calling thread. -The read is atomic in that the return value is guaranteed to be one of the -values initialized or modified with the interface operations if a proper -implicit or explicit memory barrier is used after possible runtime -initialization by any other thread and the value is modified only with the -interface operations. atomic_read does not guarantee that the runtime -initialization by any other thread is visible yet, so the user of the -interface must take care of that with a proper implicit or explicit memory -barrier. - -.. warning:: - - ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS! - - Some architectures may choose to use the volatile keyword, barriers, or - inline assembly to guarantee some degree of immediacy for atomic_read() - and atomic_set(). This is not uniformly guaranteed, and may change in - the future, so all users of atomic_t should treat atomic_read() and - atomic_set() as simple C statements that may be reordered or optimized - away entirely by the compiler or processor, and explicitly invoke the - appropriate compiler and/or memory barrier for each use case. Failure - to do so will result in code that may suddenly break when used with - different architectures or compiler optimizations, or even changes in - unrelated code which changes how the compiler optimizes the section - accessing atomic_t variables. - -Properly aligned pointers, longs, ints, and chars (and unsigned -equivalents) may be atomically loaded from and stored to in the same -sense as described for atomic_read() and atomic_set(). The READ_ONCE() -and WRITE_ONCE() macros should be used to prevent the compiler from using -optimizations that might otherwise optimize accesses out of existence on -the one hand, or that might create unsolicited accesses on the other. - -For example consider the following code:: - - while (a > 0) - do_something(); - -If the compiler can prove that do_something() does not store to the -variable a, then the compiler is within its rights transforming this to -the following:: - - if (a > 0) - for (;;) - do_something(); - -If you don't want the compiler to do this (and you probably don't), then -you should use something like the following:: - - while (READ_ONCE(a) > 0) - do_something(); - -Alternatively, you could place a barrier() call in the loop. - -For another example, consider the following code:: - - tmp_a = a; - do_something_with(tmp_a); - do_something_else_with(tmp_a); - -If the compiler can prove that do_something_with() does not store to the -variable a, then the compiler is within its rights to manufacture an -additional load as follows:: - - tmp_a = a; - do_something_with(tmp_a); - tmp_a = a; - do_something_else_with(tmp_a); - -This could fatally confuse your code if it expected the same value -to be passed to do_something_with() and do_something_else_with(). - -The compiler would be likely to manufacture this additional load if -do_something_with() was an inline function that made very heavy use -of registers: reloading from variable a could save a flush to the -stack and later reload. To prevent the compiler from attacking your -code in this manner, write the following:: - - tmp_a = READ_ONCE(a); - do_something_with(tmp_a); - do_something_else_with(tmp_a); - -For a final example, consider the following code, assuming that the -variable a is set at boot time before the second CPU is brought online -and never changed later, so that memory barriers are not needed:: - - if (a) - b = 9; - else - b = 42; - -The compiler is within its rights to manufacture an additional store -by transforming the above code into the following:: - - b = 42; - if (a) - b = 9; - -This could come as a fatal surprise to other code running concurrently -that expected b to never have the value 42 if a was zero. To prevent -the compiler from doing this, write something like:: - - if (a) - WRITE_ONCE(b, 9); - else - WRITE_ONCE(b, 42); - -Don't even -think- about doing this without proper use of memory barriers, -locks, or atomic operations if variable a can change at runtime! - -.. warning:: - - ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER! - -Now, we move onto the atomic operation interfaces typically implemented with -the help of assembly code. :: - - void atomic_add(int i, atomic_t *v); - void atomic_sub(int i, atomic_t *v); - void atomic_inc(atomic_t *v); - void atomic_dec(atomic_t *v); - -These four routines add and subtract integral values to/from the given -atomic_t value. The first two routines pass explicit integers by -which to make the adjustment, whereas the latter two use an implicit -adjustment value of "1". - -One very important aspect of these two routines is that they DO NOT -require any explicit memory barriers. They need only perform the -atomic_t counter update in an SMP safe manner. - -Next, we have:: - - int atomic_inc_return(atomic_t *v); - int atomic_dec_return(atomic_t *v); - -These routines add 1 and subtract 1, respectively, from the given -atomic_t and return the new counter value after the operation is -performed. - -Unlike the above routines, it is required that these primitives -include explicit memory barriers that are performed before and after -the operation. It must be done such that all memory operations before -and after the atomic operation calls are strongly ordered with respect -to the atomic operation itself. - -For example, it should behave as if a smp_mb() call existed both -before and after the atomic operation. - -If the atomic instructions used in an implementation provide explicit -memory barrier semantics which satisfy the above requirements, that is -fine as well. - -Let's move on:: - - int atomic_add_return(int i, atomic_t *v); - int atomic_sub_return(int i, atomic_t *v); - -These behave just like atomic_{inc,dec}_return() except that an -explicit counter adjustment is given instead of the implicit "1". -This means that like atomic_{inc,dec}_return(), the memory barrier -semantics are required. - -Next:: - - int atomic_inc_and_test(atomic_t *v); - int atomic_dec_and_test(atomic_t *v); - -These two routines increment and decrement by 1, respectively, the -given atomic counter. They return a boolean indicating whether the -resulting counter value was zero or not. - -Again, these primitives provide explicit memory barrier semantics around -the atomic operation:: - - int atomic_sub_and_test(int i, atomic_t *v); - -This is identical to atomic_dec_and_test() except that an explicit -decrement is given instead of the implicit "1". This primitive must -provide explicit memory barrier semantics around the operation:: - - int atomic_add_negative(int i, atomic_t *v); - -The given increment is added to the given atomic counter value. A boolean -is return which indicates whether the resulting counter value is negative. -This primitive must provide explicit memory barrier semantics around -the operation. - -Then:: - - int atomic_xchg(atomic_t *v, int new); - -This performs an atomic exchange operation on the atomic variable v, setting -the given new value. It returns the old value that the atomic variable v had -just before the operation. - -atomic_xchg must provide explicit memory barriers around the operation. :: - - int atomic_cmpxchg(atomic_t *v, int old, int new); - -This performs an atomic compare exchange operation on the atomic value v, -with the given old and new values. Like all atomic_xxx operations, -atomic_cmpxchg will only satisfy its atomicity semantics as long as all -other accesses of \*v are performed through atomic_xxx operations. - -atomic_cmpxchg must provide explicit memory barriers around the operation, -although if the comparison fails then no memory ordering guarantees are -required. - -The semantics for atomic_cmpxchg are the same as those defined for 'cas' -below. - -Finally:: - - int atomic_add_unless(atomic_t *v, int a, int u); - -If the atomic value v is not equal to u, this function adds a to v, and -returns non zero. If v is equal to u then it returns zero. This is done as -an atomic operation. - -atomic_add_unless must provide explicit memory barriers around the -operation unless it fails (returns 0). - -atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) - - -If a caller requires memory barrier semantics around an atomic_t -operation which does not return a value, a set of interfaces are -defined which accomplish this:: - - void smp_mb__before_atomic(void); - void smp_mb__after_atomic(void); - -Preceding a non-value-returning read-modify-write atomic operation with -smp_mb__before_atomic() and following it with smp_mb__after_atomic() -provides the same full ordering that is provided by value-returning -read-modify-write atomic operations. - -For example, smp_mb__before_atomic() can be used like so:: - - obj->dead = 1; - smp_mb__before_atomic(); - atomic_dec(&obj->ref_count); - -It makes sure that all memory operations preceding the atomic_dec() -call are strongly ordered with respect to the atomic counter -operation. In the above example, it guarantees that the assignment of -"1" to obj->dead will be globally visible to other cpus before the -atomic counter decrement. - -Without the explicit smp_mb__before_atomic() call, the -implementation could legally allow the atomic counter update visible -to other cpus before the "obj->dead = 1;" assignment. - -A missing memory barrier in the cases where they are required by the -atomic_t implementation above can have disastrous results. Here is -an example, which follows a pattern occurring frequently in the Linux -kernel. It is the use of atomic counters to implement reference -counting, and it works such that once the counter falls to zero it can -be guaranteed that no other entity can be accessing the object:: - - static void obj_list_add(struct obj *obj, struct list_head *head) - { - obj->active = 1; - list_add(&obj->list, head); - } - - static void obj_list_del(struct obj *obj) - { - list_del(&obj->list); - obj->active = 0; - } - - static void obj_destroy(struct obj *obj) - { - BUG_ON(obj->active); - kfree(obj); - } - - struct obj *obj_list_peek(struct list_head *head) - { - if (!list_empty(head)) { - struct obj *obj; - - obj = list_entry(head->next, struct obj, list); - atomic_inc(&obj->refcnt); - return obj; - } - return NULL; - } - - void obj_poke(void) - { - struct obj *obj; - - spin_lock(&global_list_lock); - obj = obj_list_peek(&global_list); - spin_unlock(&global_list_lock); - - if (obj) { - obj->ops->poke(obj); - if (atomic_dec_and_test(&obj->refcnt)) - obj_destroy(obj); - } - } - - void obj_timeout(struct obj *obj) - { - spin_lock(&global_list_lock); - obj_list_del(obj); - spin_unlock(&global_list_lock); - - if (atomic_dec_and_test(&obj->refcnt)) - obj_destroy(obj); - } - -.. note:: - - This is a simplification of the ARP queue management in the generic - neighbour discover code of the networking. Olaf Kirch found a bug wrt. - memory barriers in kfree_skb() that exposed the atomic_t memory barrier - requirements quite clearly. - -Given the above scheme, it must be the case that the obj->active -update done by the obj list deletion be visible to other processors -before the atomic counter decrement is performed. - -Otherwise, the counter could fall to zero, yet obj->active would still -be set, thus triggering the assertion in obj_destroy(). The error -sequence looks like this:: - - cpu 0 cpu 1 - obj_poke() obj_timeout() - obj = obj_list_peek(); - ... gains ref to obj, refcnt=2 - obj_list_del(obj); - obj->active = 0 ... - ... visibility delayed ... - atomic_dec_and_test() - ... refcnt drops to 1 ... - atomic_dec_and_test() - ... refcount drops to 0 ... - obj_destroy() - BUG() triggers since obj->active - still seen as one - obj->active update visibility occurs - -With the memory barrier semantics required of the atomic_t operations -which return values, the above sequence of memory visibility can never -happen. Specifically, in the above case the atomic_dec_and_test() -counter decrement would not become globally visible until the -obj->active update does. - -As a historical note, 32-bit Sparc used to only allow usage of -24-bits of its atomic_t type. This was because it used 8 bits -as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" -type instruction. However, 32-bit Sparc has since been moved over -to a "hash table of spinlocks" scheme, that allows the full 32-bit -counter to be realized. Essentially, an array of spinlocks are -indexed into based upon the address of the atomic_t being operated -on, and that lock protects the atomic operation. Parisc uses the -same scheme. - -Another note is that the atomic_t operations returning values are -extremely slow on an old 386. - - -Atomic Bitmask -============== - -We will now cover the atomic bitmask operations. You will find that -their SMP and memory barrier semantics are similar in shape and scope -to the atomic_t ops above. - -Native atomic bit operations are defined to operate on objects aligned -to the size of an "unsigned long" C data type, and are least of that -size. The endianness of the bits within each "unsigned long" are the -native endianness of the cpu. :: - - void set_bit(unsigned long nr, volatile unsigned long *addr); - void clear_bit(unsigned long nr, volatile unsigned long *addr); - void change_bit(unsigned long nr, volatile unsigned long *addr); - -These routines set, clear, and change, respectively, the bit number -indicated by "nr" on the bit mask pointed to by "ADDR". - -They must execute atomically, yet there are no implicit memory barrier -semantics required of these interfaces. :: - - int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); - int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); - int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); - -Like the above, except that these routines return a boolean which -indicates whether the changed bit was set _BEFORE_ the atomic bit -operation. - - -.. warning:: - It is incredibly important that the value be a boolean, ie. "0" or "1". - Do not try to be fancy and save a few instructions by declaring the - above to return "long" and just returning something like "old_val & - mask" because that will not work. - -For one thing, this return value gets truncated to int in many code -paths using these interfaces, so on 64-bit if the bit is set in the -upper 32-bits then testers will never see that. - -One great example of where this problem crops up are the thread_info -flag operations. Routines such as test_and_set_ti_thread_flag() chop -the return value into an int. There are other places where things -like this occur as well. - -These routines, like the atomic_t counter operations returning values, -must provide explicit memory barrier semantics around their execution. -All memory operations before the atomic bit operation call must be -made visible globally before the atomic bit operation is made visible. -Likewise, the atomic bit operation must be visible globally before any -subsequent memory operation is made visible. For example:: - - obj->dead = 1; - if (test_and_set_bit(0, &obj->flags)) - /* ... */; - obj->killed = 1; - -The implementation of test_and_set_bit() must guarantee that -"obj->dead = 1;" is visible to cpus before the atomic memory operation -done by test_and_set_bit() becomes visible. Likewise, the atomic -memory operation done by test_and_set_bit() must become visible before -"obj->killed = 1;" is visible. - -Finally there is the basic operation:: - - int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); - -Which returns a boolean indicating if bit "nr" is set in the bitmask -pointed to by "addr". - -If explicit memory barriers are required around {set,clear}_bit() (which do -not return a value, and thus does not need to provide memory barrier -semantics), two interfaces are provided:: - - void smp_mb__before_atomic(void); - void smp_mb__after_atomic(void); - -They are used as follows, and are akin to their atomic_t operation -brothers:: - - /* All memory operations before this call will - * be globally visible before the clear_bit(). - */ - smp_mb__before_atomic(); - clear_bit( ... ); - - /* The clear_bit() will be visible before all - * subsequent memory operations. - */ - smp_mb__after_atomic(); - -There are two special bitops with lock barrier semantics (acquire/release, -same as spinlocks). These operate in the same way as their non-_lock/unlock -postfixed variants, except that they are to provide acquire/release semantics, -respectively. This means they can be used for bit_spin_trylock and -bit_spin_unlock type operations without specifying any more barriers. :: - - int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); - void clear_bit_unlock(unsigned long nr, unsigned long *addr); - void __clear_bit_unlock(unsigned long nr, unsigned long *addr); - -The __clear_bit_unlock version is non-atomic, however it still implements -unlock barrier semantics. This can be useful if the lock itself is protecting -the other bits in the word. - -Finally, there are non-atomic versions of the bitmask operations -provided. They are used in contexts where some other higher-level SMP -locking scheme is being used to protect the bitmask, and thus less -expensive non-atomic operations may be used in the implementation. -They have names similar to the above bitmask operation interfaces, -except that two underscores are prefixed to the interface name. :: - - void __set_bit(unsigned long nr, volatile unsigned long *addr); - void __clear_bit(unsigned long nr, volatile unsigned long *addr); - void __change_bit(unsigned long nr, volatile unsigned long *addr); - int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); - int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); - int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); - -These non-atomic variants also do not require any special memory -barrier semantics. - -The routines xchg() and cmpxchg() must provide the same exact -memory-barrier semantics as the atomic and bit operations returning -values. - -.. note:: - - If someone wants to use xchg(), cmpxchg() and their variants, - linux/atomic.h should be included rather than asm/cmpxchg.h, unless the - code is in arch/* and can take care of itself. - -Spinlocks and rwlocks have memory barrier expectations as well. -The rule to follow is simple: - -1) When acquiring a lock, the implementation must make it globally - visible before any subsequent memory operation. - -2) When releasing a lock, the implementation must make it such that - all previous memory operations are globally visible before the - lock release. - -Which finally brings us to _atomic_dec_and_lock(). There is an -architecture-neutral version implemented in lib/dec_and_lock.c, -but most platforms will wish to optimize this in assembler. :: - - int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); - -Atomically decrement the given counter, and if will drop to zero -atomically acquire the given spinlock and perform the decrement -of the counter to zero. If it does not drop to zero, do nothing -with the spinlock. - -It is actually pretty simple to get the memory barrier correct. -Simply satisfy the spinlock grab requirements, which is make -sure the spinlock operation is globally visible before any -subsequent memory operation. - -We can demonstrate this operation more clearly if we define -an abstract atomic operation:: - - long cas(long *mem, long old, long new); - -"cas" stands for "compare and swap". It atomically: - -1) Compares "old" with the value currently at "mem". -2) If they are equal, "new" is written to "mem". -3) Regardless, the current value at "mem" is returned. - -As an example usage, here is what an atomic counter update -might look like:: - - void example_atomic_inc(long *counter) - { - long old, new, ret; - - while (1) { - old = *counter; - new = old + 1; - - ret = cas(counter, old, new); - if (ret == old) - break; - } - } - -Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():: - - int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) - { - long old, new, ret; - int went_to_zero; - - went_to_zero = 0; - while (1) { - old = atomic_read(atomic); - new = old - 1; - if (new == 0) { - went_to_zero = 1; - spin_lock(lock); - } - ret = cas(atomic, old, new); - if (ret == old) - break; - if (went_to_zero) { - spin_unlock(lock); - went_to_zero = 0; - } - } - - return went_to_zero; - } - -Now, as far as memory barriers go, as long as spin_lock() -strictly orders all subsequent memory operations (including -the cas()) with respect to itself, things will be fine. - -Said another way, _atomic_dec_and_lock() must guarantee that -a counter dropping to zero is never made visible before the -spinlock being acquired. - -.. note:: - - Note that this also means that for the case where the counter is not - dropping to zero, there are no memory ordering requirements. diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst index a1582cc79f0f..889fc84ccd1b 100644 --- a/Documentation/core-api/cachetlb.rst +++ b/Documentation/core-api/cachetlb.rst @@ -88,13 +88,17 @@ changes occur: This is used primarily during fault processing. -5) ``void update_mmu_cache(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep)`` +5) ``void update_mmu_cache_range(struct vm_fault *vmf, + struct vm_area_struct *vma, unsigned long address, pte_t *ptep, + unsigned int nr)`` - At the end of every page fault, this routine is invoked to - tell the architecture specific code that a translation - now exists at virtual address "address" for address space - "vma->vm_mm", in the software page tables. + At the end of every page fault, this routine is invoked to tell + the architecture specific code that translations now exists + in the software page tables for address space "vma->vm_mm" + at virtual address "address" for "nr" consecutive pages. + + This routine is also invoked in various other places which pass + a NULL "vmf". A port may use this information in any way it so chooses. For example, it could use this event to pre-load TLB @@ -213,9 +217,9 @@ Here are the routines, one by one: there will be no entries in the cache for the kernel address space for virtual addresses in the range 'start' to 'end-1'. - The first of these two routines is invoked after map_kernel_range() + The first of these two routines is invoked after vmap_range() has installed the page table entries. The second is invoked - before unmap_kernel_range() deletes the page table entries. + before vunmap_range() deletes the page table entries. There exists another whole class of cpu cache issues which currently require a whole different set of interfaces to handle properly. @@ -269,12 +273,17 @@ maps this page at its virtual address. If D-cache aliasing is not an issue, these two routines may simply call memcpy/memset directly and do nothing more. - ``void flush_dcache_page(struct page *page)`` + ``void flush_dcache_folio(struct folio *folio)`` + + This routines must be called when: - Any time the kernel writes to a page cache page, _OR_ - the kernel is about to read from a page cache page and - user space shared/writable mappings of this page potentially - exist, this routine is called. + a) the kernel did write to a page that is in the page cache page + and / or in high memory + b) the kernel is about to read from a page cache page and user space + shared/writable mappings of this page potentially exist. Note + that {get,pin}_user_pages{_fast} already call flush_dcache_folio + on any page found in the user address space and thus driver + code rarely needs to take this into account. .. note:: @@ -284,38 +293,35 @@ maps this page at its virtual address. handling vfs symlinks in the page cache need not call this interface at all. - The phrase "kernel writes to a page cache page" means, - specifically, that the kernel executes store instructions - that dirty data in that page at the page->virtual mapping - of that page. It is important to flush here to handle - D-cache aliasing, to make sure these kernel stores are - visible to user space mappings of that page. + The phrase "kernel writes to a page cache page" means, specifically, + that the kernel executes store instructions that dirty data in that + page at the kernel virtual mapping of that page. It is important to + flush here to handle D-cache aliasing, to make sure these kernel stores + are visible to user space mappings of that page. - The corollary case is just as important, if there are users - which have shared+writable mappings of this file, we must make - sure that kernel reads of these pages will see the most recent - stores done by the user. + The corollary case is just as important, if there are users which have + shared+writable mappings of this file, we must make sure that kernel + reads of these pages will see the most recent stores done by the user. - If D-cache aliasing is not an issue, this routine may - simply be defined as a nop on that architecture. + If D-cache aliasing is not an issue, this routine may simply be defined + as a nop on that architecture. - There is a bit set aside in page->flags (PG_arch_1) as - "architecture private". The kernel guarantees that, - for pagecache pages, it will clear this bit when such - a page first enters the pagecache. + There is a bit set aside in folio->flags (PG_arch_1) as "architecture + private". The kernel guarantees that, for pagecache pages, it will + clear this bit when such a page first enters the pagecache. This allows these interfaces to be implemented much more - efficiently. It allows one to "defer" (perhaps indefinitely) - the actual flush if there are currently no user processes - mapping this page. See sparc64's flush_dcache_page and - update_mmu_cache implementations for an example of how to go - about doing this. - - The idea is, first at flush_dcache_page() time, if - page->mapping->i_mmap is an empty tree, just mark the architecture - private page flag bit. Later, in update_mmu_cache(), a check is - made of this flag bit, and if set the flush is done and the flag - bit is cleared. + efficiently. It allows one to "defer" (perhaps indefinitely) the + actual flush if there are currently no user processes mapping this + page. See sparc64's flush_dcache_folio and update_mmu_cache_range + implementations for an example of how to go about doing this. + + The idea is, first at flush_dcache_folio() time, if + folio_flush_mapping() returns a mapping, and mapping_mapped() on that + mapping returns %false, just mark the architecture private page + flag bit. Later, in update_mmu_cache_range(), a check is made + of this flag bit, and if set the flush is done and the flag bit + is cleared. .. important:: @@ -345,25 +351,12 @@ maps this page at its virtual address. When the kernel needs to access the contents of an anonymous page, it calls this function (currently only - get_user_pages()). Note: flush_dcache_page() deliberately + get_user_pages()). Note: flush_dcache_folio() deliberately doesn't work for an anonymous page. The default implementation is a nop (and should remain so for all coherent architectures). For incoherent architectures, it should flush the cache of the page at vmaddr. - ``void flush_kernel_dcache_page(struct page *page)`` - - When the kernel needs to modify a user page is has obtained - with kmap, it calls this function after all modifications are - complete (but before kunmapping it) to bring the underlying - page up to date. It is assumed here that the user has no - incoherent cached copies (i.e. the original page was obtained - from a mechanism like get_user_pages()). The default - implementation is a nop and should remain so on all coherent - architectures. On incoherent architectures, this should flush - the kernel cache for page (using page_address(page)). - - ``void flush_icache_range(unsigned long start, unsigned long end)`` When the kernel stores into addresses that it will execute @@ -375,7 +368,7 @@ maps this page at its virtual address. ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)`` All the functionality of flush_icache_page can be implemented in - flush_dcache_page and update_mmu_cache. In the future, the hope + flush_dcache_folio and update_mmu_cache_range. In the future, the hope is to remove this interface completely. The final category of APIs is for I/O to deliberately aliased address diff --git a/Documentation/core-api/cgroup.rst b/Documentation/core-api/cgroup.rst new file mode 100644 index 000000000000..734ea21e1e17 --- /dev/null +++ b/Documentation/core-api/cgroup.rst @@ -0,0 +1,9 @@ +================== +Cgroup Kernel APIs +================== + +Device Memory Cgroup API (dmemcg) +================================= +.. kernel-doc:: kernel/cgroup/dmem.c + :export: + diff --git a/Documentation/core-api/cleanup.rst b/Documentation/core-api/cleanup.rst new file mode 100644 index 000000000000..527eb2f8ec6e --- /dev/null +++ b/Documentation/core-api/cleanup.rst @@ -0,0 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Scope-based Cleanup Helpers +=========================== + +.. kernel-doc:: include/linux/cleanup.h + :doc: scope-based cleanup helpers diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index 4a50ab7817f7..e1b0eeabbb5e 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -2,12 +2,13 @@ CPU hotplug in the Kernel ========================= -:Date: December, 2016 +:Date: September, 2021 :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>, - Rusty Russell <rusty@rustcorp.com.au>, - Srivatsa Vaddagiri <vatsa@in.ibm.com>, - Ashok Raj <ashok.raj@intel.com>, - Joel Schopp <jschopp@austin.ibm.com> + Rusty Russell <rusty@rustcorp.com.au>, + Srivatsa Vaddagiri <vatsa@in.ibm.com>, + Ashok Raj <ashok.raj@intel.com>, + Joel Schopp <jschopp@austin.ibm.com>, + Thomas Gleixner <tglx@linutronix.de> Introduction ============ @@ -30,33 +31,20 @@ which didn't support these methods. Command Line Switches ===================== ``maxcpus=n`` - Restrict boot time CPUs to *n*. Say if you have fourV CPUs, using + Restrict boot time CPUs to *n*. Say if you have four CPUs, using ``maxcpus=2`` will only boot two. You can choose to bring the other CPUs later online. ``nr_cpus=n`` - Restrict the total amount CPUs the kernel will support. If the number - supplied here is lower than the number of physically available CPUs than + Restrict the total amount of CPUs the kernel will support. If the number + supplied here is lower than the number of physically available CPUs, then those CPUs can not be brought online later. -``additional_cpus=n`` - Use this to limit hotpluggable CPUs. This option sets - ``cpu_possible_mask = cpu_present_mask + additional_cpus`` - - This option is limited to the IA64 architecture. - ``possible_cpus=n`` This option sets ``possible_cpus`` bits in ``cpu_possible_mask``. This option is limited to the X86 and S390 architecture. -``cede_offline={"off","on"}`` - Use this option to disable/enable putting offlined processors to an extended - ``H_CEDE`` state on supported pseries platforms. If nothing is specified, - ``cede_offline`` is set to "on". - - This option is limited to the PowerPC architecture. - ``cpu0_hotplug`` Allow to shutdown CPU0. @@ -98,9 +86,10 @@ Never use anything other than ``cpumask_t`` to represent bitmap of CPUs. Using CPU hotplug ================= + The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently available on multiple architectures including ARM, MIPS, PowerPC and X86. The -configuration is done via the sysfs interface: :: +configuration is done via the sysfs interface:: $ ls -lh /sys/devices/system/cpu total 0 @@ -120,35 +109,27 @@ configuration is done via the sysfs interface: :: The files *offline*, *online*, *possible*, *present* represent the CPU masks. Each CPU folder contains an *online* file which controls the logical on (1) and -off (0) state. To logically shutdown CPU4: :: +off (0) state. To logically shutdown CPU4:: $ echo 0 > /sys/devices/system/cpu/cpu4/online smpboot: CPU 4 is now offline Once the CPU is shutdown, it will be removed from */proc/interrupts*, */proc/cpuinfo* and should also not be shown visible by the *top* command. To -bring CPU4 back online: :: +bring CPU4 back online:: $ echo 1 > /sys/devices/system/cpu/cpu4/online smpboot: Booting Node 0 Processor 4 APIC 0x1 -The CPU is usable again. This should work on all CPUs. CPU0 is often special -and excluded from CPU hotplug. On X86 the kernel option -*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to -shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be -used. Some known dependencies of CPU0: - -* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline. -* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected. - -Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies -on CPU0. +The CPU is usable again. This should work on all CPUs, but CPU0 is often special +and excluded from CPU hotplug. The CPU hotplug coordination ============================ The offline case ---------------- + Once a CPU has been logically shutdown the teardown callbacks of registered hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating at state ``CPUHP_OFFLINE``. This includes: @@ -163,105 +144,491 @@ at state ``CPUHP_OFFLINE``. This includes: * Once all services are migrated, kernel calls an arch specific routine ``__cpu_disable()`` to perform arch specific cleanup. -Using the hotplug API ---------------------- -It is possible to receive notifications once a CPU is offline or onlined. This -might be important to certain drivers which need to perform some kind of setup -or clean up functions based on the number of available CPUs: :: - - #include <linux/cpuhotplug.h> - - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online", - Y_online, Y_prepare_down); - -*X* is the subsystem and *Y* the particular driver. The *Y_online* callback -will be invoked during registration on all online CPUs. If an error -occurs during the online callback the *Y_prepare_down* callback will be -invoked on all CPUs on which the online callback was previously invoked. -After registration completed, the *Y_online* callback will be invoked -once a CPU is brought online and *Y_prepare_down* will be invoked when a -CPU is shutdown. All resources which were previously allocated in -*Y_online* should be released in *Y_prepare_down*. -The return value *ret* is negative if an error occurred during the -registration process. Otherwise a positive value is returned which -contains the allocated hotplug for dynamically allocated states -(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states. - -The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a -dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state. -During the removal of a hotplug state the teardown callback will be invoked. - -Multiple instances -~~~~~~~~~~~~~~~~~~ -If a driver has multiple instances and each instance needs to perform the -callback independently then it is likely that a ''multi-state'' should be used. -First a multi-state state needs to be registered: :: - - ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online, - Y_online, Y_prepare_down); - Y_hp_online = ret; - -The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()`` -except it prepares the callbacks for a multi state and does not invoke -the callbacks. This is a one time setup. -Once a new instance is allocated, you need to register this new instance: :: - - ret = cpuhp_state_add_instance(Y_hp_online, &d->node); - -This function will add this instance to your previously allocated -*Y_hp_online* state and invoke the previously registered callback -(*Y_online*) on all online CPUs. The *node* element is a ``struct -hlist_node`` member of your per-instance data structure. - -On removal of the instance: :: - cpuhp_state_remove_instance(Y_hp_online, &d->node) - -should be invoked which will invoke the teardown callback on all online -CPUs. - -Manual setup -~~~~~~~~~~~~ -Usually it is handy to invoke setup and teardown callbacks on registration or -removal of a state because usually the operation needs to performed once a CPU -goes online (offline) and during initial setup (shutdown) of the driver. However -each registration and removal function is also available with a ``_nocalls`` -suffix which does not invoke the provided callbacks if the invocation of the -callbacks is not desired. During the manual setup (or teardown) the functions -``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU -hotplug operations. - - -The ordering of the events --------------------------- -The hotplug states are defined in ``include/linux/cpuhotplug.h``: - -* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the - CPU is up. -* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked - just the after the CPU has been brought up. The interrupts are off and - the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE* - the callbacks are invoked on the target CPU. -* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are - reserved for the dynamic allocation. -* The states are invoked in the reverse order on CPU shutdown starting with - *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are - invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*. - -A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough. -However if an earlier invocation during the bring up or shutdown is required -then an explicit state should be acquired. An explicit state might also be -required if the hotplug event requires specific ordering in respect to -another hotplug event. + +The CPU hotplug API +=================== + +CPU hotplug state machine +------------------------- + +CPU hotplug uses a trivial state machine with a linear state space from +CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown +callback. + +When a CPU is onlined, the startup callbacks are invoked sequentially until +the state CPUHP_ONLINE is reached. They can also be invoked when the +callbacks of a state are set up or an instance is added to a multi-instance +state. + +When a CPU is offlined the teardown callbacks are invoked in the reverse +order sequentially until the state CPUHP_OFFLINE is reached. They can also +be invoked when the callbacks of a state are removed or an instance is +removed from a multi-instance state. + +If a usage site requires only a callback in one direction of the hotplug +operations (CPU online or CPU offline) then the other not-required callback +can be set to NULL when the state is set up. + +The state space is divided into three sections: + +* The PREPARE section + + The PREPARE section covers the state space from CPUHP_OFFLINE to + CPUHP_BRINGUP_CPU. + + The startup callbacks in this section are invoked before the CPU is + started during a CPU online operation. The teardown callbacks are invoked + after the CPU has become dysfunctional during a CPU offline operation. + + The callbacks are invoked on a control CPU as they can't obviously run on + the hotplugged CPU which is either not yet started or has become + dysfunctional already. + + The startup callbacks are used to setup resources which are required to + bring a CPU successfully online. The teardown callbacks are used to free + resources or to move pending work to an online CPU after the hotplugged + CPU became dysfunctional. + + The startup callbacks are allowed to fail. If a callback fails, the CPU + online operation is aborted and the CPU is brought down to the previous + state (usually CPUHP_OFFLINE) again. + + The teardown callbacks in this section are not allowed to fail. + +* The STARTING section + + The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1 + and CPUHP_AP_ONLINE. + + The startup callbacks in this section are invoked on the hotplugged CPU + with interrupts disabled during a CPU online operation in the early CPU + setup code. The teardown callbacks are invoked with interrupts disabled + on the hotplugged CPU during a CPU offline operation shortly before the + CPU is completely shut down. + + The callbacks in this section are not allowed to fail. + + The callbacks are used for low level hardware initialization/shutdown and + for core subsystems. + +* The ONLINE section + + The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and + CPUHP_ONLINE. + + The startup callbacks in this section are invoked on the hotplugged CPU + during a CPU online operation. The teardown callbacks are invoked on the + hotplugged CPU during a CPU offline operation. + + The callbacks are invoked in the context of the per CPU hotplug thread, + which is pinned on the hotplugged CPU. The callbacks are invoked with + interrupts and preemption enabled. + + The callbacks are allowed to fail. When a callback fails the hotplug + operation is aborted and the CPU is brought back to the previous state. + +CPU online/offline operations +----------------------------- + +A successful online operation looks like this:: + + [CPUHP_OFFLINE] + [CPUHP_OFFLINE + 1]->startup() -> success + [CPUHP_OFFLINE + 2]->startup() -> success + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL + ... + [CPUHP_BRINGUP_CPU]->startup() -> success + === End of PREPARE section + [CPUHP_BRINGUP_CPU + 1]->startup() -> success + ... + [CPUHP_AP_ONLINE]->startup() -> success + === End of STARTUP section + [CPUHP_AP_ONLINE + 1]->startup() -> success + ... + [CPUHP_ONLINE - 1]->startup() -> success + [CPUHP_ONLINE] + +A successful offline operation looks like this:: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_AP_ONLINE + 1]->teardown() -> success + === Start of STARTUP section + [CPUHP_AP_ONLINE]->teardown() -> success + ... + [CPUHP_BRINGUP_ONLINE - 1]->teardown() + ... + === Start of PREPARE section + [CPUHP_BRINGUP_CPU]->teardown() + [CPUHP_OFFLINE + 3]->teardown() + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL + [CPUHP_OFFLINE + 1]->teardown() + [CPUHP_OFFLINE] + +A failed online operation looks like this:: + + [CPUHP_OFFLINE] + [CPUHP_OFFLINE + 1]->startup() -> success + [CPUHP_OFFLINE + 2]->startup() -> success + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL + ... + [CPUHP_BRINGUP_CPU]->startup() -> success + === End of PREPARE section + [CPUHP_BRINGUP_CPU + 1]->startup() -> success + ... + [CPUHP_AP_ONLINE]->startup() -> success + === End of STARTUP section + [CPUHP_AP_ONLINE + 1]->startup() -> success + --- + [CPUHP_AP_ONLINE + N]->startup() -> fail + [CPUHP_AP_ONLINE + (N - 1)]->teardown() + ... + [CPUHP_AP_ONLINE + 1]->teardown() + === Start of STARTUP section + [CPUHP_AP_ONLINE]->teardown() + ... + [CPUHP_BRINGUP_ONLINE - 1]->teardown() + ... + === Start of PREPARE section + [CPUHP_BRINGUP_CPU]->teardown() + [CPUHP_OFFLINE + 3]->teardown() + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL + [CPUHP_OFFLINE + 1]->teardown() + [CPUHP_OFFLINE] + +A failed offline operation looks like this:: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() + ... + [CPUHP_ONLINE - 1]->startup() + [CPUHP_ONLINE] + +Recursive failures cannot be handled sensibly. Look at the following +example of a recursive fail due to a failed offline operation: :: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() -> success + [CPUHP_ONLINE - (N - 2)]->startup() -> fail + +The CPU hotplug state machine stops right here and does not try to go back +down again because that would likely result in an endless loop:: + + [CPUHP_ONLINE - (N - 1)]->teardown() -> success + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() -> success + [CPUHP_ONLINE - (N - 2)]->startup() -> fail + [CPUHP_ONLINE - (N - 1)]->teardown() -> success + [CPUHP_ONLINE - N]->teardown() -> fail + +Lather, rinse and repeat. In this case the CPU left in state:: + + [CPUHP_ONLINE - (N - 1)] + +which at least lets the system make progress and gives the user a chance to +debug or even resolve the situation. + +Allocating a state +------------------ + +There are two ways to allocate a CPU hotplug state: + +* Static allocation + + Static allocation has to be used when the subsystem or driver has + ordering requirements versus other CPU hotplug states. E.g. the PERF core + startup callback has to be invoked before the PERF driver startup + callbacks during a CPU online operation. During a CPU offline operation + the driver teardown callbacks have to be invoked before the core teardown + callback. The statically allocated states are described by constants in + the cpuhp_state enum which can be found in include/linux/cpuhotplug.h. + + Insert the state into the enum at the proper place so the ordering + requirements are fulfilled. The state constant has to be used for state + setup and removal. + + Static allocation is also required when the state callbacks are not set + up at runtime and are part of the initializer of the CPU hotplug state + array in kernel/cpu.c. + +* Dynamic allocation + + When there are no ordering requirements for the state callbacks then + dynamic allocation is the preferred method. The state number is allocated + by the setup function and returned to the caller on success. + + Only the PREPARE and ONLINE sections provide a dynamic allocation + range. The STARTING section does not as most of the callbacks in that + section have explicit ordering requirements. + +Setup of a CPU hotplug state +---------------------------- + +The core code provides the following functions to setup a state: + +* cpuhp_setup_state(state, name, startup, teardown) +* cpuhp_setup_state_nocalls(state, name, startup, teardown) +* cpuhp_setup_state_cpuslocked(state, name, startup, teardown) +* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown) + +For cases where a driver or a subsystem has multiple instances and the same +CPU hotplug state callbacks need to be invoked for each instance, the CPU +hotplug core provides multi-instance support. The advantage over driver +specific instance lists is that the instance related functions are fully +serialized against CPU hotplug operations and provide the automatic +invocations of the state callbacks on add and removal. To set up such a +multi-instance state the following function is available: + +* cpuhp_setup_state_multi(state, name, startup, teardown) + +The @state argument is either a statically allocated state or one of the +constants for dynamically allocated states - CPUHP_BP_PREPARE_DYN, +CPUHP_AP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for +which a dynamic state should be allocated. + +The @name argument is used for sysfs output and for instrumentation. The +naming convention is "subsys:mode" or "subsys/driver:mode", +e.g. "perf:mode" or "perf/x86:mode". The common mode names are: + +======== ======================================================= +prepare For states in the PREPARE section + +dead For states in the PREPARE section which do not provide + a startup callback + +starting For states in the STARTING section + +dying For states in the STARTING section which do not provide + a startup callback + +online For states in the ONLINE section + +offline For states in the ONLINE section which do not provide + a startup callback +======== ======================================================= + +As the @name argument is only used for sysfs and instrumentation other mode +descriptors can be used as well if they describe the nature of the state +better than the common ones. + +Examples for @name arguments: "perf/online", "perf/x86:prepare", +"RCU/tree:dying", "sched/waitempty" + +The @startup argument is a function pointer to the callback which should be +invoked during a CPU online operation. If the usage site does not require a +startup callback set the pointer to NULL. + +The @teardown argument is a function pointer to the callback which should +be invoked during a CPU offline operation. If the usage site does not +require a teardown callback set the pointer to NULL. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked() + and cpuhp_setup_state_multi() only install the callbacks + + * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the + callbacks and invoke the @startup callback (if not NULL) for all online + CPUs which have currently a state greater than the newly installed + state. Depending on the state section the callback is either invoked on + the current CPU (PREPARE section) or on each online CPU (ONLINE + section) in the context of the CPU's hotplug thread. + + If a callback fails for CPU N then the teardown callback for CPU + 0 .. N-1 is invoked to rollback the operation. The state setup fails, + the callbacks for the state are not installed and in case of dynamic + allocation the allocated state is freed. + +The state setup and the callback invocations are serialized against CPU +hotplug operations. If the setup function has to be called from a CPU +hotplug read locked region, then the _cpuslocked() variants have to be +used. These functions cannot be used from within CPU hotplug callbacks. + +The function return values: + ======== =================================================================== + 0 Statically allocated state was successfully set up + + >0 Dynamically allocated state was successfully set up. + + The returned number is the state number which was allocated. If + the state callbacks have to be removed later, e.g. module + removal, then this number has to be saved by the caller and used + as @state argument for the state remove function. For + multi-instance states the dynamically allocated state number is + also required as @state argument for the instance add/remove + operations. + + <0 Operation failed + ======== =================================================================== + +Removal of a CPU hotplug state +------------------------------ + +To remove a previously set up state, the following functions are provided: + +* cpuhp_remove_state(state) +* cpuhp_remove_state_nocalls(state) +* cpuhp_remove_state_nocalls_cpuslocked(state) +* cpuhp_remove_multi_state(state) + +The @state argument is either a statically allocated state or the state +number which was allocated in the dynamic range by cpuhp_setup_state*(). If +the state is in the dynamic range, then the state number is freed and +available for dynamic allocation again. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked() + and cpuhp_remove_multi_state() only remove the callbacks. + + * cpuhp_remove_state() removes the callbacks and invokes the teardown + callback (if not NULL) for all online CPUs which have currently a state + greater than the removed state. Depending on the state section the + callback is either invoked on the current CPU (PREPARE section) or on + each online CPU (ONLINE section) in the context of the CPU's hotplug + thread. + + In order to complete the removal, the teardown callback should not fail. + +The state removal and the callback invocations are serialized against CPU +hotplug operations. If the remove function has to be called from a CPU +hotplug read locked region, then the _cpuslocked() variants have to be +used. These functions cannot be used from within CPU hotplug callbacks. + +If a multi-instance state is removed then the caller has to remove all +instances first. + +Multi-Instance state instance management +---------------------------------------- + +Once the multi-instance state is set up, instances can be added to the +state: + + * cpuhp_state_add_instance(state, node) + * cpuhp_state_add_instance_nocalls(state, node) + +The @state argument is either a statically allocated state or the state +number which was allocated in the dynamic range by cpuhp_setup_state_multi(). + +The @node argument is a pointer to an hlist_node which is embedded in the +instance's data structure. The pointer is handed to the multi-instance +state callbacks and can be used by the callback to retrieve the instance +via container_of(). + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_state_add_instance_nocalls() and only adds the instance to the + multi-instance state's node list. + + * cpuhp_state_add_instance() adds the instance and invokes the startup + callback (if not NULL) associated with @state for all online CPUs which + have currently a state greater than @state. The callback is only + invoked for the to be added instance. Depending on the state section + the callback is either invoked on the current CPU (PREPARE section) or + on each online CPU (ONLINE section) in the context of the CPU's hotplug + thread. + + If a callback fails for CPU N then the teardown callback for CPU + 0 .. N-1 is invoked to rollback the operation, the function fails and + the instance is not added to the node list of the multi-instance state. + +To remove an instance from the state's node list these functions are +available: + + * cpuhp_state_remove_instance(state, node) + * cpuhp_state_remove_instance_nocalls(state, node) + +The arguments are the same as for the cpuhp_state_add_instance*() +variants above. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_state_remove_instance_nocalls() only removes the instance from the + state's node list. + + * cpuhp_state_remove_instance() removes the instance and invokes the + teardown callback (if not NULL) associated with @state for all online + CPUs which have currently a state greater than @state. The callback is + only invoked for the to be removed instance. Depending on the state + section the callback is either invoked on the current CPU (PREPARE + section) or on each online CPU (ONLINE section) in the context of the + CPU's hotplug thread. + + In order to complete the removal, the teardown callback should not fail. + +The node list add/remove operations and the callback invocations are +serialized against CPU hotplug operations. These functions cannot be used +from within CPU hotplug callbacks and CPU hotplug read locked regions. + +Examples +-------- + +Setup and teardown a statically allocated state in the STARTING section for +notifications on online and offline operations:: + + ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying); + if (ret < 0) + return ret; + .... + cpuhp_remove_state(CPUHP_SUBSYS_STARTING); + +Setup and teardown a dynamically allocated state in the ONLINE section +for notifications on offline operations:: + + state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline); + if (state < 0) + return state; + .... + cpuhp_remove_state(state); + +Setup and teardown a dynamically allocated state in the ONLINE section +for notifications on online operations without invoking the callbacks:: + + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL); + if (state < 0) + return state; + .... + cpuhp_remove_state_nocalls(state); + +Setup, use and teardown a dynamically allocated multi-instance state in the +ONLINE section for notifications on online and offline operation:: + + state = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline); + if (state < 0) + return state; + .... + ret = cpuhp_state_add_instance(state, &inst1->node); + if (ret) + return ret; + .... + ret = cpuhp_state_add_instance(state, &inst2->node); + if (ret) + return ret; + .... + cpuhp_remove_instance(state, &inst1->node); + .... + cpuhp_remove_instance(state, &inst2->node); + .... + cpuhp_remove_multi_state(state); + Testing of hotplug states ========================= + One way to verify whether a custom state is working as expected or not is to shutdown a CPU and then put it online again. It is also possible to put the CPU to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to *CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE* which would lead to rollback to the online state. -All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states``: :: +All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states`` :: $ tail /sys/devices/system/cpu/hotplug/states 138: mm/vmscan:online @@ -275,7 +642,7 @@ All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states 168: sched:active 169: online -To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: :: +To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue:: $ cat /sys/devices/system/cpu/cpu4/hotplug/state 169 @@ -283,14 +650,14 @@ To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: :: $ cat /sys/devices/system/cpu/cpu4/hotplug/state 140 -It is important to note that the teardown callbac of state 140 have been -invoked. And now get back online: :: +It is important to note that the teardown callback of state 140 have been +invoked. And now get back online:: $ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target $ cat /sys/devices/system/cpu/cpu4/hotplug/state 169 -With trace events enabled, the individual steps are visible, too: :: +With trace events enabled, the individual steps are visible, too:: # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | @@ -325,6 +692,7 @@ trace. Architecture's requirements =========================== + The following functions and configurations are required: ``CONFIG_HOTPLUG_CPU`` @@ -346,11 +714,12 @@ The following functions and configurations are required: User Space Notification ======================= -After CPU successfully onlined or offline udev events are sent. A udev rule like: :: + +After CPU successfully onlined or offline udev events are sent. A udev rule like:: SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh" -will receive all events. A script like: :: +will receive all events. A script like:: #!/bin/sh @@ -366,6 +735,26 @@ will receive all events. A script like: :: can process the event further. +When changes to the CPUs in the system occur, the sysfs file +/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel +updates the kdump capture kernel list of CPUs itself (via elfcorehdr and +other relevant kexec segment), or '0' if userspace must update the kdump +capture kernel list of CPUs. + +The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration +option. + +To skip userspace processing of CPU hot un/plug events for kdump +(i.e. the unload-then-reload to obtain a current list of CPUs), this sysfs +file can be used in a udev rule as follows: + + SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" + +For a CPU hot un/plug event, if the architecture supports kernel updates +of the elfcorehdr (which contains the list of CPUs) and other relevant +kexec segments, then the rule skips the unload-then-reload of the kdump +capture kernel. + Kernel Inline Documentations Reference ====================================== diff --git a/Documentation/core-api/debugging-via-ohci1394.rst b/Documentation/core-api/debugging-via-ohci1394.rst index 981ad4f89fd3..cb3d3228dfc8 100644 --- a/Documentation/core-api/debugging-via-ohci1394.rst +++ b/Documentation/core-api/debugging-via-ohci1394.rst @@ -23,9 +23,9 @@ Retrieving a full system memory dump is also possible over the FireWire, using data transfer rates in the order of 10MB/s or more. With most FireWire controllers, memory access is limited to the low 4 GB -of physical address space. This can be a problem on IA64 machines where -memory is located mostly above that limit, but it is rarely a problem on -more common hardware such as x86, x86-64 and PowerPC. +of physical address space. This can be a problem on machines where memory is +located mostly above that limit, but it is rarely a problem on more common +hardware such as x86, x86-64 and PowerPC. At least LSI FW643e and FW643e2 controllers are known to support access to physical addresses above 4 GB, but this feature is currently not enabled by diff --git a/Documentation/core-api/dma-api-howto.rst b/Documentation/core-api/dma-api-howto.rst index 358d495456d1..0bf31b6c4383 100644 --- a/Documentation/core-api/dma-api-howto.rst +++ b/Documentation/core-api/dma-api-howto.rst @@ -8,7 +8,7 @@ Dynamic DMA mapping Guide This is a guide to device driver writers on how to use the DMA API with example pseudo-code. For a concise description of the API, see -DMA-API.txt. +Documentation/core-api/dma-api.rst. CPU and DMA addresses ===================== @@ -185,7 +185,7 @@ device struct of your device is embedded in the bus-specific device struct of your device. For example, &pdev->dev is a pointer to the device struct of a PCI device (pdev is a pointer to the PCI device struct of your device). -These calls usually return zero to indicated your device can perform DMA +These calls usually return zero to indicate your device can perform DMA properly on the machine given the address mask you provided, but they might return an error if the mask is too small to be supportable on the given system. If it returns non-zero, your device cannot perform DMA properly on @@ -203,13 +203,33 @@ setting the DMA mask fails. In this manner, if a user of your driver reports that performance is bad or that the device is not even detected, you can ask them for the kernel messages to find out exactly why. -The standard 64-bit addressing device would do something like this:: +The 24-bit addressing device would do something like this:: - if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))) { + if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(24))) { dev_warn(dev, "mydev: No suitable DMA available\n"); goto ignore_this_device; } +The standard 64-bit addressing device would do something like this:: + + dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64)) + +dma_set_mask_and_coherent() never return fail when DMA_BIT_MASK(64). Typical +error code like:: + + /* Wrong code */ + if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))) + dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32)) + +dma_set_mask_and_coherent() will never return failure when bigger than 32. +So typical code like:: + + /* Recommended code */ + if (support_64bit) + dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64)); + else + dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32)); + If the device only supports 32-bit addressing for descriptors in the coherent allocations, but supports full 64-bits for streaming mappings it would look like this:: @@ -707,20 +727,6 @@ to use the dma_sync_*() interfaces:: } } -Drivers converted fully to this interface should not use virt_to_bus() any -longer, nor should they use bus_to_virt(). Some drivers have to be changed a -little bit, because there is no longer an equivalent to bus_to_virt() in the -dynamic DMA mapping scheme - you have to always store the DMA addresses -returned by the dma_alloc_coherent(), dma_pool_alloc(), and dma_map_single() -calls (dma_map_sg() stores them in the scatterlist itself if the platform -supports dynamic DMA mapping in hardware) in your driver structures and/or -in the card registers. - -All drivers should be using these interfaces with no exceptions. It -is planned to completely remove virt_to_bus() and bus_to_virt() as -they are entirely deprecated. Some ports already do not provide these -as it is impossible to correctly support them. - Handling Errors =============== diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index f41620439ef3..2ad08517e626 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -5,7 +5,7 @@ Dynamic DMA mapping using the generic device :Author: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> This document describes the DMA API. For a more gentle introduction -of the API (and actual examples), see Documentation/DMA-API-HOWTO.txt. +of the API (and actual examples), see Documentation/core-api/dma-api-howto.rst. This API is split into two pieces. Part I describes the basic API. Part II describes extensions for supporting non-consistent memory @@ -206,6 +206,20 @@ others should not be larger than the returned value. :: + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. + +Mapping larger buffers may take much longer in certain scenarios. In +addition, for high-rate short-lived streaming mappings, the upfront time +spent on the mapping may account for an appreciable part of the total +request lifetime. As such, if splitting larger requests incurs no +significant performance penalty, then device drivers are advised to +limit total DMA streaming mappings length to the returned value. + +:: + bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); @@ -434,7 +448,7 @@ DMA address entries returned. Synchronise a single contiguous or scatter/gather mapping for the CPU and device. With the sync_sg API, all the parameters must be the same -as those passed into the single mapping API. With the sync_single API, +as those passed into the sg mapping API. With the sync_single API, you can use dma_handle and size parameters that aren't identical to those passed into the single mapping API to do a partial sync. @@ -479,7 +493,8 @@ without the _attrs suffixes, except that they pass an optional dma_attrs. The interpretation of DMA attributes is architecture-specific, and -each attribute should be documented in Documentation/DMA-attributes.txt. +each attribute should be documented in +Documentation/core-api/dma-attributes.rst. If dma_attrs are 0, the semantics of each of these functions is identical to those of the corresponding function @@ -492,7 +507,7 @@ for DMA:: #include <linux/dma-mapping.h> /* DMA_ATTR_FOO should be defined in linux/dma-mapping.h and - * documented in Documentation/DMA-attributes.txt */ + * documented in Documentation/core-api/dma-attributes.rst */ ... unsigned long attr; @@ -515,101 +530,250 @@ routines, e.g.::: .... } +Part Ie - IOVA-based DMA mappings +--------------------------------- -Part II - Advanced dma usage ----------------------------- +These APIs allow a very efficient mapping when using an IOMMU. They are an +optional path that requires extra code and are only recommended for drivers +where DMA mapping performance, or the space usage for storing the DMA addresses +matter. All the considerations from the previous section apply here as well. -Warning: These pieces of the DMA API should not be used in the -majority of cases, since they cater for unlikely corner cases that -don't belong in usual drivers. +:: + + bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state, + phys_addr_t phys, size_t size); -If you don't understand how cache line coherency works between a -processor and an I/O device, you should not be using this part of the -API at all. +Is used to try to allocate IOVA space for mapping operation. If it returns +false this API can't be used for the given device and the normal streaming +DMA mapping API should be used. The ``struct dma_iova_state`` is allocated +by the driver and must be kept around until unmap time. :: - void * - dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, - gfp_t flag, unsigned long attrs) + static inline bool dma_use_iova(struct dma_iova_state *state) + +Can be used by the driver to check if the IOVA-based API is used after a +call to dma_iova_try_alloc. This can be useful in the unmap path. + +:: -Identical to dma_alloc_coherent() except that when the -DMA_ATTR_NON_CONSISTENT flags is passed in the attrs argument, the -platform will choose to return either consistent or non-consistent memory -as it sees fit. By using this API, you are guaranteeing to the platform -that you have all the correct and necessary sync points for this memory -in the driver should it choose to return non-consistent memory. + int dma_iova_link(struct device *dev, struct dma_iova_state *state, + phys_addr_t phys, size_t offset, size_t size, + enum dma_data_direction dir, unsigned long attrs); + +Is used to link ranges to the IOVA previously allocated. The start of all +but the first call to dma_iova_link for a given state must be aligned +to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and +the size of all but the last range must be aligned to the DMA merge boundary +as well. + +:: + + int dma_iova_sync(struct device *dev, struct dma_iova_state *state, + size_t offset, size_t size); + +Must be called to sync the IOMMU page tables for IOVA-range mapped by one or +more calls to ``dma_iova_link()``. + +For drivers that use a one-shot mapping, all ranges can be unmapped and the +IOVA freed by calling: + +:: + + void dma_iova_destroy(struct device *dev, struct dma_iova_state *state, + size_t mapped_len, enum dma_data_direction dir, + unsigned long attrs); + +Alternatively drivers can dynamically manage the IOVA space by unmapping +and mapping individual regions. In that case + +:: -Note: where the platform can return consistent memory, it will -guarantee that the sync points become nops. + void dma_iova_unlink(struct device *dev, struct dma_iova_state *state, + size_t offset, size_t size, enum dma_data_direction dir, + unsigned long attrs); -Warning: Handling non-consistent memory is a real pain. You should -only use this API if you positively know your driver will be -required to work on one of the rare (usually non-PCI) architectures -that simply cannot make consistent memory. +is used to unmap a range previously mapped, and + +:: + + void dma_iova_free(struct device *dev, struct dma_iova_state *state); + +is used to free the IOVA space. All regions must have been unmapped using +``dma_iova_unlink()`` before calling ``dma_iova_free()``. + +Part II - Non-coherent DMA allocations +-------------------------------------- + +These APIs allow to allocate pages that are guaranteed to be DMA addressable +by the passed in device, but which need explicit management of memory ownership +for the kernel vs the device. + +If you don't understand how cache line coherency works between a processor and +an I/O device, you should not be using this part of the API. + +:: + + struct page * + dma_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, + enum dma_data_direction dir, gfp_t gfp) + +This routine allocates a region of <size> bytes of non-coherent memory. It +returns a pointer to first struct page for the region, or NULL if the +allocation failed. The resulting struct page can be used for everything a +struct page is suitable for. + +It also returns a <dma_handle> which may be cast to an unsigned integer the +same width as the bus and given to the device as the DMA address base of +the region. + +The dir parameter specified if data is read and/or written by the device, +see dma_map_single() for details. + +The gfp parameter allows the caller to specify the ``GFP_`` flags (see +kmalloc()) for the allocation, but rejects flags used to specify a memory +zone such as GFP_DMA or GFP_HIGHMEM. + +Before giving the memory to the device, dma_sync_single_for_device() needs +to be called, and before reading memory written by the device, +dma_sync_single_for_cpu(), just like for streaming DMA mappings that are +reused. :: void - dma_free_attrs(struct device *dev, size_t size, void *cpu_addr, - dma_addr_t dma_handle, unsigned long attrs) + dma_free_pages(struct device *dev, size_t size, struct page *page, + dma_addr_t dma_handle, enum dma_data_direction dir) -Free memory allocated by the dma_alloc_attrs(). All common -parameters must be identical to those otherwise passed to dma_free_coherent, -and the attrs argument must be identical to the attrs passed to -dma_alloc_attrs(). +Free a region of memory previously allocated using dma_alloc_pages(). +dev, size, dma_handle and dir must all be the same as those passed into +dma_alloc_pages(). page must be the pointer returned by dma_alloc_pages(). :: int - dma_get_cache_alignment(void) + dma_mmap_pages(struct device *dev, struct vm_area_struct *vma, + size_t size, struct page *page) -Returns the processor cache alignment. This is the absolute minimum -alignment *and* width that you must observe when either mapping -memory or doing partial flushes. +Map an allocation returned from dma_alloc_pages() into a user address space. +dev and size must be the same as those passed into dma_alloc_pages(). +page must be the pointer returned by dma_alloc_pages(). -.. note:: +:: - This API may return a number *larger* than the actual cache - line, but it will guarantee that one or more cache lines fit exactly - into the width returned by this call. It will also always be a power - of two for easy alignment. + void * + dma_alloc_noncoherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, enum dma_data_direction dir, + gfp_t gfp) + +This routine is a convenient wrapper around dma_alloc_pages that returns the +kernel virtual address for the allocated memory instead of the page structure. :: void - dma_cache_sync(struct device *dev, void *vaddr, size_t size, - enum dma_data_direction direction) + dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle, enum dma_data_direction dir) + +Free a region of memory previously allocated using dma_alloc_noncoherent(). +dev, size, dma_handle and dir must all be the same as those passed into +dma_alloc_noncoherent(). cpu_addr must be the virtual address returned by +dma_alloc_noncoherent(). + +:: + + struct sg_table * + dma_alloc_noncontiguous(struct device *dev, size_t size, + enum dma_data_direction dir, gfp_t gfp, + unsigned long attrs); + +This routine allocates <size> bytes of non-coherent and possibly non-contiguous +memory. It returns a pointer to struct sg_table that describes the allocated +and DMA mapped memory, or NULL if the allocation failed. The resulting memory +can be used for struct page mapped into a scatterlist are suitable for. + +The return sg_table is guaranteed to have 1 single DMA mapped segment as +indicated by sgt->nents, but it might have multiple CPU side segments as +indicated by sgt->orig_nents. + +The dir parameter specified if data is read and/or written by the device, +see dma_map_single() for details. + +The gfp parameter allows the caller to specify the ``GFP_`` flags (see +kmalloc()) for the allocation, but rejects flags used to specify a memory +zone such as GFP_DMA or GFP_HIGHMEM. + +The attrs argument must be either 0 or DMA_ATTR_ALLOC_SINGLE_PAGES. + +Before giving the memory to the device, dma_sync_sgtable_for_device() needs +to be called, and before reading memory written by the device, +dma_sync_sgtable_for_cpu(), just like for streaming DMA mappings that are +reused. + +:: + + void + dma_free_noncontiguous(struct device *dev, size_t size, + struct sg_table *sgt, + enum dma_data_direction dir) + +Free memory previously allocated using dma_alloc_noncontiguous(). dev, size, +and dir must all be the same as those passed into dma_alloc_noncontiguous(). +sgt must be the pointer returned by dma_alloc_noncontiguous(). + +:: + + void * + dma_vmap_noncontiguous(struct device *dev, size_t size, + struct sg_table *sgt) + +Return a contiguous kernel mapping for an allocation returned from +dma_alloc_noncontiguous(). dev and size must be the same as those passed into +dma_alloc_noncontiguous(). sgt must be the pointer returned by +dma_alloc_noncontiguous(). + +Once a non-contiguous allocation is mapped using this function, the +flush_kernel_vmap_range() and invalidate_kernel_vmap_range() APIs must be used +to manage the coherency between the kernel mapping, the device and user space +mappings (if any). + +:: + + void + dma_vunmap_noncontiguous(struct device *dev, void *vaddr) + +Unmap a kernel mapping returned by dma_vmap_noncontiguous(). dev must be the +same the one passed into dma_alloc_noncontiguous(). vaddr must be the pointer +returned by dma_vmap_noncontiguous(). -Do a partial sync of memory that was allocated by dma_alloc_attrs() with -the DMA_ATTR_NON_CONSISTENT flag starting at virtual address vaddr and -continuing on for size. Again, you *must* observe the cache line -boundaries when doing this. :: int - dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size); + dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma, + size_t size, struct sg_table *sgt) -Declare region of memory to be handed out by dma_alloc_coherent() when -it's asked for coherent memory for this device. +Map an allocation returned from dma_alloc_noncontiguous() into a user address +space. dev and size must be the same as those passed into +dma_alloc_noncontiguous(). sgt must be the pointer returned by +dma_alloc_noncontiguous(). -phys_addr is the CPU physical address to which the memory is currently -assigned (this will be ioremapped so the CPU can access the region). +:: -device_addr is the DMA address the device needs to be programmed -with to actually address this memory (this will be handed out as the -dma_addr_t in dma_alloc_coherent()). + int + dma_get_cache_alignment(void) -size is the size of the area (must be multiples of PAGE_SIZE). +Returns the processor cache alignment. This is the absolute minimum +alignment *and* width that you must observe when either mapping +memory or doing partial flushes. -As a simplification for the platforms, only *one* such region of -memory may be declared per device. +.. note:: + + This API may return a number *larger* than the actual cache + line, but it will guarantee that one or more cache lines fit exactly + into the width returned by this call. It will also always be a power + of two for easy alignment. -For reasons of efficiency, most platforms choose to track the declared -region only at the granularity of a page. For smaller allocations, -you should use the dma_pool() API. Part III - Debug drivers use of the DMA-API ------------------------------------------- diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst index 29dcbe8826e8..1887d92e8e92 100644 --- a/Documentation/core-api/dma-attributes.rst +++ b/Documentation/core-api/dma-attributes.rst @@ -25,14 +25,6 @@ Since it is optional for platforms to implement DMA_ATTR_WRITE_COMBINE, those that do not will simply ignore the attribute and exhibit default behavior. -DMA_ATTR_NON_CONSISTENT ------------------------ - -DMA_ATTR_NON_CONSISTENT lets the platform to choose to return either -consistent or non-consistent memory as it sees fit. By using this API, -you are guaranteeing to the platform that you have all the correct and -necessary sync points for this memory in the driver. - DMA_ATTR_NO_KERNEL_MAPPING -------------------------- diff --git a/Documentation/core-api/dma-isa-lpc.rst b/Documentation/core-api/dma-isa-lpc.rst index b1ec7b16c21f..17b193603f0a 100644 --- a/Documentation/core-api/dma-isa-lpc.rst +++ b/Documentation/core-api/dma-isa-lpc.rst @@ -17,7 +17,7 @@ To do ISA style DMA you need to include two headers:: #include <asm/dma.h> The first is the generic DMA API used to convert virtual addresses to -bus addresses (see Documentation/DMA-API.txt for details). +bus addresses (see Documentation/core-api/dma-api.rst for details). The second contains the routines specific to ISA DMA transfers. Since this is not present on all platforms make sure you construct your diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst new file mode 100644 index 000000000000..a15f9b1767a2 --- /dev/null +++ b/Documentation/core-api/entry.rst @@ -0,0 +1,279 @@ +Entry/exit handling for exceptions, interrupts, syscalls and KVM +================================================================ + +All transitions between execution domains require state updates which are +subject to strict ordering constraints. State updates are required for the +following: + + * Lockdep + * RCU / Context tracking + * Preemption counter + * Tracing + * Time accounting + +The update order depends on the transition type and is explained below in +the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular +exceptions`_, `NMI and NMI-like exceptions`_. + +Non-instrumentable code - noinstr +--------------------------------- + +Most instrumentation facilities depend on RCU, so instrumentation is prohibited +for entry code before RCU starts watching and exit code after RCU stops +watching. In addition, many architectures must save and restore register state, +which means that (for example) a breakpoint in the breakpoint entry code would +overwrite the debug registers of the initial breakpoint. + +Such code must be marked with the 'noinstr' attribute, placing that code into a +special section inaccessible to instrumentation and debug facilities. Some +functions are partially instrumentable, which is handled by marking them +noinstr and using instrumentation_begin() and instrumentation_end() to flag the +instrumentable ranges of code: + +.. code-block:: c + + noinstr void entry(void) + { + handle_entry(); // <-- must be 'noinstr' or '__always_inline' + ... + + instrumentation_begin(); + handle_context(); // <-- instrumentable code + instrumentation_end(); + + ... + handle_exit(); // <-- must be 'noinstr' or '__always_inline' + } + +This allows verification of the 'noinstr' restrictions via objtool on +supported architectures. + +Invoking non-instrumentable functions from instrumentable context has no +restrictions and is useful to protect e.g. state switching which would +cause malfunction if instrumented. + +All non-instrumentable entry/exit code sections before and after the RCU +state transitions must run with interrupts disabled. + +Syscalls +-------- + +Syscall-entry code starts in assembly code and calls out into low-level C code +after establishing low-level architecture-specific state and stack frames. This +low-level C code must not be instrumented. A typical syscall handling function +invoked from low-level assembly code looks like this: + +.. code-block:: c + + noinstr void syscall(struct pt_regs *regs, int nr) + { + arch_syscall_enter(regs); + nr = syscall_enter_from_user_mode(regs, nr); + + instrumentation_begin(); + if (!invoke_syscall(regs, nr) && nr != -1) + result_reg(regs) = __sys_ni_syscall(regs); + instrumentation_end(); + + syscall_exit_to_user_mode(regs); + } + +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which +establishes state in the following order: + + * Lockdep + * RCU / Context tracking + * Tracing + +and then invokes the various entry work functions like ptrace, seccomp, audit, +syscall tracing, etc. After all that is done, the instrumentable invoke_syscall +function can be invoked. The instrumentable code section then ends, after which +syscall_exit_to_user_mode() is invoked. + +syscall_exit_to_user_mode() handles all work which needs to be done before +returning to user space like tracing, audit, signals, task work etc. After +that it invokes exit_to_user_mode() which again handles the state +transition in the reverse order: + + * Tracing + * RCU / Context tracking + * Lockdep + +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also +available as fine grained subfunctions in cases where the architecture code +has to do extra work between the various steps. In such cases it has to +ensure that enter_from_user_mode() is called first on entry and +exit_to_user_mode() is called last on exit. + +Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking +to print a warning. + +KVM +--- + +Entering or exiting guest mode is very similar to syscalls. From the host +kernel point of view the CPU goes off into user space when entering the +guest and returns to the kernel on exit. + +kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). +The state operations have the same ordering. + +Task work handling is done separately for guest at the boundary of the +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of +the work handled on return to user space. + +Do not nest KVM entry/exit transitions because doing so is nonsensical. + +Interrupts and regular exceptions +--------------------------------- + +Interrupts entry and exit handling is slightly more complex than syscalls +and KVM transitions. + +If an interrupt is raised while the CPU executes in user space, the entry +and exit handling is exactly the same as for syscalls. + +If the interrupt is raised while the CPU executes in kernel space the entry and +exit handling is slightly different. RCU state is only updated when the +interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will +already be watching. Lockdep and tracing have to be updated unconditionally. + +irqentry_enter() and irqentry_exit() provide the implementation for this. + +The architecture-specific part looks similar to syscall handling: + +.. code-block:: c + + noinstr void interrupt(struct pt_regs *regs, int nr) + { + arch_interrupt_enter(regs); + state = irqentry_enter(regs); + + instrumentation_begin(); + + irq_enter_rcu(); + invoke_irq_handler(regs, nr); + irq_exit_rcu(); + + instrumentation_end(); + + irqentry_exit(regs, state); + } + +Note that the invocation of the actual interrupt handler is within a +irq_enter_rcu() and irq_exit_rcu() pair. + +irq_enter_rcu() updates the preemption count which makes in_hardirq() +return true, handles NOHZ tick state and interrupt time accounting. This +means that up to the point where irq_enter_rcu() is invoked in_hardirq() +returns false. + +irq_exit_rcu() handles interrupt time accounting, undoes the preemption +count update and eventually handles soft interrupts and NOHZ tick state. + +In theory, the preemption count could be updated in irqentry_enter(). In +practice, deferring this update to irq_enter_rcu() allows the preemption-count +code to be traced, while also maintaining symmetry with irq_exit_rcu() and +irqentry_exit(), which are described in the next paragraph. The only downside +is that the early entry code up to irq_enter_rcu() must be aware that the +preemption count has not yet been updated with the HARDIRQ_OFFSET state. + +Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count +before it handles soft interrupts, whose handlers must run in BH context rather +than irq-disabled context. In addition, irqentry_exit() might schedule, which +also requires that HARDIRQ_OFFSET has been removed from the preemption count. + +Even though interrupt handlers are expected to run with local interrupts +disabled, interrupt nesting is common from an entry/exit perspective. For +example, softirq handling happens within an irqentry_{enter,exit}() block with +local interrupts enabled. Also, although uncommon, nothing prevents an +interrupt handler from re-enabling interrupts. + +Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it +runs with local interrupts disabled. But NMIs can happen anytime, and a lot of +the entry code is shared between the two. + +NMI and NMI-like exceptions +--------------------------- + +NMIs and NMI-like exceptions (machine checks, double faults, debug +interrupts, etc.) can hit any context and must be extra careful with +the state. + +State changes for debug exceptions and machine-check exceptions depend on +whether these exceptions happened in user-space (breakpoints or watchpoints) or +in kernel mode (code patching). From user-space, they are treated like +interrupts, while from kernel mode they are treated like NMIs. + +NMIs and other NMI-like exceptions handle state transitions without +distinguishing between user-mode and kernel-mode origin. + +The state update on entry is handled in irqentry_nmi_enter() which updates +state in the following order: + + * Preemption counter + * Lockdep + * RCU / Context tracking + * Tracing + +The exit counterpart irqentry_nmi_exit() does the reverse operation in the +reverse order. + +Note that the update of the preemption counter has to be the first +operation on enter and the last operation on exit. The reason is that both +lockdep and RCU rely on in_nmi() returning true in this case. The +preemption count modification in the NMI entry/exit case must not be +traced. + +Architecture-specific code looks like this: + +.. code-block:: c + + noinstr void nmi(struct pt_regs *regs) + { + arch_nmi_enter(regs); + state = irqentry_nmi_enter(regs); + + instrumentation_begin(); + nmi_handler(regs); + instrumentation_end(); + + irqentry_nmi_exit(regs); + } + +and for e.g. a debug exception it can look like this: + +.. code-block:: c + + noinstr void debug(struct pt_regs *regs) + { + arch_nmi_enter(regs); + + debug_regs = save_debug_regs(); + + if (user_mode(regs)) { + state = irqentry_enter(regs); + + instrumentation_begin(); + user_mode_debug_handler(regs, debug_regs); + instrumentation_end(); + + irqentry_exit(regs, state); + } else { + state = irqentry_nmi_enter(regs); + + instrumentation_begin(); + kernel_mode_debug_handler(regs, debug_regs); + instrumentation_end(); + + irqentry_nmi_exit(regs, state); + } + } + +There is no combined irqentry_nmi_if_kernel() function available as the +above cannot be handled in an exception-agnostic way. + +NMIs can happen in any context. For example, an NMI-like exception triggered +while handling an NMI. So NMI entry code has to be reentrant and state updates +need to handle nesting. diff --git a/Documentation/core-api/floating-point.rst b/Documentation/core-api/floating-point.rst new file mode 100644 index 000000000000..a8d0d4b05052 --- /dev/null +++ b/Documentation/core-api/floating-point.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +Floating-point API +================== + +Kernel code is normally prohibited from using floating-point (FP) registers or +instructions, including the C float and double data types. This rule reduces +system call overhead, because the kernel does not need to save and restore the +userspace floating-point register state. + +However, occasionally drivers or library functions may need to include FP code. +This is supported by isolating the functions containing FP code to a separate +translation unit (a separate source file), and saving/restoring the FP register +state around calls to those functions. This creates "critical sections" of +floating-point usage. + +The reason for this isolation is to prevent the compiler from generating code +touching the FP registers outside these critical sections. Compilers sometimes +use FP registers to optimize inlined ``memcpy`` or variable assignment, as +floating-point registers may be wider than general-purpose registers. + +Usability of floating-point code within the kernel is architecture-specific. +Additionally, because a single kernel may be configured to support platforms +both with and without a floating-point unit, FPU availability must be checked +both at build time and at run time. + +Several architectures implement the generic kernel floating-point API from +``linux/fpu.h``, as described below. Some other architectures implement their +own unique APIs, which are documented separately. + +Build-time API +-------------- + +Floating-point code may be built if the option ``ARCH_HAS_KERNEL_FPU_SUPPORT`` +is enabled. For C code, such code must be placed in a separate file, and that +file must have its compilation flags adjusted using the following pattern:: + + CFLAGS_foo.o += $(CC_FLAGS_FPU) + CFLAGS_REMOVE_foo.o += $(CC_FLAGS_NO_FPU) + +Architectures are expected to define one or both of these variables in their +top-level Makefile as needed. For example:: + + CC_FLAGS_FPU := -mhard-float + +or:: + + CC_FLAGS_NO_FPU := -msoft-float + +Normal kernel code is assumed to use the equivalent of ``CC_FLAGS_NO_FPU``. + +Runtime API +----------- + +The runtime API is provided in ``linux/fpu.h``. This header cannot be included +from files implementing FP code (those with their compilation flags adjusted as +above). Instead, it must be included when defining the FP critical sections. + +.. c:function:: bool kernel_fpu_available( void ) + + This function reports if floating-point code can be used on this CPU or + platform. The value returned by this function is not expected to change + at runtime, so it only needs to be called once, not before every + critical section. + +.. c:function:: void kernel_fpu_begin( void ) + void kernel_fpu_end( void ) + + These functions create a floating-point critical section. It is only + valid to call ``kernel_fpu_begin()`` after a previous call to + ``kernel_fpu_available()`` returned ``true``. These functions are only + guaranteed to be callable from (preemptible or non-preemptible) process + context. + + Preemption may be disabled inside critical sections, so their size + should be minimized. They are *not* required to be reentrant. If the + caller expects to nest critical sections, it must implement its own + reference counting. diff --git a/Documentation/core-api/folio_queue.rst b/Documentation/core-api/folio_queue.rst new file mode 100644 index 000000000000..83cfbc157e49 --- /dev/null +++ b/Documentation/core-api/folio_queue.rst @@ -0,0 +1,209 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +=========== +Folio Queue +=========== + +:Author: David Howells <dhowells@redhat.com> + +.. Contents: + + * Overview + * Initialisation + * Adding and removing folios + * Querying information about a folio + * Querying information about a folio_queue + * Folio queue iteration + * Folio marks + * Lockless simultaneous production/consumption issues + + +Overview +======== + +The folio_queue struct forms a single segment in a segmented list of folios +that can be used to form an I/O buffer. As such, the list can be iterated over +using the ITER_FOLIOQ iov_iter type. + +The publicly accessible members of the structure are:: + + struct folio_queue { + struct folio_queue *next; + struct folio_queue *prev; + ... + }; + +A pair of pointers are provided, ``next`` and ``prev``, that point to the +segments on either side of the segment being accessed. Whilst this is a +doubly-linked list, it is intentionally not a circular list; the outward +sibling pointers in terminal segments should be NULL. + +Each segment in the list also stores: + + * an ordered sequence of folio pointers, + * the size of each folio and + * three 1-bit marks per folio, + +but hese should not be accessed directly as the underlying data structure may +change, but rather the access functions outlined below should be used. + +The facility can be made accessible by:: + + #include <linux/folio_queue.h> + +and to use the iterator:: + + #include <linux/uio.h> + + +Initialisation +============== + +A segment should be initialised by calling:: + + void folioq_init(struct folio_queue *folioq); + +with a pointer to the segment to be initialised. Note that this will not +necessarily initialise all the folio pointers, so care must be taken to check +the number of folios added. + + +Adding and removing folios +========================== + +Folios can be set in the next unused slot in a segment struct by calling one +of:: + + unsigned int folioq_append(struct folio_queue *folioq, + struct folio *folio); + + unsigned int folioq_append_mark(struct folio_queue *folioq, + struct folio *folio); + +Both functions update the stored folio count, store the folio and note its +size. The second function also sets the first mark for the folio added. Both +functions return the number of the slot used. [!] Note that no attempt is made +to check that the capacity wasn't overrun and the list will not be extended +automatically. + +A folio can be excised by calling:: + + void folioq_clear(struct folio_queue *folioq, unsigned int slot); + +This clears the slot in the array and also clears all the marks for that folio, +but doesn't change the folio count - so future accesses of that slot must check +if the slot is occupied. + + +Querying information about a folio +================================== + +Information about the folio in a particular slot may be queried by the +following function:: + + struct folio *folioq_folio(const struct folio_queue *folioq, + unsigned int slot); + +If a folio has not yet been set in that slot, this may yield an undefined +pointer. The size of the folio in a slot may be queried with either of:: + + unsigned int folioq_folio_order(const struct folio_queue *folioq, + unsigned int slot); + + size_t folioq_folio_size(const struct folio_queue *folioq, + unsigned int slot); + +The first function returns the size as an order and the second as a number of +bytes. + + +Querying information about a folio_queue +======================================== + +Information may be retrieved about a particular segment with the following +functions:: + + unsigned int folioq_nr_slots(const struct folio_queue *folioq); + + unsigned int folioq_count(struct folio_queue *folioq); + + bool folioq_full(struct folio_queue *folioq); + +The first function returns the maximum capacity of a segment. It must not be +assumed that this won't vary between segments. The second returns the number +of folios added to a segments and the third is a shorthand to indicate if the +segment has been filled to capacity. + +Not that the count and fullness are not affected by clearing folios from the +segment. These are more about indicating how many slots in the array have been +initialised, and it assumed that slots won't get reused, but rather the segment +will get discarded as the queue is consumed. + + +Folio marks +=========== + +Folios within a queue can also have marks assigned to them. These marks can be +used to note information such as if a folio needs folio_put() calling upon it. +There are three marks available to be set for each folio. + +The marks can be set by:: + + void folioq_mark(struct folio_queue *folioq, unsigned int slot); + void folioq_mark2(struct folio_queue *folioq, unsigned int slot); + +Cleared by:: + + void folioq_unmark(struct folio_queue *folioq, unsigned int slot); + void folioq_unmark2(struct folio_queue *folioq, unsigned int slot); + +And the marks can be queried by:: + + bool folioq_is_marked(const struct folio_queue *folioq, unsigned int slot); + bool folioq_is_marked2(const struct folio_queue *folioq, unsigned int slot); + +The marks can be used for any purpose and are not interpreted by this API. + + +Folio queue iteration +===================== + +A list of segments may be iterated over using the I/O iterator facility using +an ``iov_iter`` iterator of ``ITER_FOLIOQ`` type. The iterator may be +initialised with:: + + void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction, + const struct folio_queue *folioq, + unsigned int first_slot, unsigned int offset, + size_t count); + +This may be told to start at a particular segment, slot and offset within a +queue. The iov iterator functions will follow the next pointers when advancing +and prev pointers when reverting when needed. + + +Lockless simultaneous production/consumption issues +=================================================== + +If properly managed, the list can be extended by the producer at the head end +and shortened by the consumer at the tail end simultaneously without the need +to take locks. The ITER_FOLIOQ iterator inserts appropriate barriers to aid +with this. + +Care must be taken when simultaneously producing and consuming a list. If the +last segment is reached and the folios it refers to are entirely consumed by +the IOV iterators, an iov_iter struct will be left pointing to the last segment +with a slot number equal to the capacity of that segment. The iterator will +try to continue on from this if there's another segment available when it is +used again, but care must be taken lest the segment got removed and freed by +the consumer before the iterator was advanced. + +It is recommended that the queue always contain at least one segment, even if +that segment has never been filled or is entirely spent. This prevents the +head and tail pointers from collapsing. + + +API Function Reference +====================== + +.. kernel-doc:: include/linux/folio_queue.h diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst index 8f06d885c310..582bde9bf5a9 100644 --- a/Documentation/core-api/genericirq.rst +++ b/Documentation/core-api/genericirq.rst @@ -210,7 +210,7 @@ implemented (simplified excerpt):: } } - noop(struct irq_data *data)) + noop(struct irq_data *data) { } @@ -264,7 +264,7 @@ The following control flow is implemented (simplified excerpt):: desc->irq_data.chip->irq_unmask(); desc->status &= ~pending; handle_irq_event(desc->action); - } while (status & pending); + } while (desc->status & pending); desc->status &= ~running; @@ -419,6 +419,7 @@ functions which are exported. .. kernel-doc:: kernel/irq/manage.c .. kernel-doc:: kernel/irq/chip.c + :export: Internal Functions Provided =========================== @@ -431,6 +432,7 @@ functions. .. kernel-doc:: kernel/irq/handle.c .. kernel-doc:: kernel/irq/chip.c + :internal: Credits ======= diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst index e7c32a8de126..858b2fbcb36c 100644 --- a/Documentation/core-api/gfp_mask-from-fs-io.rst +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst @@ -55,14 +55,16 @@ scope. What about __vmalloc(GFP_NOFS) ============================== -vmalloc doesn't support GFP_NOFS semantic because there are hardcoded -GFP_KERNEL allocations deep inside the allocator which are quite non-trivial -to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is -almost always a bug. The good news is that the NOFS/NOIO semantic can be -achieved by the scope API. +Since v5.17, and specifically after the commit 451769ebb7e79 ("mm/vmalloc: +alloc GFP_NO{FS,IO} for vmalloc"), GFP_NOFS/GFP_NOIO are now supported in +``[k]vmalloc`` by implicitly using scope API. + +In earlier kernels ``vmalloc`` didn't support GFP_NOFS semantic because there +were hardcoded GFP_KERNEL allocations deep inside the allocator. That means +that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO was almost always a bug. In the ideal world, upper layers should already mark dangerous contexts -and so no special care is required and vmalloc should be called without -any problems. Sometimes if the context is not really clear or there are -layering violations then the recommended way around that is to wrap ``vmalloc`` -by the scope API with a comment explaining the problem. +and so no special care is required and ``vmalloc`` should be called without any +problems. Sometimes if the context is not really clear or there are layering +violations then the recommended way around that (on pre-v5.17 kernels) is to +wrap ``vmalloc`` by the scope API with a comment explaining the problem. diff --git a/Documentation/core-api/idr.rst b/Documentation/core-api/idr.rst index a2738050c4f0..18d724867064 100644 --- a/Documentation/core-api/idr.rst +++ b/Documentation/core-api/idr.rst @@ -17,51 +17,54 @@ solution to the problem to avoid everybody inventing their own. The IDR provides the ability to map an ID to a pointer, while the IDA provides only ID allocation, and as a result is much more memory-efficient. +The IDR interface is deprecated; please use the :doc:`XArray <xarray>` +instead. + IDR usage ========= -Start by initialising an IDR, either with :c:func:`DEFINE_IDR` -for statically allocated IDRs or :c:func:`idr_init` for dynamically +Start by initialising an IDR, either with DEFINE_IDR() +for statically allocated IDRs or idr_init() for dynamically allocated IDRs. -You can call :c:func:`idr_alloc` to allocate an unused ID. Look up -the pointer you associated with the ID by calling :c:func:`idr_find` -and free the ID by calling :c:func:`idr_remove`. +You can call idr_alloc() to allocate an unused ID. Look up +the pointer you associated with the ID by calling idr_find() +and free the ID by calling idr_remove(). If you need to change the pointer associated with an ID, you can call -:c:func:`idr_replace`. One common reason to do this is to reserve an +idr_replace(). One common reason to do this is to reserve an ID by passing a ``NULL`` pointer to the allocation function; initialise the object with the reserved ID and finally insert the initialised object into the IDR. Some users need to allocate IDs larger than ``INT_MAX``. So far all of these users have been content with a ``UINT_MAX`` limit, and they use -:c:func:`idr_alloc_u32`. If you need IDs that will not fit in a u32, +idr_alloc_u32(). If you need IDs that will not fit in a u32, we will work with you to address your needs. If you need to allocate IDs sequentially, you can use -:c:func:`idr_alloc_cyclic`. The IDR becomes less efficient when dealing +idr_alloc_cyclic(). The IDR becomes less efficient when dealing with larger IDs, so using this function comes at a slight cost. To perform an action on all pointers used by the IDR, you can -either use the callback-based :c:func:`idr_for_each` or the -iterator-style :c:func:`idr_for_each_entry`. You may need to use -:c:func:`idr_for_each_entry_continue` to continue an iteration. You can -also use :c:func:`idr_get_next` if the iterator doesn't fit your needs. +either use the callback-based idr_for_each() or the +iterator-style idr_for_each_entry(). You may need to use +idr_for_each_entry_continue() to continue an iteration. You can +also use idr_get_next() if the iterator doesn't fit your needs. -When you have finished using an IDR, you can call :c:func:`idr_destroy` +When you have finished using an IDR, you can call idr_destroy() to release the memory used by the IDR. This will not free the objects pointed to from the IDR; if you want to do that, use one of the iterators to do it. -You can use :c:func:`idr_is_empty` to find out whether there are any +You can use idr_is_empty() to find out whether there are any IDs currently allocated. If you need to take a lock while allocating a new ID from the IDR, you may need to pass a restrictive set of GFP flags, which can lead to the IDR being unable to allocate memory. To work around this, -you can call :c:func:`idr_preload` before taking the lock, and then -:c:func:`idr_preload_end` after the allocation. +you can call idr_preload() before taking the lock, and then +idr_preload_end() after the allocation. .. kernel-doc:: include/linux/idr.h :doc: idr sync diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 15ab86112627..7a4ca18ca6e2 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -18,9 +18,12 @@ it. kernel-api workqueue + watch_queue printk-basics printk-formats + printk-index symbol-namespaces + asm-annotations Data structures and low-level utilities ======================================= @@ -32,31 +35,49 @@ Library functionality that is used throughout the kernel. kobject kref + cleanup assoc_array + folio_queue xarray + maple_tree idr circular-buffers rbtree generic-radix-tree packing + this_cpu_ops timekeeping errseq + wrappers/atomic_t + wrappers/atomic_bitops + floating-point + union_find + min_heap + parser + +Low level entry and exit +======================== + +.. toctree:: + :maxdepth: 1 + + entry Concurrency primitives ====================== How Linux keeps everything from happening at the same time. See -:doc:`/locking/index` for more related documentation. +Documentation/locking/index.rst for more related documentation. .. toctree:: :maxdepth: 1 - atomic_ops refcount-vs-atomic irq/index local_ops padata ../RCU/index + wrappers/memory-barriers.rst Low-level hardware management ============================= @@ -76,21 +97,25 @@ Memory management ================= How to allocate and use memory in the kernel. Note that there is a lot -more memory-management documentation in :doc:`/vm/index`. +more memory-management documentation in Documentation/mm/index.rst. .. toctree:: :maxdepth: 1 memory-allocation + unaligned-memory-access dma-api dma-api-howto dma-attributes dma-isa-lpc + swiotlb mm-api + cgroup genalloc pin_user_pages boot-time-mm gfp_mask-from-fs-io + kho/index Interfaces for kernel debugging =============================== @@ -111,6 +136,7 @@ Documents that don't fit elsewhere or which have yet to be categorized. :maxdepth: 1 librs + netlink .. only:: subproject and html diff --git a/Documentation/core-api/irq/concepts.rst b/Documentation/core-api/irq/concepts.rst index 4273806a606b..7c4564f3cbdf 100644 --- a/Documentation/core-api/irq/concepts.rst +++ b/Documentation/core-api/irq/concepts.rst @@ -2,23 +2,24 @@ What is an IRQ? =============== -An IRQ is an interrupt request from a device. -Currently they can come in over a pin, or over a packet. -Several devices may be connected to the same pin thus -sharing an IRQ. +An IRQ is an interrupt request from a device. Currently, they can come +in over a pin, or over a packet. Several devices may be connected to +the same pin thus sharing an IRQ. Such as on legacy PCI bus: All devices +typically share 4 lanes/pins. Note that each device can request an +interrupt on each of the lanes. An IRQ number is a kernel identifier used to talk about a hardware -interrupt source. Typically this is an index into the global irq_desc -array, but except for what linux/interrupt.h implements the details -are architecture specific. +interrupt source. Typically, this is an index into the global irq_desc +array or sparse_irqs tree. But except for what linux/interrupt.h +implements, the details are architecture specific. An IRQ number is an enumeration of the possible interrupt sources on a -machine. Typically what is enumerated is the number of input pins on -all of the interrupt controller in the system. In the case of ISA -what is enumerated are the 16 input pins on the two i8259 interrupt -controllers. +machine. Typically, what is enumerated is the number of input pins on +all of the interrupt controllers in the system. In the case of ISA, +what is enumerated are the 8 input pins on each of the two i8259 +interrupt controllers. Architectures can assign additional meaning to the IRQ numbers, and -are encouraged to in the case where there is any manual configuration -of the hardware involved. The ISA IRQs are a classic example of +are encouraged to in the case where there is any manual configuration +of the hardware involved. The ISA IRQs are a classic example of assigning this kind of additional meaning. diff --git a/Documentation/core-api/irq/irq-domain.rst b/Documentation/core-api/irq/irq-domain.rst index 096db12f32d5..a01c6ead1bc0 100644 --- a/Documentation/core-api/irq/irq-domain.rst +++ b/Documentation/core-api/irq/irq-domain.rst @@ -1,72 +1,102 @@ =============================================== -The irq_domain interrupt number mapping library +The irq_domain Interrupt Number Mapping Library =============================================== The current design of the Linux kernel uses a single large number -space where each separate IRQ source is assigned a different number. -This is simple when there is only one interrupt controller, but in -systems with multiple interrupt controllers the kernel must ensure +space where each separate IRQ source is assigned a unique number. +This is simple when there is only one interrupt controller. But in +systems with multiple interrupt controllers, the kernel must ensure that each one gets assigned non-overlapping allocations of Linux IRQ numbers. The number of interrupt controllers registered as unique irqchips -show a rising tendency: for example subdrivers of different kinds +shows a rising tendency. For example, subdrivers of different kinds such as GPIO controllers avoid reimplementing identical callback mechanisms as the IRQ core system by modelling their interrupt -handlers as irqchips, i.e. in effect cascading interrupt controllers. +handlers as irqchips. I.e. in effect cascading interrupt controllers. -Here the interrupt number loose all kind of correspondence to -hardware interrupt numbers: whereas in the past, IRQ numbers could -be chosen so they matched the hardware IRQ line into the root -interrupt controller (i.e. the component actually fireing the -interrupt line to the CPU) nowadays this number is just a number. +So in the past, IRQ numbers could be chosen so that they match the +hardware IRQ line into the root interrupt controller (i.e. the +component actually firing the interrupt line to the CPU). Nowadays, +this number is just a number and the number loose all kind of +correspondence to hardware interrupt numbers. -For this reason we need a mechanism to separate controller-local -interrupt numbers, called hardware irq's, from Linux IRQ numbers. +For this reason, we need a mechanism to separate controller-local +interrupt numbers, called hardware IRQs, from Linux IRQ numbers. The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of -irq numbers, but they don't provide any support for reverse mapping of +IRQ numbers, but they don't provide any support for reverse mapping of the controller-local IRQ (hwirq) number into the Linux IRQ number space. -The irq_domain library adds mapping between hwirq and IRQ numbers on -top of the irq_alloc_desc*() API. An irq_domain to manage mapping is -preferred over interrupt controller drivers open coding their own +The irq_domain library adds a mapping between hwirq and IRQ numbers on +top of the irq_alloc_desc*() API. An irq_domain to manage the mapping +is preferred over interrupt controller drivers open coding their own reverse mapping scheme. -irq_domain also implements translation from an abstract irq_fwspec -structure to hwirq numbers (Device Tree and ACPI GSI so far), and can -be easily extended to support other IRQ topology data sources. +irq_domain also implements a translation from an abstract struct +irq_fwspec to hwirq numbers (Device Tree, non-DT firmware node, ACPI +GSI, and software node so far), and can be easily extended to support +other IRQ topology data sources. The implementation is performed +without any extra platform support code. -irq_domain usage +irq_domain Usage ================ - -An interrupt controller driver creates and registers an irq_domain by -calling one of the irq_domain_add_*() functions (each mapping method -has a different allocator function, more on that later). The function -will return a pointer to the irq_domain on success. The caller must -provide the allocator function with an irq_domain_ops structure. +struct irq_domain could be defined as an irq domain controller. That +is, it handles the mapping between hardware and virtual interrupt +numbers for a given interrupt domain. The domain structure is +generally created by the PIC code for a given PIC instance (though a +domain can cover more than one PIC if they have a flat number model). +It is the domain callbacks that are responsible for setting the +irq_chip on a given irq_desc after it has been mapped. + +The host code and data structures use a fwnode_handle pointer to +identify the domain. In some cases, and in order to preserve source +code compatibility, this fwnode pointer is "upgraded" to a DT +device_node. For those firmware infrastructures that do not provide a +unique identifier for an interrupt controller, the irq_domain code +offers a fwnode allocator. + +An interrupt controller driver creates and registers a struct irq_domain +by calling one of the irq_domain_create_*() functions (each mapping +method has a different allocator function, more on that later). The +function will return a pointer to the struct irq_domain on success. The +caller must provide the allocator function with a struct irq_domain_ops +pointer. In most cases, the irq_domain will begin empty without any mappings between hwirq and IRQ numbers. Mappings are added to the irq_domain by calling irq_create_mapping() which accepts the irq_domain and a -hwirq number as arguments. If a mapping for the hwirq doesn't already -exist then it will allocate a new Linux irq_desc, associate it with -the hwirq, and call the .map() callback so the driver can perform any -required hardware setup. - -When an interrupt is received, irq_find_mapping() function should -be used to find the Linux IRQ number from the hwirq number. - -The irq_create_mapping() function must be called *atleast once* +hwirq number as arguments. If a mapping for the hwirq doesn't already +exist, irq_create_mapping() allocates a new Linux irq_desc, associates +it with the hwirq, and calls the :c:member:`irq_domain_ops.map()` +callback. In there, the driver can perform any required hardware +setup. + +Once a mapping has been established, it can be retrieved or used via a +variety of methods: + +- irq_resolve_mapping() returns a pointer to the irq_desc structure + for a given domain and hwirq number, and NULL if there was no + mapping. +- irq_find_mapping() returns a Linux IRQ number for a given domain and + hwirq number, and 0 if there was no mapping +- generic_handle_domain_irq() handles an interrupt described by a + domain and a hwirq number + +Note that irq domain lookups must happen in contexts that are +compatible with a RCU read-side critical section. + +The irq_create_mapping() function must be called *at least once* before any call to irq_find_mapping(), lest the descriptor will not be allocated. If the driver has the Linux IRQ number or the irq_data pointer, and needs to know the associated hwirq number (such as in the irq_chip -callbacks) then it can be directly obtained from irq_data->hwirq. +callbacks) then it can be directly obtained from +:c:member:`irq_data.hwirq`. -Types of irq_domain mappings +Types of irq_domain Mappings ============================ There are several mechanisms available for reverse mapping from hwirq @@ -79,7 +109,6 @@ Linear :: - irq_domain_add_linear() irq_domain_create_linear() The linear reverse map maintains a fixed size table indexed by the @@ -92,19 +121,13 @@ map are fixed time lookup for IRQ numbers, and irq_descs are only allocated for in-use IRQs. The disadvantage is that the table must be as large as the largest possible hwirq number. -irq_domain_add_linear() and irq_domain_create_linear() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - -The majority of drivers should use the linear map. +The majority of drivers should use the Linear map. Tree ---- :: - irq_domain_add_tree() irq_domain_create_tree() The irq_domain maintains a radix tree map from hwirq numbers to Linux @@ -116,11 +139,6 @@ since it doesn't need to allocate a table as large as the largest hwirq number. The disadvantage is that hwirq to IRQ number lookup is dependent on how many entries are in the table. -irq_domain_add_tree() and irq_domain_create_tree() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - Very few drivers should need this mapping. No Map @@ -128,7 +146,7 @@ No Map :: - irq_domain_add_nomap() + irq_domain_create_nomap() The No Map mapping is to be used when the hwirq number is programmable in the hardware. In this case it is best to program the @@ -137,16 +155,17 @@ required. Calling irq_create_direct_mapping() will allocate a Linux IRQ number and call the .map() callback so that driver can program the Linux IRQ number into the hardware. -Most drivers cannot use this mapping. +Most drivers cannot use this mapping, and it is now gated on the +CONFIG_IRQ_DOMAIN_NOMAP option. Please refrain from introducing new +users of this API. Legacy ------ :: - irq_domain_add_simple() - irq_domain_add_legacy() - irq_domain_add_legacy_isa() + irq_domain_create_simple() + irq_domain_create_legacy() The Legacy mapping is a special case for drivers that already have a range of irq_descs allocated for the hwirqs. It is used when the @@ -156,6 +175,11 @@ for IRQ numbers that are passed to struct device registrations. In that case the Linux IRQ numbers cannot be dynamically assigned and the legacy mapping should be used. +As the name implies, the \*_legacy() functions are deprecated and only +exist to ease the support of ancient platforms. No new users should be +added. Same goes for the \*_simple() functions when their use results +in the legacy behaviour. + The legacy map assumes a contiguous range of IRQ numbers has already been allocated for the controller and that the IRQ number can be calculated by adding a fixed offset to the hwirq number, and @@ -168,13 +192,13 @@ supported. For example, ISA controllers would use the legacy map for mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ numbers. -Most users of legacy mappings should use irq_domain_add_simple() which -will use a legacy domain only if an IRQ range is supplied by the -system and will otherwise use a linear domain mapping. The semantics -of this call are such that if an IRQ range is specified then -descriptors will be allocated on-the-fly for it, and if no range is -specified it will fall through to irq_domain_add_linear() which means -*no* irq descriptors will be allocated. +Most users of legacy mappings should use irq_domain_create_simple() +which will use a legacy domain only if an IRQ range is supplied by the +system and will otherwise use a linear domain mapping. The semantics of +this call are such that if an IRQ range is specified then descriptors +will be allocated on-the-fly for it, and if no range is specified it +will fall through to irq_domain_create_linear() which means *no* irq +descriptors will be allocated. A typical use case for simple domains is where an irqchip provider is supporting both dynamic and static IRQ assignments. @@ -185,7 +209,7 @@ that the driver using the simple domain call irq_create_mapping() before any irq_find_mapping() since the latter will actually work for the static IRQ assignment case. -Hierarchy IRQ domain +Hierarchy IRQ Domain -------------------- On some architectures, there may be multiple interrupt controllers @@ -226,20 +250,40 @@ There are four major interfaces to use hierarchy irq_domain: 4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware to stop delivering the interrupt. -Following changes are needed to support hierarchy irq_domain: +The following is needed to support hierarchy irq_domain: -1) a new field 'parent' is added to struct irq_domain; it's used to +1) The :c:member:`parent` field in struct irq_domain is used to maintain irq_domain hierarchy information. -2) a new field 'parent_data' is added to struct irq_data; it's used to - build hierarchy irq_data to match hierarchy irq_domains. The irq_data - is used to store irq_domain pointer and hardware irq number. -3) new callbacks are added to struct irq_domain_ops to support hierarchy - irq_domain operations. - -With support of hierarchy irq_domain and hierarchy irq_data ready, an -irq_domain structure is built for each interrupt controller, and an +2) The :c:member:`parent_data` field in struct irq_data is used to + build hierarchy irq_data to match hierarchy irq_domains. The + irq_data is used to store irq_domain pointer and hardware irq + number. +3) The :c:member:`alloc()`, :c:member:`free()`, and other callbacks in + struct irq_domain_ops to support hierarchy irq_domain operations. + +With the support of hierarchy irq_domain and hierarchy irq_data ready, +an irq_domain structure is built for each interrupt controller, and an irq_data structure is allocated for each irq_domain associated with an -IRQ. Now we could go one step further to support stacked(hierarchy) +IRQ. + +For an interrupt controller driver to support hierarchy irq_domain, it +needs to: + +1) Implement irq_domain_ops.alloc() and irq_domain_ops.free() +2) Optionally, implement irq_domain_ops.activate() and + irq_domain_ops.deactivate(). +3) Optionally, implement an irq_chip to manage the interrupt controller + hardware. +4) There is no need to implement irq_domain_ops.map() and + irq_domain_ops.unmap(). They are unused with hierarchy irq_domain. + +Note the hierarchy irq_domain is in no way x86-specific, and is +heavily used to support other architectures, such as ARM, ARM64 etc. + +Stacked irq_chip +~~~~~~~~~~~~~~~~ + +Now, we could go one step further to support stacked (hierarchy) irq_chip. That is, an irq_chip is associated with each irq_data along the hierarchy. A child irq_chip may implement a required action by itself or by cooperating with its parent irq_chip. @@ -249,22 +293,28 @@ with the hardware managed by itself and may ask for services from its parent irq_chip when needed. So we could achieve a much cleaner software architecture. -For an interrupt controller driver to support hierarchy irq_domain, it -needs to: - -1) Implement irq_domain_ops.alloc and irq_domain_ops.free -2) Optionally implement irq_domain_ops.activate and - irq_domain_ops.deactivate. -3) Optionally implement an irq_chip to manage the interrupt controller - hardware. -4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap, - they are unused with hierarchy irq_domain. - -Hierarchy irq_domain is in no way x86 specific, and is heavily used to -support other architectures, such as ARM, ARM64 etc. - Debugging ========= Most of the internals of the IRQ subsystem are exposed in debugfs by turning CONFIG_GENERIC_IRQ_DEBUGFS on. + +Structures and Public Functions Provided +======================================== + +This chapter contains the autogenerated documentation of the structures +and exported kernel API functions which are used for IRQ domains. + +.. kernel-doc:: include/linux/irqdomain.h + +.. kernel-doc:: kernel/irq/irqdomain.c + :export: + +Internal Functions Provided +=========================== + +This chapter contains the autogenerated documentation of the internal +functions. + +.. kernel-doc:: kernel/irq/irqdomain.c + :internal: diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 4ac53a1363f6..ae92a2571388 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -24,11 +24,8 @@ String Conversions .. kernel-doc:: lib/vsprintf.c :export: -.. kernel-doc:: include/linux/kernel.h - :functions: kstrtol - -.. kernel-doc:: include/linux/kernel.h - :functions: kstrtoul +.. kernel-doc:: include/linux/kstrtox.h + :functions: kstrtol kstrtoul .. kernel-doc:: lib/kstrtox.c :export: @@ -39,6 +36,9 @@ String Conversions String Manipulation ------------------- +.. kernel-doc:: include/linux/fortify-string.h + :internal: + .. kernel-doc:: lib/string.c :export: @@ -96,6 +96,12 @@ Command-line Parsing .. kernel-doc:: lib/cmdline.c :export: +Error Pointers +-------------- + +.. kernel-doc:: include/linux/err.h + :internal: + Sorting ------- @@ -121,6 +127,12 @@ Text Searching CRC and Math Functions in Linux =============================== +Arithmetic Overflow Checking +---------------------------- + +.. kernel-doc:: include/linux/overflow.h + :internal: + CRC Functions ------------- @@ -150,8 +162,10 @@ Base 2 log and power Functions .. kernel-doc:: include/linux/log2.h :internal: -Integer power Functions ------------------------ +Integer log and power Functions +------------------------------- + +.. kernel-doc:: include/linux/int_log.h .. kernel-doc:: lib/math/int_pow.c :export: @@ -168,9 +182,6 @@ Division Functions .. kernel-doc:: include/linux/math64.h :internal: -.. kernel-doc:: lib/math/div64.c - :functions: div_s64_rem div64_u64_rem div64_u64 div64_s64 - .. kernel-doc:: lib/math/gcd.c :export: @@ -217,26 +228,38 @@ relay interface Module Support ============== -Module Loading --------------- +Kernel module auto-loading +-------------------------- -.. kernel-doc:: kernel/kmod.c +.. kernel-doc:: kernel/module/kmod.c :export: +Module debugging +---------------- + +.. kernel-doc:: kernel/module/stats.c + :doc: module debugging statistics overview + +dup_failed_modules - tracks duplicate failed modules +**************************************************** + +.. kernel-doc:: kernel/module/stats.c + :doc: dup_failed_modules - tracks duplicate failed modules + +module statistics debugfs counters +********************************** + +.. kernel-doc:: kernel/module/stats.c + :doc: module statistics debugfs counters + Inter Module support -------------------- -Refer to the file kernel/module.c for more information. +Refer to the files in kernel/module/ for more information. Hardware Interfaces =================== -Interrupt Handling ------------------- - -.. kernel-doc:: kernel/irq/manage.c - :export: - DMA Channels ------------ @@ -288,6 +311,7 @@ Accounting Framework Block Devices ============= +.. kernel-doc:: include/linux/bio.h .. kernel-doc:: block/blk-core.c :export: @@ -303,9 +327,6 @@ Block Devices .. kernel-doc:: block/blk-settings.c :export: -.. kernel-doc:: block/blk-exec.c - :export: - .. kernel-doc:: block/blk-flush.c :export: @@ -324,6 +345,9 @@ Block Devices .. kernel-doc:: block/genhd.c :export: +.. kernel-doc:: block/bdev.c + :export: + Char devices ============ @@ -396,3 +420,15 @@ Read-Copy Update (RCU) .. kernel-doc:: include/linux/rcu_sync.h .. kernel-doc:: kernel/rcu/sync.c + +.. kernel-doc:: kernel/rcu/tasks.h + +.. kernel-doc:: kernel/rcu/tree_stall.h + +.. kernel-doc:: include/linux/rcupdate_trace.h + +.. kernel-doc:: include/linux/rcupdate_wait.h + +.. kernel-doc:: include/linux/rcuref.h + +.. kernel-doc:: include/linux/rcutree.h diff --git a/Documentation/core-api/kho/bindings/kho.yaml b/Documentation/core-api/kho/bindings/kho.yaml new file mode 100644 index 000000000000..11e8ab7b219d --- /dev/null +++ b/Documentation/core-api/kho/bindings/kho.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Kexec HandOver (KHO) root tree + +maintainers: + - Mike Rapoport <rppt@kernel.org> + - Changyuan Lyu <changyuanl@google.com> + +description: | + System memory preserved by KHO across kexec. + +properties: + compatible: + enum: + - kho-v1 + + preserved-memory-map: + description: | + physical address (u64) of an in-memory structure describing all preserved + folios and memory ranges. + +patternProperties: + "$[0-9a-f_]+^": + $ref: sub-fdt.yaml# + description: physical address of a KHO user's own FDT. + +required: + - compatible + - preserved-memory-map + +additionalProperties: false + +examples: + - | + kho { + compatible = "kho-v1"; + preserved-memory-map = <0xf0be16 0x1000000>; + + memblock { + fdt = <0x80cc16 0x1000000>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/memblock.yaml b/Documentation/core-api/kho/bindings/memblock/memblock.yaml new file mode 100644 index 000000000000..d388c28eb91d --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/memblock.yaml @@ -0,0 +1,39 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory + +maintainers: + - Mike Rapoport <rppt@kernel.org> + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + The post-KHO kernel can then consume these reservations and they are + guaranteed to have the same physical address. + +properties: + compatible: + enum: + - reserve-mem-v1 + +patternProperties: + "$[0-9a-f_]+^": + $ref: reserve-mem.yaml# + description: reserved memory regions + +required: + - compatible + +additionalProperties: false + +examples: + - | + memblock { + compatible = "memblock-v1"; + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml new file mode 100644 index 000000000000..10282d3d1bcd --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml @@ -0,0 +1,40 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory regions + +maintainers: + - Mike Rapoport <rppt@kernel.org> + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + This object describes each such region. + +properties: + compatible: + enum: + - reserve-mem-v1 + + start: + description: | + physical address (u64) of the reserved memory region. + + size: + description: | + size (u64) of the reserved memory region. + +required: + - compatible + - start + - size + +additionalProperties: false + +examples: + - | + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; diff --git a/Documentation/core-api/kho/bindings/sub-fdt.yaml b/Documentation/core-api/kho/bindings/sub-fdt.yaml new file mode 100644 index 000000000000..b9a3d2d24850 --- /dev/null +++ b/Documentation/core-api/kho/bindings/sub-fdt.yaml @@ -0,0 +1,27 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: KHO users' FDT address + +maintainers: + - Mike Rapoport <rppt@kernel.org> + - Changyuan Lyu <changyuanl@google.com> + +description: | + Physical address of an FDT blob registered by a KHO user. + +properties: + fdt: + description: | + physical address (u64) of an FDT blob. + +required: + - fdt + +additionalProperties: false + +examples: + - | + memblock { + fdt = <0x80cc16 0x1000000>; + }; diff --git a/Documentation/core-api/kho/concepts.rst b/Documentation/core-api/kho/concepts.rst new file mode 100644 index 000000000000..36d5c05cfb30 --- /dev/null +++ b/Documentation/core-api/kho/concepts.rst @@ -0,0 +1,74 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later +.. _kho-concepts: + +======================= +Kexec Handover Concepts +======================= + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +It introduces multiple concepts: + +KHO FDT +======= + +Every KHO kexec carries a KHO specific flattened device tree (FDT) blob +that describes preserved memory regions. These regions contain either +serialized subsystem states, or in-memory data that shall not be touched +across kexec. After KHO, subsystems can retrieve and restore preserved +memory regions from KHO FDT. + +KHO only uses the FDT container format and libfdt library, but does not +adhere to the same property semantics that normal device trees do: Properties +are passed in native endianness and standardized properties like ``regs`` and +``ranges`` do not exist, hence there are no ``#...-cells`` properties. + +KHO is still under development. The FDT schema is unstable and would change +in the future. + +Scratch Regions +=============== + +To boot into kexec, we need to have a physically contiguous memory range that +contains no handed over memory. Kexec then places the target kernel and initrd +into that region. The new kernel exclusively uses this region for memory +allocations before during boot up to the initialization of the page allocator. + +We guarantee that we always have such regions through the scratch regions: On +first boot KHO allocates several physically contiguous memory regions. Since +after kexec these regions will be used by early memory allocations, there is a +scratch region per NUMA node plus a scratch region to satisfy allocations +requests that do not require particular NUMA node assignment. +By default, size of the scratch region is calculated based on amount of memory +allocated during boot. The ``kho_scratch`` kernel command line option may be +used to explicitly define size of the scratch regions. +The scratch regions are declared as CMA when page allocator is initialized so +that their memory can be used during system lifetime. CMA gives us the +guarantee that no handover pages land in that region, because handover pages +must be at a static physical memory location and CMA enforces that only +movable pages can be located inside. + +After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and +instead reuse the exact same region that was originally allocated. This allows +us to recursively execute any amount of KHO kexecs. Because we used this region +for boot memory allocations and as target memory for kexec blobs, some parts +of that memory region may be reserved. These reservations are irrelevant for +the next KHO, because kexec can overwrite even the original kernel. + +.. _kho-finalization-phase: + +KHO finalization phase +====================== + +To enable user space based kexec file loader, the kernel needs to be able to +provide the FDT that describes the current kernel's state before +performing the actual kexec. The process of generating that FDT is +called serialization. When the FDT is generated, some properties +of the system may become immutable because they are already written down +in the FDT. That state is called the KHO finalization phase. + +Public API +========== +.. kernel-doc:: kernel/kexec_handover.c + :export: diff --git a/Documentation/core-api/kho/fdt.rst b/Documentation/core-api/kho/fdt.rst new file mode 100644 index 000000000000..62505285d60d --- /dev/null +++ b/Documentation/core-api/kho/fdt.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======= +KHO FDT +======= + +KHO uses the flattened device tree (FDT) container format and libfdt +library to create and parse the data that is passed between the +kernels. The properties in KHO FDT are stored in native format. +It includes the physical address of an in-memory structure describing +all preserved memory regions, as well as physical addresses of KHO users' +own FDTs. Interpreting those sub FDTs is the responsibility of KHO users. + +KHO nodes and properties +======================== + +Property ``preserved-memory-map`` +--------------------------------- + +KHO saves a special property named ``preserved-memory-map`` under the root node. +This node contains the physical address of an in-memory structure for KHO to +preserve memory regions across kexec. + +Property ``compatible`` +----------------------- + +The ``compatible`` property determines compatibility between the kernel +that created the KHO FDT and the kernel that attempts to load it. +If the kernel that loads the KHO FDT is not compatible with it, the entire +KHO process will be bypassed. + +Property ``fdt`` +---------------- + +Generally, a KHO user serialize its state into its own FDT and instructs +KHO to preserve the underlying memory, such that after kexec, the new kernel +can recover its state from the preserved FDT. + +A KHO user thus can create a node in KHO root tree and save the physical address +of its own FDT in that node's property ``fdt`` . + +Examples +======== + +The following example demonstrates KHO FDT that preserves two memory +regions created with ``reserve_mem`` kernel command line parameter:: + + /dts-v1/; + + / { + compatible = "kho-v1"; + + preserved-memory-map = <0x40be16 0x1000000>; + + memblock { + fdt = <0x1517 0x1000000>; + }; + }; + +where the ``memblock`` node contains an FDT that is requested by the +subsystem memblock for preservation. The FDT contains the following +serialized data:: + + /dts-v1/; + + / { + compatible = "memblock-v1"; + + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + + n2 { + compatible = "reserve-mem-v1"; + start = <0xc067 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst new file mode 100644 index 000000000000..0c63b0c5c143 --- /dev/null +++ b/Documentation/core-api/kho/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======================== +Kexec Handover Subsystem +======================== + +.. toctree:: + :maxdepth: 1 + + concepts + fdt + +.. only:: subproject and html diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst index e93dc8cf52dd..7310247310a0 100644 --- a/Documentation/core-api/kobject.rst +++ b/Documentation/core-api/kobject.rst @@ -6,7 +6,7 @@ Everything you never wanted to know about kobjects, ksets, and ktypes :Last updated: December 19, 2007 Based on an original article by Jon Corbet for lwn.net written October 1, -2003 and located at http://lwn.net/Articles/51437/ +2003 and located at https://lwn.net/Articles/51437/ Part of the difficulty in understanding the driver model - and the kobject abstraction upon which it is built - is that there is no obvious starting @@ -118,7 +118,7 @@ Initialization of kobjects Code which creates a kobject must, of course, initialize that object. Some of the internal fields are setup with a (mandatory) call to kobject_init():: - void kobject_init(struct kobject *kobj, struct kobj_type *ktype); + void kobject_init(struct kobject *kobj, const struct kobj_type *ktype); The ktype is required for a kobject to be created properly, as every kobject must have an associated kobj_type. After calling kobject_init(), to @@ -156,7 +156,7 @@ kobject_name():: There is a helper function to both initialize and add the kobject to the kernel at the same time, called surprisingly enough kobject_init_and_add():: - int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype, + int kobject_init_and_add(struct kobject *kobj, const struct kobj_type *ktype, struct kobject *parent, const char *fmt, ...); The arguments are the same as the individual kobject_init() and @@ -299,7 +299,6 @@ kobj_type:: struct kobj_type { void (*release)(struct kobject *kobj); const struct sysfs_ops *sysfs_ops; - struct attribute **default_attrs; const struct attribute_group **default_groups; const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); const void *(*namespace)(struct kobject *kobj); @@ -313,10 +312,10 @@ call kobject_init() or kobject_init_and_add(). The release field in struct kobj_type is, of course, a pointer to the release() method for this type of kobject. The other two fields (sysfs_ops -and default_attrs) control how objects of this type are represented in +and default_groups) control how objects of this type are represented in sysfs; they are beyond the scope of this document. -The default_attrs pointer is a list of default attributes that will be +The default_groups pointer is a list of default attributes that will be automatically created for any kobject that is registered with this ktype. @@ -373,10 +372,9 @@ If a kset wishes to control the uevent operations of the kobjects associated with it, it can use the struct kset_uevent_ops to handle it:: struct kset_uevent_ops { - int (* const filter)(struct kset *kset, struct kobject *kobj); - const char *(* const name)(struct kset *kset, struct kobject *kobj); - int (* const uevent)(struct kset *kset, struct kobject *kobj, - struct kobj_uevent_env *env); + int (* const filter)(struct kobject *kobj); + const char *(* const name)(struct kobject *kobj); + int (* const uevent)(struct kobject *kobj, struct kobj_uevent_env *env); }; diff --git a/Documentation/core-api/kref.rst b/Documentation/core-api/kref.rst index c61eea6f1bf2..8db9ff03d952 100644 --- a/Documentation/core-api/kref.rst +++ b/Documentation/core-api/kref.rst @@ -3,7 +3,7 @@ Adding reference counters (krefs) to kernel objects =================================================== :Author: Corey Minyard <minyard@acm.org> -:Author: Thomas Hellstrom <thellstrom@vmware.com> +:Author: Thomas Hellström <thomas.hellstrom@linux.intel.com> A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and presentation on krefs, which can be found at: @@ -321,3 +321,8 @@ rcu grace period after release_entry_rcu was called. That can be accomplished by using kfree_rcu(entry, rhead) as done above, or by calling synchronize_rcu() before using kfree, but note that synchronize_rcu() may sleep for a substantial amount of time. + +Functions and structures +======================== + +.. kernel-doc:: include/linux/kref.h diff --git a/Documentation/core-api/local_ops.rst b/Documentation/core-api/local_ops.rst index 2ac3f9f29845..0b42ceaaf3c4 100644 --- a/Documentation/core-api/local_ops.rst +++ b/Documentation/core-api/local_ops.rst @@ -191,7 +191,7 @@ Here is a sample module which implements a basic per cpu counter using static void __exit test_exit(void) { - del_timer_sync(&test_timer); + timer_shutdown_sync(&test_timer); } module_init(test_init); diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst new file mode 100644 index 000000000000..ccdd1615cf97 --- /dev/null +++ b/Documentation/core-api/maple_tree.rst @@ -0,0 +1,221 @@ +.. SPDX-License-Identifier: GPL-2.0+ + + +========== +Maple Tree +========== + +:Author: Liam R. Howlett + +Overview +======== + +The Maple Tree is a B-Tree data type which is optimized for storing +non-overlapping ranges, including ranges of size 1. The tree was designed to +be simple to use and does not require a user written search method. It +supports iterating over a range of entries and going to the previous or next +entry in a cache-efficient manner. The tree can also be put into an RCU-safe +mode of operation which allows reading and writing concurrently. Writers must +synchronize on a lock, which can be the default spinlock, or the user can set +the lock to an external lock of a different type. + +The Maple Tree maintains a small memory footprint and was designed to use +modern processor cache efficiently. The majority of the users will be able to +use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex +scenarios. The most important usage of the Maple Tree is the tracking of the +virtual memory areas. + +The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple +Tree reserves values with the bottom two bits set to '10' which are below 4096 +(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved +entries then the users can convert the entries using xa_mk_value() and convert +them back by calling xa_to_value(). If the user needs to use a reserved +value, then the user can convert the value when using the +:ref:`maple-tree-advanced-api`, but are blocked by the normal API. + +The Maple Tree can also be configured to support searching for a gap of a given +size (or larger). + +Pre-allocating of nodes is also supported using the +:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a +successful store operation within a given +code segment when allocating cannot be done. Allocations of nodes are +relatively small at around 256 bytes. + +.. _maple-tree-normal-api: + +Normal API +========== + +Start by initialising a maple tree, either with DEFINE_MTREE() for statically +allocated maple trees or mt_init() for dynamically allocated ones. A +freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0`` +- ``ULONG_MAX``. There are currently two types of maple trees supported: the +allocation tree and the regular tree. The regular tree has a higher branching +factor for internal nodes. The allocation tree has a lower branching factor +but allows the user to search for a gap of a given size or larger from either +``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by +passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree. + +You can then set entries using mtree_store() or mtree_store_range(). +mtree_store() will overwrite any entry with the new entry and return 0 on +success or an error code otherwise. mtree_store_range() works in the same way +but takes a range. mtree_load() is used to retrieve the entry stored at a +given index. You can use mtree_erase() to erase an entire range by only +knowing one value within that range, or mtree_store() call with an entry of +NULL may be used to partially erase a range or many ranges at once. + +If you want to only store a new entry to a range (or index) if that range is +currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which +return -EEXIST if the range is not empty. + +You can search for an entry from an index upwards by using mt_find(). + +You can walk each entry within a range by calling mt_for_each(). You must +provide a temporary variable to store a cursor. If you want to walk each +element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If +the caller is going to hold the lock for the duration of the walk then it is +worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api` +section. + +Sometimes it is necessary to ensure the next call to store to a maple tree does +not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case. + +You can use mtree_dup() to duplicate an entire maple tree. It is a more +efficient way than inserting all elements one by one into a new tree. + +Finally, you can remove all entries from a maple tree by calling +mtree_destroy(). If the maple tree entries are pointers, you may wish to free +the entries first. + +Allocating Nodes +---------------- + +The allocations are handled by the internal tree code. See +:ref:`maple-tree-advanced-alloc` for other options. + +Locking +------- + +You do not have to worry about locking. See :ref:`maple-tree-advanced-locks` +for other options. + +The Maple Tree uses RCU and an internal spinlock to synchronise access: + +Takes RCU read lock: + * mtree_load() + * mt_find() + * mt_for_each() + * mt_next() + * mt_prev() + +Takes ma_lock internally: + * mtree_store() + * mtree_store_range() + * mtree_insert() + * mtree_insert_range() + * mtree_erase() + * mtree_dup() + * mtree_destroy() + * mt_set_in_rcu() + * mt_clear_in_rcu() + +If you want to take advantage of the internal lock to protect the data +structures that you are storing in the Maple Tree, you can call mtree_lock() +before calling mtree_load(), then take a reference count on the object you +have found before calling mtree_unlock(). This will prevent stores from +removing the object from the tree between looking up the object and +incrementing the refcount. You can also use RCU to avoid dereferencing +freed memory, but an explanation of that is beyond the scope of this +document. + +.. _maple-tree-advanced-api: + +Advanced API +============ + +The advanced API offers more flexibility and better performance at the +cost of an interface which can be harder to use and has fewer safeguards. +You must take care of your own locking while using the advanced API. +You can use the ma_lock, RCU or an external lock for protection. +You can mix advanced and normal operations on the same array, as long +as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented +in terms of the advanced API. + +The advanced API is based around the ma_state, this is where the 'mas' +prefix originates. The ma_state struct keeps track of tree operations to make +life easier for both internal and external tree users. + +Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`. +Please see above. + +The maple state keeps track of the range start and end in mas->index and +mas->last, respectively. + +mas_walk() will walk the tree to the location of mas->index and set the +mas->index and mas->last according to the range for the entry. + +You can set entries using mas_store(). mas_store() will overwrite any entry +with the new entry and return the first existing entry that is overwritten. +The range is passed in as members of the maple state: index and last. + +You can use mas_erase() to erase an entire range by setting index and +last of the maple state to the desired range to erase. This will erase +the first range that is found in that range, set the maple state index +and last as the range that was erased and return the entry that existed +at that location. + +You can walk each entry within a range by using mas_for_each(). If you want +to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as +the range. If the lock needs to be periodically dropped, see the locking +section mas_pause(). + +Using a maple state allows mas_next() and mas_prev() to function as if the +tree was a linked list. With such a high branching factor the amortized +performance penalty is outweighed by cache optimization. mas_next() will +return the next entry which occurs after the entry at index. mas_prev() +will return the previous entry which occurs before the entry at index. + +mas_find() will find the first entry which exists at or above index on +the first call, and the next entry from every subsequent calls. + +mas_find_rev() will find the first entry which exists at or below the last on +the first call, and the previous entry from every subsequent calls. + +If the user needs to yield the lock during an operation, then the maple state +must be paused using mas_pause(). + +There are a few extra interfaces provided when using an allocation tree. +If you wish to search for a gap within a range, then mas_empty_area() +or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap +starting at the lowest index given up to the maximum of the range. +mas_empty_area_rev() searches for a gap starting at the highest index given +and continues downward to the lower bound of the range. + +.. _maple-tree-advanced-alloc: + +Advanced Allocating Nodes +------------------------- + +Allocations are usually handled internally to the tree, however if allocations +need to occur before a write occurs then calling mas_expected_entries() will +allocate the worst-case number of needed nodes to insert the provided number of +ranges. This also causes the tree to enter mass insertion mode. Once +insertions are complete calling mas_destroy() on the maple state will free the +unused allocations. + +.. _maple-tree-advanced-locks: + +Advanced Locking +---------------- + +The maple tree uses a spinlock by default, but external locks can be used for +tree updates as well. To use an external lock, the tree must be initialized +with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the +MTREE_INIT_EXT() #define, which takes an external lock as an argument. + +Functions and structures +======================== + +.. kernel-doc:: include/linux/maple_tree.h +.. kernel-doc:: lib/maple_tree.c diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst index 4aa82ddd01b8..0f19dd524323 100644 --- a/Documentation/core-api/memory-allocation.rst +++ b/Documentation/core-api/memory-allocation.rst @@ -45,8 +45,9 @@ here we briefly outline their recommended usage: * If the allocation is performed from an atomic context, e.g interrupt handler, use ``GFP_NOWAIT``. This flag prevents direct reclaim and IO or filesystem operations. Consequently, under memory pressure - ``GFP_NOWAIT`` allocation is likely to fail. Allocations which - have a reasonable fallback should be using ``GFP_NOWARN``. + ``GFP_NOWAIT`` allocation is likely to fail. Users of this flag need + to provide a suitable fallback to cope with such failures where + appropriate. * If you think that accessing memory reserves is justified and the kernel will be stressed unless allocation succeeds, you may use ``GFP_ATOMIC``. * Untrusted allocations triggered from userspace should be a subject @@ -84,6 +85,50 @@ driver for a device with such restrictions, avoid using these flags. And even with hardware with restrictions it is preferable to use `dma_alloc*` APIs. +GFP flags and reclaim behavior +------------------------------ +Memory allocations may trigger direct or background reclaim and it is +useful to understand how hard the page allocator will try to satisfy that +or another request. + + * ``GFP_KERNEL & ~__GFP_RECLAIM`` - optimistic allocation without _any_ + attempt to free memory at all. The most light weight mode which even + doesn't kick the background reclaim. Should be used carefully because it + might deplete the memory and the next user might hit the more aggressive + reclaim. + + * ``GFP_KERNEL & ~__GFP_DIRECT_RECLAIM`` (or ``GFP_NOWAIT``)- optimistic + allocation without any attempt to free memory from the current + context but can wake kswapd to reclaim memory if the zone is below + the low watermark. Can be used from either atomic contexts or when + the request is a performance optimization and there is another + fallback for a slow path. + + * ``(GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM`` (aka ``GFP_ATOMIC``) - + non sleeping allocation with an expensive fallback so it can access + some portion of memory reserves. Usually used from interrupt/bottom-half + context with an expensive slow path fallback. + + * ``GFP_KERNEL`` - both background and direct reclaim are allowed and the + **default** page allocator behavior is used. That means that not costly + allocation requests are basically no-fail but there is no guarantee of + that behavior so failures have to be checked properly by callers + (e.g. OOM killer victim is allowed to fail currently). + + * ``GFP_KERNEL | __GFP_NORETRY`` - overrides the default allocator behavior + and all allocation requests fail early rather than cause disruptive + reclaim (one round of reclaim in this implementation). The OOM killer + is not invoked. + + * ``GFP_KERNEL | __GFP_RETRY_MAYFAIL`` - overrides the default allocator + behavior and all allocation requests try really hard. The request + will fail if the reclaim cannot make any progress. The OOM killer + won't be triggered. + + * ``GFP_KERNEL | __GFP_NOFAIL`` - overrides the default allocator behavior + and all allocation requests will loop endlessly until they succeed. + This might be really dangerous especially for larger orders. + Selecting memory allocator ========================== @@ -100,8 +145,14 @@ configuration, but it is a good practice to use `kmalloc` for objects smaller than page size. The address of a chunk allocated with `kmalloc` is aligned to at least -ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the -alignment is also guaranteed to be at least the respective size. +ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the +alignment is also guaranteed to be at least the respective size. For other +sizes, the alignment is guaranteed to be at least the largest power-of-two +divisor of the size. + +Chunks allocated with kmalloc() can be resized with krealloc(). Similarly +to kmalloc_array(): a helper for resizing arrays is provided in the form of +krealloc_array(). For large allocations you can use vmalloc() and vzalloc(), or directly request pages from the page allocator. The memory allocated by `vmalloc` @@ -122,7 +173,16 @@ should be used if a part of the cache might be copied to the userspace. After the cache is created kmem_cache_alloc() and its convenience wrappers can allocate memory from that cache. -When the allocated memory is no longer needed it must be freed. You can -use kvfree() for the memory allocated with `kmalloc`, `vmalloc` and -`kvmalloc`. The slab caches should be freed with kmem_cache_free(). And -don't forget to destroy the cache with kmem_cache_destroy(). +When the allocated memory is no longer needed it must be freed. + +Objects allocated by `kmalloc` can be freed by `kfree` or `kvfree`. Objects +allocated by `kmem_cache_alloc` can be freed with `kmem_cache_free`, `kfree` +or `kvfree`, where the latter two might be more convenient thanks to not +needing the kmem_cache pointer. + +The same rules apply to _bulk and _rcu flavors of freeing functions. + +Memory allocated by `vmalloc` can be freed with `vfree` or `kvfree`. +Memory allocated by `kvmalloc` can be freed with `kvfree`. +Caches created by `kmem_cache_create` should be freed with +`kmem_cache_destroy` only after freeing all the allocated objects first. diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst index de7467e48067..682259ee633a 100644 --- a/Documentation/core-api/memory-hotplug.rst +++ b/Documentation/core-api/memory-hotplug.rst @@ -57,7 +57,6 @@ The third argument (arg) passes a pointer of struct memory_notify:: unsigned long start_pfn; unsigned long nr_pages; int status_change_nid_normal; - int status_change_nid_high; int status_change_nid; } @@ -65,8 +64,6 @@ The third argument (arg) passes a pointer of struct memory_notify:: - nr_pages is # of pages of online/offline memory. - status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask is (will be) set/clear, if this is -1, then nodemask status is not changed. -- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask - is (will be) set/clear, if this is -1, then nodemask status is not changed. - status_change_nid is set node id when N_MEMORY of nodemask is (will be) set/clear. It means a new(memoryless) node gets new memory by online and a node loses all memory. If this is -1, then nodemask status is not changed. diff --git a/Documentation/core-api/min_heap.rst b/Documentation/core-api/min_heap.rst new file mode 100644 index 000000000000..9f57766581df --- /dev/null +++ b/Documentation/core-api/min_heap.rst @@ -0,0 +1,302 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +Min Heap API +============ + +:Author: Kuan-Wei Chiu <visitorckw@gmail.com> + +Introduction +============ + +The Min Heap API provides a set of functions and macros for managing min-heaps +in the Linux kernel. A min-heap is a binary tree structure where the value of +each node is less than or equal to the values of its children, ensuring that +the smallest element is always at the root. + +This document provides a guide to the Min Heap API, detailing how to define and +use min-heaps. Users should not directly call functions with **__min_heap_*()** +prefixes, but should instead use the provided macro wrappers. + +In addition to the standard version of the functions, the API also includes a +set of inline versions for performance-critical scenarios. These inline +functions have the same names as their non-inline counterparts but include an +**_inline** suffix. For example, **__min_heap_init_inline** and its +corresponding macro wrapper **min_heap_init_inline**. The inline versions allow +custom comparison and swap functions to be called directly, rather than through +indirect function calls. This can significantly reduce overhead, especially +when CONFIG_MITIGATION_RETPOLINE is enabled, as indirect function calls become +more expensive. As with the non-inline versions, it is important to use the +macro wrappers for inline functions instead of directly calling the functions +themselves. + +Data Structures +=============== + +Min-Heap Definition +------------------- + +The core data structure for representing a min-heap is defined using the +**MIN_HEAP_PREALLOCATED** and **DEFINE_MIN_HEAP** macros. These macros allow +you to define a min-heap with a preallocated buffer or dynamically allocated +memory. + +Example: + +.. code-block:: c + + #define MIN_HEAP_PREALLOCATED(_type, _name, _nr) + struct _name { + size_t nr; /* Number of elements in the heap */ + size_t size; /* Maximum number of elements that can be held */ + _type *data; /* Pointer to the heap data */ + _type preallocated[_nr]; /* Static preallocated array */ + } + + #define DEFINE_MIN_HEAP(_type, _name) MIN_HEAP_PREALLOCATED(_type, _name, 0) + +A typical heap structure will include a counter for the number of elements +(`nr`), the maximum capacity of the heap (`size`), and a pointer to an array of +elements (`data`). Optionally, you can specify a static array for preallocated +heap storage using **MIN_HEAP_PREALLOCATED**. + +Min Heap Callbacks +------------------ + +The **struct min_heap_callbacks** provides customization options for ordering +elements in the heap and swapping them. It contains two function pointers: + +.. code-block:: c + + struct min_heap_callbacks { + bool (*less)(const void *lhs, const void *rhs, void *args); + void (*swp)(void *lhs, void *rhs, void *args); + }; + +- **less** is the comparison function used to establish the order of elements. +- **swp** is a function for swapping elements in the heap. If swp is set to + NULL, the default swap function will be used, which swaps the elements based on their size + +Macro Wrappers +============== + +The following macro wrappers are provided for interacting with the heap in a +user-friendly manner. Each macro corresponds to a function that operates on the +heap, and they abstract away direct calls to internal functions. + +Each macro accepts various parameters that are detailed below. + +Heap Initialization +-------------------- + +.. code-block:: c + + min_heap_init(heap, data, size); + +- **heap**: A pointer to the min-heap structure to be initialized. +- **data**: A pointer to the buffer where the heap elements will be stored. If + `NULL`, the preallocated buffer within the heap structure will be used. +- **size**: The maximum number of elements the heap can hold. + +This macro initializes the heap, setting its initial state. If `data` is +`NULL`, the preallocated memory inside the heap structure will be used for +storage. Otherwise, the user-provided buffer is used. The operation is **O(1)**. + +**Inline Version:** min_heap_init_inline(heap, data, size) + +Accessing the Top Element +------------------------- + +.. code-block:: c + + element = min_heap_peek(heap); + +- **heap**: A pointer to the min-heap from which to retrieve the smallest + element. + +This macro returns a pointer to the smallest element (the root) of the heap, or +`NULL` if the heap is empty. The operation is **O(1)**. + +**Inline Version:** min_heap_peek_inline(heap) + +Heap Insertion +-------------- + +.. code-block:: c + + success = min_heap_push(heap, element, callbacks, args); + +- **heap**: A pointer to the min-heap into which the element should be inserted. +- **element**: A pointer to the element to be inserted into the heap. +- **callbacks**: A pointer to a `struct min_heap_callbacks` providing the + `less` and `swp` functions. +- **args**: Optional arguments passed to the `less` and `swp` functions. + +This macro inserts an element into the heap. It returns `true` if the insertion +was successful and `false` if the heap is full. The operation is **O(log n)**. + +**Inline Version:** min_heap_push_inline(heap, element, callbacks, args) + +Heap Removal +------------ + +.. code-block:: c + + success = min_heap_pop(heap, callbacks, args); + +- **heap**: A pointer to the min-heap from which to remove the smallest element. +- **callbacks**: A pointer to a `struct min_heap_callbacks` providing the + `less` and `swp` functions. +- **args**: Optional arguments passed to the `less` and `swp` functions. + +This macro removes the smallest element (the root) from the heap. It returns +`true` if the element was successfully removed, or `false` if the heap is +empty. The operation is **O(log n)**. + +**Inline Version:** min_heap_pop_inline(heap, callbacks, args) + +Heap Maintenance +---------------- + +You can use the following macros to maintain the heap's structure: + +.. code-block:: c + + min_heap_sift_down(heap, pos, callbacks, args); + +- **heap**: A pointer to the min-heap. +- **pos**: The index from which to start sifting down. +- **callbacks**: A pointer to a `struct min_heap_callbacks` providing the + `less` and `swp` functions. +- **args**: Optional arguments passed to the `less` and `swp` functions. + +This macro restores the heap property by moving the element at the specified +index (`pos`) down the heap until it is in the correct position. The operation +is **O(log n)**. + +**Inline Version:** min_heap_sift_down_inline(heap, pos, callbacks, args) + +.. code-block:: c + + min_heap_sift_up(heap, idx, callbacks, args); + +- **heap**: A pointer to the min-heap. +- **idx**: The index of the element to sift up. +- **callbacks**: A pointer to a `struct min_heap_callbacks` providing the + `less` and `swp` functions. +- **args**: Optional arguments passed to the `less` and `swp` functions. + +This macro restores the heap property by moving the element at the specified +index (`idx`) up the heap. The operation is **O(log n)**. + +**Inline Version:** min_heap_sift_up_inline(heap, idx, callbacks, args) + +.. code-block:: c + + min_heapify_all(heap, callbacks, args); + +- **heap**: A pointer to the min-heap. +- **callbacks**: A pointer to a `struct min_heap_callbacks` providing the + `less` and `swp` functions. +- **args**: Optional arguments passed to the `less` and `swp` functions. + +This macro ensures that the entire heap satisfies the heap property. It is +called when the heap is built from scratch or after many modifications. The +operation is **O(n)**. + +**Inline Version:** min_heapify_all_inline(heap, callbacks, args) + +Removing Specific Elements +-------------------------- + +.. code-block:: c + + success = min_heap_del(heap, idx, callbacks, args); + +- **heap**: A pointer to the min-heap. +- **idx**: The index of the element to delete. +- **callbacks**: A pointer to a `struct min_heap_callbacks` providing the + `less` and `swp` functions. +- **args**: Optional arguments passed to the `less` and `swp` functions. + +This macro removes an element at the specified index (`idx`) from the heap and +restores the heap property. The operation is **O(log n)**. + +**Inline Version:** min_heap_del_inline(heap, idx, callbacks, args) + +Other Utilities +=============== + +- **min_heap_full(heap)**: Checks whether the heap is full. + Complexity: **O(1)**. + +.. code-block:: c + + bool full = min_heap_full(heap); + +- `heap`: A pointer to the min-heap to check. + +This macro returns `true` if the heap is full, otherwise `false`. + +**Inline Version:** min_heap_full_inline(heap) + +- **min_heap_empty(heap)**: Checks whether the heap is empty. + Complexity: **O(1)**. + +.. code-block:: c + + bool empty = min_heap_empty(heap); + +- `heap`: A pointer to the min-heap to check. + +This macro returns `true` if the heap is empty, otherwise `false`. + +**Inline Version:** min_heap_empty_inline(heap) + +Example Usage +============= + +An example usage of the min-heap API would involve defining a heap structure, +initializing it, and inserting and removing elements as needed. + +.. code-block:: c + + #include <linux/min_heap.h> + + int my_less_function(const void *lhs, const void *rhs, void *args) { + return (*(int *)lhs < *(int *)rhs); + } + + struct min_heap_callbacks heap_cb = { + .less = my_less_function, /* Comparison function for heap order */ + .swp = NULL, /* Use default swap function */ + }; + + void example_usage(void) { + /* Pre-populate the buffer with elements */ + int buffer[5] = {5, 2, 8, 1, 3}; + /* Declare a min-heap */ + DEFINE_MIN_HEAP(int, my_heap); + + /* Initialize the heap with preallocated buffer and size */ + min_heap_init(&my_heap, buffer, 5); + + /* Build the heap using min_heapify_all */ + my_heap.nr = 5; /* Set the number of elements in the heap */ + min_heapify_all(&my_heap, &heap_cb, NULL); + + /* Peek at the top element (should be 1 in this case) */ + int *top = min_heap_peek(&my_heap); + pr_info("Top element: %d\n", *top); + + /* Pop the top element (1) and get the new top (2) */ + min_heap_pop(&my_heap, &heap_cb, NULL); + top = min_heap_peek(&my_heap); + pr_info("New top element: %d\n", *top); + + /* Insert a new element (0) and recheck the top */ + int new_element = 0; + min_heap_push(&my_heap, &new_element, &heap_cb, NULL); + top = min_heap_peek(&my_heap); + pr_info("Top element after insertion: %d\n", *top); + } diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index 2adffb3f7914..af8151db88b2 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -19,22 +19,16 @@ User Space Memory Access Memory Allocation Controls ========================== -Functions which need to allocate memory often use GFP flags to express -how that memory should be allocated. The GFP acronym stands for "get -free pages", the underlying memory allocation function. Not every GFP -flag is allowed to every function which may allocate memory. Most -users will want to use a plain ``GFP_KERNEL``. - -.. kernel-doc:: include/linux/gfp.h +.. kernel-doc:: include/linux/gfp_types.h :doc: Page mobility and placement hints -.. kernel-doc:: include/linux/gfp.h +.. kernel-doc:: include/linux/gfp_types.h :doc: Watermark modifiers -.. kernel-doc:: include/linux/gfp.h +.. kernel-doc:: include/linux/gfp_types.h :doc: Reclaim modifiers -.. kernel-doc:: include/linux/gfp.h +.. kernel-doc:: include/linux/gfp_types.h :doc: Useful GFP flag combinations The Slab Cache @@ -43,7 +37,7 @@ The Slab Cache .. kernel-doc:: include/linux/slab.h :internal: -.. kernel-doc:: mm/slab.c +.. kernel-doc:: mm/slub.c :export: .. kernel-doc:: mm/slab_common.c @@ -61,15 +55,30 @@ Virtually Contiguous Mappings File Mapping and Page Cache =========================== -.. kernel-doc:: mm/readahead.c - :export: +Filemap +------- .. kernel-doc:: mm/filemap.c :export: +Readahead +--------- + +.. kernel-doc:: mm/readahead.c + :doc: Readahead Overview + +.. kernel-doc:: mm/readahead.c + :export: + +Writeback +--------- + .. kernel-doc:: mm/page-writeback.c :export: +Truncate +-------- + .. kernel-doc:: mm/truncate.c :export: @@ -95,3 +104,39 @@ More Memory Management Functions :export: .. kernel-doc:: mm/page_alloc.c +.. kernel-doc:: mm/mempolicy.c +.. kernel-doc:: include/linux/mm_types.h + :internal: +.. kernel-doc:: include/linux/mm_inline.h +.. kernel-doc:: include/linux/page-flags.h +.. kernel-doc:: include/linux/mm.h + :internal: +.. kernel-doc:: include/linux/page_ref.h +.. kernel-doc:: include/linux/mmzone.h +.. kernel-doc:: mm/util.c + :functions: folio_mapping + +.. kernel-doc:: mm/rmap.c +.. kernel-doc:: mm/migrate.c +.. kernel-doc:: mm/mmap.c +.. kernel-doc:: mm/kmemleak.c +.. #kernel-doc:: mm/hmm.c (build warnings) +.. kernel-doc:: mm/memremap.c +.. kernel-doc:: mm/hugetlb.c +.. kernel-doc:: mm/swap.c +.. kernel-doc:: mm/zpool.c +.. kernel-doc:: mm/memcontrol.c +.. #kernel-doc:: mm/memory-tiers.c (build warnings) +.. kernel-doc:: mm/shmem.c +.. kernel-doc:: mm/migrate_device.c +.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files) +.. kernel-doc:: mm/mapping_dirty_helpers.c +.. #kernel-doc:: mm/memory-failure.c (build warnings) +.. kernel-doc:: mm/percpu.c +.. kernel-doc:: mm/maccess.c +.. kernel-doc:: mm/vmscan.c +.. kernel-doc:: mm/memory_hotplug.c +.. kernel-doc:: mm/mmu_notifier.c +.. kernel-doc:: mm/balloon_compaction.c +.. kernel-doc:: mm/huge_memory.c +.. kernel-doc:: mm/io-mapping.c diff --git a/Documentation/core-api/netlink.rst b/Documentation/core-api/netlink.rst new file mode 100644 index 000000000000..9f692b02bfe6 --- /dev/null +++ b/Documentation/core-api/netlink.rst @@ -0,0 +1,102 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +.. _kernel_netlink: + +=================================== +Netlink notes for kernel developers +=================================== + +General guidance +================ + +Attribute enums +--------------- + +Older families often define "null" attributes and commands with value +of ``0`` and named ``unspec``. This is supported (``type: unused``) +but should be avoided in new families. The ``unspec`` enum values are +not used in practice, so just set the value of the first attribute to ``1``. + +Message enums +------------- + +Use the same command IDs for requests and replies. This makes it easier +to match them up, and we have plenty of ID space. + +Use separate command IDs for notifications. This makes it easier to +sort the notifications from replies (and present them to the user +application via a different API than replies). + +Answer requests +--------------- + +Older families do not reply to all of the commands, especially NEW / ADD +commands. User only gets information whether the operation succeeded or +not via the ACK. Try to find useful data to return. Once the command is +added whether it replies with a full message or only an ACK is uAPI and +cannot be changed. It's better to err on the side of replying. + +Specifically NEW and ADD commands should reply with information identifying +the created object such as the allocated object's ID (without having to +resort to using ``NLM_F_ECHO``). + +NLM_F_ECHO +---------- + +Make sure to pass the request info to genl_notify() to allow ``NLM_F_ECHO`` +to take effect. This is useful for programs that need precise feedback +from the kernel (for example for logging purposes). + +Support dump consistency +------------------------ + +If iterating over objects during dump may skip over objects or repeat +them - make sure to report dump inconsistency with ``NLM_F_DUMP_INTR``. +This is usually implemented by maintaining a generation id for the +structure and recording it in the ``seq`` member of struct netlink_callback. + +Netlink specification +===================== + +Documentation of the Netlink specification parts which are only relevant +to the kernel space. + +Globals +------- + +kernel-policy +~~~~~~~~~~~~~ + +Defines whether the kernel validation policy is ``global`` i.e. the same for all +operations of the family, defined for each operation individually - ``per-op``, +or separately for each operation and operation type (do vs dump) - ``split``. +New families should use ``per-op`` (default) to be able to narrow down the +attributes accepted by a specific command. + +checks +------ + +Documentation for the ``checks`` sub-sections of attribute specs. + +unterminated-ok +~~~~~~~~~~~~~~~ + +Accept strings without the null-termination (for legacy families only). +Switches from the ``NLA_NUL_STRING`` to ``NLA_STRING`` policy type. + +max-len +~~~~~~~ + +Defines max length for a binary or string attribute (corresponding +to the ``len`` member of struct nla_policy). For string attributes terminating +null character is not counted towards ``max-len``. + +The field may either be a literal integer value or a name of a defined +constant. String types may reduce the constant by one +(i.e. specify ``max-len: CONST - 1``) to reserve space for the terminating +character so implementations should recognize such pattern. + +min-len +~~~~~~~ + +Similar to ``max-len`` but defines minimum length. diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst index d8c341fe383e..0ce2078c8e13 100644 --- a/Documentation/core-api/packing.rst +++ b/Documentation/core-api/packing.rst @@ -151,16 +151,195 @@ the more significant 4-byte word. We always think of our offsets as if there were no quirk, and we translate them afterwards, before accessing the memory region. +Note on buffer lengths not multiple of 4 +---------------------------------------- + +To deal with memory layout quirks where groups of 4 bytes are laid out "little +endian" relative to each other, but "big endian" within the group itself, the +concept of groups of 4 bytes is intrinsic to the packing API (not to be +confused with the memory access, which is performed byte by byte, though). + +With buffer lengths not multiple of 4, this means one group will be incomplete. +Depending on the quirks, this may lead to discontinuities in the bit fields +accessible through the buffer. The packing API assumes discontinuities were not +the intention of the memory layout, so it avoids them by effectively logically +shortening the most significant group of 4 octets to the number of octets +actually available. + +Example with a 31 byte sized buffer given below. Physical buffer offsets are +implicit, and increase from left to right within a group, and from top to +bottom within a column. + +No quirks: + +:: + + 31 29 28 | Group 7 (most significant) + 27 26 25 24 | Group 6 + 23 22 21 20 | Group 5 + 19 18 17 16 | Group 4 + 15 14 13 12 | Group 3 + 11 10 9 8 | Group 2 + 7 6 5 4 | Group 1 + 3 2 1 0 | Group 0 (least significant) + +QUIRK_LSW32_IS_FIRST: + +:: + + 3 2 1 0 | Group 0 (least significant) + 7 6 5 4 | Group 1 + 11 10 9 8 | Group 2 + 15 14 13 12 | Group 3 + 19 18 17 16 | Group 4 + 23 22 21 20 | Group 5 + 27 26 25 24 | Group 6 + 30 29 28 | Group 7 (most significant) + +QUIRK_LITTLE_ENDIAN: + +:: + + 30 28 29 | Group 7 (most significant) + 24 25 26 27 | Group 6 + 20 21 22 23 | Group 5 + 16 17 18 19 | Group 4 + 12 13 14 15 | Group 3 + 8 9 10 11 | Group 2 + 4 5 6 7 | Group 1 + 0 1 2 3 | Group 0 (least significant) + +QUIRK_LITTLE_ENDIAN | QUIRK_LSW32_IS_FIRST: + +:: + + 0 1 2 3 | Group 0 (least significant) + 4 5 6 7 | Group 1 + 8 9 10 11 | Group 2 + 12 13 14 15 | Group 3 + 16 17 18 19 | Group 4 + 20 21 22 23 | Group 5 + 24 25 26 27 | Group 6 + 28 29 30 | Group 7 (most significant) + Intended use ------------ Drivers that opt to use this API first need to identify which of the above 3 quirk combinations (for a total of 8) match what the hardware documentation -describes. Then they should wrap the packing() function, creating a new -xxx_packing() that calls it using the proper QUIRK_* one-hot bits set. +describes. + +There are 3 supported usage patterns, detailed below. + +packing() +^^^^^^^^^ + +This API function is deprecated. The packing() function returns an int-encoded error code, which protects the programmer against incorrect API use. The errors are not expected to occur -durring runtime, therefore it is reasonable for xxx_packing() to return void -and simply swallow those errors. Optionally it can dump stack or print the -error description. +during runtime, therefore it is reasonable to wrap packing() into a custom +function which returns void and swallows those errors. Optionally it can +dump stack or print the error description. + +.. code-block:: c + + void my_packing(void *buf, u64 *val, int startbit, int endbit, + size_t len, enum packing_op op) + { + int err; + + /* Adjust quirks accordingly */ + err = packing(buf, val, startbit, endbit, len, op, QUIRK_LSW32_IS_FIRST); + if (likely(!err)) + return; + + if (err == -EINVAL) { + pr_err("Start bit (%d) expected to be larger than end (%d)\n", + startbit, endbit); + } else if (err == -ERANGE) { + if ((startbit - endbit + 1) > 64) + pr_err("Field %d-%d too large for 64 bits!\n", + startbit, endbit); + else + pr_err("Cannot store %llx inside bits %d-%d (would truncate)\n", + *val, startbit, endbit); + } + dump_stack(); + } + +pack() and unpack() +^^^^^^^^^^^^^^^^^^^ + +These are const-correct variants of packing(), and eliminate the last "enum +packing_op op" argument. + +Calling pack(...) is equivalent, and preferred, to calling packing(..., PACK). + +Calling unpack(...) is equivalent, and preferred, to calling packing(..., UNPACK). + +pack_fields() and unpack_fields() +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The library exposes optimized functions for the scenario where there are many +fields represented in a buffer, and it encourages consumer drivers to avoid +repetitive calls to pack() and unpack() for each field, but instead use +pack_fields() and unpack_fields(), which reduces the code footprint. + +These APIs use field definitions in arrays of ``struct packed_field_u8`` or +``struct packed_field_u16``, allowing consumer drivers to minimize the size +of these arrays according to their custom requirements. + +The pack_fields() and unpack_fields() API functions are actually macros which +automatically select the appropriate function at compile time, based on the +type of the fields array passed in. + +An additional benefit over pack() and unpack() is that sanity checks on the +field definitions are handled at compile time with ``BUILD_BUG_ON`` rather +than only when the offending code is executed. These functions return void and +wrapping them to handle unexpected errors is not necessary. + +It is recommended, but not required, that you wrap your packed buffer into a +structured type with a fixed size. This generally makes it easier for the +compiler to enforce that the correct size buffer is used. + +Here is an example of how to use the fields APIs: + +.. code-block:: c + + /* Ordering inside the unpacked structure is flexible and can be different + * from the packed buffer. Here, it is optimized to reduce padding. + */ + struct data { + u64 field3; + u32 field4; + u16 field1; + u8 field2; + }; + + #define SIZE 13 + + typdef struct __packed { u8 buf[SIZE]; } packed_buf_t; + + static const struct packed_field_u8 fields[] = { + PACKED_FIELD(100, 90, struct data, field1), + PACKED_FIELD(90, 87, struct data, field2), + PACKED_FIELD(86, 30, struct data, field3), + PACKED_FIELD(29, 0, struct data, field4), + }; + + void unpack_your_data(const packed_buf_t *buf, struct data *unpacked) + { + BUILD_BUG_ON(sizeof(*buf) != SIZE; + + unpack_fields(buf, sizeof(*buf), unpacked, fields, + QUIRK_LITTLE_ENDIAN); + } + + void pack_your_data(const struct data *unpacked, packed_buf_t *buf) + { + BUILD_BUG_ON(sizeof(*buf) != SIZE; + + pack_fields(buf, sizeof(*buf), unpacked, fields, + QUIRK_LITTLE_ENDIAN); + } diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst index 0830e5b0e821..05b73c6c105f 100644 --- a/Documentation/core-api/padata.rst +++ b/Documentation/core-api/padata.rst @@ -27,22 +27,11 @@ padata_instance structure for overall control of how jobs are to be run:: #include <linux/padata.h> - struct padata_instance *padata_alloc_possible(const char *name); + struct padata_instance *padata_alloc(const char *name); 'name' simply identifies the instance. -There are functions for enabling and disabling the instance:: - - int padata_start(struct padata_instance *pinst); - void padata_stop(struct padata_instance *pinst); - -These functions are setting or clearing the "PADATA_INIT" flag; if that flag is -not set, other functions will refuse to work. padata_start() returns zero on -success (flag set) or -EINVAL if the padata cpumask contains no active CPU -(flag not set). padata_stop() clears the flag and blocks until the padata -instance is unused. - -Finally, complete padata initialization by allocating a padata_shell:: +Then, complete padata initialization by allocating a padata_shell:: struct padata_shell *padata_alloc_shell(struct padata_instance *pinst); @@ -53,7 +42,7 @@ padata_shells associated with it, each allowing a separate series of jobs. Modifying cpumasks ------------------ -The CPUs used to run jobs can be changed in two ways, programatically with +The CPUs used to run jobs can be changed in two ways, programmatically with padata_set_cpumask() or via sysfs. The former is defined:: int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type, @@ -155,11 +144,10 @@ submitted. Destroying ---------- -Cleaning up a padata instance predictably involves calling the three free +Cleaning up a padata instance predictably involves calling the two free functions that correspond to the allocation in reverse:: void padata_free_shell(struct padata_shell *ps); - void padata_stop(struct padata_instance *pinst); void padata_free(struct padata_instance *pinst); It is the user's responsibility to ensure all outstanding jobs are complete diff --git a/Documentation/core-api/parser.rst b/Documentation/core-api/parser.rst new file mode 100644 index 000000000000..45750d04b895 --- /dev/null +++ b/Documentation/core-api/parser.rst @@ -0,0 +1,17 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============== +Generic parser +============== + +Overview +======== + +The generic parser is a simple parser for parsing mount options, +filesystem options, driver options, subsystem options, etc. + +Parser API +========== + +.. kernel-doc:: lib/parser.c + :export: diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 6068266dd303..c16ca163b55e 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -33,7 +33,7 @@ all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so that's a natural dividing line, and a good point to make separate wrapper calls. In other words, use pin_user_pages*() for DMA-pinned pages, and -get_user_pages*() for other cases. There are four cases described later on in +get_user_pages*() for other cases. There are five cases described later on in this document, to further clarify that concept. FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, @@ -55,18 +55,17 @@ flags the caller provides. The caller is required to pass in a non-null struct pages* array, and the function then pins pages by incrementing each by a special value: GUP_PIN_COUNTING_BIAS. -For huge pages (and in fact, any compound page of more than 2 pages), the -GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting -is achieved, by using the 3rd struct page in the compound page. A new struct -page field, hpage_pinned_refcount, has been added in order to support this. +For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, +the extra space available in the struct folio is used to store the +pincount directly. -This approach for compound pages avoids the counting upper limit problems that -are discussed below. Those limitations would have been aggravated severely by -huge pages, because each tail page adds a refcount to the head page. And in -fact, testing revealed that, without a separate hpage_pinned_refcount field, -page overflows were seen in some huge page stress tests. +This approach for large folios avoids the counting upper limit problems +that are discussed below. Those limitations would have been aggravated +severely by huge pages, because each tail page adds a refcount to the +head page. And in fact, testing revealed that, without a separate pincount +field, refcount overflows were seen in some huge page stress tests. -This also means that huge pages and compound pages (of order > 1) do not suffer +This also means that huge pages and large folios do not suffer from the false positives problem that is mentioned below.:: Function @@ -113,6 +112,12 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time. +* Because of that limitation, special handling is applied to the zero pages + when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its + refcount or pincount at all (it is permanent, so there's no need). The + unpinning functions also don't do anything to a zero page. This is + transparent to the caller. + * Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used. @@ -127,7 +132,7 @@ CASE 1: Direct IO (DIO) ----------------------- There are GUP references to pages that are serving as DIO buffers. These buffers are needed for a relatively short time (so they -are not "long term"). No special synchronization with page_mkclean() or +are not "long term"). No special synchronization with folio_mkclean() or munmap() is provided. Therefore, flags to set at the call site are: :: FOLL_PIN @@ -139,7 +144,7 @@ CASE 2: RDMA ------------ There are GUP references to pages that are serving as DMA buffers. These buffers are needed for a long time ("long term"). No special -synchronization with page_mkclean() or munmap() is provided. Therefore, flags +synchronization with folio_mkclean() or munmap() is provided. Therefore, flags to set at the call site are: :: FOLL_PIN | FOLL_LONGTERM @@ -148,6 +153,8 @@ NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's because DAX pages do not have a separate page cache, and so "pinning" implies locking down file system blocks, which is not (yet) supported in that way. +.. _mmu-notifier-registration-case: + CASE 3: MMU notifier registration, with or without page faulting hardware ------------------------------------------------------------------------- Device drivers can pin pages via get_user_pages*(), and register for mmu @@ -163,7 +170,7 @@ callback, simply remove the range from the device's page tables. Either way, as long as the driver unpins the pages upon mmu notifier callback, then there is proper synchronization with both filesystem and mm -(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. +(folio_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. CASE 4: Pinning for struct page manipulation only ------------------------------------------------- @@ -189,20 +196,20 @@ INCORRECT (uses FOLL_GET calls): write to the data within the pages put_page() -page_maybe_dma_pinned(): the whole point of pinning -=================================================== +folio_maybe_dma_pinned(): the whole point of pinning +==================================================== -The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able -to query, "is this page DMA-pinned?" That allows code such as page_mkclean() +The whole point of marking folios as "DMA-pinned" or "gup-pinned" is to be able +to query, "is this folio DMA-pinned?" That allows code such as folio_mkclean() (and file system writeback code in general) to make informed decisions about -what to do when a page cannot be unmapped due to such pins. +what to do when a folio cannot be unmapped due to such pins. What to do in those cases is the subject of a years-long series of discussions and debates (see the References at the end of this document). It's a TODO item here: fill in the details once that's worked out. Meanwhile, it's safe to say that having this available: :: - static inline bool page_maybe_dma_pinned(struct page *page) + static inline bool folio_maybe_dma_pinned(struct folio *folio) ...is a prerequisite to solving the long-running gup+DMA problem. @@ -221,12 +228,12 @@ Unit testing ============ This file:: - tools/testing/selftests/vm/gup_benchmark.c + tools/testing/selftests/mm/gup_test.c has the following new calls to exercise the new pin*() wrapper functions: -* PIN_FAST_BENCHMARK (./gup_benchmark -a) -* PIN_BENCHMARK (./gup_benchmark -b) +* PIN_FAST_BENCHMARK (./gup_test -a) +* PIN_BASIC_TEST (./gup_test -b) You can monitor how many total dma-pinned pages have been acquired and released since the system was booted, via two new /proc/vmstat entries: :: @@ -264,9 +271,9 @@ place.) Other diagnostics ================= -dump_page() has been enhanced slightly, to handle these new counting fields, and -to better report on compound pages in general. Specifically, for compound pages -with order > 1, the exact (hpage_pinned_refcount) pincount is reported. +dump_page() has been enhanced slightly to handle these new counting +fields, and to better report on large folios in general. Specifically, +for large folios, the exact pincount is reported. References ========== diff --git a/Documentation/core-api/printk-basics.rst b/Documentation/core-api/printk-basics.rst index 563a9ce5fe1d..2dde24ca7d9f 100644 --- a/Documentation/core-api/printk-basics.rst +++ b/Documentation/core-api/printk-basics.rst @@ -69,7 +69,7 @@ You can check the current *console_loglevel* with:: The result shows the *current*, *default*, *minimum* and *boot-time-default* log levels. -To change the current console_loglevel simply write the the desired level to +To change the current console_loglevel simply write the desired level to ``/proc/sys/kernel/printk``. For example, to print all messages to the console:: # echo 8 > /proc/sys/kernel/printk @@ -107,9 +107,6 @@ also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined. Function reference ================== -.. kernel-doc:: kernel/printk/printk.c - :functions: printk - .. kernel-doc:: include/linux/printk.h - :functions: pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info + :functions: printk pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info pr_fmt pr_debug pr_devel pr_cont diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 8c9aba262b1e..4b7f3646ec6c 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -15,9 +15,10 @@ Integer types If variable is of Type, use printk format specifier: ------------------------------------------------------------ - char %d or %x + signed char %d or %hhx unsigned char %u or %x - short int %d or %x + char %u or %x + short int %d or %hx unsigned short int %u or %x int %d or %x unsigned int %u or %x @@ -27,9 +28,9 @@ Integer types unsigned long long %llu or %llx size_t %zu or %zx ssize_t %zd or %zx - s8 %d or %x + s8 %d or %hhx u8 %u or %x - s16 %d or %x + s16 %d or %hx u16 %u or %x s32 %d or %x u32 %u or %x @@ -37,14 +38,13 @@ Integer types u64 %llu or %llx -If <type> is dependent on a config option for its size (e.g., sector_t, -blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a -format specifier of its largest possible type and explicitly cast to it. +If <type> is architecture-dependent for its size (e.g., cycles_t, tcflag_t) or +is dependent on a config option for its size (e.g., blk_status_t), use a format +specifier of its largest possible type and explicitly cast to it. Example:: - printk("test: sector number/total blocks: %llu/%llu\n", - (unsigned long long)sector, (unsigned long long)blockcount); + printk("test: latency: %llu cycles\n", (unsigned long long)time); Reminder: sizeof() returns type size_t. @@ -79,7 +79,19 @@ Pointers printed without a specifier extension (i.e unadorned %p) are hashed to prevent leaking information about the kernel memory layout. This has the added benefit of providing a unique identifier. On 64-bit machines the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it -gathers enough entropy. If you *really* want the address see %px below. +gathers enough entropy. + +When possible, use specialised modifiers such as %pS or %pB (described below) +to avoid the need of providing an unhashed address that has to be interpreted +post-hoc. If not possible, and the aim of printing the address is to provide +more information for debugging, use %p and boot the kernel with the +``no_hash_pointers`` parameter during debugging, which will print all %p +addresses unmodified. If you *really* always want the unmodified address, see +%px below. + +If (and only if) you are printing addresses as a content of a virtual file in +e.g. procfs or sysfs (using e.g. seq_printf(), not printk()) read by a +userspace process, use the %pK modifier described below instead of %p or %px. Error Pointers -------------- @@ -114,6 +126,18 @@ used when printing stack backtraces. The specifier takes into consideration the effect of compiler optimisations which may occur when tail-calls are used and marked with the noreturn GCC attribute. +If the pointer is within a module, the module name and optionally build ID is +printed after the symbol name with an extra ``b`` appended to the end of the +specifier. + +:: + + %pS versatile_init+0x0/0x110 [module_name] + %pSb versatile_init+0x0/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e] + %pSRb versatile_init+0x9/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e] + (with __builtin_extract_return_addr() translation) + %pBb prev_fn_of_versatile_init+0x88/0x88 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e] + Probed Pointers from BPF / tracing ---------------------------------- @@ -139,6 +163,11 @@ For printing kernel pointers which should be hidden from unprivileged users. The behaviour of %pK depends on the kptr_restrict sysctl - see Documentation/admin-guide/sysctl/kernel.rst for more details. +This modifier is *only* intended when producing content of a file read by +userspace from e.g. procfs or sysfs, not for dmesg. Please refer to the +section about %p above for discussion about how to manage hashing pointers +in printk(). + Unmodified Addresses -------------------- @@ -153,6 +182,13 @@ equivalent to %lx (or %lu). %px is preferred because it is more uniquely grep'able. If in the future we need to modify the way the kernel handles printing pointers we will be better equipped to find the call sites. +Before using %px, consider if using %p is sufficient together with enabling the +``no_hash_pointers`` kernel parameter during debugging sessions (see the %p +description above). One valid scenario for %px might be printing information +immediately before a panic, which prevents any sensitive information to be +exploited anyway, and with %px there would be no need to reproduce the panic +with no_hash_pointers. + Pointer Differences ------------------- @@ -173,12 +209,17 @@ Struct Resources :: %pr [mem 0x60000000-0x6fffffff flags 0x2200] or + [mem 0x60000000 flags 0x2200] or [mem 0x0000000060000000-0x000000006fffffff flags 0x2200] + [mem 0x0000000060000000 flags 0x2200] %pR [mem 0x60000000-0x6fffffff pref] or + [mem 0x60000000 pref] or [mem 0x0000000060000000-0x000000006fffffff pref] + [mem 0x0000000060000000 pref] For printing struct resources. The ``R`` and ``r`` specifiers result in a -printed resource with (R) or without (r) a decoded flags member. +printed resource with (R) or without (r) a decoded flags member. If start is +equal to end only print the start value. Passed by reference. @@ -195,6 +236,19 @@ width of the CPU data path. Passed by reference. +Struct Range +------------ + +:: + + %pra [range 0x0000000060000000-0x000000006fffffff] or + [range 0x0000000060000000] + +For printing struct range. struct range holds an arbitrary range of u64 +values. If start is equal to end only print the start value. + +Passed by reference. + DMA address types dma_addr_t ---------------------------- @@ -317,7 +371,7 @@ colon-separators. Leading zeros are always used. The additional ``c`` specifier can be used with the ``I`` specifier to print a compressed IPv6 address as described by -http://tools.ietf.org/html/rfc5952 +https://tools.ietf.org/html/rfc5952 Passed by reference. @@ -341,7 +395,7 @@ The additional ``p``, ``f``, and ``s`` specifiers are used to specify port flowinfo a ``/`` and scope a ``%``, each followed by the actual value. In case of an IPv6 address the compressed IPv6 address as described by -http://tools.ietf.org/html/rfc5952 is being used if the additional +https://tools.ietf.org/html/rfc5952 is being used if the additional specifier ``c`` is given. The IPv6 address is surrounded by ``[``, ``]`` in case of additional specifiers ``p``, ``f`` or ``s`` as suggested by https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07 @@ -490,18 +544,25 @@ Time and date :: %pt[RT] YYYY-mm-ddTHH:MM:SS + %pt[RT]s YYYY-mm-dd HH:MM:SS %pt[RT]d YYYY-mm-dd %pt[RT]t HH:MM:SS - %pt[RT][dt][r] + %pt[RT][dt][r][s] + +For printing date and time as represented by:: -For printing date and time as represented by R struct rtc_time structure T time64_t type + in human readable format. By default year will be incremented by 1900 and month by 1. Use %pt[RT]r (raw) to suppress this behaviour. +The %pt[RT]s (space) will override ISO 8601 separator by using ' ' (space) +instead of 'T' (Capital T) between date and time. It won't have any effect +when date or time is omitted. + Passed by reference. struct clk @@ -510,9 +571,8 @@ struct clk :: %pC pll1 - %pCn pll1 -For printing struct clk structures. %pC and %pCn print the name of the clock +For printing struct clk structures. %pC prints the name of the clock (Common Clock Framework) or a unique 32-bit ID (legacy clock framework). Passed by reference. @@ -529,22 +589,28 @@ For printing bitmap and its derivatives such as cpumask and nodemask, %*pb outputs the bitmap with field width as the number of bits and %*pbl output the bitmap as range list with field width as the number of bits. -Passed by reference. +The field width is passed by value, the bitmap is passed by reference. +Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease +printing cpumask and nodemask. -Flags bitfields such as page flags, gfp_flags ---------------------------------------------- +Flags bitfields such as page flags and gfp_flags +-------------------------------------------------------- :: - %pGp referenced|uptodate|lru|active|private + %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff) %pGg GFP_USER|GFP_DMA32|GFP_NOWARN %pGv read|exec|mayread|maywrite|mayexec|denywrite For printing flags bitfields as a collection of symbolic constants that would construct the value. The type of flags is given by the third -character. Currently supported are [p]age flags, [v]ma_flags (both -expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag -names and print order depends on the particular type. +character. Currently supported are: + + - p - [p]age flags, expects value of type (``unsigned long *``) + - v - [v]ma_flags, expects value of type (``unsigned long *``) + - g - [g]fp_flags, expects value of type (``gfp_t *``) + +The flag names and print order depends on the particular type. Note that this format should not be used directly in the :c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags() @@ -563,10 +629,70 @@ For printing netdev_features_t. Passed by reference. +V4L2 and DRM FourCC code (pixel format) +--------------------------------------- + +:: + + %p4cc + +Print a FourCC code used by V4L2 or DRM, including format endianness and +its numerical value as hexadecimal. + +Passed by reference. + +Examples:: + + %p4cc BG12 little-endian (0x32314742) + %p4cc Y10 little-endian (0x20303159) + %p4cc NV12 big-endian (0xb231564e) + +Generic FourCC code +------------------- + +:: + %p4c[h[R]lb] gP00 (0x67503030) + +Print a generic FourCC code, as both ASCII characters and its numerical +value as hexadecimal. + +The generic FourCC code is always printed in the big-endian format, +the most significant byte first. This is the opposite of V4L/DRM FourCCs. + +The additional ``h``, ``hR``, ``l``, and ``b`` specifiers define what +endianness is used to load the stored bytes. The data might be interpreted +using the host, reversed host byte order, little-endian, or big-endian. + +Passed by reference. + +Examples for a little-endian machine, given &(u32)0x67503030:: + + %p4ch gP00 (0x67503030) + %p4chR 00Pg (0x30305067) + %p4cl gP00 (0x67503030) + %p4cb 00Pg (0x30305067) + +Examples for a big-endian machine, given &(u32)0x67503030:: + + %p4ch gP00 (0x67503030) + %p4chR 00Pg (0x30305067) + %p4cl 00Pg (0x30305067) + %p4cb gP00 (0x67503030) + +Rust +---- + +:: + + %pA + +Only intended to be used from Rust code to format ``core::fmt::Arguments``. +Do *not* use it from C. + Thanks ====== -If you add other %p extensions, please extend <lib/test_printf.c> with -one or more test cases, if at all feasible. +If you add other %p extensions, please extend <lib/tests/printf_kunit.c> +with one or more test cases, if at all feasible. Thank you for your cooperation and attention. diff --git a/Documentation/core-api/printk-index.rst b/Documentation/core-api/printk-index.rst new file mode 100644 index 000000000000..1979c5dd32fe --- /dev/null +++ b/Documentation/core-api/printk-index.rst @@ -0,0 +1,137 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +Printk Index +============ + +There are many ways to monitor the state of the system. One important +source of information is the system log. It provides a lot of information, +including more or less important warnings and error messages. + +There are monitoring tools that filter and take action based on messages +logged. + +The kernel messages are evolving together with the code. As a result, +particular kernel messages are not KABI and never will be! + +It is a huge challenge for maintaining the system log monitors. It requires +knowing what messages were updated in a particular kernel version and why. +Finding these changes in the sources would require non-trivial parsers. +Also it would require matching the sources with the binary kernel which +is not always trivial. Various changes might be backported. Various kernel +versions might be used on different monitored systems. + +This is where the printk index feature might become useful. It provides +a dump of printk formats used all over the source code used for the kernel +and modules on the running system. It is accessible at runtime via debugfs. + +The printk index helps to find changes in the message formats. Also it helps +to track the strings back to the kernel sources and the related commit. + + +User Interface +============== + +The index of printk formats are split in into separate files. The files are +named according to the binaries where the printk formats are built-in. There +is always "vmlinux" and optionally also modules, for example:: + + /sys/kernel/debug/printk/index/vmlinux + /sys/kernel/debug/printk/index/ext4 + /sys/kernel/debug/printk/index/scsi_mod + +Note that only loaded modules are shown. Also printk formats from a module +might appear in "vmlinux" when the module is built-in. + +The content is inspired by the dynamic debug interface and looks like:: + + $> head -1 /sys/kernel/debug/printk/index/vmlinux; shuf -n 5 vmlinux + # <level[,flags]> filename:line function "format" + <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n" + <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n" + <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n" + <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n" + <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" + +, where the meaning is: + + - :level: log level value: 0-7 for particular severity, -1 as default, + 'c' as continuous line without an explicit log level + - :flags: optional flags: currently only 'c' for KERN_CONT + - :filename\:line: source filename and line number of the related + printk() call. Note that there are many wrappers, for example, + pr_warn(), pr_warn_once(), dev_warn(). + - :function: function name where the printk() call is used. + - :format: format string + +The extra information makes it a bit harder to find differences +between various kernels. Especially the line number might change +very often. On the other hand, it helps a lot to confirm that +it is the same string or find the commit that is responsible +for eventual changes. + + +printk() Is Not a Stable KABI +============================= + +Several developers are afraid that exporting all these implementation +details into the user space will transform particular printk() calls +into KABI. + +But it is exactly the opposite. printk() calls must _not_ be KABI. +And the printk index helps user space tools to deal with this. + + +Subsystem specific printk wrappers +================================== + +The printk index is generated using extra metadata that are stored in +a dedicated .elf section ".printk_index". It is achieved using macro +wrappers doing __printk_index_emit() together with the real printk() +call. The same technique is used also for the metadata used by +the dynamic debug feature. + +The metadata are stored for a particular message only when it is printed +using these special wrappers. It is implemented for the commonly +used printk() calls, including, for example, pr_warn(), or pr_once(). + +Additional changes are necessary for various subsystem specific wrappers +that call the original printk() via a common helper function. These needs +their own wrappers adding __printk_index_emit(). + +Only few subsystem specific wrappers have been updated so far, +for example, dev_printk(). As a result, the printk formats from +some subsystems can be missing in the printk index. + + +Subsystem specific prefix +========================= + +The macro pr_fmt() macro allows to define a prefix that is printed +before the string generated by the related printk() calls. + +Subsystem specific wrappers usually add even more complicated +prefixes. + +These prefixes can be stored into the printk index metadata +by an optional parameter of __printk_index_emit(). The debugfs +interface might then show the printk formats including these prefixes. +For example, drivers/acpi/osl.c contains:: + + #define pr_fmt(fmt) "ACPI: OSL: " fmt + + static int __init acpi_no_auto_serialize_setup(char *str) + { + acpi_gbl_auto_serialize_methods = FALSE; + pr_info("Auto-serialization disabled\n"); + + return 1; + } + +This results in the following printk index entry:: + + <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" + +It helps matching messages from the real log with printk index. +Then the source file name, line number, and function name can +be used to match the string with the source code. diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index ec575e72d0b2..7eb7c6023e09 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -4,31 +4,48 @@ Memory Protection Keys ====================== -Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature -which is found on Intel's Skylake (and later) "Scalable Processor" -Server CPUs. It will be available in future non-server Intel parts -and future AMD processors. - -For anyone wishing to test or use this feature, it is available in -Amazon's EC2 C5 instances and is known to work there using an Ubuntu -17.04 image. - -Memory Protection Keys provides a mechanism for enforcing page-based -protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. - -There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each +Memory Protection Keys provide a mechanism for enforcing page-based +protections, but without requiring modification of the page tables when an +application changes protection domains. + +Pkeys Userspace (PKU) is a feature which can be found on: + * Intel server CPUs, Skylake and later + * Intel client CPUs, Tiger Lake (11th Gen Core) and later + * Future AMD CPUs + * arm64 CPUs implementing the Permission Overlay Extension (FEAT_S1POE) + +x86_64 +====== +Pkeys work by dedicating 4 previously Reserved bits in each page table entry to +a "protection key", giving 16 possible keys. + +Protections for each key are defined with a per-CPU user-accessible register +(PKRU). Each of these is a 32-bit register storing two bits (Access Disable +and Write Disable) for each of 16 keys. + +Being a CPU register, PKRU is inherently thread-local, potentially giving each thread a different set of protections from every other thread. -There are two new instructions (RDPKRU/WRPKRU) for reading and writing -to the new register. The feature is only available in 64-bit mode, -even though there is theoretically space in the PAE PTEs. These -permissions are enforced on data access only and have no effect on -instruction fetches. +There are two instructions (RDPKRU/WRPKRU) for reading and writing to the +register. The feature is only available in 64-bit mode, even though there is +theoretically space in the PAE PTEs. These permissions are enforced on data +access only and have no effect on instruction fetches. + +arm64 +===== + +Pkeys use 3 bits in each page table entry, to encode a "protection key index", +giving 8 possible keys. + +Protections for each key are defined with a per-CPU user-writable system +register (POR_EL0). This is a 64-bit register encoding read, write and execute +overlay permissions for each protection key index. + +Being a CPU register, POR_EL0 is inherently thread-local, potentially giving +each thread a different set of protections from every other thread. + +Unlike x86_64, the protection key permissions also apply to instruction +fetches. Syscalls ======== @@ -40,11 +57,10 @@ There are 3 system calls which directly interact with pkeys:: int pkey_mprotect(unsigned long start, size_t len, unsigned long prot, int pkey); -Before a pkey can be used, it must first be allocated with -pkey_alloc(). An application calls the WRPKRU instruction -directly in order to change access permissions to memory covered -with a key. In this example WRPKRU is wrapped by a C function -called pkey_set(). +Before a pkey can be used, it must first be allocated with pkey_alloc(). An +application writes to the architecture specific CPU register directly in order +to change access permissions to memory covered with a key. In this example +this is wrapped by a C function called pkey_set(). :: int real_prot = PROT_READ|PROT_WRITE; @@ -66,9 +82,9 @@ is no longer in use:: munmap(ptr, PAGE_SIZE); pkey_free(pkey); -.. note:: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. - An example implementation can be found in - tools/testing/selftests/x86/protection_keys.c. +.. note:: pkey_set() is a wrapper around writing to the CPU register. + Example implementations can be found in + tools/testing/selftests/mm/pkey-{arm64,powerpc,x86}.h Behavior ======== @@ -98,3 +114,7 @@ with a read():: The kernel will send a SIGSEGV in both cases, but si_code will be set to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when the plain mprotect() permissions are violated. + +Note that kernel accesses from a kthread (such as io_uring) will use a default +value for the protection key register and so will not be consistent with +userspace's value of the register or mprotect(). diff --git a/Documentation/core-api/rbtree.rst b/Documentation/core-api/rbtree.rst index 6b88837fbf82..ed1a9fbc779e 100644 --- a/Documentation/core-api/rbtree.rst +++ b/Documentation/core-api/rbtree.rst @@ -201,7 +201,7 @@ search trees, such as for traversals or users relying on a the particular order for their own logic. To this end, users can use 'struct rb_root_cached' to optimize O(logN) rb_first() calls to a simple pointer fetch avoiding potentially expensive tree iterations. This is done at negligible runtime -overhead for maintanence; albeit larger memory footprint. +overhead for maintenance; albeit larger memory footprint. Similar to the rb_root structure, cached rbtrees are initialized to be empty via:: diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst index 79a009ce11df..94e628c1eb49 100644 --- a/Documentation/core-api/refcount-vs-atomic.rst +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -86,7 +86,19 @@ Memory ordering guarantee changes: * none (both fully unordered) -case 2) - increment-based ops that return no value +case 2) - non-"Read/Modify/Write" (RMW) ops with release ordering +----------------------------------------------------------------- + +Function changes: + + * atomic_set_release() --> refcount_set_release() + +Memory ordering guarantee changes: + + * none (both provide RELEASE ordering) + + +case 3) - increment-based ops that return no value -------------------------------------------------- Function changes: @@ -98,7 +110,7 @@ Memory ordering guarantee changes: * none (both fully unordered) -case 3) - decrement-based RMW ops that return no value +case 4) - decrement-based RMW ops that return no value ------------------------------------------------------ Function changes: @@ -110,7 +122,7 @@ Memory ordering guarantee changes: * fully unordered --> RELEASE ordering -case 4) - increment-based RMW ops that return a value +case 5) - increment-based RMW ops that return a value ----------------------------------------------------- Function changes: @@ -126,7 +138,20 @@ Memory ordering guarantees changes: result of obtaining pointer to the object! -case 5) - generic dec/sub decrement-based RMW ops that return a value +case 6) - increment-based RMW ops with acquire ordering that return a value +--------------------------------------------------------------------------- + +Function changes: + + * atomic_inc_not_zero() --> refcount_inc_not_zero_acquire() + * no atomic counterpart --> refcount_add_not_zero_acquire() + +Memory ordering guarantees changes: + + * fully ordered --> ACQUIRE ordering on success + + +case 7) - generic dec/sub decrement-based RMW ops that return a value --------------------------------------------------------------------- Function changes: @@ -139,7 +164,7 @@ Memory ordering guarantees changes: * fully ordered --> RELEASE ordering + ACQUIRE ordering on success -case 6) other decrement-based RMW ops that return a value +case 8) other decrement-based RMW ops that return a value --------------------------------------------------------- Function changes: @@ -154,7 +179,7 @@ Memory ordering guarantees changes: .. note:: atomic_add_unless() only provides full order on success. -case 7) - lock-based RMW +case 9) - lock-based RMW ------------------------ Function changes: diff --git a/Documentation/core-api/swiotlb.rst b/Documentation/core-api/swiotlb.rst new file mode 100644 index 000000000000..9e0fe027dd3b --- /dev/null +++ b/Documentation/core-api/swiotlb.rst @@ -0,0 +1,321 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +DMA and swiotlb +=============== + +swiotlb is a memory buffer allocator used by the Linux kernel DMA layer. It is +typically used when a device doing DMA can't directly access the target memory +buffer because of hardware limitations or other requirements. In such a case, +the DMA layer calls swiotlb to allocate a temporary memory buffer that conforms +to the limitations. The DMA is done to/from this temporary memory buffer, and +the CPU copies the data between the temporary buffer and the original target +memory buffer. This approach is generically called "bounce buffering", and the +temporary memory buffer is called a "bounce buffer". + +Device drivers don't interact directly with swiotlb. Instead, drivers inform +the DMA layer of the DMA attributes of the devices they are managing, and use +the normal DMA map, unmap, and sync APIs when programming a device to do DMA. +These APIs use the device DMA attributes and kernel-wide settings to determine +if bounce buffering is necessary. If so, the DMA layer manages the allocation, +freeing, and sync'ing of bounce buffers. Since the DMA attributes are per +device, some devices in a system may use bounce buffering while others do not. + +Because the CPU copies data between the bounce buffer and the original target +memory buffer, doing bounce buffering is slower than doing DMA directly to the +original memory buffer, and it consumes more CPU resources. So it is used only +when necessary for providing DMA functionality. + +Usage Scenarios +--------------- +swiotlb was originally created to handle DMA for devices with addressing +limitations. As physical memory sizes grew beyond 4 GiB, some devices could +only provide 32-bit DMA addresses. By allocating bounce buffer memory below +the 4 GiB line, these devices with addressing limitations could still work and +do DMA. + +More recently, Confidential Computing (CoCo) VMs have the guest VM's memory +encrypted by default, and the memory is not accessible by the host hypervisor +and VMM. For the host to do I/O on behalf of the guest, the I/O must be +directed to guest memory that is unencrypted. CoCo VMs set a kernel-wide option +to force all DMA I/O to use bounce buffers, and the bounce buffer memory is set +up as unencrypted. The host does DMA I/O to/from the bounce buffer memory, and +the Linux kernel DMA layer does "sync" operations to cause the CPU to copy the +data to/from the original target memory buffer. The CPU copying bridges between +the unencrypted and the encrypted memory. This use of bounce buffers allows +device drivers to "just work" in a CoCo VM, with no modifications +needed to handle the memory encryption complexity. + +Other edge case scenarios arise for bounce buffers. For example, when IOMMU +mappings are set up for a DMA operation to/from a device that is considered +"untrusted", the device should be given access only to the memory containing +the data being transferred. But if that memory occupies only part of an IOMMU +granule, other parts of the granule may contain unrelated kernel data. Since +IOMMU access control is per-granule, the untrusted device can gain access to +the unrelated kernel data. This problem is solved by bounce buffering the DMA +operation and ensuring that unused portions of the bounce buffers do not +contain any unrelated kernel data. + +Core Functionality +------------------ +The primary swiotlb APIs are swiotlb_tbl_map_single() and +swiotlb_tbl_unmap_single(). The "map" API allocates a bounce buffer of a +specified size in bytes and returns the physical address of the buffer. The +buffer memory is physically contiguous. The expectation is that the DMA layer +maps the physical memory address to a DMA address, and returns the DMA address +to the driver for programming into the device. If a DMA operation specifies +multiple memory buffer segments, a separate bounce buffer must be allocated for +each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a +CPU copy) to initialize the bounce buffer to match the contents of the original +buffer. + +swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might have +updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, the +unmap does a "sync" operation to cause a CPU copy of the data from the bounce +buffer back to the original buffer. Then the bounce buffer memory is freed. + +swiotlb also provides "sync" APIs that correspond to the dma_sync_*() APIs that +a driver may use when control of a buffer transitions between the CPU and the +device. The swiotlb "sync" APIs cause a CPU copy of the data between the +original buffer and the bounce buffer. Like the dma_sync_*() APIs, the swiotlb +"sync" APIs support doing a partial sync, where only a subset of the bounce +buffer is copied to/from the original buffer. + +Core Functionality Constraints +------------------------------ +The swiotlb map/unmap/sync APIs must operate without blocking, as they are +called by the corresponding DMA APIs which may run in contexts that cannot +block. Hence the default memory pool for swiotlb allocations must be +pre-allocated at boot time (but see Dynamic swiotlb below). Because swiotlb +allocations must be physically contiguous, the entire default memory pool is +allocated as a single contiguous block. + +The need to pre-allocate the default swiotlb pool creates a boot-time tradeoff. +The pool should be large enough to ensure that bounce buffer requests can +always be satisfied, as the non-blocking requirement means requests can't wait +for space to become available. But a large pool potentially wastes memory, as +this pre-allocated memory is not available for other uses in the system. The +tradeoff is particularly acute in CoCo VMs that use bounce buffers for all DMA +I/O. These VMs use a heuristic to set the default pool size to ~6% of memory, +with a max of 1 GiB, which has the potential to be very wasteful of memory. +Conversely, the heuristic might produce a size that is insufficient, depending +on the I/O patterns of the workload in the VM. The dynamic swiotlb feature +described below can help, but has limitations. Better management of the swiotlb +default memory pool size remains an open issue. + +A single allocation from swiotlb is limited to IO_TLB_SIZE * IO_TLB_SEGSIZE +bytes, which is 256 KiB with current definitions. When a device's DMA settings +are such that the device might use swiotlb, the maximum size of a DMA segment +must be limited to that 256 KiB. This value is communicated to higher-level +kernel code via dma_map_mapping_size() and swiotlb_max_mapping_size(). If the +higher-level code fails to account for this limit, it may make requests that +are too large for swiotlb, and get a "swiotlb full" error. + +A key device DMA setting is "min_align_mask", which is a power of 2 minus 1 +so that some number of low order bits are set, or it may be zero. swiotlb +allocations ensure these min_align_mask bits of the physical address of the +bounce buffer match the same bits in the address of the original buffer. When +min_align_mask is non-zero, it may produce an "alignment offset" in the address +of the bounce buffer that slightly reduces the maximum size of an allocation. +This potential alignment offset is reflected in the value returned by +swiotlb_max_mapping_size(), which can show up in places like +/sys/block/<device>/queue/max_sectors_kb. For example, if a device does not use +swiotlb, max_sectors_kb might be 512 KiB or larger. If a device might use +swiotlb, max_sectors_kb will be 256 KiB. When min_align_mask is non-zero, +max_sectors_kb might be even smaller, such as 252 KiB. + +swiotlb_tbl_map_single() also takes an "alloc_align_mask" parameter. This +parameter specifies the allocation of bounce buffer space must start at a +physical address with the alloc_align_mask bits set to zero. But the actual +bounce buffer might start at a larger address if min_align_mask is non-zero. +Hence there may be pre-padding space that is allocated prior to the start of +the bounce buffer. Similarly, the end of the bounce buffer is rounded up to an +alloc_align_mask boundary, potentially resulting in post-padding space. Any +pre-padding or post-padding space is not initialized by swiotlb code. The +"alloc_align_mask" parameter is used by IOMMU code when mapping for untrusted +devices. It is set to the granule size - 1 so that the bounce buffer is +allocated entirely from granules that are not used for any other purpose. + +Data structures concepts +------------------------ +Memory used for swiotlb bounce buffers is allocated from overall system memory +as one or more "pools". The default pool is allocated during system boot with a +default size of 64 MiB. The default pool size may be modified with the +"swiotlb=" kernel boot line parameter. The default size may also be adjusted +due to other conditions, such as running in a CoCo VM, as described above. If +CONFIG_SWIOTLB_DYNAMIC is enabled, additional pools may be allocated later in +the life of the system. Each pool must be a contiguous range of physical +memory. The default pool is allocated below the 4 GiB physical address line so +it works for devices that can only address 32-bits of physical memory (unless +architecture-specific code provides the SWIOTLB_ANY flag). In a CoCo VM, the +pool memory must be decrypted before swiotlb is used. + +Each pool is divided into "slots" of size IO_TLB_SIZE, which is 2 KiB with +current definitions. IO_TLB_SEGSIZE contiguous slots (128 slots) constitute +what might be called a "slot set". When a bounce buffer is allocated, it +occupies one or more contiguous slots. A slot is never shared by multiple +bounce buffers. Furthermore, a bounce buffer must be allocated from a single +slot set, which leads to the maximum bounce buffer size being IO_TLB_SIZE * +IO_TLB_SEGSIZE. Multiple smaller bounce buffers may co-exist in a single slot +set if the alignment and size constraints can be met. + +Slots are also grouped into "areas", with the constraint that a slot set exists +entirely in a single area. Each area has its own spin lock that must be held to +manipulate the slots in that area. The division into areas avoids contending +for a single global spin lock when swiotlb is heavily used, such as in a CoCo +VM. The number of areas defaults to the number of CPUs in the system for +maximum parallelism, but since an area can't be smaller than IO_TLB_SEGSIZE +slots, it might be necessary to assign multiple CPUs to the same area. The +number of areas can also be set via the "swiotlb=" kernel boot parameter. + +When allocating a bounce buffer, if the area associated with the calling CPU +does not have enough free space, areas associated with other CPUs are tried +sequentially. For each area tried, the area's spin lock must be obtained before +trying an allocation, so contention may occur if swiotlb is relatively busy +overall. But an allocation request does not fail unless all areas do not have +enough free space. + +IO_TLB_SIZE, IO_TLB_SEGSIZE, and the number of areas must all be powers of 2 as +the code uses shifting and bit masking to do many of the calculations. The +number of areas is rounded up to a power of 2 if necessary to meet this +requirement. + +The default pool is allocated with PAGE_SIZE alignment. If an alloc_align_mask +argument to swiotlb_tbl_map_single() specifies a larger alignment, one or more +initial slots in each slot set might not meet the alloc_align_mask criterium. +Because a bounce buffer allocation can't cross a slot set boundary, eliminating +those initial slots effectively reduces the max size of a bounce buffer. +Currently, there's no problem because alloc_align_mask is set based on IOMMU +granule size, and granules cannot be larger than PAGE_SIZE. But if that were to +change in the future, the initial pool allocation might need to be done with +alignment larger than PAGE_SIZE. + +Dynamic swiotlb +--------------- +When CONFIG_SWIOTLB_DYNAMIC is enabled, swiotlb can do on-demand expansion of +the amount of memory available for allocation as bounce buffers. If a bounce +buffer request fails due to lack of available space, an asynchronous background +task is kicked off to allocate memory from general system memory and turn it +into an swiotlb pool. Creating an additional pool must be done asynchronously +because the memory allocation may block, and as noted above, swiotlb requests +are not allowed to block. Once the background task is kicked off, the bounce +buffer request creates a "transient pool" to avoid returning an "swiotlb full" +error. A transient pool has the size of the bounce buffer request, and is +deleted when the bounce buffer is freed. Memory for this transient pool comes +from the general system memory atomic pool so that creation does not block. +Creating a transient pool has relatively high cost, particularly in a CoCo VM +where the memory must be decrypted, so it is done only as a stopgap until the +background task can add another non-transient pool. + +Adding a dynamic pool has limitations. Like with the default pool, the memory +must be physically contiguous, so the size is limited to MAX_PAGE_ORDER pages +(e.g., 4 MiB on a typical x86 system). Due to memory fragmentation, a max size +allocation may not be available. The dynamic pool allocator tries smaller sizes +until it succeeds, but with a minimum size of 1 MiB. Given sufficient system +memory fragmentation, dynamically adding a pool might not succeed at all. + +The number of areas in a dynamic pool may be different from the number of areas +in the default pool. Because the new pool size is typically a few MiB at most, +the number of areas will likely be smaller. For example, with a new pool size +of 4 MiB and the 256 KiB minimum area size, only 16 areas can be created. If +the system has more than 16 CPUs, multiple CPUs must share an area, creating +more lock contention. + +New pools added via dynamic swiotlb are linked together in a linear list. +swiotlb code frequently must search for the pool containing a particular +swiotlb physical address, so that search is linear and not performant with a +large number of dynamic pools. The data structures could be improved for +faster searches. + +Overall, dynamic swiotlb works best for small configurations with relatively +few CPUs. It allows the default swiotlb pool to be smaller so that memory is +not wasted, with dynamic pools making more space available if needed (as long +as fragmentation isn't an obstacle). It is less useful for large CoCo VMs. + +Data Structure Details +---------------------- +swiotlb is managed with four primary data structures: io_tlb_mem, io_tlb_pool, +io_tlb_area, and io_tlb_slot. io_tlb_mem describes a swiotlb memory allocator, +which includes the default memory pool and any dynamic or transient pools +linked to it. Limited statistics on swiotlb usage are kept per memory allocator +and are stored in this data structure. These statistics are available under +/sys/kernel/debug/swiotlb when CONFIG_DEBUG_FS is set. + +io_tlb_pool describes a memory pool, either the default pool, a dynamic pool, +or a transient pool. The description includes the start and end addresses of +the memory in the pool, a pointer to an array of io_tlb_area structures, and a +pointer to an array of io_tlb_slot structures that are associated with the pool. + +io_tlb_area describes an area. The primary field is the spin lock used to +serialize access to slots in the area. The io_tlb_area array for a pool has an +entry for each area, and is accessed using a 0-based area index derived from the +calling processor ID. Areas exist solely to allow parallel access to swiotlb +from multiple CPUs. + +io_tlb_slot describes an individual memory slot in the pool, with size +IO_TLB_SIZE (2 KiB currently). The io_tlb_slot array is indexed by the slot +index computed from the bounce buffer address relative to the starting memory +address of the pool. The size of struct io_tlb_slot is 24 bytes, so the +overhead is about 1% of the slot size. + +The io_tlb_slot array is designed to meet several requirements. First, the DMA +APIs and the corresponding swiotlb APIs use the bounce buffer address as the +identifier for a bounce buffer. This address is returned by +swiotlb_tbl_map_single(), and then passed as an argument to +swiotlb_tbl_unmap_single() and the swiotlb_sync_*() functions. The original +memory buffer address obviously must be passed as an argument to +swiotlb_tbl_map_single(), but it is not passed to the other APIs. Consequently, +swiotlb data structures must save the original memory buffer address so that it +can be used when doing sync operations. This original address is saved in the +io_tlb_slot array. + +Second, the io_tlb_slot array must handle partial sync requests. In such cases, +the argument to swiotlb_sync_*() is not the address of the start of the bounce +buffer but an address somewhere in the middle of the bounce buffer, and the +address of the start of the bounce buffer isn't known to swiotlb code. But +swiotlb code must be able to calculate the corresponding original memory buffer +address to do the CPU copy dictated by the "sync". So an adjusted original +memory buffer address is populated into the struct io_tlb_slot for each slot +occupied by the bounce buffer. An adjusted "alloc_size" of the bounce buffer is +also recorded in each struct io_tlb_slot so a sanity check can be performed on +the size of the "sync" operation. The "alloc_size" field is not used except for +the sanity check. + +Third, the io_tlb_slot array is used to track available slots. The "list" field +in struct io_tlb_slot records how many contiguous available slots exist starting +at that slot. A "0" indicates that the slot is occupied. A value of "1" +indicates only the current slot is available. A value of "2" indicates the +current slot and the next slot are available, etc. The maximum value is +IO_TLB_SEGSIZE, which can appear in the first slot in a slot set, and indicates +that the entire slot set is available. These values are used when searching for +available slots to use for a new bounce buffer. They are updated when allocating +a new bounce buffer and when freeing a bounce buffer. At pool creation time, the +"list" field is initialized to IO_TLB_SEGSIZE down to 1 for the slots in every +slot set. + +Fourth, the io_tlb_slot array keeps track of any "padding slots" allocated to +meet alloc_align_mask requirements described above. When +swiotlb_tbl_map_single() allocates bounce buffer space to meet alloc_align_mask +requirements, it may allocate pre-padding space across zero or more slots. But +when swiotlb_tbl_unmap_single() is called with the bounce buffer address, the +alloc_align_mask value that governed the allocation, and therefore the +allocation of any padding slots, is not known. The "pad_slots" field records +the number of padding slots so that swiotlb_tbl_unmap_single() can free them. +The "pad_slots" value is recorded only in the first non-padding slot allocated +to the bounce buffer. + +Restricted pools +---------------- +The swiotlb machinery is also used for "restricted pools", which are pools of +memory separate from the default swiotlb pool, and that are dedicated for DMA +use by a particular device. Restricted pools provide a level of DMA memory +protection on systems with limited hardware protection capabilities, such as +those lacking an IOMMU. Such usage is specified by DeviceTree entries and +requires that CONFIG_DMA_RESTRICTED_POOL is set. Each restricted pool is based +on its own io_tlb_mem data structure that is independent of the main swiotlb +io_tlb_mem. + +Restricted pools add swiotlb_alloc() and swiotlb_free() APIs, which are called +from the dma_alloc_*() and dma_free_*() APIs. The swiotlb_alloc/free() APIs +allocate/free slots from/to the restricted pool directly and do not go through +swiotlb_tbl_map/unmap_single(). diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst index 9b76337f6756..32fc73dc5529 100644 --- a/Documentation/core-api/symbol-namespaces.rst +++ b/Documentation/core-api/symbol-namespaces.rst @@ -6,18 +6,8 @@ The following document describes how to use Symbol Namespaces to structure the export surface of in-kernel symbols exported through the family of EXPORT_SYMBOL() macros. -.. Table of Contents - - === 1 Introduction - === 2 How to define Symbol Namespaces - --- 2.1 Using the EXPORT_SYMBOL macros - --- 2.2 Using the DEFAULT_SYMBOL_NAMESPACE define - === 3 How to use Symbols exported in Namespaces - === 4 Loading Modules that use namespaced Symbols - === 5 Automatically creating MODULE_IMPORT_NS statements - -1. Introduction -=============== +Introduction +============ Symbol Namespaces have been introduced as a means to structure the export surface of the in-kernel API. It allows subsystem maintainers to partition @@ -28,34 +18,37 @@ kernel. As of today, modules that make use of symbols exported into namespaces, are required to import the namespace. Otherwise the kernel will, depending on its configuration, reject loading the module or warn about a missing import. -2. How to define Symbol Namespaces -================================== +Additionally, it is possible to put symbols into a module namespace, strictly +limiting which modules are allowed to use these symbols. + +How to define Symbol Namespaces +=============================== Symbols can be exported into namespace using different methods. All of them are changing the way EXPORT_SYMBOL and friends are instrumented to create ksymtab entries. -2.1 Using the EXPORT_SYMBOL macros -================================== +Using the EXPORT_SYMBOL macros +------------------------------ In addition to the macros EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(), that allow exporting of kernel symbols to the kernel symbol table, variants of these are available to export symbols into a certain namespace: EXPORT_SYMBOL_NS() and -EXPORT_SYMBOL_NS_GPL(). They take one additional argument: the namespace. -Please note that due to macro expansion that argument needs to be a -preprocessor symbol. E.g. to export the symbol `usb_stor_suspend` into the -namespace `USB_STORAGE`, use:: +EXPORT_SYMBOL_NS_GPL(). They take one additional argument: the namespace as a +string constant. Note that this string must not contain whitespaces. +E.g. to export the symbol ``usb_stor_suspend`` into the +namespace ``USB_STORAGE``, use:: - EXPORT_SYMBOL_NS(usb_stor_suspend, USB_STORAGE); + EXPORT_SYMBOL_NS(usb_stor_suspend, "USB_STORAGE"); -The corresponding ksymtab entry struct `kernel_symbol` will have the member -`namespace` set accordingly. A symbol that is exported without a namespace will -refer to `NULL`. There is no default namespace if none is defined. `modpost` -and kernel/module.c make use the namespace at build time or module load time, -respectively. +The corresponding ksymtab entry struct ``kernel_symbol`` will have the member +``namespace`` set accordingly. A symbol that is exported without a namespace will +refer to ``NULL``. There is no default namespace if none is defined. ``modpost`` +and kernel/module/main.c make use the namespace at build time or module load +time, respectively. -2.2 Using the DEFAULT_SYMBOL_NAMESPACE define -============================================= +Using the DEFAULT_SYMBOL_NAMESPACE define +----------------------------------------- Defining namespaces for all symbols of a subsystem can be very verbose and may become hard to maintain. Therefore a default define (DEFAULT_SYMBOL_NAMESPACE) @@ -64,11 +57,11 @@ and EXPORT_SYMBOL_GPL() macro expansions that do not specify a namespace. There are multiple ways of specifying this define and it depends on the subsystem and the maintainer's preference, which one to use. The first option -is to define the default namespace in the `Makefile` of the subsystem. E.g. to +is to define the default namespace in the ``Makefile`` of the subsystem. E.g. to export all symbols defined in usb-common into the namespace USB_COMMON, add a line like this to drivers/usb/common/Makefile:: - ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=USB_COMMON + ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE='"USB_COMMON"' That will affect all EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL() statements. A symbol exported with EXPORT_SYMBOL_NS() while this definition is present, will @@ -78,14 +71,29 @@ as this argument has preference over a default symbol namespace. A second option to define the default namespace is directly in the compilation unit as preprocessor statement. The above example would then read:: - #undef DEFAULT_SYMBOL_NAMESPACE - #define DEFAULT_SYMBOL_NAMESPACE USB_COMMON + #define DEFAULT_SYMBOL_NAMESPACE "USB_COMMON" + +within the corresponding compilation unit before the #include for +<linux/export.h>. Typically it's placed before the first #include statement. + +Using the EXPORT_SYMBOL_GPL_FOR_MODULES() macro +----------------------------------------------- -within the corresponding compilation unit before any EXPORT_SYMBOL macro is -used. +Symbols exported using this macro are put into a module namespace. This +namespace cannot be imported. -3. How to use Symbols exported in Namespaces -============================================ +The macro takes a comma separated list of module names, allowing only those +modules to access this symbol. Simple tail-globs are supported. + +For example:: + + EXPORT_SYMBOL_GPL_FOR_MODULES(preempt_notifier_inc, "kvm,kvm-*") + +will limit usage of this symbol to modules whoes name matches the given +patterns. + +How to use Symbols exported in Namespaces +========================================= In order to use symbols that are exported into namespaces, kernel modules need to explicitly import these namespaces. Otherwise the kernel might reject to @@ -94,9 +102,9 @@ for the namespaces it uses symbols from. E.g. a module using the usb_stor_suspend symbol from above, needs to import the namespace USB_STORAGE using a statement like:: - MODULE_IMPORT_NS(USB_STORAGE); + MODULE_IMPORT_NS("USB_STORAGE"); -This will create a `modinfo` tag in the module for each imported namespace. +This will create a ``modinfo`` tag in the module for each imported namespace. This has the side effect, that the imported namespaces of a module can be inspected with modinfo:: @@ -107,13 +115,12 @@ inspected with modinfo:: It is advisable to add the MODULE_IMPORT_NS() statement close to other module -metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). Refer to section -5. for a way to create missing import statements automatically. +metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). -4. Loading Modules that use namespaced Symbols -============================================== +Loading Modules that use namespaced Symbols +=========================================== -At module loading time (e.g. `insmod`), the kernel will check each symbol +At module loading time (e.g. ``insmod``), the kernel will check each symbol referenced from the module for its availability and whether the namespace it might be exported to has been imported by the module. The default behaviour of the kernel is to reject loading modules that don't specify sufficient imports. @@ -122,8 +129,8 @@ allow loading of modules that don't satisfy this precondition, a configuration option is available: Setting MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=y will enable loading regardless, but will emit a warning. -5. Automatically creating MODULE_IMPORT_NS statements -===================================================== +Automatically creating MODULE_IMPORT_NS statements +================================================== Missing namespaces imports can easily be detected at build time. In fact, modpost will emit a warning if a module uses a symbol from a namespace @@ -138,20 +145,23 @@ missing imports. Fixing missing imports can be done with:: A typical scenario for module authors would be:: - write code that depends on a symbol from a not imported namespace - - `make` + - ``make`` - notice the warning of modpost telling about a missing import - - run `make nsdeps` to add the import to the correct code location + - run ``make nsdeps`` to add the import to the correct code location For subsystem maintainers introducing a namespace, the steps are very similar. -Again, `make nsdeps` will eventually add the missing namespace imports for +Again, ``make nsdeps`` will eventually add the missing namespace imports for in-tree modules:: - move or add symbols to a namespace (e.g. with EXPORT_SYMBOL_NS()) - - `make` (preferably with an allmodconfig to cover all in-kernel + - ``make`` (preferably with an allmodconfig to cover all in-kernel modules) - notice the warning of modpost telling about a missing import - - run `make nsdeps` to add the import to the correct code location + - run ``make nsdeps`` to add the import to the correct code location You can also run nsdeps for external module builds. A typical usage is:: $ make -C <path_to_kernel_src> M=$PWD nsdeps + +Note: it will happily generate an import statement for the module namespace; +which will not work and generates build and runtime failures. diff --git a/Documentation/core-api/this_cpu_ops.rst b/Documentation/core-api/this_cpu_ops.rst new file mode 100644 index 000000000000..533ac5dd5750 --- /dev/null +++ b/Documentation/core-api/this_cpu_ops.rst @@ -0,0 +1,347 @@ +=================== +this_cpu operations +=================== + +:Author: Christoph Lameter, August 4th, 2014 +:Author: Pranith Kumar, Aug 2nd, 2014 + +this_cpu operations are a way of optimizing access to per cpu +variables associated with the *currently* executing processor. This is +done through the use of segment registers (or a dedicated register where +the cpu permanently stored the beginning of the per cpu area for a +specific processor). + +this_cpu operations add a per cpu variable offset to the processor +specific per cpu base and encode that operation in the instruction +operating on the per cpu variable. + +This means that there are no atomicity issues between the calculation of +the offset and the operation on the data. Therefore it is not +necessary to disable preemption or interrupts to ensure that the +processor is not changed between the calculation of the address and +the operation on the data. + +Read-modify-write operations are of particular interest. Frequently +processors have special lower latency instructions that can operate +without the typical synchronization overhead, but still provide some +sort of relaxed atomicity guarantees. The x86, for example, can execute +RMW (Read Modify Write) instructions like inc/dec/cmpxchg without the +lock prefix and the associated latency penalty. + +Access to the variable without the lock prefix is not synchronized but +synchronization is not necessary since we are dealing with per cpu +data specific to the currently executing processor. Only the current +processor should be accessing that variable and therefore there are no +concurrency issues with other processors in the system. + +Please note that accesses by remote processors to a per cpu area are +exceptional situations and may impact performance and/or correctness +(remote write operations) of local RMW operations via this_cpu_*. + +The main use of the this_cpu operations has been to optimize counter +operations. + +The following this_cpu() operations with implied preemption protection +are defined. These operations can be used without worrying about +preemption and interrupts:: + + this_cpu_read(pcp) + this_cpu_write(pcp, val) + this_cpu_add(pcp, val) + this_cpu_and(pcp, val) + this_cpu_or(pcp, val) + this_cpu_add_return(pcp, val) + this_cpu_xchg(pcp, nval) + this_cpu_cmpxchg(pcp, oval, nval) + this_cpu_sub(pcp, val) + this_cpu_inc(pcp) + this_cpu_dec(pcp) + this_cpu_sub_return(pcp, val) + this_cpu_inc_return(pcp) + this_cpu_dec_return(pcp) + + +Inner working of this_cpu operations +------------------------------------ + +On x86 the fs: or the gs: segment registers contain the base of the +per cpu area. It is then possible to simply use the segment override +to relocate a per cpu relative address to the proper per cpu area for +the processor. So the relocation to the per cpu base is encoded in the +instruction via a segment register prefix. + +For example:: + + DEFINE_PER_CPU(int, x); + int z; + + z = this_cpu_read(x); + +results in a single instruction:: + + mov ax, gs:[x] + +instead of a sequence of calculation of the address and then a fetch +from that address which occurs with the per cpu operations. Before +this_cpu_ops such sequence also required preempt disable/enable to +prevent the kernel from moving the thread to a different processor +while the calculation is performed. + +Consider the following this_cpu operation:: + + this_cpu_inc(x) + +The above results in the following single instruction (no lock prefix!):: + + inc gs:[x] + +instead of the following operations required if there is no segment +register:: + + int *y; + int cpu; + + cpu = get_cpu(); + y = per_cpu_ptr(&x, cpu); + (*y)++; + put_cpu(); + +Note that these operations can only be used on per cpu data that is +reserved for a specific processor. Without disabling preemption in the +surrounding code this_cpu_inc() will only guarantee that one of the +per cpu counters is correctly incremented. However, there is no +guarantee that the OS will not move the process directly before or +after the this_cpu instruction is executed. In general this means that +the value of the individual counters for each processor are +meaningless. The sum of all the per cpu counters is the only value +that is of interest. + +Per cpu variables are used for performance reasons. Bouncing cache +lines can be avoided if multiple processors concurrently go through +the same code paths. Since each processor has its own per cpu +variables no concurrent cache line updates take place. The price that +has to be paid for this optimization is the need to add up the per cpu +counters when the value of a counter is needed. + + +Special operations +------------------ + +:: + + y = this_cpu_ptr(&x) + +Takes the offset of a per cpu variable (&x !) and returns the address +of the per cpu variable that belongs to the currently executing +processor. this_cpu_ptr avoids multiple steps that the common +get_cpu/put_cpu sequence requires. No processor number is +available. Instead, the offset of the local per cpu area is simply +added to the per cpu offset. + +Note that this operation can only be used in code segments where +smp_processor_id() may be used, for example, where preemption has been +disabled. The pointer is then used to access local per cpu data in a +critical section. When preemption is re-enabled this pointer is usually +no longer useful since it may no longer point to per cpu data of the +current processor. + +The special cases where it makes sense to obtain a per-CPU pointer in +preemptible code are addressed by raw_cpu_ptr(), but such use cases need +to handle cases where two different CPUs are accessing the same per cpu +variable, which might well be that of a third CPU. These use cases are +typically performance optimizations. For example, SRCU implements a pair +of counters as a pair of per-CPU variables, and rcu_read_lock_nmisafe() +uses raw_cpu_ptr() to get a pointer to some CPU's counter, and uses +atomic_inc_long() to handle migration between the raw_cpu_ptr() and +the atomic_inc_long(). + +Per cpu variables and offsets +----------------------------- + +Per cpu variables have *offsets* to the beginning of the per cpu +area. They do not have addresses although they look like that in the +code. Offsets cannot be directly dereferenced. The offset must be +added to a base pointer of a per cpu area of a processor in order to +form a valid address. + +Therefore the use of x or &x outside of the context of per cpu +operations is invalid and will generally be treated like a NULL +pointer dereference. + +:: + + DEFINE_PER_CPU(int, x); + +In the context of per cpu operations the above implies that x is a per +cpu variable. Most this_cpu operations take a cpu variable. + +:: + + int __percpu *p = &x; + +&x and hence p is the *offset* of a per cpu variable. this_cpu_ptr() +takes the offset of a per cpu variable which makes this look a bit +strange. + + +Operations on a field of a per cpu structure +-------------------------------------------- + +Let's say we have a percpu structure:: + + struct s { + int n,m; + }; + + DEFINE_PER_CPU(struct s, p); + + +Operations on these fields are straightforward:: + + this_cpu_inc(p.m) + + z = this_cpu_cmpxchg(p.m, 0, 1); + + +If we have an offset to struct s:: + + struct s __percpu *ps = &p; + + this_cpu_dec(ps->m); + + z = this_cpu_inc_return(ps->n); + + +The calculation of the pointer may require the use of this_cpu_ptr() +if we do not make use of this_cpu ops later to manipulate fields:: + + struct s *pp; + + pp = this_cpu_ptr(&p); + + pp->m--; + + z = pp->n++; + + +Variants of this_cpu ops +------------------------ + +this_cpu ops are interrupt safe. Some architectures do not support +these per cpu local operations. In that case the operation must be +replaced by code that disables interrupts, then does the operations +that are guaranteed to be atomic and then re-enable interrupts. Doing +so is expensive. If there are other reasons why the scheduler cannot +change the processor we are executing on then there is no reason to +disable interrupts. For that purpose the following __this_cpu operations +are provided. + +These operations have no guarantee against concurrent interrupts or +preemption. If a per cpu variable is not used in an interrupt context +and the scheduler cannot preempt, then they are safe. If any interrupts +still occur while an operation is in progress and if the interrupt too +modifies the variable, then RMW actions can not be guaranteed to be +safe:: + + __this_cpu_read(pcp) + __this_cpu_write(pcp, val) + __this_cpu_add(pcp, val) + __this_cpu_and(pcp, val) + __this_cpu_or(pcp, val) + __this_cpu_add_return(pcp, val) + __this_cpu_xchg(pcp, nval) + __this_cpu_cmpxchg(pcp, oval, nval) + __this_cpu_sub(pcp, val) + __this_cpu_inc(pcp) + __this_cpu_dec(pcp) + __this_cpu_sub_return(pcp, val) + __this_cpu_inc_return(pcp) + __this_cpu_dec_return(pcp) + + +Will increment x and will not fall-back to code that disables +interrupts on platforms that cannot accomplish atomicity through +address relocation and a Read-Modify-Write operation in the same +instruction. + + +&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n) +-------------------------------------------- + +The first operation takes the offset and forms an address and then +adds the offset of the n field. This may result in two add +instructions emitted by the compiler. + +The second one first adds the two offsets and then does the +relocation. IMHO the second form looks cleaner and has an easier time +with (). The second form also is consistent with the way +this_cpu_read() and friends are used. + + +Remote access to per cpu data +------------------------------ + +Per cpu data structures are designed to be used by one cpu exclusively. +If you use the variables as intended, this_cpu_ops() are guaranteed to +be "atomic" as no other CPU has access to these data structures. + +There are special cases where you might need to access per cpu data +structures remotely. It is usually safe to do a remote read access +and that is frequently done to summarize counters. Remote write access +something which could be problematic because this_cpu ops do not +have lock semantics. A remote write may interfere with a this_cpu +RMW operation. + +Remote write accesses to percpu data structures are highly discouraged +unless absolutely necessary. Please consider using an IPI to wake up +the remote CPU and perform the update to its per cpu area. + +To access per-cpu data structure remotely, typically the per_cpu_ptr() +function is used:: + + + DEFINE_PER_CPU(struct data, datap); + + struct data *p = per_cpu_ptr(&datap, cpu); + +This makes it explicit that we are getting ready to access a percpu +area remotely. + +You can also do the following to convert the datap offset to an address:: + + struct data *p = this_cpu_ptr(&datap); + +but, passing of pointers calculated via this_cpu_ptr to other cpus is +unusual and should be avoided. + +Remote access are typically only for reading the status of another cpus +per cpu data. Write accesses can cause unique problems due to the +relaxed synchronization requirements for this_cpu operations. + +One example that illustrates some concerns with write operations is +the following scenario that occurs because two per cpu variables +share a cache-line but the relaxed synchronization is applied to +only one process updating the cache-line. + +Consider the following example:: + + + struct test { + atomic_t a; + int b; + }; + + DEFINE_PER_CPU(struct test, onecacheline); + +There is some concern about what would happen if the field 'a' is updated +remotely from one processor and the local processor would use this_cpu ops +to update field b. Care should be taken that such simultaneous accesses to +data within the same cache line are avoided. Also costly synchronization +may be necessary. IPIs are generally recommended in such scenarios instead +of a remote write to the per cpu area of another processor. + +Even in cases where the remote writes are rare, please bear in +mind that a remote write will evict the cache line from the processor +that most likely will access it. If the processor wakes up and finds a +missing local cache line of a per cpu area, its performance and hence +the wake up times will be affected. diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst index 729e24864fe7..22ec68f24421 100644 --- a/Documentation/core-api/timekeeping.rst +++ b/Documentation/core-api/timekeeping.rst @@ -132,6 +132,7 @@ Some additional variants exist for more specialized cases: .. c:function:: u64 ktime_get_mono_fast_ns( void ) u64 ktime_get_raw_fast_ns( void ) u64 ktime_get_boot_fast_ns( void ) + u64 ktime_get_tai_fast_ns( void ) u64 ktime_get_real_fast_ns( void ) These variants are safe to call from any context, including from diff --git a/Documentation/core-api/unaligned-memory-access.rst b/Documentation/core-api/unaligned-memory-access.rst new file mode 100644 index 000000000000..5ceeb80eb539 --- /dev/null +++ b/Documentation/core-api/unaligned-memory-access.rst @@ -0,0 +1,265 @@ +========================= +Unaligned Memory Accesses +========================= + +:Author: Daniel Drake <dsd@gentoo.org>, +:Author: Johannes Berg <johannes@sipsolutions.net> + +:With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt, + Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz, + Vadim Lobanov + + +Linux runs on a wide variety of architectures which have varying behaviour +when it comes to memory access. This document presents some details about +unaligned accesses, why you need to write code that doesn't cause them, +and how to write such code! + + +The definition of an unaligned access +===================================== + +Unaligned memory accesses occur when you try to read N bytes of data starting +from an address that is not evenly divisible by N (i.e. addr % N != 0). +For example, reading 4 bytes of data from address 0x10004 is fine, but +reading 4 bytes of data from address 0x10005 would be an unaligned memory +access. + +The above may seem a little vague, as memory access can happen in different +ways. The context here is at the machine code level: certain instructions read +or write a number of bytes to or from memory (e.g. movb, movw, movl in x86 +assembly). As will become clear, it is relatively easy to spot C statements +which will compile to multiple-byte memory access instructions, namely when +dealing with types such as u16, u32 and u64. + + +Natural alignment +================= + +The rule mentioned above forms what we refer to as natural alignment: +When accessing N bytes of memory, the base memory address must be evenly +divisible by N, i.e. addr % N == 0. + +When writing code, assume the target architecture has natural alignment +requirements. + +In reality, only a few architectures require natural alignment on all sizes +of memory access. However, we must consider ALL supported architectures; +writing code that satisfies natural alignment requirements is the easiest way +to achieve full portability. + + +Why unaligned access is bad +=========================== + +The effects of performing an unaligned memory access vary from architecture +to architecture. It would be easy to write a whole document on the differences +here; a summary of the common scenarios is presented below: + + - Some architectures are able to perform unaligned memory accesses + transparently, but there is usually a significant performance cost. + - Some architectures raise processor exceptions when unaligned accesses + happen. The exception handler is able to correct the unaligned access, + at significant cost to performance. + - Some architectures raise processor exceptions when unaligned accesses + happen, but the exceptions do not contain enough information for the + unaligned access to be corrected. + - Some architectures are not capable of unaligned memory access, but will + silently perform a different memory access to the one that was requested, + resulting in a subtle code bug that is hard to detect! + +It should be obvious from the above that if your code causes unaligned +memory accesses to happen, your code will not work correctly on certain +platforms and will cause performance problems on others. + + +Code that does not cause unaligned access +========================================= + +At first, the concepts above may seem a little hard to relate to actual +coding practice. After all, you don't have a great deal of control over +memory addresses of certain variables, etc. + +Fortunately things are not too complex, as in most cases, the compiler +ensures that things will work for you. For example, take the following +structure:: + + struct foo { + u16 field1; + u32 field2; + u8 field3; + }; + +Let us assume that an instance of the above structure resides in memory +starting at address 0x10000. With a basic level of understanding, it would +not be unreasonable to expect that accessing field2 would cause an unaligned +access. You'd be expecting field2 to be located at offset 2 bytes into the +structure, i.e. address 0x10002, but that address is not evenly divisible +by 4 (remember, we're reading a 4 byte value here). + +Fortunately, the compiler understands the alignment constraints, so in the +above case it would insert 2 bytes of padding in between field1 and field2. +Therefore, for standard structure types you can always rely on the compiler +to pad structures so that accesses to fields are suitably aligned (assuming +you do not cast the field to a type of different length). + +Similarly, you can also rely on the compiler to align variables and function +parameters to a naturally aligned scheme, based on the size of the type of +the variable. + +At this point, it should be clear that accessing a single byte (u8 or char) +will never cause an unaligned access, because all memory addresses are evenly +divisible by one. + +On a related topic, with the above considerations in mind you may observe +that you could reorder the fields in the structure in order to place fields +where padding would otherwise be inserted, and hence reduce the overall +resident memory size of structure instances. The optimal layout of the +above example is:: + + struct foo { + u32 field2; + u16 field1; + u8 field3; + }; + +For a natural alignment scheme, the compiler would only have to add a single +byte of padding at the end of the structure. This padding is added in order +to satisfy alignment constraints for arrays of these structures. + +Another point worth mentioning is the use of __attribute__((packed)) on a +structure type. This GCC-specific attribute tells the compiler never to +insert any padding within structures, useful when you want to use a C struct +to represent some data that comes in a fixed arrangement 'off the wire'. + +You might be inclined to believe that usage of this attribute can easily +lead to unaligned accesses when accessing fields that do not satisfy +architectural alignment requirements. However, again, the compiler is aware +of the alignment constraints and will generate extra instructions to perform +the memory access in a way that does not cause unaligned access. Of course, +the extra instructions obviously cause a loss in performance compared to the +non-packed case, so the packed attribute should only be used when avoiding +structure padding is of importance. + + +Code that causes unaligned access +================================= + +With the above in mind, let's move onto a real life example of a function +that can cause an unaligned memory access. The following function taken +from include/linux/etherdevice.h is an optimized routine to compare two +ethernet MAC addresses for equality:: + + bool ether_addr_equal(const u8 *addr1, const u8 *addr2) + { + #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) | + ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4))); + + return fold == 0; + #else + const u16 *a = (const u16 *)addr1; + const u16 *b = (const u16 *)addr2; + return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0; + #endif + } + +In the above function, when the hardware has efficient unaligned access +capability, there is no issue with this code. But when the hardware isn't +able to access memory on arbitrary boundaries, the reference to a[0] causes +2 bytes (16 bits) to be read from memory starting at address addr1. + +Think about what would happen if addr1 was an odd address such as 0x10003. +(Hint: it'd be an unaligned access.) + +Despite the potential unaligned access problems with the above function, it +is included in the kernel anyway but is understood to only work normally on +16-bit-aligned addresses. It is up to the caller to ensure this alignment or +not use this function at all. This alignment-unsafe function is still useful +as it is a decent optimization for the cases when you can ensure alignment, +which is true almost all of the time in ethernet networking context. + + +Here is another example of some code that could cause unaligned accesses:: + + void myfunc(u8 *data, u32 value) + { + [...] + *((u32 *) data) = cpu_to_le32(value); + [...] + } + +This code will cause unaligned accesses every time the data parameter points +to an address that is not evenly divisible by 4. + +In summary, the 2 main scenarios where you may run into unaligned access +problems involve: + + 1. Casting variables to types of different lengths + 2. Pointer arithmetic followed by access to at least 2 bytes of data + + +Avoiding unaligned accesses +=========================== + +The easiest way to avoid unaligned access is to use the get_unaligned() and +put_unaligned() macros provided by the <linux/unaligned.h> header file. + +Going back to an earlier example of code that potentially causes unaligned +access:: + + void myfunc(u8 *data, u32 value) + { + [...] + *((u32 *) data) = cpu_to_le32(value); + [...] + } + +To avoid the unaligned memory access, you would rewrite it as follows:: + + void myfunc(u8 *data, u32 value) + { + [...] + value = cpu_to_le32(value); + put_unaligned(value, (u32 *) data); + [...] + } + +The get_unaligned() macro works similarly. Assuming 'data' is a pointer to +memory and you wish to avoid unaligned access, its usage is as follows:: + + u32 value = get_unaligned((u32 *) data); + +These macros work for memory accesses of any length (not just 32 bits as +in the examples above). Be aware that when compared to standard access of +aligned memory, using these macros to access unaligned memory can be costly in +terms of performance. + +If use of such macros is not convenient, another option is to use memcpy(), +where the source or destination (or both) are of type u8* or unsigned char*. +Due to the byte-wise nature of this operation, unaligned accesses are avoided. + + +Alignment vs. Networking +======================== + +On architectures that require aligned loads, networking requires that the IP +header is aligned on a four-byte boundary to optimise the IP stack. For +regular ethernet hardware, the constant NET_IP_ALIGN is used. On most +architectures this constant has the value 2 because the normal ethernet +header is 14 bytes long, so in order to get proper alignment one needs to +DMA to an address which can be expressed as 4*n + 2. One notable exception +here is powerpc which defines NET_IP_ALIGN to 0 because DMA to unaligned +addresses can be very expensive and dwarf the cost of unaligned loads. + +For some ethernet hardware that cannot DMA to unaligned addresses like +4*n+2 or non-ethernet hardware, this can be a problem, and it is then +required to copy the incoming frame into an aligned buffer. Because this is +unnecessary on architectures that can do unaligned accesses, the code can be +made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so:: + + #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + skb = original skb + #else + skb = copy skb + #endif diff --git a/Documentation/core-api/union_find.rst b/Documentation/core-api/union_find.rst new file mode 100644 index 000000000000..6df8b94fdb5a --- /dev/null +++ b/Documentation/core-api/union_find.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Union-Find in Linux +==================== + + +:Date: June 21, 2024 +:Author: Xavier <xavier_qy@163.com> + +What is union-find, and what is it used for? +------------------------------------------------ + +Union-find is a data structure used to handle the merging and querying +of disjoint sets. The primary operations supported by union-find are: + + Initialization: Resetting each element as an individual set, with + each set's initial parent node pointing to itself. + + Find: Determine which set a particular element belongs to, usually by + returning a “representative element” of that set. This operation + is used to check if two elements are in the same set. + + Union: Merge two sets into one. + +As a data structure used to maintain sets (groups), union-find is commonly +utilized to solve problems related to offline queries, dynamic connectivity, +and graph theory. It is also a key component in Kruskal's algorithm for +computing the minimum spanning tree, which is crucial in scenarios like +network routing. Consequently, union-find is widely referenced. Additionally, +union-find has applications in symbolic computation, register allocation, +and more. + +Space Complexity: O(n), where n is the number of nodes. + +Time Complexity: Using path compression can reduce the time complexity of +the find operation, and using union by rank can reduce the time complexity +of the union operation. These optimizations reduce the average time +complexity of each find and union operation to O(α(n)), where α(n) is the +inverse Ackermann function. This can be roughly considered a constant time +complexity for practical purposes. + +This document covers use of the Linux union-find implementation. For more +information on the nature and implementation of union-find, see: + + Wikipedia entry on union-find + https://en.wikipedia.org/wiki/Disjoint-set_data_structure + +Linux implementation of union-find +----------------------------------- + +Linux's union-find implementation resides in the file "lib/union_find.c". +To use it, "#include <linux/union_find.h>". + +The union-find data structure is defined as follows:: + + struct uf_node { + struct uf_node *parent; + unsigned int rank; + }; + +In this structure, parent points to the parent node of the current node. +The rank field represents the height of the current tree. During a union +operation, the tree with the smaller rank is attached under the tree with the +larger rank to maintain balance. + +Initializing union-find +----------------------- + +You can complete the initialization using either static or initialization +interface. Initialize the parent pointer to point to itself and set the rank +to 0. +Example:: + + struct uf_node my_node = UF_INIT_NODE(my_node); + +or + + uf_node_init(&my_node); + +Find the Root Node of union-find +-------------------------------- + +This operation is mainly used to determine whether two nodes belong to the same +set in the union-find. If they have the same root, they are in the same set. +During the find operation, path compression is performed to improve the +efficiency of subsequent find operations. +Example:: + + int connected; + struct uf_node *root1 = uf_find(&node_1); + struct uf_node *root2 = uf_find(&node_2); + if (root1 == root2) + connected = 1; + else + connected = 0; + +Union Two Sets in union-find +---------------------------- + +To union two sets in the union-find, you first find their respective root nodes +and then link the smaller node to the larger node based on the rank of the root +nodes. +Example:: + + uf_union(&node_1, &node_2); diff --git a/Documentation/core-api/watch_queue.rst b/Documentation/core-api/watch_queue.rst new file mode 100644 index 000000000000..54f13ad5fc17 --- /dev/null +++ b/Documentation/core-api/watch_queue.rst @@ -0,0 +1,343 @@ +============================== +General notification mechanism +============================== + +The general notification mechanism is built on top of the standard pipe driver +whereby it effectively splices notification messages from the kernel into pipes +opened by userspace. This can be used in conjunction with:: + + * Key/keyring notifications + + +The notifications buffers can be enabled by: + + "General setup"/"General notification queue" + (CONFIG_WATCH_QUEUE) + +This document has the following sections: + +.. contents:: :local: + + +Overview +======== + +This facility appears as a pipe that is opened in a special mode. The pipe's +internal ring buffer is used to hold messages that are generated by the kernel. +These messages are then read out by read(). Splice and similar are disabled on +such pipes due to them wanting to, under some circumstances, revert their +additions to the ring - which might end up interleaved with notification +messages. + +The owner of the pipe has to tell the kernel which sources it would like to +watch through that pipe. Only sources that have been connected to a pipe will +insert messages into it. Note that a source may be bound to multiple pipes and +insert messages into all of them simultaneously. + +Filters may also be emplaced on a pipe so that certain source types and +subevents can be ignored if they're not of interest. + +A message will be discarded if there isn't a slot available in the ring or if +no preallocated message buffer is available. In both of these cases, read() +will insert a WATCH_META_LOSS_NOTIFICATION message into the output buffer after +the last message currently in the buffer has been read. + +Note that when producing a notification, the kernel does not wait for the +consumers to collect it, but rather just continues on. This means that +notifications can be generated whilst spinlocks are held and also protects the +kernel from being held up indefinitely by a userspace malfunction. + + +Message Structure +================= + +Notification messages begin with a short header:: + + struct watch_notification { + __u32 type:24; + __u32 subtype:8; + __u32 info; + }; + +"type" indicates the source of the notification record and "subtype" indicates +the type of record from that source (see the Watch Sources section below). The +type may also be "WATCH_TYPE_META". This is a special record type generated +internally by the watch queue itself. There are two subtypes: + + * WATCH_META_REMOVAL_NOTIFICATION + * WATCH_META_LOSS_NOTIFICATION + +The first indicates that an object on which a watch was installed was removed +or destroyed and the second indicates that some messages have been lost. + +"info" indicates a bunch of things, including: + + * The length of the message in bytes, including the header (mask with + WATCH_INFO_LENGTH and shift by WATCH_INFO_LENGTH__SHIFT). This indicates + the size of the record, which may be between 8 and 127 bytes. + + * The watch ID (mask with WATCH_INFO_ID and shift by WATCH_INFO_ID__SHIFT). + This indicates that caller's ID of the watch, which may be between 0 + and 255. Multiple watches may share a queue, and this provides a means to + distinguish them. + + * A type-specific field (WATCH_INFO_TYPE_INFO). This is set by the + notification producer to indicate some meaning specific to the type and + subtype. + +Everything in info apart from the length can be used for filtering. + +The header can be followed by supplementary information. The format of this is +at the discretion is defined by the type and subtype. + + +Watch List (Notification Source) API +==================================== + +A "watch list" is a list of watchers that are subscribed to a source of +notifications. A list may be attached to an object (say a key or a superblock) +or may be global (say for device events). From a userspace perspective, a +non-global watch list is typically referred to by reference to the object it +belongs to (such as using KEYCTL_NOTIFY and giving it a key serial number to +watch that specific key). + +To manage a watch list, the following functions are provided: + + * :: + + void init_watch_list(struct watch_list *wlist, + void (*release_watch)(struct watch *wlist)); + + Initialise a watch list. If ``release_watch`` is not NULL, then this + indicates a function that should be called when the watch_list object is + destroyed to discard any references the watch list holds on the watched + object. + + * ``void remove_watch_list(struct watch_list *wlist);`` + + This removes all of the watches subscribed to a watch_list and frees them + and then destroys the watch_list object itself. + + +Watch Queue (Notification Output) API +===================================== + +A "watch queue" is the buffer allocated by an application that notification +records will be written into. The workings of this are hidden entirely inside +of the pipe device driver, but it is necessary to gain a reference to it to set +a watch. These can be managed with: + + * ``struct watch_queue *get_watch_queue(int fd);`` + + Since watch queues are indicated to the kernel by the fd of the pipe that + implements the buffer, userspace must hand that fd through a system call. + This can be used to look up an opaque pointer to the watch queue from the + system call. + + * ``void put_watch_queue(struct watch_queue *wqueue);`` + + This discards the reference obtained from ``get_watch_queue()``. + + +Watch Subscription API +====================== + +A "watch" is a subscription on a watch list, indicating the watch queue, and +thus the buffer, into which notification records should be written. The watch +queue object may also carry filtering rules for that object, as set by +userspace. Some parts of the watch struct can be set by the driver:: + + struct watch { + union { + u32 info_id; /* ID to be OR'd in to info field */ + ... + }; + void *private; /* Private data for the watched object */ + u64 id; /* Internal identifier */ + ... + }; + +The ``info_id`` value should be an 8-bit number obtained from userspace and +shifted by WATCH_INFO_ID__SHIFT. This is OR'd into the WATCH_INFO_ID field of +struct watch_notification::info when and if the notification is written into +the associated watch queue buffer. + +The ``private`` field is the driver's data associated with the watch_list and +is cleaned up by the ``watch_list::release_watch()`` method. + +The ``id`` field is the source's ID. Notifications that are posted with a +different ID are ignored. + +The following functions are provided to manage watches: + + * ``void init_watch(struct watch *watch, struct watch_queue *wqueue);`` + + Initialise a watch object, setting its pointer to the watch queue, using + appropriate barriering to avoid lockdep complaints. + + * ``int add_watch_to_object(struct watch *watch, struct watch_list *wlist);`` + + Subscribe a watch to a watch list (notification source). The + driver-settable fields in the watch struct must have been set before this + is called. + + * :: + + int remove_watch_from_object(struct watch_list *wlist, + struct watch_queue *wqueue, + u64 id, false); + + Remove a watch from a watch list, where the watch must match the specified + watch queue (``wqueue``) and object identifier (``id``). A notification + (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue to + indicate that the watch got removed. + + * ``int remove_watch_from_object(struct watch_list *wlist, NULL, 0, true);`` + + Remove all the watches from a watch list. It is expected that this will be + called preparatory to destruction and that the watch list will be + inaccessible to new watches by this point. A notification + (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue of each + subscribed watch to indicate that the watch got removed. + + +Notification Posting API +======================== + +To post a notification to watch list so that the subscribed watches can see it, +the following function should be used:: + + void post_watch_notification(struct watch_list *wlist, + struct watch_notification *n, + const struct cred *cred, + u64 id); + +The notification should be preformatted and a pointer to the header (``n``) +should be passed in. The notification may be larger than this and the size in +units of buffer slots is noted in ``n->info & WATCH_INFO_LENGTH``. + +The ``cred`` struct indicates the credentials of the source (subject) and is +passed to the LSMs, such as SELinux, to allow or suppress the recording of the +note in each individual queue according to the credentials of that queue +(object). + +The ``id`` is the ID of the source object (such as the serial number on a key). +Only watches that have the same ID set in them will see this notification. + + +Watch Sources +============= + +Any particular buffer can be fed from multiple sources. Sources include: + + * WATCH_TYPE_KEY_NOTIFY + + Notifications of this type indicate changes to keys and keyrings, including + the changes of keyring contents or the attributes of keys. + + See Documentation/security/keys/core.rst for more information. + + +Event Filtering +=============== + +Once a watch queue has been created, a set of filters can be applied to limit +the events that are received using:: + + struct watch_notification_filter filter = { + ... + }; + ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter) + +The filter description is a variable of type:: + + struct watch_notification_filter { + __u32 nr_filters; + __u32 __reserved; + struct watch_notification_type_filter filters[]; + }; + +Where "nr_filters" is the number of filters in filters[] and "__reserved" +should be 0. The "filters" array has elements of the following type:: + + struct watch_notification_type_filter { + __u32 type; + __u32 info_filter; + __u32 info_mask; + __u32 subtype_filter[8]; + }; + +Where: + + * ``type`` is the event type to filter for and should be something like + "WATCH_TYPE_KEY_NOTIFY" + + * ``info_filter`` and ``info_mask`` act as a filter on the info field of the + notification record. The notification is only written into the buffer if:: + + (watch.info & info_mask) == info_filter + + This could be used, for example, to ignore events that are not exactly on + the watched point in a mount tree. + + * ``subtype_filter`` is a bitmask indicating the subtypes that are of + interest. Bit 0 of subtype_filter[0] corresponds to subtype 0, bit 1 to + subtype 1, and so on. + +If the argument to the ioctl() is NULL, then the filters will be removed and +all events from the watched sources will come through. + + +Userspace Code Example +====================== + +A buffer is created with something like the following:: + + pipe2(fds, O_TMPFILE); + ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256); + +It can then be set to receive keyring change notifications:: + + keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); + +The notifications can then be consumed by something like the following:: + + static void consumer(int rfd, struct watch_queue_buffer *buf) + { + unsigned char buffer[128]; + ssize_t buf_len; + + while (buf_len = read(rfd, buffer, sizeof(buffer)), + buf_len > 0 + ) { + void *p = buffer; + void *end = buffer + buf_len; + while (p < end) { + union { + struct watch_notification n; + unsigned char buf1[128]; + } n; + size_t largest, len; + + largest = end - p; + if (largest > 128) + largest = 128; + memcpy(&n, p, largest); + + len = (n->info & WATCH_INFO_LENGTH) >> + WATCH_INFO_LENGTH__SHIFT; + if (len == 0 || len > largest) + return; + + switch (n.n.type) { + case WATCH_TYPE_META: + got_meta(&n.n); + case WATCH_TYPE_KEY_NOTIFY: + saw_key_change(&n.n); + break; + } + + p += len; + } + } + } diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst index 00a5ba51e63f..e295835fc116 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -1,6 +1,6 @@ -==================================== -Concurrency Managed Workqueue (cmwq) -==================================== +========= +Workqueue +========= :Date: September, 2010 :Author: Tejun Heo <tj@kernel.org> @@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle. When a new work item gets queued, the worker begins executing again. -Why cmwq? -========= +Why Concurrency Managed Workqueue? +================================== In the original wq implementation, a multi threaded (MT) wq had one worker thread per CPU and a single threaded (ST) wq had one worker @@ -77,10 +77,12 @@ wants a function to be executed asynchronously it has to set up a work item pointing to that function and queue that work item on a workqueue. -Special purpose threads, called worker threads, execute the functions -off of the queue, one after the other. If no work is queued, the -worker threads become idle. These worker threads are managed in so -called worker-pools. +A work item can be executed in either a thread or the BH (softirq) context. + +For threaded workqueues, special purpose threads, called [k]workers, execute +the functions off of the queue, one after the other. If no work is queued, +the worker threads become idle. These worker threads are managed in +worker-pools. The cmwq design differentiates between the user-facing workqueues that subsystems and drivers queue work items on and the backend mechanism @@ -91,6 +93,12 @@ for high priority ones, for each possible CPU and some extra worker-pools to serve work items queued on unbound workqueues - the number of these backing pools is dynamic. +BH workqueues use the same framework. However, as there can only be one +concurrent execution context, there's no need to worry about concurrency. +Each per-CPU BH worker pool contains only one pseudo worker which represents +the BH execution context. A BH workqueue can be considered a convenience +interface to softirq. + Subsystems and drivers can create and queue work items through special workqueue API functions as they see fit. They can influence some aspects of the way the work items are executed by setting flags on the @@ -106,7 +114,7 @@ unless specifically overridden, a work item of a bound workqueue will be queued on the worklist of either normal or highpri worker-pool that is associated to the CPU the issuer is running on. -For any worker pool implementation, managing the concurrency level +For any thread pool implementation, managing the concurrency level (how many execution contexts are active) is an important issue. cmwq tries to keep the concurrency at a minimal but sufficient level. Minimal to save resources and sufficient in that the system is used at @@ -164,6 +172,17 @@ resources, scheduled and executed. ``flags`` --------- +``WQ_BH`` + BH workqueues can be considered a convenience interface to softirq. BH + workqueues are always per-CPU and all BH work items are executed in the + queueing CPU's softirq context in the queueing order. + + All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the + only allowed additional flag. + + BH work items cannot sleep. All other features such as delayed queueing, + flushing and canceling are supported. + ``WQ_UNBOUND`` Work items queued to an unbound wq are served by the special worker-pools which host workers which are not bound to any @@ -216,25 +235,20 @@ resources, scheduled and executed. This flag is meaningless for unbound wq. -Note that the flag ``WQ_NON_REENTRANT`` no longer exists as all -workqueues are now non-reentrant - any work item is guaranteed to be -executed by at most one worker system-wide at any given time. - ``max_active`` -------------- -``@max_active`` determines the maximum number of execution contexts -per CPU which can be assigned to the work items of a wq. For example, -with ``@max_active`` of 16, at most 16 work items of the wq can be -executing at the same time per CPU. +``@max_active`` determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, with +``@max_active`` of 16, at most 16 work items of the wq can be executing +at the same time per CPU. This is always a per-CPU attribute, even for +unbound workqueues. -Currently, for a bound wq, the maximum limit for ``@max_active`` is -512 and the default value used when 0 is specified is 256. For an -unbound wq, the limit is higher of 512 and 4 * -``num_possible_cpus()``. These values are chosen sufficiently high -such that they are not the limiting factor while providing protection -in runaway cases. +The maximum limit for ``@max_active`` is 2048 and the default value used +when 0 is specified is 1024. These values are chosen sufficiently high +such that they are not the limiting factor while providing protection in +runaway cases. The number of active work items of a wq is usually regulated by the users of the wq, more specifically, by how many work items the users @@ -242,15 +256,11 @@ may queue at the same time. Unless there is a specific need for throttling the number of active work items, specifying '0' is recommended. -Some users depend on the strict execution ordering of ST wq. The -combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` used to -achieve this behavior. Work items on such wq were always queued to the -unbound worker-pools and only one work item could be active at any given -time thus achieving the same ordering property as ST wq. - -In the current implementation the above configuration only guarantees -ST behavior within a given NUMA node. Instead ``alloc_ordered_queue()`` should -be used to achieve system-wide ST behavior. +Some users depend on strict execution ordering where only one work item +is in flight at any given time and the work items are processed in +queueing order. While the combination of ``@max_active`` of 1 and +``WQ_UNBOUND`` used to achieve this behavior, this is no longer the +case. Use alloc_ordered_workqueue() instead. Example Execution Scenarios @@ -347,11 +357,366 @@ Guidelines difference in execution characteristics between using a dedicated wq and a system wq. + Note: If something may generate more than @max_active outstanding + work items (do stress test your producers), it may saturate a system + wq and potentially lead to deadlock. It should utilize its own + dedicated workqueue rather than the system wq. + * Unless work items are expected to consume a huge amount of CPU cycles, using a bound wq is usually beneficial due to the increased level of locality in wq operations and work item execution. +Affinity Scopes +=============== + +An unbound workqueue groups CPUs according to its affinity scope to improve +cache locality. For example, if a workqueue is using the default affinity +scope of "cache", it will group CPUs according to last level cache +boundaries. A work item queued on the workqueue will be assigned to a worker +on one of the CPUs which share the last level cache with the issuing CPU. +Once started, the worker may or may not be allowed to move outside the scope +depending on the ``affinity_strict`` setting of the scope. + +Workqueue currently supports the following affinity scopes. + +``default`` + Use the scope in module parameter ``workqueue.default_affinity_scope`` + which is always set to one of the scopes below. + +``cpu`` + CPUs are not grouped. A work item issued on one CPU is processed by a + worker on the same CPU. This makes unbound workqueues behave as per-cpu + workqueues without concurrency management. + +``smt`` + CPUs are grouped according to SMT boundaries. This usually means that the + logical threads of each physical CPU core are grouped together. + +``cache`` + CPUs are grouped according to cache boundaries. Which specific cache + boundary is used is determined by the arch code. L3 is used in a lot of + cases. This is the default affinity scope. + +``numa`` + CPUs are grouped according to NUMA boundaries. + +``system`` + All CPUs are put in the same group. Workqueue makes no effort to process a + work item on a CPU close to the issuing CPU. + +The default affinity scope can be changed with the module parameter +``workqueue.default_affinity_scope`` and a specific workqueue's affinity +scope can be changed using ``apply_workqueue_attrs()``. + +If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope +related interface files under its ``/sys/devices/virtual/workqueue/WQ_NAME/`` +directory. + +``affinity_scope`` + Read to see the current affinity scope. Write to change. + + When default is the current scope, reading this file will also show the + current effective scope in parentheses, for example, ``default (cache)``. + +``affinity_strict`` + 0 by default indicating that affinity scopes are not strict. When a work + item starts execution, workqueue makes a best-effort attempt to ensure + that the worker is inside its affinity scope, which is called + repatriation. Once started, the scheduler is free to move the worker + anywhere in the system as it sees fit. This enables benefiting from scope + locality while still being able to utilize other CPUs if necessary and + available. + + If set to 1, all workers of the scope are guaranteed always to be in the + scope. This may be useful when crossing affinity scopes has other + implications, for example, in terms of power consumption or workload + isolation. Strict NUMA scope can also be used to match the workqueue + behavior of older kernels. + + +Affinity Scopes and Performance +=============================== + +It'd be ideal if an unbound workqueue's behavior is optimal for vast +majority of use cases without further tuning. Unfortunately, in the current +kernel, there exists a pronounced trade-off between locality and utilization +necessitating explicit configurations when workqueues are heavily used. + +Higher locality leads to higher efficiency where more work is performed for +the same number of consumed CPU cycles. However, higher locality may also +cause lower overall system utilization if the work items are not spread +enough across the affinity scopes by the issuers. The following performance +testing with dm-crypt clearly illustrates this trade-off. + +The tests are run on a CPU with 12-cores/24-threads split across four L3 +caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency. +``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and +opened with ``cryptsetup`` with default settings. + + +Scenario 1: Enough issuers and work spread across the machine +------------------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \ + --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \ + --name=iops-test-job --verify=sha512 + +There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512`` +makes ``fio`` generate and read back the content each time which makes +execution locality matter between the issuer and ``kcryptd``. The following +are the read bandwidths and CPU utilizations depending on different affinity +scope settings on ``kcryptd`` measured over five runs. Bandwidths are in +MiBps, and CPU util in percents. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 1159.40 ±1.34 + - 99.31 ±0.02 + + * - cache + - 1166.40 ±0.89 + - 99.34 ±0.01 + + * - cache (strict) + - 1166.00 ±0.71 + - 99.35 ±0.01 + +With enough issuers spread across the system, there is no downside to +"cache", strict or otherwise. All three configurations saturate the whole +machine but the cache-affine ones outperform by 0.6% thanks to improved +locality. + + +Scenario 2: Fewer issuers, enough work for saturation +----------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \ + --time_based --group_reporting --name=iops-test-job --verify=sha512 + +The only difference from the previous scenario is ``--numjobs=8``. There are +a third of the issuers but is still enough total work to saturate the +system. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 1155.40 ±0.89 + - 97.41 ±0.05 + + * - cache + - 1154.40 ±1.14 + - 96.15 ±0.09 + + * - cache (strict) + - 1112.00 ±4.64 + - 93.26 ±0.35 + +This is more than enough work to saturate the system. Both "system" and +"cache" are nearly saturating the machine but not fully. "cache" is using +less CPU but the better efficiency puts it at the same bandwidth as +"system". + +Eight issuers moving around over four L3 cache scope still allow "cache +(strict)" to mostly saturate the machine but the loss of work conservation +is now starting to hurt with 3.7% bandwidth loss. + + +Scenario 3: Even fewer issuers, not enough work to saturate +----------------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \ + --time_based --group_reporting --name=iops-test-job --verify=sha512 + +Again, the only difference is ``--numjobs=4``. With the number of issuers +reduced to four, there now isn't enough work to saturate the whole system +and the bandwidth becomes dependent on completion latencies. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 993.60 ±1.82 + - 75.49 ±0.06 + + * - cache + - 973.40 ±1.52 + - 74.90 ±0.07 + + * - cache (strict) + - 828.20 ±4.49 + - 66.84 ±0.29 + +Now, the tradeoff between locality and utilization is clearer. "cache" shows +2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%. + + +Conclusion and Recommendations +------------------------------ + +In the above experiments, the efficiency advantage of the "cache" affinity +scope over "system" is, while consistent and noticeable, small. However, the +impact is dependent on the distances between the scopes and may be more +pronounced in processors with more complex topologies. + +While the loss of work-conservation in certain scenarios hurts, it is a lot +better than "cache (strict)" and maximizing workqueue utilization is +unlikely to be the common case anyway. As such, "cache" is the default +affinity scope for unbound pools. + +* As there is no one option which is great for most cases, workqueue usages + that may consume a significant amount of CPU are recommended to configure + the workqueues using ``apply_workqueue_attrs()`` and/or enable + ``WQ_SYSFS``. + +* An unbound workqueue with strict "cpu" affinity scope behaves the same as + ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the + latter and an unbound workqueue provides a lot more flexibility. + +* Affinity scopes are introduced in Linux v6.5. To emulate the previous + behavior, use strict "numa" affinity scope. + +* The loss of work-conservation in non-strict affinity scopes is likely + originating from the scheduler. There is no theoretical reason why the + kernel wouldn't be able to do the right thing and maintain + work-conservation in most cases. As such, it is possible that future + scheduler improvements may make most of these tunables unnecessary. + + +Examining Configuration +======================= + +Use tools/workqueue/wq_dump.py to examine unbound CPU affinity +configuration, worker pools and how workqueues map to the pools: :: + + $ tools/workqueue/wq_dump.py + Affinity Scopes + =============== + wq_unbound_cpumask=0000000f + + CPU + nr_pods 4 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 + + SMT + nr_pods 4 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 + + CACHE (default) + nr_pods 2 + pod_cpus [0]=00000003 [1]=0000000c + pod_node [0]=0 [1]=1 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 + + NUMA + nr_pods 2 + pod_cpus [0]=00000003 [1]=0000000c + pod_node [0]=0 [1]=1 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 + + SYSTEM + nr_pods 1 + pod_cpus [0]=0000000f + pod_node [0]=-1 + cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 + + Worker Pools + ============ + pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0 + pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0 + pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1 + pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1 + pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2 + pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2 + pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3 + pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3 + pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f + pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003 + pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c + pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f + pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003 + pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c + + Workqueue CPU -> pool + ===================== + [ workqueue \ CPU 0 1 2 3 dfl] + events percpu 0 2 4 6 + events_highpri percpu 1 3 5 7 + events_long percpu 0 2 4 6 + events_unbound unbound 9 9 10 10 8 + events_freezable percpu 0 2 4 6 + events_power_efficient percpu 0 2 4 6 + events_freezable_pwr_ef percpu 0 2 4 6 + rcu_gp percpu 0 2 4 6 + rcu_par_gp percpu 0 2 4 6 + slub_flushwq percpu 0 2 4 6 + netns ordered 8 8 8 8 8 + ... + +See the command's help message for more info. + + +Monitoring +========== + +Use tools/workqueue/wq_monitor.py to monitor workqueue operations: :: + + $ tools/workqueue/wq_monitor.py events + total infl CPUtime CPUhog CMW/RPR mayday rescued + events 18545 0 6.1 0 5 - - + events_highpri 8 0 0.0 0 0 - - + events_long 3 0 0.0 0 0 - - + events_unbound 38306 0 0.1 - 7 - - + events_freezable 0 0 0.0 0 0 - - + events_power_efficient 29598 0 0.2 0 0 - - + events_freezable_pwr_ef 10 0 0.0 0 0 - - + sock_diag_events 0 0 0.0 0 0 - - + + total infl CPUtime CPUhog CMW/RPR mayday rescued + events 18548 0 6.1 0 5 - - + events_highpri 8 0 0.0 0 0 - - + events_long 3 0 0.0 0 0 - - + events_unbound 38322 0 0.1 - 7 - - + events_freezable 0 0 0.0 0 0 - - + events_power_efficient 29603 0 0.2 0 0 - - + events_freezable_pwr_ef 10 0 0.0 0 0 - - + sock_diag_events 0 0 0.0 0 0 - - + + ... + +See the command's help message for more info. + + Debugging ========= @@ -374,8 +739,8 @@ of possible problems: The first one can be tracked using tracing: :: - $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt + $ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event + $ cat /sys/kernel/tracing/trace_pipe > out.txt (wait a few secs) ^C @@ -392,7 +757,27 @@ The work item's function should be trivially visible in the stack trace. +Non-reentrance Conditions +========================= + +Workqueue guarantees that a work item cannot be re-entrant if the following +conditions hold after a work item gets queued: + + 1. The work function hasn't been changed. + 2. No one queues the work item to another workqueue. + 3. The work item hasn't been reinitiated. + +In other words, if the above conditions hold, the work item is guaranteed to be +executed by at most one worker system-wide at any given time. + +Note that requeuing the work item (to the same queue) in the self function +doesn't break these conditions, so it's safe to do. Otherwise, caution is +required when breaking the conditions inside a work function. + + Kernel Inline Documentations Reference ====================================== .. kernel-doc:: include/linux/workqueue.h + +.. kernel-doc:: kernel/workqueue.c diff --git a/Documentation/core-api/wrappers/atomic_bitops.rst b/Documentation/core-api/wrappers/atomic_bitops.rst new file mode 100644 index 000000000000..bf24e4081a8f --- /dev/null +++ b/Documentation/core-api/wrappers/atomic_bitops.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + This is a simple wrapper to bring atomic_bitops.txt into the RST world + until such a time as that file can be converted directly. + +============= +Atomic bitops +============= + +.. raw:: latex + + \footnotesize + +.. include:: ../../atomic_bitops.txt + :literal: + +.. raw:: latex + + \normalsize diff --git a/Documentation/core-api/wrappers/atomic_t.rst b/Documentation/core-api/wrappers/atomic_t.rst new file mode 100644 index 000000000000..ed109a964c77 --- /dev/null +++ b/Documentation/core-api/wrappers/atomic_t.rst @@ -0,0 +1,19 @@ +.. SPDX-License-Identifier: GPL-2.0 + This is a simple wrapper to bring atomic_t.txt into the RST world + until such a time as that file can be converted directly. + +============ +Atomic types +============ + +.. raw:: latex + + \footnotesize + +.. include:: ../../atomic_t.txt + :literal: + +.. raw:: latex + + \normalsize + diff --git a/Documentation/core-api/wrappers/memory-barriers.rst b/Documentation/core-api/wrappers/memory-barriers.rst new file mode 100644 index 000000000000..532460b5e3eb --- /dev/null +++ b/Documentation/core-api/wrappers/memory-barriers.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + This is a simple wrapper to bring memory-barriers.txt into the RST world + until such a time as that file can be converted directly. + +============================ +Linux kernel memory barriers +============================ + +.. raw:: latex + + \footnotesize + +.. include:: ../../memory-barriers.txt + :literal: + +.. raw:: latex + + \normalsize diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index 640934b6f7b4..c6c91cbd0c3c 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -42,8 +42,8 @@ call xa_tag_pointer() to create an entry with a tag, xa_untag_pointer() to turn a tagged entry back into an untagged pointer and xa_pointer_tag() to retrieve the tag of an entry. Tagged pointers use the same bits that are used to distinguish value entries from normal pointers, so you must -decide whether they want to store value entries or tagged pointers in -any particular XArray. +decide whether you want to store value entries or tagged pointers in any +particular XArray. The XArray does not support storing IS_ERR() pointers as some conflict with value entries or internal entries. @@ -52,8 +52,9 @@ An unusual feature of the XArray is the ability to create entries which occupy a range of indices. Once stored to, looking up any index in the range will return the same entry as looking up any other index in the range. Storing to any index will store to all of them. Multi-index -entries can be explicitly split into smaller entries, or storing ``NULL`` -into any entry will cause the XArray to forget about the range. +entries can be explicitly split into smaller entries. Unsetting (using +xa_erase() or xa_store() with ``NULL``) any entry will cause the XArray +to forget about the range. Normal API ========== @@ -63,13 +64,14 @@ for statically allocated XArrays or xa_init() for dynamically allocated ones. A freshly-initialised XArray contains a ``NULL`` pointer at every index. -You can then set entries using xa_store() and get entries -using xa_load(). xa_store will overwrite any entry with the -new entry and return the previous entry stored at that index. You can -use xa_erase() instead of calling xa_store() with a -``NULL`` entry. There is no difference between an entry that has never -been stored to, one that has been erased and one that has most recently -had ``NULL`` stored to it. +You can then set entries using xa_store() and get entries using +xa_load(). xa_store() will overwrite any entry with the new entry and +return the previous entry stored at that index. You can unset entries +using xa_erase() or by setting the entry to ``NULL`` using xa_store(). +There is no difference between an entry that has never been stored to +and one that has been erased with xa_erase(); an entry that has most +recently had ``NULL`` stored to it is also equivalent except if the +XArray was initialized with ``XA_FLAGS_ALLOC``. You can conditionally replace an entry at an index by using xa_cmpxchg(). Like cmpxchg(), it will only succeed if @@ -315,11 +317,15 @@ indeed the normal API is implemented in terms of the advanced API. The advanced API is only available to modules with a GPL-compatible license. The advanced API is based around the xa_state. This is an opaque data -structure which you declare on the stack using the XA_STATE() -macro. This macro initialises the xa_state ready to start walking -around the XArray. It is used as a cursor to maintain the position -in the XArray and let you compose various operations together without -having to restart from the top every time. +structure which you declare on the stack using the XA_STATE() macro. +This macro initialises the xa_state ready to start walking around the +XArray. It is used as a cursor to maintain the position in the XArray +and let you compose various operations together without having to restart +from the top every time. The contents of the xa_state are protected by +the rcu_read_lock() or the xas_lock(). If you need to drop whichever of +those locks is protecting your state and tree, you must call xas_pause() +so that future calls do not rely on the parts of the state which were +left unprotected. The xa_state is also used to store errors. You can call xas_error() to retrieve the error. All operations check whether @@ -475,13 +481,27 @@ or iterations will move the index to the first index in the range. Each entry will only be returned once, no matter how many indices it occupies. -Using xas_next() or xas_prev() with a multi-index xa_state -is not supported. Using either of these functions on a multi-index entry -will reveal sibling entries; these should be skipped over by the caller. - -Storing ``NULL`` into any index of a multi-index entry will set the entry -at every index to ``NULL`` and dissolve the tie. Splitting a multi-index -entry into entries occupying smaller ranges is not yet supported. +Using xas_next() or xas_prev() with a multi-index xa_state is not +supported. Using either of these functions on a multi-index entry will +reveal sibling entries; these should be skipped over by the caller. + +Storing ``NULL`` into any index of a multi-index entry will set the +entry at every index to ``NULL`` and dissolve the tie. A multi-index +entry can be split into entries occupying smaller ranges by calling +xas_split_alloc() without the xa_lock held, followed by taking the lock +and calling xas_split() or calling xas_try_split() with xa_lock. The +difference between xas_split_alloc()+xas_split() and xas_try_alloc() is +that xas_split_alloc() + xas_split() split the entry from the original +order to the new order in one shot uniformly, whereas xas_try_split() +iteratively splits the entry containing the index non-uniformly. +For example, to split an order-9 entry, which takes 2^(9-6)=8 slots, +assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need +8 xa_node. xas_try_split() splits the order-9 entry into +2 order-8 entries, then split one order-8 entry, based on the given index, +to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries. +When splitting the order-6 entry and a new xa_node is needed, xas_try_split() +will try to allocate one if possible. As a result, xas_try_split() would only +need 1 xa_node instead of 8. Functions and structures ======================== |