aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/README.rst121
-rw-r--r--Documentation/admin-guide/acpi/dsdt-override.rst13
-rw-r--r--Documentation/admin-guide/acpi/fan_performance_states.rst28
-rw-r--r--Documentation/admin-guide/acpi/index.rst1
-rw-r--r--Documentation/admin-guide/blockdev/index.rst6
-rw-r--r--Documentation/admin-guide/blockdev/zram.rst25
-rw-r--r--Documentation/admin-guide/bootconfig.rst31
-rw-r--r--Documentation/admin-guide/cgroup-v1/hugetlb.rst4
-rw-r--r--Documentation/admin-guide/cgroup-v1/memcg_test.rst2
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst6
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst290
-rw-r--r--Documentation/admin-guide/cputopology.rst25
-rw-r--r--Documentation/admin-guide/device-mapper/verity.rst4
-rw-r--r--Documentation/admin-guide/device-mapper/writecache.rst18
-rw-r--r--Documentation/admin-guide/devices.rst7
-rw-r--r--Documentation/admin-guide/devices.txt2
-rw-r--r--Documentation/admin-guide/dynamic-debug-howto.rst246
-rw-r--r--Documentation/admin-guide/efi-stub.rst4
-rw-r--r--Documentation/admin-guide/gpio/index.rst1
-rw-r--r--Documentation/admin-guide/hw-vuln/index.rst1
-rw-r--r--Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst260
-rw-r--r--Documentation/admin-guide/hw-vuln/spectre.rst59
-rw-r--r--Documentation/admin-guide/index.rst1
-rw-r--r--Documentation/admin-guide/iostats.rst6
-rw-r--r--Documentation/admin-guide/kdump/kdump.rst10
-rw-r--r--Documentation/admin-guide/kdump/vmcoreinfo.rst10
-rw-r--r--Documentation/admin-guide/kernel-parameters.rst11
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt1010
-rw-r--r--Documentation/admin-guide/laptops/lg-laptop.rst2
-rw-r--r--Documentation/admin-guide/media/fimc.rst2
-rw-r--r--Documentation/admin-guide/media/i2c-cardlist.rst2
-rw-r--r--Documentation/admin-guide/media/imx7.rst2
-rw-r--r--Documentation/admin-guide/media/omap3isp.rst2
-rw-r--r--Documentation/admin-guide/media/omap4_camera.rst2
-rw-r--r--Documentation/admin-guide/media/vimc.dot18
-rw-r--r--Documentation/admin-guide/media/vimc.rst32
-rw-r--r--Documentation/admin-guide/media/vivid.rst16
-rw-r--r--Documentation/admin-guide/mm/cma_debugfs.rst10
-rw-r--r--Documentation/admin-guide/mm/concepts.rst2
-rw-r--r--Documentation/admin-guide/mm/damon/index.rst9
-rw-r--r--Documentation/admin-guide/mm/damon/lru_sort.rst294
-rw-r--r--Documentation/admin-guide/mm/damon/reclaim.rst44
-rw-r--r--Documentation/admin-guide/mm/damon/start.rst13
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst619
-rw-r--r--Documentation/admin-guide/mm/hugetlbpage.rst6
-rw-r--r--Documentation/admin-guide/mm/index.rst2
-rw-r--r--Documentation/admin-guide/mm/ksm.rst54
-rw-r--r--Documentation/admin-guide/mm/memory-hotplug.rst4
-rw-r--r--Documentation/admin-guide/mm/multigen_lru.rst162
-rw-r--r--Documentation/admin-guide/mm/numa_memory_policy.rst16
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst2
-rw-r--r--Documentation/admin-guide/mm/shrinker_debugfs.rst135
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst16
-rw-r--r--Documentation/admin-guide/mm/userfaultfd.rst41
-rw-r--r--Documentation/admin-guide/mm/zswap.rst22
-rw-r--r--Documentation/admin-guide/nfs/nfs-client.rst15
-rw-r--r--Documentation/admin-guide/perf/alibaba_pmu.rst100
-rw-r--r--Documentation/admin-guide/perf/hns3-pmu.rst136
-rw-r--r--Documentation/admin-guide/perf/index.rst3
-rw-r--r--Documentation/admin-guide/pm/amd-pstate.rst237
-rw-r--r--Documentation/admin-guide/pm/cpuidle.rst15
-rw-r--r--Documentation/admin-guide/pm/intel-speed-select.rst22
-rw-r--r--Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst60
-rw-r--r--Documentation/admin-guide/pm/working-state.rst1
-rw-r--r--Documentation/admin-guide/reporting-issues.rst73
-rw-r--r--Documentation/admin-guide/reporting-regressions.rst451
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst131
-rw-r--r--Documentation/admin-guide/sysctl/net.rst62
-rw-r--r--Documentation/admin-guide/sysctl/vm.rst53
-rw-r--r--Documentation/admin-guide/tainted-kernels.rst7
70 files changed, 4108 insertions, 989 deletions
diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst
index caa3c09a5c3f..9a969c0157f1 100644
--- a/Documentation/admin-guide/README.rst
+++ b/Documentation/admin-guide/README.rst
@@ -1,9 +1,9 @@
.. _readme:
-Linux kernel release 5.x <http://kernel.org/>
+Linux kernel release 6.x <http://kernel.org/>
=============================================
-These are the release notes for Linux version 5. Read them carefully,
+These are the release notes for Linux version 6. Read them carefully,
as they tell you what this is all about, explain how to install the
kernel, and what to do if something goes wrong.
@@ -63,7 +63,7 @@ Installing the kernel source
directory where you have permissions (e.g. your home directory) and
unpack it::
- xz -cd linux-5.x.tar.xz | tar xvf -
+ xz -cd linux-6.x.tar.xz | tar xvf -
Replace "X" with the version number of the latest kernel.
@@ -72,12 +72,12 @@ Installing the kernel source
files. They should match the library, and not get messed up by
whatever the kernel-du-jour happens to be.
- - You can also upgrade between 5.x releases by patching. Patches are
+ - You can also upgrade between 6.x releases by patching. Patches are
distributed in the xz format. To install by patching, get all the
newer patch files, enter the top level directory of the kernel source
- (linux-5.x) and execute::
+ (linux-6.x) and execute::
- xz -cd ../patch-5.x.xz | patch -p1
+ xz -cd ../patch-6.x.xz | patch -p1
Replace "x" for all versions bigger than the version "x" of your current
source tree, **in_order**, and you should be ok. You may want to remove
@@ -85,13 +85,13 @@ Installing the kernel source
that there are no failed patches (some-file-name# or some-file-name.rej).
If there are, either you or I have made a mistake.
- Unlike patches for the 5.x kernels, patches for the 5.x.y kernels
+ Unlike patches for the 6.x kernels, patches for the 6.x.y kernels
(also known as the -stable kernels) are not incremental but instead apply
- directly to the base 5.x kernel. For example, if your base kernel is 5.0
- and you want to apply the 5.0.3 patch, you must not first apply the 5.0.1
- and 5.0.2 patches. Similarly, if you are running kernel version 5.0.2 and
- want to jump to 5.0.3, you must first reverse the 5.0.2 patch (that is,
- patch -R) **before** applying the 5.0.3 patch. You can read more on this in
+ directly to the base 6.x kernel. For example, if your base kernel is 6.0
+ and you want to apply the 6.0.3 patch, you must not first apply the 6.0.1
+ and 6.0.2 patches. Similarly, if you are running kernel version 6.0.2 and
+ want to jump to 6.0.3, you must first reverse the 6.0.2 patch (that is,
+ patch -R) **before** applying the 6.0.3 patch. You can read more on this in
:ref:`Documentation/process/applying-patches.rst <applying_patches>`.
Alternatively, the script patch-kernel can be used to automate this
@@ -114,7 +114,7 @@ Installing the kernel source
Software requirements
---------------------
- Compiling and running the 5.x kernels requires up-to-date
+ Compiling and running the 6.x kernels requires up-to-date
versions of various software packages. Consult
:ref:`Documentation/process/changes.rst <changes>` for the minimum version numbers
required and how to get updates for these packages. Beware that using
@@ -132,12 +132,12 @@ Build directory for the kernel
place for the output files (including .config).
Example::
- kernel source code: /usr/src/linux-5.x
+ kernel source code: /usr/src/linux-6.x
build directory: /home/name/build/kernel
To configure and build the kernel, use::
- cd /usr/src/linux-5.x
+ cd /usr/src/linux-6.x
make O=/home/name/build/kernel menuconfig
make O=/home/name/build/kernel
sudo make O=/home/name/build/kernel modules_install install
@@ -262,8 +262,6 @@ Compiling the kernel
- Make sure you have at least gcc 5.1 available.
For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
- Please note that you can still run a.out user programs with this kernel.
-
- Do a ``make`` to create a compressed kernel image. It is also
possible to do ``make install`` if you have lilo installed to suit the
kernel makefiles, but you may want to check your particular lilo setup first.
@@ -332,85 +330,10 @@ Compiling the kernel
If something goes wrong
-----------------------
- - If you have problems that seem to be due to kernel bugs, please check
- the file MAINTAINERS to see if there is a particular person associated
- with the part of the kernel that you are having trouble with. If there
- isn't anyone listed there, then the second best thing is to mail
- them to me (torvalds@linux-foundation.org), and possibly to any other
- relevant mailing-list or to the newsgroup.
-
- - In all bug-reports, *please* tell what kernel you are talking about,
- how to duplicate the problem, and what your setup is (use your common
- sense). If the problem is new, tell me so, and if the problem is
- old, please try to tell me when you first noticed it.
-
- - If the bug results in a message like::
-
- unable to handle kernel paging request at address C0000010
- Oops: 0002
- EIP: 0010:XXXXXXXX
- eax: xxxxxxxx ebx: xxxxxxxx ecx: xxxxxxxx edx: xxxxxxxx
- esi: xxxxxxxx edi: xxxxxxxx ebp: xxxxxxxx
- ds: xxxx es: xxxx fs: xxxx gs: xxxx
- Pid: xx, process nr: xx
- xx xx xx xx xx xx xx xx xx xx
-
- or similar kernel debugging information on your screen or in your
- system log, please duplicate it *exactly*. The dump may look
- incomprehensible to you, but it does contain information that may
- help debugging the problem. The text above the dump is also
- important: it tells something about why the kernel dumped code (in
- the above example, it's due to a bad kernel pointer). More information
- on making sense of the dump is in Documentation/admin-guide/bug-hunting.rst
-
- - If you compiled the kernel with CONFIG_KALLSYMS you can send the dump
- as is, otherwise you will have to use the ``ksymoops`` program to make
- sense of the dump (but compiling with CONFIG_KALLSYMS is usually preferred).
- This utility can be downloaded from
- https://www.kernel.org/pub/linux/utils/kernel/ksymoops/ .
- Alternatively, you can do the dump lookup by hand:
-
- - In debugging dumps like the above, it helps enormously if you can
- look up what the EIP value means. The hex value as such doesn't help
- me or anybody else very much: it will depend on your particular
- kernel setup. What you should do is take the hex value from the EIP
- line (ignore the ``0010:``), and look it up in the kernel namelist to
- see which kernel function contains the offending address.
-
- To find out the kernel function name, you'll need to find the system
- binary associated with the kernel that exhibited the symptom. This is
- the file 'linux/vmlinux'. To extract the namelist and match it against
- the EIP from the kernel crash, do::
-
- nm vmlinux | sort | less
-
- This will give you a list of kernel addresses sorted in ascending
- order, from which it is simple to find the function that contains the
- offending address. Note that the address given by the kernel
- debugging messages will not necessarily match exactly with the
- function addresses (in fact, that is very unlikely), so you can't
- just 'grep' the list: the list will, however, give you the starting
- point of each kernel function, so by looking for the function that
- has a starting address lower than the one you are searching for but
- is followed by a function with a higher address you will find the one
- you want. In fact, it may be a good idea to include a bit of
- "context" in your problem report, giving a few lines around the
- interesting one.
-
- If you for some reason cannot do the above (you have a pre-compiled
- kernel image or similar), telling me as much about your setup as
- possible will help. Please read
- 'Documentation/admin-guide/reporting-issues.rst' for details.
-
- - Alternatively, you can use gdb on a running kernel. (read-only; i.e. you
- cannot change values or set break points.) To do this, first compile the
- kernel with -g; edit arch/x86/Makefile appropriately, then do a ``make
- clean``. You'll also need to enable CONFIG_PROC_FS (via ``make config``).
-
- After you've rebooted with the new kernel, do ``gdb vmlinux /proc/kcore``.
- You can now use all the usual gdb commands. The command to look up the
- point where your system crashed is ``l *0xXXXXXXXX``. (Replace the XXXes
- with the EIP value.)
-
- gdb'ing a non-running kernel currently fails because ``gdb`` (wrongly)
- disregards the starting offset for which the kernel is compiled.
+If you have problems that seem to be due to kernel bugs, please follow the
+instructions at 'Documentation/admin-guide/reporting-issues.rst'.
+
+Hints on understanding kernel bug reports are in
+'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel
+with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and
+'Documentation/dev-tools/kgdb.rst'.
diff --git a/Documentation/admin-guide/acpi/dsdt-override.rst b/Documentation/admin-guide/acpi/dsdt-override.rst
deleted file mode 100644
index 50bd7f194bf4..000000000000
--- a/Documentation/admin-guide/acpi/dsdt-override.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===============
-Overriding DSDT
-===============
-
-Linux supports a method of overriding the BIOS DSDT:
-
-CONFIG_ACPI_CUSTOM_DSDT - builds the image into the kernel.
-
-When to use this method is described in detail on the
-Linux/ACPI home page:
-https://01.org/linux-acpi/documentation/overriding-dsdt
diff --git a/Documentation/admin-guide/acpi/fan_performance_states.rst b/Documentation/admin-guide/acpi/fan_performance_states.rst
index 98fe5c333121..b9e4b4d146c1 100644
--- a/Documentation/admin-guide/acpi/fan_performance_states.rst
+++ b/Documentation/admin-guide/acpi/fan_performance_states.rst
@@ -60,3 +60,31 @@ For example::
When a given field is not populated or its value provided by the platform
firmware is invalid, the "not-defined" string is shown instead of the value.
+
+ACPI Fan Fine Grain Control
+=============================
+
+When _FIF object specifies support for fine grain control, then fan speed
+can be set from 0 to 100% with the recommended minimum "step size" via
+_FSL object. User can adjust fan speed using thermal sysfs cooling device.
+
+Here use can look at fan performance states for a reference speed (speed_rpm)
+and set it by changing cooling device cur_state. If the fine grain control
+is supported then user can also adjust to some other speeds which are
+not defined in the performance states.
+
+The support of fine grain control is presented via sysfs attribute
+"fine_grain_control". If fine grain control is present, this attribute
+will show "1" otherwise "0".
+
+This sysfs attribute is presented in the same directory as performance states.
+
+ACPI Fan Performance Feedback
+=============================
+
+The optional _FST object provides status information for the fan device.
+This includes field to provide current fan speed in revolutions per minute
+at which the fan is rotating.
+
+This speed is presented in the sysfs using the attribute "fan_speed_rpm",
+in the same directory as performance states.
diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst
index 71277689ad97..b078fdb8f4c9 100644
--- a/Documentation/admin-guide/acpi/index.rst
+++ b/Documentation/admin-guide/acpi/index.rst
@@ -9,7 +9,6 @@ the Linux ACPI support.
:maxdepth: 1
initrd_table_override
- dsdt-override
ssdt-overlays
cppc_sysfs
fan_performance_states
diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst
index b903cf152091..957ccf617797 100644
--- a/Documentation/admin-guide/blockdev/index.rst
+++ b/Documentation/admin-guide/blockdev/index.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-===========================
-The Linux RapidIO Subsystem
-===========================
+=============
+Block Devices
+=============
.. toctree::
:maxdepth: 1
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index 3e11926a4df9..c73b16930449 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -315,8 +315,8 @@ To use the feature, admin should set up backing device via::
echo /dev/sda5 > /sys/block/zramX/backing_dev
-before disksize setting. It supports only partition at this moment.
-If admin wants to use incompressible page writeback, they could do via::
+before disksize setting. It supports only partitions at this moment.
+If admin wants to use incompressible page writeback, they could do it via::
echo huge > /sys/block/zramX/writeback
@@ -341,9 +341,14 @@ Admin can request writeback of those idle pages at right timing via::
echo idle > /sys/block/zramX/writeback
-With the command, zram writeback idle pages from memory to the storage.
+With the command, zram will writeback idle pages from memory to the storage.
-If admin want to write a specific page in zram device to backing device,
+Additionally, if a user choose to writeback only huge and idle pages
+this can be accomplished with::
+
+ echo huge_idle > /sys/block/zramX/writeback
+
+If an admin wants to write a specific page in zram device to the backing device,
they could write a page index into the interface.
echo "page_index=1251" > /sys/block/zramX/writeback
@@ -354,7 +359,7 @@ to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
-any writeback. IOW, if admin wants to apply writeback budget, he should
+any writeback. IOW, if admin wants to apply writeback budget, they should
enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable
@@ -365,7 +370,7 @@ until admin sets the budget via /sys/block/zramX/writeback_limit.
(If admin doesn't enable writeback_limit_enable, writeback_limit's value
assigned via /sys/block/zramX/writeback_limit is meaningless.)
-If admin want to limit writeback as per-day 400M, he could do it
+If admin wants to limit writeback as per-day 400M, they could do it
like below::
$ MB_SHIFT=20
@@ -375,16 +380,16 @@ like below::
$ echo 1 > /sys/block/zram0/writeback_limit_enable
If admins want to allow further write again once the budget is exhausted,
-he could do it like below::
+they could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit
-If admin wants to see remaining writeback budget since last set::
+If an admin wants to see the remaining writeback budget since last set::
$ cat /sys/block/zramX/writeback_limit
-If admin want to disable writeback limit, he could do::
+If an admin wants to disable writeback limit, they could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable
@@ -393,7 +398,7 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.
-If admin wants to measure writeback count in a certain period, he could
+If admin wants to measure writeback count in a certain period, they could
know it via /sys/block/zram0/bd_stat's 3rd column.
memory tracking
diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
index a1860fc0ca88..d99994345d41 100644
--- a/Documentation/admin-guide/bootconfig.rst
+++ b/Documentation/admin-guide/bootconfig.rst
@@ -158,9 +158,15 @@ Each key-value pair is shown in each line with following style::
Boot Kernel With a Boot Config
==============================
-Since the boot configuration file is loaded with initrd, it will be added
-to the end of the initrd (initramfs) image file with padding, size,
-checksum and 12-byte magic word as below.
+There are two options to boot the kernel with bootconfig: attaching the
+bootconfig to the initrd image or embedding it in the kernel itself.
+
+Attaching a Boot Config to Initrd
+---------------------------------
+
+Since the boot configuration file is loaded with initrd by default,
+it will be added to the end of the initrd (initramfs) image file with
+padding, size, checksum and 12-byte magic word as below.
[initrd][bootconfig][padding][size(le32)][checksum(le32)][#BOOTCONFIG\n]
@@ -196,6 +202,25 @@ To remove the config from the image, you can use -d option as below::
Then add "bootconfig" on the normal kernel command line to tell the
kernel to look for the bootconfig at the end of the initrd file.
+Embedding a Boot Config into Kernel
+-----------------------------------
+
+If you can not use initrd, you can also embed the bootconfig file in the
+kernel by Kconfig options. In this case, you need to recompile the kernel
+with the following configs::
+
+ CONFIG_BOOT_CONFIG_EMBED=y
+ CONFIG_BOOT_CONFIG_EMBED_FILE="/PATH/TO/BOOTCONFIG/FILE"
+
+``CONFIG_BOOT_CONFIG_EMBED_FILE`` requires an absolute path or a relative
+path to the bootconfig file from source tree or object tree.
+The kernel will embed it as the default bootconfig.
+
+Just as when attaching the bootconfig to the initrd, you need ``bootconfig``
+option on the kernel command line to enable the embedded bootconfig.
+
+Note that even if you set this option, you can override the embedded
+bootconfig by another bootconfig which attached to the initrd.
Kernel parameters via Boot Config
=================================
diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index 338f2c7d7a1c..0fa724d82abb 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -29,12 +29,14 @@ Brief summary of control files::
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
+ hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include::
hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes
+ hugetlb.1GB.numa_stat
hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt
hugetlb.1GB.rsvd.limit_in_bytes
@@ -43,6 +45,7 @@ files include::
hugetlb.1GB.rsvd.failcnt
hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes
+ hugetlb.64KB.numa_stat
hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt
hugetlb.64KB.rsvd.limit_in_bytes
@@ -51,6 +54,7 @@ files include::
hugetlb.64KB.rsvd.failcnt
hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes
+ hugetlb.32MB.numa_stat
hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt
hugetlb.32MB.rsvd.limit_in_bytes
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 45b94f7b3beb..a402359abb99 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -97,7 +97,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
=============
Page Cache is charged at
- - add_to_page_cache_locked().
+ - filemap_add_folio().
The logic is very clear. (About migration, see below)
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index faac50149a22..5b86245450bd 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -64,6 +64,7 @@ Brief summary of control files.
threads
cgroup.procs show list of processes
cgroup.event_control an interface for event_fd()
+ This knob is not available on CONFIG_PREEMPT_RT systems.
memory.usage_in_bytes show current usage for memory
(See 5.5 for details)
memory.memsw.usage_in_bytes show current usage for memory+Swap
@@ -75,6 +76,7 @@ Brief summary of control files.
memory.max_usage_in_bytes show max memory usage recorded
memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
memory.soft_limit_in_bytes set/show soft limit of memory usage
+ This knob is not available on CONFIG_PREEMPT_RT systems.
memory.stat show various statistics
memory.use_hierarchy set/show hierarchical account enabled
This knob is deprecated and shouldn't be
@@ -297,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
isolating a page from its LRU under lruvec->lru_lock.
-2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
+2.7 Kernel Memory Extension
-----------------------------------------------
With the Kernel memory extension, the Memory Controller is able to limit
@@ -384,8 +386,6 @@ U != 0, K >= U:
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG
-c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
-d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
-------------------------------------------------------------------
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 2aeb7ae8b393..dc254a3cb956 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -184,6 +184,14 @@ cgroup v2 currently supports the following mount options.
ignored on non-init namespace mounts. Please refer to the
Delegation section for details.
+ favordynmods
+ Reduce the latencies of dynamic cgroup modifications such as
+ task migrations and controller on/offs at the cost of making
+ hot path operations such as forks and exits more expensive.
+ The static usage pattern of creating a cgroup, enabling
+ controllers, and then seeding it with CLONE_INTO_CGROUP is
+ not affected by this option.
+
memory_localevents
Only populate memory.events with data for the current cgroup,
and not any subtrees. This is legacy behaviour, the default
@@ -968,6 +976,29 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ cgroup.pressure
+ A read-write single value file that allowed values are "0" and "1".
+ The default is "1".
+
+ Writing "0" to the file will disable the cgroup PSI accounting.
+ Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+ This control attribute is not hierarchical, so disable or enable PSI
+ accounting in a cgroup does not affect PSI accounting in descendants
+ and doesn't need pass enablement via ancestors from root.
+
+ The reason this control attribute exists is that PSI accounts stalls for
+ each cgroup separately and aggregates it at each level of the hierarchy.
+ This may cause non-negligible overhead for some workloads when under
+ deep level of the hierarchy, in which case this control attribute can
+ be used to disable PSI accounting in the non-leaf cgroups.
+
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
@@ -1208,6 +1239,41 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
+ memory.reclaim
+ A write-only nested-keyed file which exists for all cgroups.
+
+ This is a simple interface to trigger memory reclaim in the
+ target cgroup.
+
+ This file accepts a single key, the number of bytes to reclaim.
+ No nested keys are currently supported.
+
+ Example::
+
+ echo "1G" > memory.reclaim
+
+ The interface can be later extended with nested keys to
+ configure the reclaim behavior. For example, specify the
+ type of memory to reclaim from (anon, file, ..).
+
+ Please note that the kernel can over or under reclaim from
+ the target cgroup. If less bytes are reclaimed than the
+ specified amount, -EAGAIN is returned.
+
+ Please note that the proactive reclaim (triggered by this
+ interface) is not meant to indicate memory pressure on the
+ memory cgroup. Therefore socket memory balancing triggered by
+ the memory reclaim normally is not exercised in this case.
+ This means that the networking layer will not adapt based on
+ reclaim induced by memory.reclaim.
+
+ memory.peak
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The max memory usage recorded for the cgroup and its
+ descendants since the creation of the cgroup.
+
memory.oom.group
A read-write single value file which exists on non-root
cgroups. The default value is "0".
@@ -1268,6 +1334,9 @@ PAGE_SIZE multiple when read back.
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
+ oom_group_kill
+ The number of times a group OOM has occurred.
+
memory.events.local
Similar to memory.events but the fields in the file are local
to the cgroup i.e. not hierarchical. The file modified event
@@ -1298,12 +1367,22 @@ PAGE_SIZE multiple when read back.
Amount of memory used to cache filesystem data,
including tmpfs and shared memory.
+ kernel (npn)
+ Amount of total kernel memory, including
+ (kernel_stack, pagetables, percpu, vmalloc, slab) in
+ addition to other kernel memory use cases.
+
kernel_stack
Amount of memory allocated to kernel stacks.
pagetables
Amount of memory allocated for page tables.
+ sec_pagetables
+ Amount of memory allocated for secondary page tables,
+ this currently includes KVM mmu allocations on x86
+ and arm64.
+
percpu (npn)
Amount of memory used for storing per-cpu kernel
data structures.
@@ -1311,10 +1390,19 @@ PAGE_SIZE multiple when read back.
sock (npn)
Amount of memory used in network transmission buffers
+ vmalloc (npn)
+ Amount of memory used for vmap backed memory.
+
shmem
Amount of cached filesystem data that is swap-backed,
such as tmpfs, shm segments, shared anonymous mmap()s
+ zswap
+ Amount of memory consumed by the zswap compression backend.
+
+ zswapped
+ Amount of application memory swapped out to zswap.
+
file_mapped
Amount of cached filesystem data mapped with mmap()
@@ -1388,6 +1476,24 @@ PAGE_SIZE multiple when read back.
workingset_nodereclaim
Number of times a shadow node has been reclaimed
+ pgscan (npn)
+ Amount of scanned pages (in an inactive LRU list)
+
+ pgsteal (npn)
+ Amount of reclaimed pages
+
+ pgscan_kswapd (npn)
+ Amount of scanned pages by kswapd (in an inactive LRU list)
+
+ pgscan_direct (npn)
+ Amount of scanned pages directly (in an inactive LRU list)
+
+ pgsteal_kswapd (npn)
+ Amount of reclaimed pages by kswapd
+
+ pgsteal_direct (npn)
+ Amount of reclaimed pages directly
+
pgfault (npn)
Total number of page faults incurred
@@ -1397,12 +1503,6 @@ PAGE_SIZE multiple when read back.
pgrefill (npn)
Amount of scanned pages (in an active LRU list)
- pgscan (npn)
- Amount of scanned pages (in an inactive LRU list)
-
- pgsteal (npn)
- Amount of reclaimed pages
-
pgactivate (npn)
Amount of pages moved to the active LRU list
@@ -1505,6 +1605,21 @@ PAGE_SIZE multiple when read back.
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.
+ memory.zswap.current
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The total amount of memory consumed by the zswap compression
+ backend.
+
+ memory.zswap.max
+ A read-write single value file which exists on non-root
+ cgroups. The default is "max".
+
+ Zswap usage hard limit. If a cgroup's zswap pool reaches this
+ limit, it will refuse to take any more stores before existing
+ entries fault back in or are written out to disk.
+
memory.pressure
A read-only nested-keyed file.
@@ -1870,7 +1985,7 @@ IO Latency Interface Files
io.latency
This takes a similar format as the other controllers.
- "MAJOR:MINOR target=<target time in microseconds"
+ "MAJOR:MINOR target=<target time in microseconds>"
io.stat
If the controller is enabled you will see extra stats in io.stat in
@@ -2098,75 +2213,93 @@ Cpuset Interface Files
It accepts only the following input values when written to.
- ======== ================================
- "root" a partition root
- "member" a non-root member of a partition
- ======== ================================
-
- When set to be a partition root, the current cgroup is the
- root of a new partition or scheduling domain that comprises
- itself and all its descendants except those that are separate
- partition roots themselves and their descendants. The root
- cgroup is always a partition root.
-
- There are constraints on where a partition root can be set.
- It can only be set in a cgroup if all the following conditions
- are true.
-
- 1) The "cpuset.cpus" is not empty and the list of CPUs are
- exclusive, i.e. they are not shared by any of its siblings.
- 2) The parent cgroup is a partition root.
- 3) The "cpuset.cpus" is also a proper subset of the parent's
- "cpuset.cpus.effective".
- 4) There is no child cgroups with cpuset enabled. This is for
- eliminating corner cases that have to be handled if such a
- condition is allowed.
-
- Setting it to partition root will take the CPUs away from the
- effective CPUs of the parent cgroup. Once it is set, this
- file cannot be reverted back to "member" if there are any child
- cgroups with cpuset enabled.
-
- A parent partition cannot distribute all its CPUs to its
- child partitions. There must be at least one cpu left in the
- parent partition.
-
- Once becoming a partition root, changes to "cpuset.cpus" is
- generally allowed as long as the first condition above is true,
- the change will not take away all the CPUs from the parent
- partition and the new "cpuset.cpus" value is a superset of its
- children's "cpuset.cpus" values.
-
- Sometimes, external factors like changes to ancestors'
- "cpuset.cpus" or cpu hotplug can cause the state of the partition
- root to change. On read, the "cpuset.sched.partition" file
- can show the following values.
-
- ============== ==============================
- "member" Non-root member of a partition
- "root" Partition root
- "root invalid" Invalid partition root
- ============== ==============================
-
- It is a partition root if the first 2 partition root conditions
- above are true and at least one CPU from "cpuset.cpus" is
- granted by the parent cgroup.
-
- A partition root can become invalid if none of CPUs requested
- in "cpuset.cpus" can be granted by the parent cgroup or the
- parent cgroup is no longer a partition root itself. In this
- case, it is not a real partition even though the restriction
- of the first partition root condition above will still apply.
- The cpu affinity of all the tasks in the cgroup will then be
- associated with CPUs in the nearest ancestor partition.
-
- An invalid partition root can be transitioned back to a
- real partition root if at least one of the requested CPUs
- can now be granted by its parent. In this case, the cpu
- affinity of all the tasks in the formerly invalid partition
- will be associated to the CPUs of the newly formed partition.
- Changing the partition state of an invalid partition root to
- "member" is always allowed even if child cpusets are present.
+ ========== =====================================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ ========== =====================================
+
+ The root cgroup is always a partition root and its state
+ cannot be changed. All other non-root cgroups start out as
+ "member".
+
+ When set to "root", the current cgroup is the root of a new
+ partition or scheduling domain that comprises itself and all
+ its descendants except those that are separate partition roots
+ themselves and their descendants.
+
+ When set to "isolated", the CPUs in that partition root will
+ be in an isolated state without any load balancing from the
+ scheduler. Tasks placed in such a partition with multiple
+ CPUs should be carefully distributed and bound to each of the
+ individual CPUs for optimal performance.
+
+ The value shown in "cpuset.cpus.effective" of a partition root
+ is the CPUs that the partition root can dedicate to a potential
+ new child partition root. The new child subtracts available
+ CPUs from its parent "cpuset.cpus.effective".
+
+ A partition root ("root" or "isolated") can be in one of the
+ two possible states - valid or invalid. An invalid partition
+ root is in a degraded state where some state information may
+ be retained, but behaves more like a "member".
+
+ All possible state transitions among "member", "root" and
+ "isolated" are allowed.
+
+ On read, the "cpuset.cpus.partition" file can show the following
+ values.
+
+ ============================= =====================================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ "root invalid (<reason>)" Invalid partition root
+ "isolated invalid (<reason>)" Invalid isolated partition root
+ ============================= =====================================
+
+ In the case of an invalid partition root, a descriptive string on
+ why the partition is invalid is included within parentheses.
+
+ For a partition root to become valid, the following conditions
+ must be met.
+
+ 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
+ are not shared by any of its siblings (exclusivity rule).
+ 2) The parent cgroup is a valid partition root.
+ 3) The "cpuset.cpus" is not empty and must contain at least
+ one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
+ 4) The "cpuset.cpus.effective" cannot be empty unless there is
+ no task associated with this partition.
+
+ External events like hotplug or changes to "cpuset.cpus" can
+ cause a valid partition root to become invalid and vice versa.
+ Note that a task cannot be moved to a cgroup with empty
+ "cpuset.cpus.effective".
+
+ For a valid partition root with the sibling cpu exclusivity
+ rule enabled, changes made to "cpuset.cpus" that violate the
+ exclusivity rule will invalidate the partition as well as its
+ sibiling partitions with conflicting cpuset.cpus values. So
+ care must be taking in changing "cpuset.cpus".
+
+ A valid non-root parent partition may distribute out all its CPUs
+ to its child partitions when there is no task associated with it.
+
+ Care must be taken to change a valid partition root to
+ "member" as all its child partitions, if present, will become
+ invalid causing disruption to tasks running in those child
+ partitions. These inactivated partitions could be recovered if
+ their parent is switched back to a partition root with a proper
+ set of "cpuset.cpus".
+
+ Poll and inotify events are triggered whenever the state of
+ "cpuset.cpus.partition" changes. That includes changes caused
+ by write to "cpuset.cpus.partition", cpu hotplug or other
+ changes that modify the validity status of the partition.
+ This will allow user space agents to monitor unexpected changes
+ to "cpuset.cpus.partition" without the need to do continuous
+ polling.
Device controller
@@ -2260,6 +2393,11 @@ HugeTLB Interface Files
are local to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.
+ hugetlb.<hugepagesize>.numa_stat
+ Similar to memory.numa_stat, it shows the numa information of the
+ hugetlb pages of <hugepagesize> in this cgroup. Only active in
+ use hugetlb pages are included. The per-node values are in bytes.
+
Misc
----
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index 6b62e182baf4..d29cacc9b3c3 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -8,11 +8,9 @@ to /proc/cpuinfo output of some architectures. They reside in
Documentation/ABI/stable/sysfs-devices-system-cpu.
Architecture-neutral, drivers/base/topology.c, exports these attributes.
-However, the book and drawer related sysfs files will only be created if
-CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
-
-CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390,
-where they reflect the cpu and cache hierarchy.
+However the die, cluster, book, and drawer hierarchy related sysfs files will
+only be created if an architecture provides the related macros as described
+below.
For an architecture to support this feature, it must define some of
these macros in include/asm-XXX/topology.h::
@@ -43,15 +41,14 @@ not defined by include/asm-XXX/topology.h:
2) topology_die_id: -1
3) topology_cluster_id: -1
4) topology_core_id: 0
-5) topology_sibling_cpumask: just the given CPU
-6) topology_core_cpumask: just the given CPU
-7) topology_cluster_cpumask: just the given CPU
-8) topology_die_cpumask: just the given CPU
-
-For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
-default definitions for topology_book_id() and topology_book_cpumask().
-For architectures that don't support drawers (CONFIG_SCHED_DRAWER) there are
-no default definitions for topology_drawer_id() and topology_drawer_cpumask().
+5) topology_book_id: -1
+6) topology_drawer_id: -1
+7) topology_sibling_cpumask: just the given CPU
+8) topology_core_cpumask: just the given CPU
+9) topology_cluster_cpumask: just the given CPU
+10) topology_die_cpumask: just the given CPU
+11) topology_book_cpumask: just the given CPU
+12) topology_drawer_cpumask: just the given CPU
Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal
diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
index 1a6b91368e59..a65c1602cb23 100644
--- a/Documentation/admin-guide/device-mapper/verity.rst
+++ b/Documentation/admin-guide/device-mapper/verity.rst
@@ -141,6 +141,10 @@ root_hash_sig_key_desc <key_description>
also gain new certificates at run time if they are signed by a certificate
already in the secondary trusted keyring.
+try_verify_in_tasklet
+ If verity hashes are in cache, verify data blocks in kernel tasklet instead
+ of workqueue. This option can reduce IO latency.
+
Theory of operation
===================
diff --git a/Documentation/admin-guide/device-mapper/writecache.rst b/Documentation/admin-guide/device-mapper/writecache.rst
index 10429779a91a..60c16b7fd5ac 100644
--- a/Documentation/admin-guide/device-mapper/writecache.rst
+++ b/Documentation/admin-guide/device-mapper/writecache.rst
@@ -20,6 +20,7 @@ Constructor parameters:
size)
5. the number of optional parameters (the parameters with an argument
count as two)
+
start_sector n (default: 0)
offset from the start of cache device in 512-byte sectors
high_watermark n (default: 50)
@@ -74,20 +75,21 @@ Constructor parameters:
the origin volume in the last n milliseconds
Status:
+
1. error indicator - 0 if there was no error, otherwise error number
2. the number of blocks
3. the number of free blocks
4. the number of blocks under writeback
-5. the number of read requests
-6. the number of read requests that hit the cache
-7. the number of write requests
-8. the number of write requests that hit uncommitted block
-9. the number of write requests that hit committed block
-10. the number of write requests that bypass the cache
-11. the number of write requests that are allocated in the cache
+5. the number of read blocks
+6. the number of read blocks that hit the cache
+7. the number of write blocks
+8. the number of write blocks that hit uncommitted block
+9. the number of write blocks that hit committed block
+10. the number of write blocks that bypass the cache
+11. the number of write blocks that are allocated in the cache
12. the number of write requests that are blocked on the freelist
13. the number of flush requests
-14. the number of discard requests
+14. the number of discarded blocks
Messages:
flush
diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst
index 035275fedbdd..e3776d77374b 100644
--- a/Documentation/admin-guide/devices.rst
+++ b/Documentation/admin-guide/devices.rst
@@ -7,10 +7,9 @@ This list is the Linux Device List, the official registry of allocated
device numbers and ``/dev`` directory nodes for the Linux operating
system.
-The LaTeX version of this document is no longer maintained, nor is
-the document that used to reside at lanana.org. This version in the
-mainline Linux kernel is the master document. Updates shall be sent
-as patches to the kernel maintainers (see the
+The version of this document at lanana.org is no longer maintained. This
+version in the mainline Linux kernel is the master document. Updates
+shall be sent as patches to the kernel maintainers (see the
:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` document).
Specifically explore the sections titled "CHAR and MISC DRIVERS", and
"BLOCK LAYER" in the MAINTAINERS file to find the right maintainers
diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt
index c07dc0ee860e..9764d6edb189 100644
--- a/Documentation/admin-guide/devices.txt
+++ b/Documentation/admin-guide/devices.txt
@@ -1933,7 +1933,7 @@
...
255= /dev/umem/d15p15 15th partition of 16th board.
- 117 char COSA/SRP synchronous serial card
+ 117 char [REMOVED] COSA/SRP synchronous serial card
0 = /dev/cosa0c0 1st board, 1st channel
1 = /dev/cosa0c1 1st board, 2nd channel
...
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index a89cfa083155..faa22f77847a 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -5,143 +5,115 @@ Dynamic debug
Introduction
============
-This document describes how to use the dynamic debug (dyndbg) feature.
+Dynamic debug allows you to dynamically enable/disable kernel
+debug-print code to obtain additional kernel information.
-Dynamic debug is designed to allow you to dynamically enable/disable
-kernel code to obtain additional kernel information. Currently, if
-``CONFIG_DYNAMIC_DEBUG`` is set, then all ``pr_debug()``/``dev_dbg()`` and
-``print_hex_dump_debug()``/``print_hex_dump_bytes()`` calls can be dynamically
-enabled per-callsite.
+If ``/proc/dynamic_debug/control`` exists, your kernel has dynamic
+debug. You'll need root access (sudo su) to use this.
-If you do not want to enable dynamic debug globally (i.e. in some embedded
-system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic
-debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any
-modules which you'd like to dynamically debug later.
-
-If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is just
-shortcut for ``print_hex_dump(KERN_DEBUG)``.
-
-For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is
-its ``prefix_str`` argument, if it is constant string; or ``hexdump``
-in case ``prefix_str`` is built dynamically.
+Dynamic debug provides:
-Dynamic debug has even more useful features:
+ * a Catalog of all *prdbgs* in your kernel.
+ ``cat /proc/dynamic_debug/control`` to see them.
- * Simple query language allows turning on and off debugging
- statements by matching any combination of 0 or 1 of:
+ * a Simple query/command language to alter *prdbgs* by selecting on
+ any combination of 0 or 1 of:
- source filename
- function name
- line number (including ranges of line numbers)
- module name
- format string
-
- * Provides a debugfs control file: ``<debugfs>/dynamic_debug/control``
- which can be read to display the complete list of known debug
- statements, to help guide you
-
-Controlling dynamic debug Behaviour
-===================================
-
-The behaviour of ``pr_debug()``/``dev_dbg()`` are controlled via writing to a
-control file in the 'debugfs' filesystem. Thus, you must first mount
-the debugfs filesystem, in order to make use of this feature.
-Subsequently, we refer to the control file as:
-``<debugfs>/dynamic_debug/control``. For example, if you want to enable
-printing from source file ``svcsock.c``, line 1603 you simply do::
-
- nullarbor:~ # echo 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
-
-If you make a mistake with the syntax, the write will fail thus::
-
- nullarbor:~ # echo 'file svcsock.c wtf 1 +p' >
- <debugfs>/dynamic_debug/control
- -bash: echo: write error: Invalid argument
-
-Note, for systems without 'debugfs' enabled, the control file can be
-found in ``/proc/dynamic_debug/control``.
+ - class name (as known/declared by each module)
Viewing Dynamic Debug Behaviour
===============================
-You can view the currently configured behaviour of all the debug
-statements via::
+You can view the currently configured behaviour in the *prdbg* catalog::
- nullarbor:~ # cat <debugfs>/dynamic_debug/control
+ :#> head -n7 /proc/dynamic_debug/control
# filename:lineno [module]function flags format
- net/sunrpc/svc_rdma.c:323 [svcxprt_rdma]svc_rdma_cleanup =_ "SVCRDMA Module Removed, deregister RPC RDMA transport\012"
- net/sunrpc/svc_rdma.c:341 [svcxprt_rdma]svc_rdma_init =_ "\011max_inline : %d\012"
- net/sunrpc/svc_rdma.c:340 [svcxprt_rdma]svc_rdma_init =_ "\011sq_depth : %d\012"
- net/sunrpc/svc_rdma.c:338 [svcxprt_rdma]svc_rdma_init =_ "\011max_requests : %d\012"
- ...
+ init/main.c:1179 [main]initcall_blacklist =_ "blacklisting initcall %s\012
+ init/main.c:1218 [main]initcall_blacklisted =_ "initcall %s blacklisted\012"
+ init/main.c:1424 [main]run_init_process =_ " with arguments:\012"
+ init/main.c:1426 [main]run_init_process =_ " %s\012"
+ init/main.c:1427 [main]run_init_process =_ " with environment:\012"
+ init/main.c:1429 [main]run_init_process =_ " %s\012"
+The 3rd space-delimited column shows the current flags, preceded by
+a ``=`` for easy use with grep/cut. ``=p`` shows enabled callsites.
-You can also apply standard Unix text manipulation filters to this
-data, e.g.::
+Controlling dynamic debug Behaviour
+===================================
- nullarbor:~ # grep -i rdma <debugfs>/dynamic_debug/control | wc -l
- 62
+The behaviour of *prdbg* sites are controlled by writing
+query/commands to the control file. Example::
- nullarbor:~ # grep -i tcp <debugfs>/dynamic_debug/control | wc -l
- 42
+ # grease the interface
+ :#> alias ddcmd='echo $* > /proc/dynamic_debug/control'
-The third column shows the currently enabled flags for each debug
-statement callsite (see below for definitions of the flags). The
-default value, with no flags enabled, is ``=_``. So you can view all
-the debug statement callsites with any non-default flags::
+ :#> ddcmd '-p; module main func run* +p'
+ :#> grep =p /proc/dynamic_debug/control
+ init/main.c:1424 [main]run_init_process =p " with arguments:\012"
+ init/main.c:1426 [main]run_init_process =p " %s\012"
+ init/main.c:1427 [main]run_init_process =p " with environment:\012"
+ init/main.c:1429 [main]run_init_process =p " %s\012"
- nullarbor:~ # awk '$3 != "=_"' <debugfs>/dynamic_debug/control
- # filename:lineno [module]function flags format
- net/sunrpc/svcsock.c:1603 [sunrpc]svc_send p "svc_process: st_sendto returned %d\012"
+Error messages go to console/syslog::
+
+ :#> ddcmd mode foo +p
+ dyndbg: unknown keyword "mode"
+ dyndbg: query parse failed
+ bash: echo: write error: Invalid argument
+
+If debugfs is also enabled and mounted, ``dynamic_debug/control`` is
+also under the mount-dir, typically ``/sys/kernel/debug/``.
Command Language Reference
==========================
-At the lexical level, a command comprises a sequence of words separated
+At the basic lexical level, a command is a sequence of words separated
by spaces or tabs. So these are all equivalent::
- nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
- nullarbor:~ # echo -n ' file svcsock.c line 1603 +p ' >
- <debugfs>/dynamic_debug/control
- nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd file svcsock.c line 1603 +p
+ :#> ddcmd "file svcsock.c line 1603 +p"
+ :#> ddcmd ' file svcsock.c line 1603 +p '
Command submissions are bounded by a write() system call.
Multiple commands can be written together, separated by ``;`` or ``\n``::
- ~# echo "func pnpacpi_get_resources +p; func pnp_assign_mem +p" \
- > <debugfs>/dynamic_debug/control
+ :#> ddcmd "func pnpacpi_get_resources +p; func pnp_assign_mem +p"
+ :#> ddcmd <<"EOC"
+ func pnpacpi_get_resources +p
+ func pnp_assign_mem +p
+ EOC
+ :#> cat query-batch-file > /proc/dynamic_debug/control
-If your query set is big, you can batch them too::
+You can also use wildcards in each query term. The match rule supports
+``*`` (matches zero or more characters) and ``?`` (matches exactly one
+character). For example, you can match all usb drivers::
- ~# cat query-batch-file > <debugfs>/dynamic_debug/control
+ :#> ddcmd file "drivers/usb/*" +p # "" to suppress shell expansion
-Another way is to use wildcards. The match rule supports ``*`` (matches
-zero or more characters) and ``?`` (matches exactly one character). For
-example, you can match all usb drivers::
-
- ~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
-
-At the syntactical level, a command comprises a sequence of match
-specifications, followed by a flags change specification::
+Syntactically, a command is pairs of keyword values, followed by a
+flags change or setting::
command ::= match-spec* flags-spec
-The match-spec's are used to choose a subset of the known pr_debug()
-callsites to which to apply the flags-spec. Think of them as a query
-with implicit ANDs between each pair. Note that an empty list of
-match-specs will select all debug statement callsites.
+The match-spec's select *prdbgs* from the catalog, upon which to apply
+the flags-spec, all constraints are ANDed together. An absent keyword
+is the same as keyword "*".
+
-A match specification comprises a keyword, which controls the
-attribute of the callsite to be compared, and a value to compare
-against. Possible keywords are:::
+A match specification is a keyword, which selects the attribute of
+the callsite to be compared, and a value to compare against. Possible
+keywords are:::
match-spec ::= 'func' string |
'file' string |
'module' string |
'format' string |
+ 'class' string |
'line' line-range
line-range ::= lineno |
@@ -203,6 +175,16 @@ format
format "nfsd: SETATTR" // a neater way to match a format with whitespace
format 'nfsd: SETATTR' // yet another way to match a format with whitespace
+class
+ The given class_name is validated against each module, which may
+ have declared a list of known class_names. If the class_name is
+ found for a module, callsite & class matching and adjustment
+ proceeds. Examples::
+
+ class DRM_UT_KMS # a DRM.debug category
+ class JUNK # silent non-match
+ // class TLD_* # NOTICE: no wildcard in class names
+
line
The given line number or range of line numbers is compared
against the line number of each ``pr_debug()`` callsite. A single
@@ -228,17 +210,16 @@ of the characters::
The flags are::
p enables the pr_debug() callsite.
- f Include the function name in the printed message
- l Include line number in the printed message
- m Include module name in the printed message
- t Include thread ID in messages not generated from interrupt context
- _ No flags are set. (Or'd with others on input)
+ _ enables no flags.
-For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only ``p`` flag
-have meaning, other flags ignored.
+ Decorator flags add to the message-prefix, in order:
+ t Include thread ID, or <intr>
+ m Include module name
+ f Include the function name
+ l Include line number
-For display, the flags are preceded by ``=``
-(mnemonic: what the flags are currently equal to).
+For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only
+the ``p`` flag has meaning, other flags are ignored.
Note the regexp ``^[-+=][flmpt_]+$`` matches a flags specification.
To clear all flags at once, use ``=_`` or ``-flmpt``.
@@ -313,7 +294,7 @@ For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or
enabled by ``-DDEBUG`` flag during compilation) can be disabled later via
the debugfs interface if the debug messages are no longer needed::
- echo "module module_name -p" > <debugfs>/dynamic_debug/control
+ echo "module module_name -p" > /proc/dynamic_debug/control
Examples
========
@@ -321,37 +302,31 @@ Examples
::
// enable the message at line 1603 of file svcsock.c
- nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'file svcsock.c line 1603 +p'
// enable all the messages in file svcsock.c
- nullarbor:~ # echo -n 'file svcsock.c +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'file svcsock.c +p'
// enable all the messages in the NFS server module
- nullarbor:~ # echo -n 'module nfsd +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'module nfsd +p'
// enable all 12 messages in the function svc_process()
- nullarbor:~ # echo -n 'func svc_process +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'func svc_process +p'
// disable all 12 messages in the function svc_process()
- nullarbor:~ # echo -n 'func svc_process -p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'func svc_process -p'
// enable messages for NFS calls READ, READLINK, READDIR and READDIR+.
- nullarbor:~ # echo -n 'format "nfsd: READ" +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'format "nfsd: READ" +p'
// enable messages in files of which the paths include string "usb"
- nullarbor:~ # echo -n 'file *usb* +p' > <debugfs>/dynamic_debug/control
+ :#> ddcmd 'file *usb* +p' > /proc/dynamic_debug/control
// enable all messages
- nullarbor:~ # echo -n '+p' > <debugfs>/dynamic_debug/control
+ :#> ddcmd '+p' > /proc/dynamic_debug/control
// add module, function to all enabled messages
- nullarbor:~ # echo -n '+mf' > <debugfs>/dynamic_debug/control
+ :#> ddcmd '+mf' > /proc/dynamic_debug/control
// boot-args example, with newlines and comments for readability
Kernel command line: ...
@@ -364,3 +339,38 @@ Examples
dyndbg="file init/* +p #cmt ; func parse_one +p"
// enable pr_debugs in 2 functions in a module loaded later
pc87360.dyndbg="func pc87360_init_device +p; func pc87360_find +p"
+
+Kernel Configuration
+====================
+
+Dynamic Debug is enabled via kernel config items::
+
+ CONFIG_DYNAMIC_DEBUG=y # build catalog, enables CORE
+ CONFIG_DYNAMIC_DEBUG_CORE=y # enable mechanics only, skip catalog
+
+If you do not want to enable dynamic debug globally (i.e. in some embedded
+system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic
+debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any
+modules which you'd like to dynamically debug later.
+
+
+Kernel *prdbg* API
+==================
+
+The following functions are cataloged and controllable when dynamic
+debug is enabled::
+
+ pr_debug()
+ dev_dbg()
+ print_hex_dump_debug()
+ print_hex_dump_bytes()
+
+Otherwise, they are off by default; ``ccflags += -DDEBUG`` or
+``#define DEBUG`` in a source file will enable them appropriately.
+
+If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is
+just a shortcut for ``print_hex_dump(KERN_DEBUG)``.
+
+For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is
+its ``prefix_str`` argument, if it is constant string; or ``hexdump``
+in case ``prefix_str`` is built dynamically.
diff --git a/Documentation/admin-guide/efi-stub.rst b/Documentation/admin-guide/efi-stub.rst
index 833edb0d0bc4..b24e7c40d832 100644
--- a/Documentation/admin-guide/efi-stub.rst
+++ b/Documentation/admin-guide/efi-stub.rst
@@ -7,10 +7,10 @@ as a PE/COFF image, thereby convincing EFI firmware loaders to load
it as an EFI executable. The code that modifies the bzImage header,
along with the EFI-specific entry point that the firmware loader
jumps to are collectively known as the "EFI boot stub", and live in
-arch/x86/boot/header.S and arch/x86/boot/compressed/eboot.c,
+arch/x86/boot/header.S and drivers/firmware/efi/libstub/x86-stub.c,
respectively. For ARM the EFI stub is implemented in
arch/arm/boot/compressed/efi-header.S and
-arch/arm/boot/compressed/efi-stub.c. EFI stub code that is shared
+drivers/firmware/efi/libstub/arm32-stub.c. EFI stub code that is shared
between architectures is in drivers/firmware/efi/libstub.
For arm64, there is no compressed kernel support, so the Image itself
diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst
index 7db367572f30..f6861ca16ffe 100644
--- a/Documentation/admin-guide/gpio/index.rst
+++ b/Documentation/admin-guide/gpio/index.rst
@@ -10,6 +10,7 @@ gpio
gpio-aggregator
sysfs
gpio-mockup
+ gpio-sim
.. only:: subproject and html
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 8cbc711cda93..4df436e7c417 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -17,3 +17,4 @@ are configurable at compile, boot or run time.
special-register-buffer-data-sampling.rst
core-scheduling.rst
l1d_flush.rst
+ processor_mmio_stale_data.rst
diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
new file mode 100644
index 000000000000..c98fd11907cc
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
@@ -0,0 +1,260 @@
+=========================================
+Processor MMIO Stale Data Vulnerabilities
+=========================================
+
+Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O
+(MMIO) vulnerabilities that can expose data. The sequences of operations for
+exposing data range from simple to very complex. Because most of the
+vulnerabilities require the attacker to have access to MMIO, many environments
+are not affected. System environments using virtualization where MMIO access is
+provided to untrusted guests may need mitigation. These vulnerabilities are
+not transient execution attacks. However, these vulnerabilities may propagate
+stale data into core fill buffers where the data can subsequently be inferred
+by an unmitigated transient execution attack. Mitigation for these
+vulnerabilities includes a combination of microcode update and software
+changes, depending on the platform and usage model. Some of these mitigations
+are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or
+those used to mitigate Special Register Buffer Data Sampling (SRBDS).
+
+Data Propagators
+================
+Propagators are operations that result in stale data being copied or moved from
+one microarchitectural buffer or register to another. Processor MMIO Stale Data
+Vulnerabilities are operations that may result in stale data being directly
+read into an architectural, software-visible state or sampled from a buffer or
+register.
+
+Fill Buffer Stale Data Propagator (FBSDP)
+-----------------------------------------
+Stale data may propagate from fill buffers (FB) into the non-coherent portion
+of the uncore on some non-coherent writes. Fill buffer propagation by itself
+does not make stale data architecturally visible. Stale data must be propagated
+to a location where it is subject to reading or sampling.
+
+Sideband Stale Data Propagator (SSDP)
+-------------------------------------
+The sideband stale data propagator (SSDP) is limited to the client (including
+Intel Xeon server E3) uncore implementation. The sideband response buffer is
+shared by all client cores. For non-coherent reads that go to sideband
+destinations, the uncore logic returns 64 bytes of data to the core, including
+both requested data and unrequested stale data, from a transaction buffer and
+the sideband response buffer. As a result, stale data from the sideband
+response and transaction buffers may now reside in a core fill buffer.
+
+Primary Stale Data Propagator (PSDP)
+------------------------------------
+The primary stale data propagator (PSDP) is limited to the client (including
+Intel Xeon server E3) uncore implementation. Similar to the sideband response
+buffer, the primary response buffer is shared by all client cores. For some
+processors, MMIO primary reads will return 64 bytes of data to the core fill
+buffer including both requested data and unrequested stale data. This is
+similar to the sideband stale data propagator.
+
+Vulnerabilities
+===============
+Device Register Partial Write (DRPW) (CVE-2022-21166)
+-----------------------------------------------------
+Some endpoint MMIO registers incorrectly handle writes that are smaller than
+the register size. Instead of aborting the write or only copying the correct
+subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than
+specified by the write transaction may be written to the register. On
+processors affected by FBSDP, this may expose stale data from the fill buffers
+of the core that created the write transaction.
+
+Shared Buffers Data Sampling (SBDS) (CVE-2022-21125)
+----------------------------------------------------
+After propagators may have moved data around the uncore and copied stale data
+into client core fill buffers, processors affected by MFBDS can leak data from
+the fill buffer. It is limited to the client (including Intel Xeon server E3)
+uncore implementation.
+
+Shared Buffers Data Read (SBDR) (CVE-2022-21123)
+------------------------------------------------
+It is similar to Shared Buffer Data Sampling (SBDS) except that the data is
+directly read into the architectural software-visible state. It is limited to
+the client (including Intel Xeon server E3) uncore implementation.
+
+Affected Processors
+===================
+Not all the CPUs are affected by all the variants. For instance, most
+processors for the server market (excluding Intel Xeon E3 processors) are
+impacted by only Device Register Partial Write (DRPW).
+
+Below is the list of affected Intel processors [#f1]_:
+
+ =================== ============ =========
+ Common name Family_Model Steppings
+ =================== ============ =========
+ HASWELL_X 06_3FH 2,4
+ SKYLAKE_L 06_4EH 3
+ BROADWELL_X 06_4FH All
+ SKYLAKE_X 06_55H 3,4,6,7,11
+ BROADWELL_D 06_56H 3,4,5
+ SKYLAKE 06_5EH 3
+ ICELAKE_X 06_6AH 4,5,6
+ ICELAKE_D 06_6CH 1
+ ICELAKE_L 06_7EH 5
+ ATOM_TREMONT_D 06_86H All
+ LAKEFIELD 06_8AH 1
+ KABYLAKE_L 06_8EH 9 to 12
+ ATOM_TREMONT 06_96H 1
+ ATOM_TREMONT_L 06_9CH 0
+ KABYLAKE 06_9EH 9 to 13
+ COMETLAKE 06_A5H 2,3,5
+ COMETLAKE_L 06_A6H 0,1
+ ROCKETLAKE 06_A7H 1
+ =================== ============ =========
+
+If a CPU is in the affected processor list, but not affected by a variant, it
+is indicated by new bits in MSR IA32_ARCH_CAPABILITIES. As described in a later
+section, mitigation largely remains the same for all the variants, i.e. to
+clear the CPU fill buffers via VERW instruction.
+
+New bits in MSRs
+================
+Newer processors and microcode update on existing affected processors added new
+bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate
+specific variants of Processor MMIO Stale Data vulnerabilities and mitigation
+capability.
+
+MSR IA32_ARCH_CAPABILITIES
+--------------------------
+Bit 13 - SBDR_SSDP_NO - When set, processor is not affected by either the
+ Shared Buffers Data Read (SBDR) vulnerability or the sideband stale
+ data propagator (SSDP).
+Bit 14 - FBSDP_NO - When set, processor is not affected by the Fill Buffer
+ Stale Data Propagator (FBSDP).
+Bit 15 - PSDP_NO - When set, processor is not affected by Primary Stale Data
+ Propagator (PSDP).
+Bit 17 - FB_CLEAR - When set, VERW instruction will overwrite CPU fill buffer
+ values as part of MD_CLEAR operations. Processors that do not
+ enumerate MDS_NO (meaning they are affected by MDS) but that do
+ enumerate support for both L1D_FLUSH and MD_CLEAR implicitly enumerate
+ FB_CLEAR as part of their MD_CLEAR support.
+Bit 18 - FB_CLEAR_CTRL - Processor supports read and write to MSR
+ IA32_MCU_OPT_CTRL[FB_CLEAR_DIS]. On such processors, the FB_CLEAR_DIS
+ bit can be set to cause the VERW instruction to not perform the
+ FB_CLEAR action. Not all processors that support FB_CLEAR will support
+ FB_CLEAR_CTRL.
+
+MSR IA32_MCU_OPT_CTRL
+---------------------
+Bit 3 - FB_CLEAR_DIS - When set, VERW instruction does not perform the FB_CLEAR
+action. This may be useful to reduce the performance impact of FB_CLEAR in
+cases where system software deems it warranted (for example, when performance
+is more critical, or the untrusted software has no MMIO access). Note that
+FB_CLEAR_DIS has no impact on enumeration (for example, it does not change
+FB_CLEAR or MD_CLEAR enumeration) and it may not be supported on all processors
+that enumerate FB_CLEAR.
+
+Mitigation
+==========
+Like MDS, all variants of Processor MMIO Stale Data vulnerabilities have the
+same mitigation strategy to force the CPU to clear the affected buffers before
+an attacker can extract the secrets.
+
+This is achieved by using the otherwise unused and obsolete VERW instruction in
+combination with a microcode update. The microcode clears the affected CPU
+buffers when the VERW instruction is executed.
+
+Kernel reuses the MDS function to invoke the buffer clearing:
+
+ mds_clear_cpu_buffers()
+
+On MDS affected CPUs, the kernel already invokes CPU buffer clear on
+kernel/userspace, hypervisor/guest and C-state (idle) transitions. No
+additional mitigation is needed on such CPUs.
+
+For CPUs not affected by MDS or TAA, mitigation is needed only for the attacker
+with MMIO capability. Therefore, VERW is not required for kernel/userspace. For
+virtualization case, VERW is only needed at VMENTER for a guest with MMIO
+capability.
+
+Mitigation points
+-----------------
+Return to user space
+^^^^^^^^^^^^^^^^^^^^
+Same mitigation as MDS when affected by MDS/TAA, otherwise no mitigation
+needed.
+
+C-State transition
+^^^^^^^^^^^^^^^^^^
+Control register writes by CPU during C-state transition can propagate data
+from fill buffer to uncore buffers. Execute VERW before C-state transition to
+clear CPU fill buffers.
+
+Guest entry point
+^^^^^^^^^^^^^^^^^
+Same mitigation as MDS when processor is also affected by MDS/TAA, otherwise
+execute VERW at VMENTER only for MMIO capable guests. On CPUs not affected by
+MDS/TAA, guest without MMIO access cannot extract secrets using Processor MMIO
+Stale Data vulnerabilities, so there is no need to execute VERW for such guests.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The kernel command line allows to control the Processor MMIO Stale Data
+mitigations at boot time with the option "mmio_stale_data=". The valid
+arguments for this option are:
+
+ ========== =================================================================
+ full If the CPU is vulnerable, enable mitigation; CPU buffer clearing
+ on exit to userspace and when entering a VM. Idle transitions are
+ protected as well. It does not automatically disable SMT.
+ full,nosmt Same as full, with SMT disabled on vulnerable CPUs. This is the
+ complete mitigation.
+ off Disables mitigation completely.
+ ========== =================================================================
+
+If the CPU is affected and mmio_stale_data=off is not supplied on the kernel
+command line, then the kernel selects the appropriate mitigation.
+
+Mitigation status information
+-----------------------------
+The Linux kernel provides a sysfs interface to enumerate the current
+vulnerability status of the system: whether the system is vulnerable, and
+which mitigations are active. The relevant sysfs file is:
+
+ /sys/devices/system/cpu/vulnerabilities/mmio_stale_data
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable
+ * - 'Vulnerable'
+ - The processor is vulnerable, but no mitigation enabled
+ * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+ - The processor is vulnerable, but microcode is not updated. The
+ mitigation is enabled on a best effort basis.
+ * - 'Mitigation: Clear CPU buffers'
+ - The processor is vulnerable and the CPU buffer clearing mitigation is
+ enabled.
+ * - 'Unknown: No mitigations'
+ - The processor vulnerability status is unknown because it is
+ out of Servicing period. Mitigation is not attempted.
+
+Definitions:
+------------
+
+Servicing period: The process of providing functional and security updates to
+Intel processors or platforms, utilizing the Intel Platform Update (IPU)
+process or other similar mechanisms.
+
+End of Servicing Updates (ESU): ESU is the date at which Intel will no
+longer provide Servicing, such as through IPU or other similar update
+processes. ESU dates will typically be aligned to end of quarter.
+
+If the processor is vulnerable then the following information is appended to
+the above information:
+
+ ======================== ===========================================
+ 'SMT vulnerable' SMT is enabled
+ 'SMT disabled' SMT is disabled
+ 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
+ ======================== ===========================================
+
+References
+----------
+.. [#f1] Affected Processors
+ https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
index a2b22d5640ec..c4dcdb3d0d45 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -60,8 +60,8 @@ privileged data touched during the speculative execution.
Spectre variant 1 attacks take advantage of speculative execution of
conditional branches, while Spectre variant 2 attacks use speculative
execution of indirect branches to leak privileged memory.
-See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[7] <spec_ref7>`
-:ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
+See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[6] <spec_ref6>`
+:ref:`[7] <spec_ref7>` :ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
Spectre variant 1 (Bounds Check Bypass)
---------------------------------------
@@ -131,6 +131,19 @@ steer its indirect branch speculations to gadget code, and measure the
speculative execution's side effects left in level 1 cache to infer the
victim's data.
+Yet another variant 2 attack vector is for the attacker to poison the
+Branch History Buffer (BHB) to speculatively steer an indirect branch
+to a specific Branch Target Buffer (BTB) entry, even if the entry isn't
+associated with the source address of the indirect branch. Specifically,
+the BHB might be shared across privilege levels even in the presence of
+Enhanced IBRS.
+
+Currently the only known real-world BHB attack vector is via
+unprivileged eBPF. Therefore, it's highly recommended to not enable
+unprivileged eBPF, especially when eIBRS is used (without retpolines).
+For a full mitigation against BHB attacks, it's recommended to use
+retpolines (or eIBRS combined with retpolines).
+
Attack scenarios
----------------
@@ -364,13 +377,15 @@ The possible values in this file are:
- Kernel status:
- ==================================== =================================
- 'Not affected' The processor is not vulnerable
- 'Vulnerable' Vulnerable, no mitigation
- 'Mitigation: Full generic retpoline' Software-focused mitigation
- 'Mitigation: Full AMD retpoline' AMD-specific software mitigation
- 'Mitigation: Enhanced IBRS' Hardware-focused mitigation
- ==================================== =================================
+ ======================================== =================================
+ 'Not affected' The processor is not vulnerable
+ 'Mitigation: None' Vulnerable, no mitigation
+ 'Mitigation: Retpolines' Use Retpoline thunks
+ 'Mitigation: LFENCE' Use LFENCE instructions
+ 'Mitigation: Enhanced IBRS' Hardware-focused mitigation
+ 'Mitigation: Enhanced IBRS + Retpolines' Hardware-focused + Retpolines
+ 'Mitigation: Enhanced IBRS + LFENCE' Hardware-focused + LFENCE
+ ======================================== =================================
- Firmware status: Show if Indirect Branch Restricted Speculation (IBRS) is
used to protect against Spectre variant 2 attacks when calling firmware (x86 only).
@@ -407,6 +422,14 @@ The possible values in this file are:
'RSB filling' Protection of RSB on context switch enabled
============= ===========================================
+ - EIBRS Post-barrier Return Stack Buffer (PBRSB) protection status:
+
+ =========================== =======================================================
+ 'PBRSB-eIBRS: SW sequence' CPU is affected and protection of RSB on VMEXIT enabled
+ 'PBRSB-eIBRS: Vulnerable' CPU is vulnerable
+ 'PBRSB-eIBRS: Not affected' CPU is not affected by PBRSB
+ =========================== =======================================================
+
Full mitigation might require a microcode update from the CPU
vendor. When the necessary microcode is not available, the kernel will
report vulnerability.
@@ -583,12 +606,14 @@ kernel command line.
Specific mitigations can also be selected manually:
- retpoline
- replace indirect branches
- retpoline,generic
- google's original retpoline
- retpoline,amd
- AMD-specific minimal thunk
+ retpoline auto pick between generic,lfence
+ retpoline,generic Retpolines
+ retpoline,lfence LFENCE; indirect branch
+ retpoline,amd alias for retpoline,lfence
+ eibrs enhanced IBRS
+ eibrs,retpoline enhanced IBRS + Retpolines
+ eibrs,lfence enhanced IBRS + LFENCE
+ ibrs use IBRS to protect kernel
Not specifying this option is equivalent to
spectre_v2=auto.
@@ -599,7 +624,7 @@ kernel command line.
spectre_v2=off. Spectre variant 1 mitigations
cannot be disabled.
-For spectre_v2_user see :doc:`/admin-guide/kernel-parameters`.
+For spectre_v2_user see Documentation/admin-guide/kernel-parameters.txt
Mitigation selection guide
--------------------------
@@ -681,7 +706,7 @@ AMD white papers:
.. _spec_ref6:
-[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf>`_.
+[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf>`_.
ARM white papers:
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..5bfafcbb9562 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -35,6 +35,7 @@ problems and bugs in particular.
:maxdepth: 1
reporting-issues
+ reporting-regressions
security-bugs
bug-hunting
bug-bisect
diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
index 9b14b0c2c9c4..609a3201fd4e 100644
--- a/Documentation/admin-guide/iostats.rst
+++ b/Documentation/admin-guide/iostats.rst
@@ -76,7 +76,7 @@ Field 3 -- # of sectors read (unsigned long)
Field 4 -- # of milliseconds spent reading (unsigned int)
This is the total number of milliseconds spent by all reads (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 5 -- # of writes completed (unsigned long)
This is the total number of writes completed successfully.
@@ -89,7 +89,7 @@ Field 7 -- # of sectors written (unsigned long)
Field 8 -- # of milliseconds spent writing (unsigned int)
This is the total number of milliseconds spent by all writes (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 9 -- # of I/Os currently in progress (unsigned int)
The only field that should go to zero. Incremented as requests are
@@ -120,7 +120,7 @@ Field 14 -- # of sectors discarded (unsigned long)
Field 15 -- # of milliseconds spent discarding (unsigned int)
This is the total number of milliseconds spent by all discards (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 16 -- # of flush requests completed
This is the total number of flush requests completed successfully.
diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index cb30ca3df27c..a748e7eb4429 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -146,9 +146,9 @@ System kernel config options
CONFIG_SYSFS=y
Note that "sysfs file system support" might not appear in the "Pseudo
- filesystems" menu if "Configure standard kernel features (for small
- systems)" is not enabled in "General Setup." In this case, check the
- .config file itself to ensure that sysfs is turned on, as follows::
+ filesystems" menu if "Configure standard kernel features (expert users)"
+ is not enabled in "General Setup." In this case, check the .config file
+ itself to ensure that sysfs is turned on, as follows::
grep 'CONFIG_SYSFS' .config
@@ -533,6 +533,10 @@ the following command::
cp /proc/vmcore <dump-file>
+or use scp to write out the dump file between hosts on a network, e.g::
+
+ scp /proc/vmcore remote_username@remote_ip:<dump-file>
+
You can also use makedumpfile utility to write out the dump file
with specified options to filter out unwanted contents, e.g::
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 3861a25faae1..6726f439958c 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -200,7 +200,7 @@ prb
A pointer to the printk ringbuffer (struct printk_ringbuffer). This
may be pointing to the static boot ringbuffer or the dynamically
-allocated ringbuffer, depending on when the the core dump occurred.
+allocated ringbuffer, depending on when the core dump occurred.
Used by user-space tools to read the active kernel log buffer.
printk_rb_static
@@ -494,6 +494,14 @@ architecture which is used to lookup the page-tables for the Virtual
addresses in the higher VA range (refer to ARMv8 ARM document for
more details).
+MODULES_VADDR|MODULES_END|VMALLOC_START|VMALLOC_END|VMEMMAP_START|VMEMMAP_END
+-----------------------------------------------------------------------------
+
+Used to get the correct ranges:
+ MODULES_VADDR ~ MODULES_END-1 : Kernel module space.
+ VMALLOC_START ~ VMALLOC_END-1 : vmalloc() / ioremap() space.
+ VMEMMAP_START ~ VMEMMAP_END-1 : vmemmap region, used for struct page array.
+
arm
===
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 01ba293a2d70..959f73a32712 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -99,6 +99,7 @@ parameter is applicable::
ALSA ALSA sound support is enabled.
APIC APIC support is enabled.
APM Advanced Power Management support is enabled.
+ APPARMOR AppArmor support is enabled.
ARM ARM architecture is enabled.
ARM64 ARM64 architecture is enabled.
AX25 Appropriate AX.25 support is enabled.
@@ -108,15 +109,15 @@ parameter is applicable::
DYNAMIC_DEBUG Build in debug messages and enable them at runtime
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
EFI EFI Partitioning (GPT) is enabled
- EIDE EIDE/ATAPI support is enabled.
EVM Extended Verification Module
FB The frame buffer device is enabled.
FTRACE Function tracing enabled.
GCOV GCOV profiling is enabled.
+ HIBERNATION HIBERNATION is enabled.
HW Appropriate hardware is enabled.
+ HYPER_V HYPERV support is enabled.
IA-64 IA-64 architecture is enabled.
IMA Integrity measurement architecture is enabled.
- IOSCHED More than one I/O scheduler is enabled.
IP_PNP IP DHCP, BOOTP, or RARP is enabled.
IPV6 IPv6 support is enabled.
ISAPNP ISA PnP code is enabled.
@@ -140,7 +141,6 @@ parameter is applicable::
NUMA NUMA support is enabled.
NFS Appropriate NFS support is enabled.
OF Devicetree is enabled.
- OSS OSS sound support is enabled.
PV_OPS A paravirtualized kernel is enabled.
PARIDE The ParIDE (parallel port IDE) subsystem is enabled.
PARISC The PA-RISC architecture is enabled.
@@ -160,7 +160,6 @@ parameter is applicable::
the Documentation/scsi/ sub-directory.
SECURITY Different security models are enabled.
SELINUX SELinux support is enabled.
- APPARMOR AppArmor support is enabled.
SERIAL Serial support is enabled.
SH SuperH architecture is enabled.
SMP The kernel is an SMP kernel.
@@ -168,7 +167,6 @@ parameter is applicable::
SWSUSP Software suspend (hibernation) is enabled.
SUSPEND System suspend states are enabled.
TPM TPM drivers are enabled.
- TS Appropriate touchscreen support is enabled.
UMS USB Mass Storage support is enabled.
USB USB support is enabled.
USBHID USB Human Interface Device support is enabled.
@@ -177,7 +175,6 @@ parameter is applicable::
VGA The VGA console has been enabled.
VT Virtual terminal support is enabled.
WDT Watchdog support is enabled.
- XT IBM PC/XT MFM hard disk support is enabled.
X86-32 X86-32, aka i386 architecture is enabled.
X86-64 X86-64 architecture is enabled.
More X86-64 boot options can be found in
@@ -211,7 +208,7 @@ The number of kernel parameters is not limited, but the length of the
complete command line (parameters including spaces etc.) is limited to
a fixed number of characters. This limit depends on the architecture
and is between 256 and 4096 characters. It is defined in the file
-./include/asm/setup.h as COMMAND_LINE_SIZE.
+./include/uapi/asm-generic/setup.h as COMMAND_LINE_SIZE.
Finally, the [KMG] suffix is commonly described after a number of kernel
parameter values. These 'K', 'M', and 'G' letters represent the _binary_
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 685645e75706..a465d5242774 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -321,6 +321,8 @@
force_enable - Force enable the IOMMU on platforms known
to be buggy with IOMMU enabled. Use this
option with care.
+ pgtbl_v1 - Use v1 page table for DMA-API (Default).
+ pgtbl_v2 - Use v2 page table for DMA-API.
amd_iommu_dump= [HW,X86-64]
Enable AMD IOMMU driver option to dump the ACPI table
@@ -400,6 +402,12 @@
arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension
support
+ arm64.nosve [ARM64] Unconditionally disable Scalable Vector
+ Extension support
+
+ arm64.nosme [ARM64] Unconditionally disable Scalable Matrix
+ Extension support
+
ataflop= [HW,M68k]
atarimouse= [HW,MOUSE] Atari Mouse
@@ -461,6 +469,12 @@
Format: <io>,<irq>,<mode>
See header of drivers/net/hamradio/baycom_ser_hdx.c.
+ bert_disable [ACPI]
+ Disable BERT OS support on buggy BIOSes.
+
+ bgrt_disable [ACPI][X86]
+ Disable BGRT to avoid flickering OEM logo.
+
blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input.
See Documentation/block/cmdline-partition.rst
@@ -476,12 +490,6 @@
See Documentation/admin-guide/bootconfig.rst
- bert_disable [ACPI]
- Disable BERT OS support on buggy BIOSes.
-
- bgrt_disable [ACPI][X86]
- Disable BGRT to avoid flickering OEM logo.
-
bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards)
bttv.radio= Most important insmod options are available as
kernel args too.
@@ -550,7 +558,7 @@
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
- checkreqprot [SELINUX] Set initial checkreqprot flag value.
+ checkreqprot= [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
0 -- check protection applied by kernel (includes
@@ -563,6 +571,25 @@
cio_ignore= [S390]
See Documentation/s390/common_io.rst for details.
+
+ clearcpuid=X[,X...] [X86]
+ Disable CPUID feature X for the kernel. See
+ arch/x86/include/asm/cpufeatures.h for the valid bit
+ numbers X. Note the Linux-specific bits are not necessarily
+ stable over kernel options, but the vendor-specific
+ ones should be.
+ X can also be a string as appearing in the flags: line
+ in /proc/cpuinfo which does not have the above
+ instability issue. However, not all features have names
+ in /proc/cpuinfo.
+ Note that using this option will taint your kernel.
+ Also note that user programs calling CPUID directly
+ or using the feature without checking anything
+ will still see it. This just prevents it from
+ being used by the kernel or shown in /proc/cpuinfo.
+ Also note the kernel might malfunction if you disable
+ some critical bits.
+
clk_ignore_unused
[CLK]
Prevents the clock framework from automatically gating
@@ -612,8 +639,8 @@
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
- unstable. Defaults to three retries, that is,
- four attempts to read the clock under test.
+ unstable. Defaults to two retries, that is,
+ three attempts to read the clock under test.
clocksource.verify_n_cpus= [KNL]
Limit the number of CPUs checked for clocksources
@@ -631,19 +658,6 @@
Defaults to zero when built as a module and to
10 seconds when built into the kernel.
- clearcpuid=BITNUM[,BITNUM...] [X86]
- Disable CPUID feature X for the kernel. See
- arch/x86/include/asm/cpufeatures.h for the valid bit
- numbers. Note the Linux specific bits are not necessarily
- stable over kernel options, but the vendor specific
- ones should be.
- Also note that user programs calling CPUID directly
- or using the feature without checking anything
- will still see it. This just prevents it from
- being used by the kernel or shown in /proc/cpuinfo.
- Also note the kernel might malfunction if you disable
- some critical bits.
-
cma=nn[MG]@[start[MG][-end[MG]]]
[KNL,CMA]
Sets the size of kernel global memory area for
@@ -724,6 +738,12 @@
hvc<n> Use the hypervisor console device <n>. This is for
both Xen and PowerPC hypervisors.
+ { null | "" }
+ Use to disable console output, i.e., to have kernel
+ console messages discarded.
+ This must be the only console= parameter used on the
+ kernel command line.
+
If the device connected to the port is not a TTY but a braille
device, prepend "brl," before the device type, for instance
console=brl,ttyS0
@@ -759,6 +779,24 @@
0: default value, disable debugging
1: enable debugging at boot time
+ cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
+ Format:
+ <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
+
+ cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
+ CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
+ Some features depend on CPU0. Known dependencies are:
+ 1. Resume from suspend/hibernate depends on CPU0.
+ Suspend/hibernate will fail if CPU0 is offline and you
+ need to online CPU0 before suspend/hibernate.
+ 2. PIC interrupts also depend on CPU0. CPU0 can't be
+ removed if a PIC interrupt is detected.
+ It's said poweroff/reboot may depend on CPU0 on some
+ machines although I haven't seen such issues so far
+ after CPU0 is offline on a few tested machines.
+ If the dependencies are under your control, you can
+ turn on cpu0_hotplug.
+
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system
@@ -779,9 +817,13 @@
on every CPU online, such as boot, and resume from suspend.
Default: 10000
- cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
- Format:
- <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
+ crash_kexec_post_notifiers
+ Run kdump after running panic-notifiers and dumping
+ kmsg. This only for the users who doubt kdump always
+ succeeds in any situation.
+ Note that this also increases risks of kdump failure,
+ because some panic notifiers can make the crashed
+ kernel more unstable.
crashkernel=size[KMG][@offset[KMG]]
[KNL] Using kexec, Linux can switch to a 'crash kernel'
@@ -802,7 +844,7 @@
Documentation/admin-guide/kdump/kdump.rst for an example.
crashkernel=size[KMG],high
- [KNL, X86-64] range could be above 4G. Allow kernel
+ [KNL, X86-64, ARM64] range could be above 4G. Allow kernel
to allocate physical memory region from top, so could
be above 4G if system have more than 4G ram installed.
Otherwise memory region will be allocated below 4G, if
@@ -815,14 +857,20 @@
that require some amount of low memory, e.g. swiotlb
requires at least 64M+32K low memory, also enough extra
low memory is needed to make sure DMA buffers for 32-bit
- devices won't run out. Kernel would try to allocate at
+ devices won't run out. Kernel would try to allocate
at least 256M below 4G automatically.
- This one let user to specify own low range under 4G
+ This one lets the user specify own low range under 4G
for second kernel instead.
0: to disable low allocation.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
+ [KNL, ARM64] range in low memory.
+ This one lets the user specify a low range in the
+ DMA zone for the crash dump kernel.
+ It will be ignored when crashkernel=X,high is not used
+ or memory reserved is located in the DMA zones.
+
cryptomgr.notests
[KNL] Disable crypto self-tests
@@ -920,10 +968,6 @@
debugpat [X86] Enable PAT debugging
- decnet.addr= [HW,NET]
- Format: <area>[,<node>]
- See also Documentation/networking/decnet.rst.
-
default_hugepagesz=
[HW] The size of the default HugeTLB page. This is
the size represented by the legacy /proc/ hugepages
@@ -939,11 +983,39 @@
[KNL] Debugging option to set a timeout in seconds for
deferred probe to give up waiting on dependencies to
probe. Only specific dependencies (subsystems or
- drivers) that have opted in will be ignored. A timeout of 0
- will timeout at the end of initcalls. This option will also
+ drivers) that have opted in will be ignored. A timeout
+ of 0 will timeout at the end of initcalls. If the time
+ out hasn't expired, it'll be restarted by each
+ successful driver registration. This option will also
dump out devices still on the deferred probe list after
retrying.
+ delayacct [KNL] Enable per-task delay accounting
+
+ dell_smm_hwmon.ignore_dmi=
+ [HW] Continue probing hardware even if DMI data
+ indicates that the driver is running on unsupported
+ hardware.
+
+ dell_smm_hwmon.force=
+ [HW] Activate driver even if SMM BIOS signature does
+ not match list of supported models and enable otherwise
+ blacklisted features.
+
+ dell_smm_hwmon.power_status=
+ [HW] Report power status in /proc/i8k
+ (disabled by default).
+
+ dell_smm_hwmon.restricted=
+ [HW] Allow controlling fans only if SYS_ADMIN
+ capability is set.
+
+ dell_smm_hwmon.fan_mult=
+ [HW] Factor to multiply fan speed with.
+
+ dell_smm_hwmon.fan_max=
+ [HW] Maximum configurable fan speed.
+
dfltcc= [HW,S390]
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
@@ -973,17 +1045,6 @@
disable= [IPV6]
See Documentation/networking/ipv6.rst.
- hardened_usercopy=
- [KNL] Under CONFIG_HARDENED_USERCOPY, whether
- hardening is enabled for this boot. Hardened
- usercopy checking is used to protect the kernel
- from reading or writing beyond known memory
- allocation boundaries as a proactive defense
- against bounds-checking flaws in the kernel's
- copy_to_user()/copy_from_user() interface.
- on Perform hardened usercopy checks (default).
- off Disable hardened usercopy checks.
-
disable_radix [PPC]
Disable RADIX MMU mode on POWER9
@@ -1046,7 +1107,10 @@
driver later using sysfs.
driver_async_probe= [KNL]
- List of driver names to be probed asynchronously.
+ List of driver names to be probed asynchronously. *
+ matches with all driver names. If * is specified, the
+ rest of the listed driver names are those that will NOT
+ match the *.
Format: <driver_name1>,<driver_name2>...
drm.edid_firmware=[<connector>:]<file>[,[<connector>:]<file>]
@@ -1092,8 +1156,12 @@
nopku [X86] Disable Memory Protection Keys CPU feature found
in some Intel CPUs.
- <module>.async_probe [KNL]
- Enable asynchronous probe on this module.
+ <module>.async_probe[=<bool>] [KNL]
+ If no <bool> value is specified or if the value
+ specified is not a valid <bool>, enable asynchronous
+ probe on this module. Otherwise, enable/disable
+ asynchronous probe on this module as indicated by the
+ <bool> value. See also: module.async_probe
early_ioremap_debug [KNL]
Enable debug messages in early_ioremap support. This
@@ -1252,7 +1320,7 @@
Append ",keep" to not disable it when the real console
takes over.
- Only one of vga, efi, serial, or usb debug port can
+ Only one of vga, serial, or usb debug port can
be used at a time.
Currently only ttyS0 and ttyS1 may be specified by
@@ -1267,7 +1335,7 @@
Interaction with the standard serial driver is not
very good.
- The VGA and EFI output is eventually overwritten by
+ The VGA output is eventually overwritten by
the real console.
The xen option can only be used in Xen domains.
@@ -1286,17 +1354,6 @@
force: enforce the use of EDAC to report H/W event.
default: on.
- ekgdboc= [X86,KGDB] Allow early kernel console debugging
- ekgdboc=kbd
-
- This is designed to be used in conjunction with
- the boot argument: earlyprintk=vga
-
- This parameter works in place of the kgdboc parameter
- but can only be used if the backing tty is available
- very early in the boot process. For early debugging
- via a serial port see kgdboc_earlycon instead.
-
edd= [EDD]
Format: {"off" | "on" | "skip[mbr]"}
@@ -1358,6 +1415,17 @@
eisa_irq_edge= [PARISC,HW]
See header of drivers/parisc/eisa.c.
+ ekgdboc= [X86,KGDB] Allow early kernel console debugging
+ Format: ekgdboc=kbd
+
+ This is designed to be used in conjunction with
+ the boot argument: earlyprintk=vga
+
+ This parameter works in place of the kgdboc parameter
+ but can only be used if the backing tty is available
+ very early in the boot process. For early debugging
+ via a serial port see kgdboc_earlycon instead.
+
elanfreq= [X86-32]
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
@@ -1379,7 +1447,7 @@
(in particular on some ATI chipsets).
The kernel tries to set a reasonable default.
- enforcing [SELINUX] Set initial enforcing status.
+ enforcing= [SELINUX] Set initial enforcing status.
Format: {"0" | "1"}
See security/selinux/Kconfig help text.
0 -- permissive (log only, no denials).
@@ -1401,6 +1469,14 @@
Permit 'security.evm' to be updated regardless of
current integrity status.
+ early_page_ext [KNL] Enforces page_ext initialization to earlier
+ stages so cover more early boot allocations.
+ Please note that as side effect some optimizations
+ might be disabled to achieve that (e.g. parallelized
+ memory initialization is disabled) so the boot process
+ might take longer, especially on systems with a lot of
+ memory. Available with CONFIG_PAGE_EXTENSION=y.
+
failslab=
fail_usercopy=
fail_page_alloc=
@@ -1435,6 +1511,14 @@
as early as possible in order to facilitate early
boot debugging.
+ ftrace_boot_snapshot
+ [FTRACE] On boot up, a snapshot will be taken of the
+ ftrace ring buffer that can be read at:
+ /sys/kernel/tracing/snapshot.
+ This is useful if you need tracing information from kernel
+ boot up that is likely to be overridden by user space
+ start up functionality.
+
ftrace_dump_on_oops[=orig_cpu]
[FTRACE] will dump the trace buffers on oops.
If no parameter is passed, ftrace will dump
@@ -1548,6 +1632,17 @@
Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0.
Default: 1024
+ hardened_usercopy=
+ [KNL] Under CONFIG_HARDENED_USERCOPY, whether
+ hardening is enabled for this boot. Hardened
+ usercopy checking is used to protect the kernel
+ from reading or writing beyond known memory
+ allocation boundaries as a proactive defense
+ against bounds-checking flaws in the kernel's
+ copy_to_user()/copy_from_user() interface.
+ on Perform hardened usercopy checks (default).
+ off Disable hardened usercopy checks.
+
hardlockup_all_cpu_backtrace=
[KNL] Should the hard-lockup detector generate
backtraces on all cpus.
@@ -1568,6 +1663,15 @@
corresponding firmware-first mode error processing
logic will be disabled.
+ hibernate= [HIBERNATION]
+ noresume Don't check if there's a hibernation image
+ present during boot.
+ nocompress Don't compress/decompress hibernation images.
+ no Disable hibernation and resume.
+ protect_image Turn on image protection during restoration
+ (that will set all pages holding image data
+ during restoration read-only).
+
highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
size of <nn>. This works even on boxes that have no
highmem otherwise. This also works to reduce highmem
@@ -1579,6 +1683,19 @@
hlt [BUGS=ARM,SH]
+ hostname= [KNL] Set the hostname (aka UTS nodename).
+ Format: <string>
+ This allows setting the system's hostname during early
+ startup. This sets the name returned by gethostname.
+ Using this parameter to set the hostname makes it
+ possible to ensure the hostname is correctly set before
+ any userspace processes run, avoiding the possibility
+ that a process may call gethostname before the hostname
+ has been explicitly set, resulting in the calling
+ process getting an incorrect result. The string must
+ not exceed the maximum allowed hostname length (usually
+ 64 characters) and will be truncated otherwise.
+
hpet= [X86-32,HPET] option to control HPET usage
Format: { enable (default) | disable | force |
verbose }
@@ -1590,16 +1707,6 @@
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
- hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
- of gigantic hugepages. Or using node format, the size
- of a CMA area per node can be specified.
- Format: nn[KMGTPE] or (node format)
- <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
-
- Reserve a CMA area of given size and allocate gigantic
- hugepages using the CMA allocator. If enabled, the
- boot-time allocation of gigantic hugepages is skipped.
-
hugepages= [HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
@@ -1621,22 +1728,35 @@
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
+ hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
+ of gigantic hugepages. Or using node format, the size
+ of a CMA area per node can be specified.
+ Format: nn[KMGTPE] or (node format)
+ <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
+
+ Reserve a CMA area of given size and allocate gigantic
+ hugepages using the CMA allocator. If enabled, the
+ boot-time allocation of gigantic hugepages is skipped.
+
hugetlb_free_vmemmap=
- [KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+ [KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
enabled.
+ Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
Allows heavy hugetlb users to free up some more
- memory (6 * PAGE_SIZE for each 2MB hugetlb page).
+ memory (7 * PAGE_SIZE for each 2MB hugetlb page).
Format: { on | off (default) }
- on: enable the feature
- off: disable the feature
+ on: enable HVO
+ off: disable HVO
- Built with CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y,
+ Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
the default is on.
- This is not compatible with memory_hotplug.memmap_on_memory.
- If both parameters are enabled, hugetlb_free_vmemmap takes
- precedence over memory_hotplug.memmap_on_memory.
+ Note that the vmemmap pages may be allocated from the added
+ memory block itself when memory_hotplug.memmap_on_memory is
+ enabled, those vmemmap pages cannot be optimized even if this
+ feature is enabled. Other vmemmap pages not allocated from
+ the added memory block itself do not be affected.
hung_task_panic=
[KNL] Should the hung task detector generate panics.
@@ -1703,17 +1823,6 @@
i810= [HW,DRM]
- i8k.ignore_dmi [HW] Continue probing hardware even if DMI data
- indicates that the driver is running on unsupported
- hardware.
- i8k.force [HW] Activate i8k driver even if SMM BIOS signature
- does not match list of supported models.
- i8k.power_status
- [HW] Report power status in /proc/i8k
- (disabled by default)
- i8k.restricted [HW] Allow controlling fans only if SYS_ADMIN
- capability is set.
-
i915.invert_brightness=
[DRM] Invert the sense of the variable that is used to
set the brightness of the panel backlight. Normally a
@@ -1731,26 +1840,6 @@
icn= [HW,ISDN]
Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
- ide-core.nodma= [HW] (E)IDE subsystem
- Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
- .vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr
- .cdrom .chs .ignore_cable are additional options
- See Documentation/ide/ide.rst.
-
- ide-generic.probe-mask= [HW] (E)IDE subsystem
- Format: <int>
- Probe mask for legacy ISA IDE ports. Depending on
- platform up to 6 ports are supported, enabled by
- setting corresponding bits in the mask to 1. The
- default value is 0x0, which has a special meaning.
- On systems that have PCI, it triggers scanning the
- PCI bus for the first and the second port, which
- are then probed. On systems without PCI the value
- of 0x0 enables probing the two first ports as if it
- was 0x3.
-
- ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem
- Claim all unknown PCI IDE storage controllers.
idle= [X86]
Format: idle=poll, idle=halt, idle=nomwait
@@ -1876,7 +1965,8 @@
ima_template= [IMA]
Select one of defined IMA measurements template formats.
- Formats: { "ima" | "ima-ng" | "ima-sig" }
+ Formats: { "ima" | "ima-ng" | "ima-ngv2" | "ima-sig" |
+ "ima-sigv2" }
Default: "ima-ng"
ima_template_fmt=
@@ -2208,23 +2298,39 @@
ivrs_ioapic [HW,X86-64]
Provide an override to the IOAPIC-ID<->DEVICE-ID
- mapping provided in the IVRS ACPI table. For
- example, to map IOAPIC-ID decimal 10 to
- PCI device 00:14.0 write the parameter as:
+ mapping provided in the IVRS ACPI table.
+ By default, PCI segment is 0, and can be omitted.
+ For example:
+ * To map IOAPIC-ID decimal 10 to PCI device 00:14.0
+ write the parameter as:
ivrs_ioapic[10]=00:14.0
+ * To map IOAPIC-ID decimal 10 to PCI segment 0x1 and
+ PCI device 00:14.0 write the parameter as:
+ ivrs_ioapic[10]=0001:00:14.0
ivrs_hpet [HW,X86-64]
Provide an override to the HPET-ID<->DEVICE-ID
- mapping provided in the IVRS ACPI table. For
- example, to map HPET-ID decimal 0 to
- PCI device 00:14.0 write the parameter as:
+ mapping provided in the IVRS ACPI table.
+ By default, PCI segment is 0, and can be omitted.
+ For example:
+ * To map HPET-ID decimal 0 to PCI device 00:14.0
+ write the parameter as:
ivrs_hpet[0]=00:14.0
+ * To map HPET-ID decimal 10 to PCI segment 0x1 and
+ PCI device 00:14.0 write the parameter as:
+ ivrs_ioapic[10]=0001:00:14.0
ivrs_acpihid [HW,X86-64]
Provide an override to the ACPI-HID:UID<->DEVICE-ID
- mapping provided in the IVRS ACPI table. For
- example, to map UART-HID:UID AMD0020:0 to
- PCI device 00:14.5 write the parameter as:
+ mapping provided in the IVRS ACPI table.
+
+ For example, to map UART-HID:UID AMD0020:0 to
+ PCI segment 0x1 and PCI device ID 00:14.5,
+ write the parameter as:
+ ivrs_acpihid[0001:00:14.5]=AMD0020:0
+
+ By default, PCI segment is 0, and can be omitted.
+ For example, PCI device 00:14.5 write the parameter as:
ivrs_acpihid[00:14.5]=AMD0020:0
js= [HW,JOY] Analog joystick
@@ -2336,16 +2442,43 @@
0: force disabled
1: force enabled
+ kunit.enable= [KUNIT] Enable executing KUnit tests. Requires
+ CONFIG_KUNIT to be set to be fully enabled. The
+ default value can be overridden via
+ KUNIT_DEFAULT_ENABLED.
+ Default is 1 (enabled)
+
kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
Default is 0 (don't ignore, but inject #GP)
+ kvm.eager_page_split=
+ [KVM,X86] Controls whether or not KVM will try to
+ proactively split all huge pages during dirty logging.
+ Eager page splitting reduces interruptions to vCPU
+ execution by eliminating the write-protection faults
+ and MMU lock contention that would otherwise be
+ required to split huge pages lazily.
+
+ VM workloads that rarely perform writes or that write
+ only to a small region of VM memory may benefit from
+ disabling eager page splitting to allow huge pages to
+ still be used for reads.
+
+ The behavior of eager page splitting depends on whether
+ KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
+ disabled, all huge pages in a memslot will be eagerly
+ split when dirty logging is enabled on that memslot. If
+ enabled, eager page splitting will be performed during
+ the KVM_CLEAR_DIRTY ioctl, and only for the pages being
+ cleared.
+
+ Eager page splitting is only supported when kvm.tdp_mmu=Y.
+
+ Default is Y (on).
+
kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
Default is false (don't support).
- kvm.mmu_audit= [KVM] This is a R/W parameter which allows audit
- KVM MMU at runtime.
- Default is 0 (off)
-
kvm.nx_huge_pages=
[KVM] Controls the software workaround for the
X86_BUG_ITLB_MULTIHIT bug.
@@ -2389,7 +2522,6 @@
protected: nVHE-based mode with support for guests whose
state is kept private from the host.
- Not valid if the kernel is running in EL2.
Defaults to VHE/nVHE based on hardware support. Setting
mode to "protected" will disable kexec and hibernation
@@ -2573,14 +2705,14 @@
when set.
Format: <int>
- libata.force= [LIBATA] Force configurations. The format is comma-
- separated list of "[ID:]VAL" where ID is
- PORT[.DEVICE]. PORT and DEVICE are decimal numbers
- matching port, link or device. Basically, it matches
- the ATA ID string printed on console by libata. If
- the whole ID part is omitted, the last PORT and DEVICE
- values are used. If ID hasn't been specified yet, the
- configuration applies to all ports, links and devices.
+ libata.force= [LIBATA] Force configurations. The format is a comma-
+ separated list of "[ID:]VAL" where ID is PORT[.DEVICE].
+ PORT and DEVICE are decimal numbers matching port, link
+ or device. Basically, it matches the ATA ID string
+ printed on console by libata. If the whole ID part is
+ omitted, the last PORT and DEVICE values are used. If
+ ID hasn't been specified yet, the configuration applies
+ to all ports, links and devices.
If only DEVICE is omitted, the parameter applies to
the port and all links and devices behind it. DEVICE
@@ -2590,7 +2722,7 @@
host link and device attached to it.
The VAL specifies the configuration to force. As long
- as there's no ambiguity shortcut notation is allowed.
+ as there is no ambiguity, shortcut notation is allowed.
For example, both 1.5 and 1.5G would work for 1.5Gbps.
The following configurations can be forced.
@@ -2603,27 +2735,64 @@
udma[/][16,25,33,44,66,100,133] notation is also
allowed.
+ * nohrst, nosrst, norst: suppress hard, soft and both
+ resets.
+
+ * rstonce: only attempt one reset during hot-unplug
+ link recovery.
+
+ * [no]dbdelay: Enable or disable the extra 200ms delay
+ before debouncing a link PHY and device presence
+ detection.
+
* [no]ncq: Turn on or off NCQ.
- * [no]ncqtrim: Turn off queued DSM TRIM.
+ * [no]ncqtrim: Enable or disable queued DSM TRIM.
+
+ * [no]ncqati: Enable or disable NCQ trim on ATI chipset.
+
+ * [no]trim: Enable or disable (unqueued) TRIM.
+
+ * trim_zero: Indicate that TRIM command zeroes data.
- * nohrst, nosrst, norst: suppress hard, soft
- and both resets.
+ * max_trim_128m: Set 128M maximum trim size limit.
- * rstonce: only attempt one reset during
- hot-unplug link recovery
+ * [no]dma: Turn on or off DMA transfers.
- * dump_id: dump IDENTIFY data.
+ * atapi_dmadir: Enable ATAPI DMADIR bridge support.
- * atapi_dmadir: Enable ATAPI DMADIR bridge support
+ * atapi_mod16_dma: Enable the use of ATAPI DMA for
+ commands that are not a multiple of 16 bytes.
+
+ * [no]dmalog: Enable or disable the use of the
+ READ LOG DMA EXT command to access logs.
+
+ * [no]iddevlog: Enable or disable access to the
+ identify device data log.
+
+ * [no]logdir: Enable or disable access to the general
+ purpose log directory.
+
+ * max_sec_128: Set transfer size limit to 128 sectors.
+
+ * max_sec_1024: Set or clear transfer size limit to
+ 1024 sectors.
+
+ * max_sec_lba48: Set or clear transfer size limit to
+ 65535 sectors.
+
+ * [no]lpm: Enable or disable link power management.
+
+ * [no]setxfer: Indicate if transfer speed mode setting
+ should be skipped.
+
+ * dump_id: Dump IDENTIFY data.
* disable: Disable this device.
If there are multiple matching configurations changing
the same attribute, the last one is used.
- memblock=debug [KNL] Enable memblock debug messages.
-
load_ramdisk= [RAM] [Deprecated]
lockd.nlm_grace_period=P [NFS] Assign grace period.
@@ -2765,7 +2934,7 @@
different yeeloong laptops.
Example: machtype=lemote-yeeloong-2f-7inch
- max_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory greater
+ max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater
than or equal to this physical address is ignored.
maxcpus= [SMP] Maximum number of processors that an SMP kernel
@@ -2827,6 +2996,9 @@
For details see: Documentation/admin-guide/hw-vuln/mds.rst
+ mem=nn[KMG] [HEXAGON] Set the memory size.
+ Must be specified, otherwise memory size will be 0.
+
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
Amount of memory to be used in cases as follows:
@@ -2834,6 +3006,13 @@
2 when the kernel is not able to see the whole system memory;
3 memory that lies after 'mem=' boundary is excluded from
the hypervisor, then assigned to KVM guests.
+ 4 to limit the memory available for kdump kernel.
+
+ [ARC,MICROBLAZE] - the limit applies only to low memory,
+ high memory is not affected.
+
+ [ARM64] - only limits memory covered by the linear
+ mapping. The NOMAP regions are not affected.
[X86] Work as limiting max address. Use together
with memmap= to avoid physical address space collisions.
@@ -2844,9 +3023,19 @@
in above case 3, memory may need be hot added after boot
if system memory of hypervisor is not sufficient.
+ mem=nn[KMG]@ss[KMG]
+ [ARM,MIPS] - override the memory layout reported by
+ firmware.
+ Define a memory region of size nn[KMG] starting at
+ ss[KMG].
+ Multiple different regions can be specified with
+ multiple mem= parameters on the command line.
+
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory.
+ memblock=debug [KNL] Enable memblock debug messages.
+
memchunk=nn[KMG]
[KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers.
@@ -2932,10 +3121,12 @@
[KNL,X86,ARM] Boolean flag to enable this feature.
Format: {on | off (default)}
When enabled, runtime hotplugged memory will
- allocate its internal metadata (struct pages)
- from the hotadded memory which will allow to
- hotadd a lot of memory without requiring
- additional memory to do so.
+ allocate its internal metadata (struct pages,
+ those vmemmap pages cannot be optimized even
+ if hugetlb_free_vmemmap is enabled) from the
+ hotadded memory which will allow to hotadd a
+ lot of memory without requiring additional
+ memory to do so.
This feature is disabled by default because it
has some implication on large (e.g. GB)
allocations in some configurations (e.g. small
@@ -2945,10 +3136,6 @@
Note that even when enabled, there are a few cases where
the feature is not effective.
- This is not compatible with hugetlb_free_vmemmap. If
- both parameters are enabled, hugetlb_free_vmemmap takes
- precedence over memory_hotplug.memmap_on_memory.
-
memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest
Format: <integer>
default : 0 <disable>
@@ -2967,7 +3154,7 @@
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
- Refer to Documentation/virt/kvm/amd-memory-encryption.rst
+ Refer to Documentation/virt/kvm/x86/amd-memory-encryption.rst
for details on when memory encryption can be activated.
mem_sleep_default= [SUSPEND] Default system suspend mode:
@@ -2990,7 +3177,7 @@
mga= [HW,DRM]
- min_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory below this
+ min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this
physical address is ignored.
mini2440= [ARM,HW,KNL]
@@ -3025,19 +3212,23 @@
improves system performance, but it may also
expose users to several CPU vulnerabilities.
Equivalent to: nopti [X86,PPC]
- kpti=0 [ARM64]
+ if nokaslr then kpti=0 [ARM64]
nospectre_v1 [X86,PPC]
nobp=0 [S390]
nospectre_v2 [X86,PPC,S390,ARM64]
spectre_v2_user=off [X86]
spec_store_bypass_disable=off [X86,PPC]
ssbd=force-off [ARM64]
+ nospectre_bhb [ARM64]
l1tf=off [X86]
mds=off [X86]
tsx_async_abort=off [X86]
kvm.nx_huge_pages=off [X86]
+ srbds=off [X86,INTEL]
no_entry_flush [PPC]
no_uaccess_flush [PPC]
+ mmio_stale_data=off [X86]
+ retbleed=off [X86]
Exceptions:
This does not have any effect on
@@ -3059,6 +3250,8 @@
Equivalent to: l1tf=flush,nosmt [X86]
mds=full,nosmt [X86]
tsx_async_abort=full,nosmt [X86]
+ mmio_stale_data=full,nosmt [X86]
+ retbleed=auto,nosmt [X86]
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
@@ -3068,6 +3261,49 @@
log everything. Information is printed at KERN_DEBUG
so loglevel=8 may also need to be specified.
+ mmio_stale_data=
+ [X86,INTEL] Control mitigation for the Processor
+ MMIO Stale Data vulnerabilities.
+
+ Processor MMIO Stale Data is a class of
+ vulnerabilities that may expose data after an MMIO
+ operation. Exposed data could originate or end in
+ the same CPU buffers as affected by MDS and TAA.
+ Therefore, similar to MDS and TAA, the mitigation
+ is to clear the affected CPU buffers.
+
+ This parameter controls the mitigation. The
+ options are:
+
+ full - Enable mitigation on vulnerable CPUs
+
+ full,nosmt - Enable mitigation and disable SMT on
+ vulnerable CPUs.
+
+ off - Unconditionally disable mitigation
+
+ On MDS or TAA affected machines,
+ mmio_stale_data=off can be prevented by an active
+ MDS or TAA mitigation as these vulnerabilities are
+ mitigated with the same mechanism so in order to
+ disable this mitigation, you need to specify
+ mds=off and tsx_async_abort=off too.
+
+ Not specifying this option is equivalent to
+ mmio_stale_data=full.
+
+ For details see:
+ Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
+
+ module.async_probe=<bool>
+ [KNL] When set to true, modules will use async probing
+ by default. To enable/disable async probing for a
+ specific module, use the module specific control that
+ is documented under <module>.async_probe. When both
+ module.async_probe and <module>.async_probe are
+ specified, <module>.async_probe takes precedence for
+ the specific module.
+
module.sig_enforce
[KNL] When CONFIG_MODULE_SIG is set, this means that
modules without (valid) signatures will fail to load.
@@ -3114,20 +3350,6 @@
mtdparts= [MTD]
See drivers/mtd/parsers/cmdlinepart.c
- multitce=off [PPC] This parameter disables the use of the pSeries
- firmware feature for updating multiple TCE entries
- at a time.
-
- onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
-
- Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
-
- boundary - index of last SLC block on Flex-OneNAND.
- The remaining blocks are configured as MLC blocks.
- lock - Configure if Flex-OneNAND boundary should be locked.
- Once locked, the boundary cannot be changed.
- 1 indicates lock status, 0 indicates unlock status.
-
mtdset= [ARM]
ARM/S3C2412 JIVE boot control
@@ -3154,6 +3376,10 @@
Used for mtrr cleanup. It is spare mtrr entries number.
Set to 2 or more if your graphical card needs more.
+ multitce=off [PPC] This parameter disables the use of the pSeries
+ firmware feature for updating multiple TCE entries
+ at a time.
+
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
netdev= [NET] Network devices parameters
@@ -3163,6 +3389,11 @@
This usage is only documented in each driver source
file if at all.
+ netpoll.carrier_timeout=
+ [NET] Specifies amount of time (in seconds) that
+ netpoll should wait for a carrier. By default netpoll
+ waits 4 seconds.
+
nf_conntrack.acct=
[NETFILTER] Enable connection tracking flow accounting
0 to disable accounting
@@ -3313,11 +3544,6 @@
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
- netpoll.carrier_timeout=
- [NET] Specifies amount of time (in seconds) that
- netpoll should wait for a carrier. By default netpoll
- waits 4 seconds.
-
no387 [BUGS=X86-32] Tells the kernel to use the 387 maths
emulation library even if a 387 maths coprocessor
is present.
@@ -3367,15 +3593,8 @@
noautogroup Disable scheduler automatic task group creation.
- nobats [PPC] Do not use BATs for mapping kernel lowmem
- on "Classic" PPC cores.
-
nocache [ARM]
- noclflush [BUGS=X86] Don't use the CLFLUSH instruction
-
- delayacct [KNL] Enable per-task delay accounting
-
nodsp [SH] Disable hardware DSP at boot time.
noefi Disable EFI runtime services support.
@@ -3384,16 +3603,11 @@
noexec [IA-64]
- noexec [X86]
- On X86-32 available only on PAE configured kernels.
- noexec=on: enable non-executable mappings (default)
- noexec=off: disable non-executable mappings
-
- nosmap [X86,PPC]
+ nosmap [PPC]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
- nosmep [X86,PPC]
+ nosmep [PPC64s]
Disable SMEP (Supervisor Mode Execution Prevention)
even if it is supported by processor.
@@ -3412,7 +3626,7 @@
nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.
- nohugevmalloc [PPC] Disable kernel huge vmalloc mappings.
+ nohugevmalloc [KNL,X86,PPC,ARM64] Disable kernel huge vmalloc mappings.
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
@@ -3425,11 +3639,15 @@
(bounds check bypass). With this option data leaks are
possible in the system.
- nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for
+ nospectre_v2 [X86,PPC_E500,ARM64] Disable all mitigations for
the Spectre variant 2 (indirect branch prediction)
vulnerability. System may allow data leaks with this
option.
+ nospectre_bhb [ARM64] Disable all mitigations for Spectre-BHB (branch
+ history injection) vulnerability. System may allow data leaks
+ with this option.
+
nospec_store_bypass_disable
[HW] Disable all mitigations for the Speculative Store Bypass vulnerability
@@ -3485,8 +3703,7 @@
difficult since unequal pointers can no longer be
compared. However, if this command-line option is
specified, then all normal pointers will have their true
- value printed. Pointers printed via %pK may still be
- hashed. This option should only be specified when
+ value printed. This option should only be specified when
debugging the kernel. Please do not use on production
kernels.
@@ -3506,6 +3723,9 @@
just as if they had also been called out in the
rcu_nocbs= boot parameter.
+ Note that this argument takes precedence over
+ the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
+
noiotrap [SH] Disables trapped I/O port accesses.
noirqdebug [X86-32] Disables the code which attempts to detect and
@@ -3538,17 +3758,14 @@
[X86,PV_OPS] Disable paravirtualized VMware scheduler
clock and use the default one.
- no-steal-acc [X86,PV_OPS,ARM64] Disable paravirtualized steal time
- accounting. steal time is computed, but won't
- influence scheduler behaviour
+ no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized
+ steal time accounting. steal time is computed, but
+ won't influence scheduler behaviour
nolapic [X86-32,APIC] Do not enable or use the local APIC.
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
- noltlbs [PPC] Do not use large page/tlb entries for kernel
- lowmem mapping on PPC40x and PPC8xx
-
nomca [IA-64] Disable machine check abort handling
nomce [X86-32] Disable Machine Check Exception
@@ -3580,11 +3797,6 @@
noreplace-smp [X86-32,SMP] Don't replace SMP instructions
with UP alternatives
- nordrand [X86] Disable kernel use of the RDRAND and
- RDSEED instructions even if they are supported
- by the processor. RDRAND and RDSEED are still
- available to user space applications.
-
noresume [SWSUSP] Disables resume and restores original swap
space.
@@ -3594,8 +3806,6 @@
nosbagart [IA-64]
- nosep [BUGS=X86-32] Disables x86 SYSENTER/SYSEXIT support.
-
nosgx [X86-64,SGX] Disables Intel SGX kernel support.
nosmp [SMP] Tells an SMP kernel to act as a UP kernel,
@@ -3612,19 +3822,9 @@
nox2apic [X86-64,APIC] Do not enable x2APIC mode.
- cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
- CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
- Some features depend on CPU0. Known dependencies are:
- 1. Resume from suspend/hibernate depends on CPU0.
- Suspend/hibernate will fail if CPU0 is offline and you
- need to online CPU0 before suspend/hibernate.
- 2. PIC interrupts also depend on CPU0. CPU0 can't be
- removed if a PIC interrupt is detected.
- It's said poweroff/reboot may depend on CPU0 on some
- machines although I haven't seen such issues so far
- after CPU0 is offline on a few tested machines.
- If the dependencies are under your control, you can
- turn on cpu0_hotplug.
+ NOTE: this parameter will be ignored on systems with the
+ LEGACY_XAPIC_DISABLED bit set in the
+ IA32_XAPIC_DISABLE_STATUS MSR.
nps_mtm_hs_ctr= [KNL,ARC]
This parameter sets the maximum duration, in
@@ -3678,6 +3878,16 @@
For example, to override I2C bus2:
omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100
+ onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
+
+ Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
+
+ boundary - index of last SLC block on Flex-OneNAND.
+ The remaining blocks are configured as MLC blocks.
+ lock - Configure if Flex-OneNAND boundary should be locked.
+ Once locked, the boundary cannot be changed.
+ 1 indicates lock status, 0 indicates unlock status.
+
oops=panic Always panic on oopses. Default is to just kill the
process, but there is a small probability of
deadlocking the machine.
@@ -3726,6 +3936,11 @@
bit 3: print locks info if CONFIG_LOCKDEP is on
bit 4: print ftrace buffer
bit 5: print all printk messages in buffer
+ bit 6: print all CPUs backtrace (if available in the arch)
+ *Be aware* that this option may print a _lot_ of lines,
+ so there are risks of losing older messages in the log.
+ Use this option carefully, maybe worth to setup a
+ bigger log buffer with "log_buf_len" along with this.
panic_on_taint= Bitmask for conditionally calling panic() in add_taint()
Format: <hex>[,nousertaint]
@@ -3743,14 +3958,6 @@
panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN().
- crash_kexec_post_notifiers
- Run kdump after running panic-notifiers and dumping
- kmsg. This only for the users who doubt kdump always
- succeeds in any situation.
- Note that this also increases risks of kdump failure,
- because some panic notifiers can make the crashed
- kernel more unstable.
-
parkbd.port= [HW] Parallel port number the keyboard adapter is
connected to, default is 0.
Format: <parport#>
@@ -3995,6 +4202,15 @@
please report a bug.
nocrs [X86] Ignore PCI host bridge windows from ACPI.
If you need to use this, please report a bug.
+ use_e820 [X86] Use E820 reservations to exclude parts of
+ PCI host bridge windows. This is a workaround
+ for BIOS defects in host bridge _CRS methods.
+ If you need to use this, please report a bug to
+ <linux-pci@vger.kernel.org>.
+ no_e820 [X86] Ignore E820 reservations for PCI host
+ bridge windows. This is the default on modern
+ hardware. If you need to use this, please report
+ a bug to <linux-pci@vger.kernel.org>.
routeirq Do IRQ routing for all PCI devices.
This is normally done in pci_enable_device(),
so this option is a temporary workaround
@@ -4166,6 +4382,14 @@
Override pmtimer IOPort with a hex value.
e.g. pmtmr=0x508
+ pmu_override= [PPC] Override the PMU.
+ This option takes over the PMU facility, so it is no
+ longer usable by perf. Setting this option starts the
+ PMU counters by setting MMCR0 to 0 (the FC bit is
+ cleared). If a number is given, then MMCR1 is set to
+ that number, otherwise (e.g., 'pmu_override=on'), MMCR1
+ remains 0.
+
pm_debug_messages [SUSPEND,KNL]
Enable suspend/resume debug messages during boot up.
@@ -4348,6 +4572,12 @@
fully seed the kernel's CRNG. Default is controlled
by CONFIG_RANDOM_TRUST_CPU.
+ random.trust_bootloader={on,off}
+ [KNL] Enable or disable trusting the use of a
+ seed passed by the bootloader (if available) to
+ fully seed the kernel's CRNG. Default is controlled
+ by CONFIG_RANDOM_TRUST_BOOTLOADER.
+
randomize_kstack_offset=
[KNL] Enable or disable kernel stack offset
randomization, which provides roughly 5 bits of
@@ -4390,6 +4620,9 @@
no-callback mode from boot but the mode may be
toggled at runtime via cpusets.
+ Note that this argument takes precedence over
+ the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
+
rcu_nocb_poll [KNL]
Rather than requiring that offloaded CPUs
(specified by rcu_nocbs= above) explicitly
@@ -4496,6 +4729,36 @@
(the least-favored priority). Otherwise, when
RCU_BOOST is not set, valid values are 0-99 and
the default is zero (non-realtime operation).
+ When RCU_NOCB_CPU is set, also adjust the
+ priority of NOCB callback kthreads.
+
+ rcutree.rcu_divisor= [KNL]
+ Set the shift-right count to use to compute
+ the callback-invocation batch limit bl from
+ the number of callbacks queued on this CPU.
+ The result will be bounded below by the value of
+ the rcutree.blimit kernel parameter. Every bl
+ callbacks, the softirq handler will exit in
+ order to allow the CPU to do other work.
+
+ Please note that this callback-invocation batch
+ limit applies only to non-offloaded callback
+ invocation. Offloaded callbacks are instead
+ invoked in the context of an rcuoc kthread, which
+ scheduler will preempt as it does any other task.
+
+ rcutree.nocb_nobypass_lim_per_jiffy= [KNL]
+ On callback-offloaded (rcu_nocbs) CPUs,
+ RCU reduces the lock contention that would
+ otherwise be caused by callback floods through
+ use of the ->nocb_bypass list. However, in the
+ common non-flooded case, RCU queues directly to
+ the main ->cblist in order to avoid the extra
+ overhead of the ->nocb_bypass list and its lock.
+ But if there are too many callbacks queued during
+ a single jiffy, RCU pre-queues the callbacks into
+ the ->nocb_bypass queue. The definition of "too
+ many" is supplied by this kernel boot parameter.
rcutree.rcu_nocb_gp_stride= [KNL]
Set the number of NOCB callback kthreads in
@@ -4806,6 +5069,18 @@
rcupdate.rcu_cpu_stall_timeout= [KNL]
Set timeout for RCU CPU stall warning messages.
+ The value is in seconds and the maximum allowed
+ value is 300 seconds.
+
+ rcupdate.rcu_exp_cpu_stall_timeout= [KNL]
+ Set timeout for expedited RCU CPU stall warning
+ messages. The value is in milliseconds
+ and the maximum allowed value is 21000
+ milliseconds. Please note that this value is
+ adjusted to an arch timer tick resolution.
+ Setting this to zero causes the value from
+ rcupdate.rcu_cpu_stall_timeout to be used (after
+ conversion from seconds to milliseconds).
rcupdate.rcu_expedited= [KNL]
Use expedited grace-period primitives, for
@@ -4868,10 +5143,34 @@
number avoids disturbing real-time workloads,
but lengthens grace periods.
+ rcupdate.rcu_task_stall_info= [KNL]
+ Set initial timeout in jiffies for RCU task stall
+ informational messages, which give some indication
+ of the problem for those not patient enough to
+ wait for ten minutes. Informational messages are
+ only printed prior to the stall-warning message
+ for a given grace period. Disable with a value
+ less than or equal to zero. Defaults to ten
+ seconds. A change in value does not take effect
+ until the beginning of the next grace period.
+
+ rcupdate.rcu_task_stall_info_mult= [KNL]
+ Multiplier for time interval between successive
+ RCU task stall informational messages for a given
+ RCU tasks grace period. This value is clamped
+ to one through ten, inclusive. It defaults to
+ the value three, so that the first informational
+ message is printed 10 seconds into the grace
+ period, the second at 40 seconds, the third at
+ 160 seconds, and then the stall warning at 600
+ seconds would prevent a fourth at 640 seconds.
+
rcupdate.rcu_task_stall_timeout= [KNL]
- Set timeout in jiffies for RCU task stall warning
- messages. Disable with a value less than or equal
- to zero.
+ Set timeout in jiffies for RCU task stall
+ warning messages. Disable with a value less
+ than or equal to zero. Defaults to ten minutes.
+ A change in value does not take effect until
+ the beginning of the next grace period.
rcupdate.rcu_self_test= [KNL]
Run the RCU early boot self tests
@@ -4990,17 +5289,45 @@
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
- hibernate= [HIBERNATION]
- noresume Don't check if there's a hibernation image
- present during boot.
- nocompress Don't compress/decompress hibernation images.
- no Disable hibernation and resume.
- protect_image Turn on image protection during restoration
- (that will set all pages holding image data
- during restoration read-only).
-
retain_initrd [RAM] Keep initrd memory after extraction
+ retbleed= [X86] Control mitigation of RETBleed (Arbitrary
+ Speculative Code Execution with Return Instructions)
+ vulnerability.
+
+ AMD-based UNRET and IBPB mitigations alone do not stop
+ sibling threads from influencing the predictions of other
+ sibling threads. For that reason, STIBP is used on pro-
+ cessors that support it, and mitigate SMT on processors
+ that don't.
+
+ off - no mitigation
+ auto - automatically select a migitation
+ auto,nosmt - automatically select a mitigation,
+ disabling SMT if necessary for
+ the full mitigation (only on Zen1
+ and older without STIBP).
+ ibpb - On AMD, mitigate short speculation
+ windows on basic block boundaries too.
+ Safe, highest perf impact. It also
+ enables STIBP if present. Not suitable
+ on Intel.
+ ibpb,nosmt - Like "ibpb" above but will disable SMT
+ when STIBP is not available. This is
+ the alternative for systems which do not
+ have STIBP.
+ unret - Force enable untrained return thunks,
+ only effective on AMD f15h-f17h based
+ systems.
+ unret,nosmt - Like unret, but will disable SMT when STIBP
+ is not available. This is the alternative for
+ systems which do not have STIBP.
+
+ Selecting 'auto' will choose a mitigation method at run
+ time according to the CPU.
+
+ Not specifying this option is equivalent to retbleed=auto.
+
rfkill.default_state=
0 "airplane mode". All wifi, bluetooth, wimax, gps, fm,
etc. communication is blocked by default.
@@ -5025,6 +5352,8 @@
rodata= [KNL]
on Mark read-only kernel memory as read-only (default).
off Leave read-only kernel memory writable for debugging.
+ full Mark read-only kernel memory and aliases as read-only
+ [arm64]
rockchip.usb_uart
Enable the uart passthrough on the designated usb port
@@ -5221,6 +5550,8 @@
serialnumber [BUGS=X86-32]
+ sev=option[,option...] [X86-64] See Documentation/x86/x86_64/boot-options.rst
+
shapers= [NET]
Maximal number of shapers.
@@ -5244,7 +5575,7 @@
cache (risks via metadata attacks are mostly
unchanged). Debug options disable merging on their
own.
- For more information see Documentation/vm/slub.rst.
+ For more information see Documentation/mm/slub.rst.
slab_max_order= [MM, SLAB]
Determines the maximum allowed order for slabs.
@@ -5258,13 +5589,13 @@
slub_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the
last alloc / free. For more information see
- Documentation/vm/slub.rst.
+ Documentation/mm/slub.rst.
slub_max_order= [MM, SLUB]
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. For more information see
- Documentation/vm/slub.rst.
+ Documentation/mm/slub.rst.
slub_min_objects= [MM, SLUB]
The minimum number of objects per slab. SLUB will
@@ -5273,12 +5604,12 @@
the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired.
- For more information see Documentation/vm/slub.rst.
+ For more information see Documentation/mm/slub.rst.
slub_min_order= [MM, SLUB]
Determines the minimum page order for slabs. Must be
lower than slub_max_order.
- For more information see Documentation/vm/slub.rst.
+ For more information see Documentation/mm/slub.rst.
slub_merge [MM, SLUB]
Same with slab_merge.
@@ -5290,6 +5621,17 @@
smart2= [HW]
Format: <io1>[,<io2>[,...,<io8>]]
+ smp.csd_lock_timeout= [KNL]
+ Specify the period of time in milliseconds
+ that smp_call_function() and friends will wait
+ for a CPU to release the CSD lock. This is
+ useful when diagnosing bugs involving CPUs
+ disabling interrupts for extended periods
+ of time. Defaults to 5,000 milliseconds, and
+ setting a value of zero disables this feature.
+ This feature may be more efficiently disabled
+ using the csdlock_debug- kernel parameter.
+
smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
smsc-ircc2.ircc_sir= [HW] SIR base I/O port
@@ -5301,7 +5643,7 @@
1: Fast pin select (default)
2: ATC IRMode
- smt [KNL,S390] Set the maximum number of threads (logical
+ smt= [KNL,S390] Set the maximum number of threads (logical
CPUs) to use per physical CPU on systems capable of
symmetric multithreading (SMT). Will be capped to the
actual hardware limit.
@@ -5353,8 +5695,13 @@
Specific mitigations can also be selected manually:
retpoline - replace indirect branches
- retpoline,generic - google's original retpoline
- retpoline,amd - AMD-specific minimal thunk
+ retpoline,generic - Retpolines
+ retpoline,lfence - LFENCE; indirect branch
+ retpoline,amd - alias for retpoline,lfence
+ eibrs - enhanced IBRS
+ eibrs,retpoline - enhanced IBRS + Retpolines
+ eibrs,lfence - enhanced IBRS + LFENCE
+ ibrs - use IBRS to protect kernel
Not specifying this option is equivalent to
spectre_v2=auto.
@@ -5517,6 +5864,30 @@
off: Disable mitigation and remove
performance impact to RDRAND and RDSEED
+ srcutree.big_cpu_lim [KNL]
+ Specifies the number of CPUs constituting a
+ large system, such that srcu_struct structures
+ should immediately allocate an srcu_node array.
+ This kernel-boot parameter defaults to 128,
+ but takes effect only when the low-order four
+ bits of srcutree.convert_to_big is equal to 3
+ (decide at boot).
+
+ srcutree.convert_to_big [KNL]
+ Specifies under what conditions an SRCU tree
+ srcu_struct structure will be converted to big
+ form, that is, with an rcu_node tree:
+
+ 0: Never.
+ 1: At init_srcu_struct() time.
+ 2: When rcutorture decides to.
+ 3: Decide at boot time (default).
+ 0x1X: Above plus if high contention.
+
+ Either way, the srcu_node tree will be sized based
+ on the actual runtime number of CPUs (nr_cpu_ids)
+ instead of the compile-time CONFIG_NR_CPUS.
+
srcutree.counter_wrap_check [KNL]
Specifies how frequently to check for
grace-period sequence counter wrap for the
@@ -5534,6 +5905,32 @@
expediting. Set to zero to disable automatic
expediting.
+ srcutree.srcu_max_nodelay [KNL]
+ Specifies the number of no-delay instances
+ per jiffy for which the SRCU grace period
+ worker thread will be rescheduled with zero
+ delay. Beyond this limit, worker thread will
+ be rescheduled with a sleep delay of one jiffy.
+
+ srcutree.srcu_max_nodelay_phase [KNL]
+ Specifies the per-grace-period phase, number of
+ non-sleeping polls of readers. Beyond this limit,
+ grace period worker thread will be rescheduled
+ with a sleep delay of one jiffy, between each
+ rescan of the readers, for a grace period phase.
+
+ srcutree.srcu_retry_check_delay [KNL]
+ Specifies number of microseconds of non-sleeping
+ delay between each non-sleeping poll of readers.
+
+ srcutree.small_contention_lim [KNL]
+ Specifies the number of update-side contention
+ events per jiffy will be tolerated before
+ initiating a conversion of an srcu_struct
+ structure to big form. Note that the value of
+ srcutree.convert_to_big must have the 0x10 bit
+ set for contention-based conversions to occur.
+
ssbd= [ARM64,HW]
Speculative Store Bypass Disable control
@@ -5652,14 +6049,12 @@
This parameter controls use of the Protected
Execution Facility on pSeries.
- swapaccount=[0|1]
- [KNL] Enable accounting of swap in memory resource
- controller if no parameter or 1 is given or disable
- it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
-
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
- Format: { <int> | force | noforce }
+ Format: { <int> [,<int>] | force | noforce }
<int> -- Number of I/O TLB slabs
+ <int> -- Second integer after comma. Number of swiotlb
+ areas with their own lock. Will be rounded up
+ to a power of 2.
force -- force using of bounce buffers even if they
wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
@@ -5699,7 +6094,8 @@
tdfx= [HW,DRM]
- test_suspend= [SUSPEND][,N]
+ test_suspend= [SUSPEND]
+ Format: { "mem" | "standby" | "freeze" }[,N]
Specify "mem" (for Suspend-to-RAM) or "standby" (for
standby suspend) or "freeze" (for suspend type freeze)
as the system sleep state during system startup with
@@ -5783,32 +6179,7 @@
This will guarantee that all the other pcrs
are saved.
- trace_buf_size=nn[KMG]
- [FTRACE] will set tracing buffer size on each cpu.
-
- trace_event=[event-list]
- [FTRACE] Set and start specified trace events in order
- to facilitate early boot debugging. The event-list is a
- comma-separated list of trace events to enable. See
- also Documentation/trace/events.rst
-
- trace_options=[option-list]
- [FTRACE] Enable or disable tracer options at boot.
- The option-list is a comma delimited list of options
- that can be enabled or disabled just as if you were
- to echo the option name into
-
- /sys/kernel/debug/tracing/trace_options
-
- For example, to enable stacktrace option (to dump the
- stack trace of each event), add to the command line:
-
- trace_options=stacktrace
-
- See also Documentation/trace/ftrace.rst "trace options"
- section.
-
- tp_printk[FTRACE]
+ tp_printk [FTRACE]
Have the tracepoints sent to printk as well as the
tracing ring buffer. This is useful for early boot up
where the system hangs or reboots and does not give the
@@ -5830,7 +6201,7 @@
frequency tracepoints such as irq or sched, can cause
the system to live lock.
- tp_printk_stop_on_boot[FTRACE]
+ tp_printk_stop_on_boot [FTRACE]
When tp_printk (above) is set, it can cause a lot of noise
on the console. It may be useful to only include the
printing of events during boot up, as user space may
@@ -5839,6 +6210,53 @@
This command line option will stop the printing of events
to console at the late_initcall_sync() time frame.
+ trace_buf_size=nn[KMG]
+ [FTRACE] will set tracing buffer size on each cpu.
+
+ trace_clock= [FTRACE] Set the clock used for tracing events
+ at boot up.
+ local - Use the per CPU time stamp counter
+ (converted into nanoseconds). Fast, but
+ depending on the architecture, may not be
+ in sync between CPUs.
+ global - Event time stamps are synchronize across
+ CPUs. May be slower than the local clock,
+ but better for some race conditions.
+ counter - Simple counting of events (1, 2, ..)
+ note, some counts may be skipped due to the
+ infrastructure grabbing the clock more than
+ once per event.
+ uptime - Use jiffies as the time stamp.
+ perf - Use the same clock that perf uses.
+ mono - Use ktime_get_mono_fast_ns() for time stamps.
+ mono_raw - Use ktime_get_raw_fast_ns() for time
+ stamps.
+ boot - Use ktime_get_boot_fast_ns() for time stamps.
+ Architectures may add more clocks. See
+ Documentation/trace/ftrace.rst for more details.
+
+ trace_event=[event-list]
+ [FTRACE] Set and start specified trace events in order
+ to facilitate early boot debugging. The event-list is a
+ comma-separated list of trace events to enable. See
+ also Documentation/trace/events.rst
+
+ trace_options=[option-list]
+ [FTRACE] Enable or disable tracer options at boot.
+ The option-list is a comma delimited list of options
+ that can be enabled or disabled just as if you were
+ to echo the option name into
+
+ /sys/kernel/debug/tracing/trace_options
+
+ For example, to enable stacktrace option (to dump the
+ stack trace of each event), add to the command line:
+
+ trace_options=stacktrace
+
+ See also Documentation/trace/ftrace.rst "trace options"
+ section.
+
traceoff_on_warning
[FTRACE] enable this option to disable tracing when a
warning is hit. This turns off "tracing_on". Tracing can
@@ -5867,11 +6285,22 @@
sources:
- "tpm"
- "tee"
+ - "caam"
If not specified then it defaults to iterating through
the trust source list starting with TPM and assigns the
first trust source as a backend which is initialized
successfully during iteration.
+ trusted.rng= [KEYS]
+ Format: <string>
+ The RNG used to generate key material for trusted keys.
+ Can be one of:
+ - "kernel"
+ - the same value as trusted.source: "tpm" or "tee"
+ - "default"
+ If not specified, "default" is used. In this case,
+ the RNG's choice is left to each individual trust source.
+
tsc= Disable clocksource stability checks for TSC.
Format: <string>
[x86] reliable: mark tsc clocksource as reliable, this
@@ -6179,7 +6608,7 @@
HIGHMEM regardless of setting
of CONFIG_HIGHPTE.
- vdso= [X86,SH]
+ vdso= [X86,SH,SPARC]
On X86_32, this is an alias for vdso32=. Otherwise:
vdso=1: enable VDSO (the default)
@@ -6205,11 +6634,12 @@
video= [FB] Frame buffer configuration
See Documentation/fb/modedb.rst.
- video.brightness_switch_enabled= [0,1]
+ video.brightness_switch_enabled= [ACPI]
+ Format: [0|1]
If set to 1, on receiving an ACPI notify event
generated by hotkey, video driver will adjust brightness
level and then send out the event to user space through
- the allocated input device; If set to 0, video driver
+ the allocated input device. If set to 0, video driver
will only send out the event without touching backlight
brightness level.
default: 1
@@ -6421,6 +6851,12 @@
Crash from Xen panic notifier, without executing late
panic() code such as dumping handler.
+ xen_msr_safe= [X86,XEN]
+ Format: <bool>
+ Select whether to always use non-faulting (safe) MSR
+ access functions when running as Xen PV guest. The
+ default value is controlled by CONFIG_XEN_PV_MSR_SAFE.
+
xen_nopvspin [X86,XEN]
Disables the qspinlock slowpath using Xen PV optimizations.
This parameter is obsoleted by "nopvspin" parameter, which
@@ -6494,6 +6930,12 @@
controller on both pseries and powernv
platforms. Only useful on POWER9 and above.
+ xive.store-eoi=off [PPC]
+ By default on POWER10 and above, the kernel will use
+ stores for EOI handling when the XIVE interrupt mode
+ is active. This option allows the XIVE driver to use
+ loads instead, as on POWER9.
+
xhci-hcd.quirks [USB,KNL]
A hex value specifying bitmask with supplemental xhci
host controller quirks. Meaning of each bit can be
diff --git a/Documentation/admin-guide/laptops/lg-laptop.rst b/Documentation/admin-guide/laptops/lg-laptop.rst
index 6fbe165dcd27..67fd6932cef4 100644
--- a/Documentation/admin-guide/laptops/lg-laptop.rst
+++ b/Documentation/admin-guide/laptops/lg-laptop.rst
@@ -38,7 +38,7 @@ FN lock.
Battery care limit
------------------
-Writing 80/100 to /sys/devices/platform/lg-laptop/battery_care_limit
+Writing 80/100 to /sys/class/power_supply/CMB0/charge_control_end_threshold
sets the maximum capacity to charge the battery. Limiting the charge
reduces battery capacity loss over time.
diff --git a/Documentation/admin-guide/media/fimc.rst b/Documentation/admin-guide/media/fimc.rst
index 56b149d9a527..267ef52fe387 100644
--- a/Documentation/admin-guide/media/fimc.rst
+++ b/Documentation/admin-guide/media/fimc.rst
@@ -14,7 +14,7 @@ data from LCD controller (FIMD) through the SoC internal writeback data
path. There are multiple FIMC instances in the SoCs (up to 4), having
slightly different capabilities, like pixel alignment constraints, rotator
availability, LCD writeback support, etc. The driver is located at
-drivers/media/platform/exynos4-is directory.
+drivers/media/platform/samsung/exynos4-is directory.
Supported SoCs
--------------
diff --git a/Documentation/admin-guide/media/i2c-cardlist.rst b/Documentation/admin-guide/media/i2c-cardlist.rst
index db17f39b56cf..ef3b5fff3b01 100644
--- a/Documentation/admin-guide/media/i2c-cardlist.rst
+++ b/Documentation/admin-guide/media/i2c-cardlist.rst
@@ -284,7 +284,7 @@ tda9887 TDA 9885/6/7 analog IF demodulator
tea5761 TEA 5761 radio tuner
tea5767 TEA 5767 radio tuner
tua9001 Infineon TUA9001 silicon tuner
-tuner-xc2028 XCeive xc2028/xc3028 tuners
+xc2028 XCeive xc2028/xc3028 tuners
xc4000 Xceive XC4000 silicon tuner
xc5000 Xceive XC5000 silicon tuner
============ ==================================================
diff --git a/Documentation/admin-guide/media/imx7.rst b/Documentation/admin-guide/media/imx7.rst
index 4785ae8ac978..2fa27718f52a 100644
--- a/Documentation/admin-guide/media/imx7.rst
+++ b/Documentation/admin-guide/media/imx7.rst
@@ -33,7 +33,7 @@ reference manual [#f1]_.
Entities
--------
-imx7-mipi-csi2
+imx-mipi-csi2
--------------
This is the MIPI CSI-2 receiver entity. It has one sink pad to receive the pixel
diff --git a/Documentation/admin-guide/media/omap3isp.rst b/Documentation/admin-guide/media/omap3isp.rst
index bc447bbec7ce..f32e7375a1a2 100644
--- a/Documentation/admin-guide/media/omap3isp.rst
+++ b/Documentation/admin-guide/media/omap3isp.rst
@@ -17,7 +17,7 @@ Introduction
------------
This file documents the Texas Instruments OMAP 3 Image Signal Processor (ISP)
-driver located under drivers/media/platform/omap3isp. The original driver was
+driver located under drivers/media/platform/ti/omap3isp. The original driver was
written by Texas Instruments but since that it has been rewritten (twice) at
Nokia.
diff --git a/Documentation/admin-guide/media/omap4_camera.rst b/Documentation/admin-guide/media/omap4_camera.rst
index 24db4222d36d..2ada9b1e6897 100644
--- a/Documentation/admin-guide/media/omap4_camera.rst
+++ b/Documentation/admin-guide/media/omap4_camera.rst
@@ -25,7 +25,7 @@ As of Revision AB, the ISS is described in detail in section 8.
This driver is supporting **only** the CSI2-A/B interfaces for now.
It makes use of the Media Controller framework [#f2]_, and inherited most of the
-code from OMAP3 ISP driver (found under drivers/media/platform/omap3isp/\*),
+code from OMAP3 ISP driver (found under drivers/media/platform/ti/omap3isp/\*),
except that it doesn't need an IOMMU now for ISS buffers memory mapping.
Supports usage of MMAP buffers only (for now).
diff --git a/Documentation/admin-guide/media/vimc.dot b/Documentation/admin-guide/media/vimc.dot
index 57863a13fa39..92a5bb631235 100644
--- a/Documentation/admin-guide/media/vimc.dot
+++ b/Documentation/admin-guide/media/vimc.dot
@@ -5,18 +5,22 @@ digraph board {
n00000001 [label="{{} | Sensor A\n/dev/v4l-subdev0 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
n00000001:port0 -> n00000005:port0 [style=bold]
n00000001:port0 -> n0000000b [style=bold]
+ n00000001 -> n00000002
+ n00000002 [label="{{} | Lens A\n/dev/v4l-subdev5 | {<port0>}}", shape=Mrecord, style=filled, fillcolor=green]
n00000003 [label="{{} | Sensor B\n/dev/v4l-subdev1 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
n00000003:port0 -> n00000008:port0 [style=bold]
n00000003:port0 -> n0000000f [style=bold]
+ n00000003 -> n00000004
+ n00000004 [label="{{} | Lens B\n/dev/v4l-subdev6 | {<port0>}}", shape=Mrecord, style=filled, fillcolor=green]
n00000005 [label="{{<port0> 0} | Debayer A\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
- n00000005:port1 -> n00000017:port0
+ n00000005:port1 -> n00000015:port0
n00000008 [label="{{<port0> 0} | Debayer B\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
- n00000008:port1 -> n00000017:port0 [style=dashed]
+ n00000008:port1 -> n00000015:port0 [style=dashed]
n0000000b [label="Raw Capture 0\n/dev/video0", shape=box, style=filled, fillcolor=yellow]
n0000000f [label="Raw Capture 1\n/dev/video1", shape=box, style=filled, fillcolor=yellow]
- n00000013 [label="RGB/YUV Input\n/dev/video2", shape=box, style=filled, fillcolor=yellow]
- n00000013 -> n00000017:port0 [style=dashed]
- n00000017 [label="{{<port0> 0} | Scaler\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
- n00000017:port1 -> n0000001a [style=bold]
- n0000001a [label="RGB/YUV Capture\n/dev/video3", shape=box, style=filled, fillcolor=yellow]
+ n00000013 [label="{{} | RGB/YUV Input\n/dev/v4l-subdev4 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000013:port0 -> n00000015:port0 [style=dashed]
+ n00000015 [label="{{<port0> 0} | Scaler\n/dev/v4l-subdev5 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000015:port1 -> n00000018 [style=bold]
+ n00000018 [label="RGB/YUV Capture\n/dev/video2", shape=box, style=filled, fillcolor=yellow]
}
diff --git a/Documentation/admin-guide/media/vimc.rst b/Documentation/admin-guide/media/vimc.rst
index 180507d455f2..3b4d2b36b4f3 100644
--- a/Documentation/admin-guide/media/vimc.rst
+++ b/Documentation/admin-guide/media/vimc.rst
@@ -53,6 +53,25 @@ vimc-sensor:
* 1 Pad source
+vimc-lens:
+ Ancillary lens for a sensor. Supports auto focus control. Linked to
+ a vimc-sensor using an ancillary link. The lens supports FOCUS_ABSOLUTE
+ control.
+
+.. code-block:: bash
+
+ media-ctl -p
+ ...
+ - entity 28: Lens A (0 pad, 0 link)
+ type V4L2 subdev subtype Lens flags 0
+ device node name /dev/v4l-subdev6
+ - entity 29: Lens B (0 pad, 0 link)
+ type V4L2 subdev subtype Lens flags 0
+ device node name /dev/v4l-subdev7
+ v4l2-ctl -d /dev/v4l-subdev7 -C focus_absolute
+ focus_absolute: 0
+
+
vimc-debayer:
Transforms images in bayer format into a non-bayer format.
Exposes:
@@ -76,3 +95,16 @@ vimc-capture:
* 1 Pad sink
* 1 Pad source
+
+Module options
+--------------
+
+Vimc has a module parameter to configure the driver.
+
+* ``allocator=<unsigned int>``
+
+ memory allocator selection, default is 0. It specifies the way buffers
+ will be allocated.
+
+ - 0: vmalloc
+ - 1: dma-contig
diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst
index 6d7175f96f74..abd90ed31090 100644
--- a/Documentation/admin-guide/media/vivid.rst
+++ b/Documentation/admin-guide/media/vivid.rst
@@ -714,6 +714,20 @@ The Test Pattern Controls are all specific to video capture.
does the same for the EAV (End of Active Video) code.
+- Insert Video Guard Band
+
+ adds 4 columns of pixels with the HDMI Video Guard Band code at the
+ left hand side of the image. This only works with 3 or 4 byte RGB pixel
+ formats. The RGB pixel value 0xab/0x55/0xab turns out to be equivalent
+ to the HDMI Video Guard Band code that precedes each active video line
+ (see section 5.2.2.1 in the HDMI 1.3 Specification). To test if a video
+ receiver has correct HDMI Video Guard Band processing, enable this
+ control and then move the image to the left hand side of the screen.
+ That will result in video lines that start with multiple pixels that
+ have the same value as the Video Guard Band that precedes them.
+ Receivers that will just keep skipping Video Guard Band values will
+ now fail and either loose sync or these video lines will shift.
+
Capture Feature Selection Controls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -1304,7 +1318,7 @@ instance. This setup would require the following commands:
$ v4l2-ctl -d2 -i2
$ v4l2-ctl -d2 -c horizontal_movement=4
$ v4l2-ctl -d1 --overlay=1
- $ v4l2-ctl -d1 -c loop_video=1
+ $ v4l2-ctl -d0 -c loop_video=1
$ v4l2-ctl -d2 --stream-mmap --overlay=1
And from another console:
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
index 4e06ffabd78a..7367e6294ef6 100644
--- a/Documentation/admin-guide/mm/cma_debugfs.rst
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -5,10 +5,10 @@ CMA Debugfs Interface
The CMA debugfs interface is useful to retrieve basic information out of the
different CMA areas and to test allocation/release in each of the areas.
-Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
-kernel's CMA index. So the first CMA zone would be:
+Each CMA area represents a directory under <debugfs>/cma/, represented by
+its CMA name like below:
- <debugfs>/cma/cma-0
+ <debugfs>/cma/<cma_name>
The structure of the files created under that directory is as follows:
@@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows:
- [RO] bitmap: The bitmap of page states in the zone.
- [WO] alloc: Allocate N pages from that CMA area. For example::
- echo 5 > <debugfs>/cma/cma-2/alloc
+ echo 5 > <debugfs>/cma/<cma_name>/alloc
-would try to allocate 5 pages from the cma-2 area.
+would try to allocate 5 pages from the 'cma_name' area.
- [WO] free: Free N pages from that CMA area, similar to the above.
diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst
index b966fcff993b..c79f1e336222 100644
--- a/Documentation/admin-guide/mm/concepts.rst
+++ b/Documentation/admin-guide/mm/concepts.rst
@@ -125,7 +125,7 @@ processor. Each bank is referred to as a `node` and for each node Linux
constructs an independent memory management subsystem. A node has its
own set of zones, lists of free and used pages and various statistics
counters. You can find more details about NUMA in
-:ref:`Documentation/vm/numa.rst <numa>` and in
+:ref:`Documentation/mm/numa.rst <numa>` and in
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
Page cache
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst
index 61aff88347f3..33d37bb2fb4e 100644
--- a/Documentation/admin-guide/mm/damon/index.rst
+++ b/Documentation/admin-guide/mm/damon/index.rst
@@ -1,10 +1,10 @@
.. SPDX-License-Identifier: GPL-2.0
-========================
-Monitoring Data Accesses
-========================
+==========================
+DAMON: Data Access MONitor
+==========================
-:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring.
+:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.
Using DAMON, users can analyze the memory access patterns of their systems and
optimize those.
@@ -14,3 +14,4 @@ optimize those.
start
usage
reclaim
+ lru_sort
diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst
new file mode 100644
index 000000000000..c09cace80651
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/lru_sort.rst
@@ -0,0 +1,294 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+DAMON-based LRU-lists Sorting
+=============================
+
+DAMON-based LRU-lists Sorting (DAMON_LRU_SORT) is a static kernel module that
+aimed to be used for proactive and lightweight data access pattern based
+(de)prioritization of pages on their LRU-lists for making LRU-lists a more
+trusworthy data access pattern source.
+
+Where Proactive LRU-lists Sorting is Required?
+==============================================
+
+As page-granularity access checking overhead could be significant on huge
+systems, LRU lists are normally not proactively sorted but partially and
+reactively sorted for special events including specific user requests, system
+calls and memory pressure. As a result, LRU lists are sometimes not so
+perfectly prepared to be used as a trustworthy access pattern source for some
+situations including reclamation target pages selection under sudden memory
+pressure.
+
+Because DAMON can identify access patterns of best-effort accuracy while
+inducing only user-specified range of overhead, proactively running
+DAMON_LRU_SORT could be helpful for making LRU lists more trustworthy access
+pattern source with low and controlled overhead.
+
+How It Works?
+=============
+
+DAMON_LRU_SORT finds hot pages (pages of memory regions that showing access
+rates that higher than a user-specified threshold) and cold pages (pages of
+memory regions that showing no access for a time that longer than a
+user-specified threshold) using DAMON, and prioritizes hot pages while
+deprioritizing cold pages on their LRU-lists. To avoid it consuming too much
+CPU for the prioritizations, a CPU time usage limit can be configured. Under
+the limit, it prioritizes and deprioritizes more hot and cold pages first,
+respectively. System administrators can also configure under what situation
+this scheme should automatically activated and deactivated with three memory
+pressure watermarks.
+
+Its default parameters for hotness/coldness thresholds and CPU quota limit are
+conservatively chosen. That is, the module under its default parameters could
+be widely used without harm for common situations while providing a level of
+benefits for systems having clear hot/cold access patterns under memory
+pressure while consuming only a limited small portion of CPU time.
+
+Interface: Module Parameters
+============================
+
+To use this feature, you should first ensure your system is running on a kernel
+that is built with ``CONFIG_DAMON_LRU_SORT=y``.
+
+To let sysadmins enable or disable it and tune for the given system,
+DAMON_LRU_SORT utilizes module parameters. That is, you can put
+``damon_lru_sort.<parameter>=<value>`` on the kernel boot command line or write
+proper values to ``/sys/modules/damon_lru_sort/parameters/<parameter>`` files.
+
+Below are the description of each parameter.
+
+enabled
+-------
+
+Enable or disable DAMON_LRU_SORT.
+
+You can enable DAMON_LRU_SORT by setting the value of this parameter as ``Y``.
+Setting it as ``N`` disables DAMON_LRU_SORT. Note that DAMON_LRU_SORT could do
+no real monitoring and LRU-lists sorting due to the watermarks-based activation
+condition. Refer to below descriptions for the watermarks parameter for this.
+
+commit_inputs
+-------------
+
+Make DAMON_LRU_SORT reads the input parameters again, except ``enabled``.
+
+Input parameters that updated while DAMON_LRU_SORT is running are not applied
+by default. Once this parameter is set as ``Y``, DAMON_LRU_SORT reads values
+of parametrs except ``enabled`` again. Once the re-reading is done, this
+parameter is set as ``N``. If invalid parameters are found while the
+re-reading, DAMON_LRU_SORT will be disabled.
+
+hot_thres_access_freq
+---------------------
+
+Access frequency threshold for hot memory regions identification in permil.
+
+If a memory region is accessed in frequency of this or higher, DAMON_LRU_SORT
+identifies the region as hot, and mark it as accessed on the LRU list, so that
+it could not be reclaimed under memory pressure. 50% by default.
+
+cold_min_age
+------------
+
+Time threshold for cold memory regions identification in microseconds.
+
+If a memory region is not accessed for this or longer time, DAMON_LRU_SORT
+identifies the region as cold, and mark it as unaccessed on the LRU list, so
+that it could be reclaimed first under memory pressure. 120 seconds by
+default.
+
+quota_ms
+--------
+
+Limit of time for trying the LRU lists sorting in milliseconds.
+
+DAMON_LRU_SORT tries to use only up to this time within a time window
+(quota_reset_interval_ms) for trying LRU lists sorting. This can be used
+for limiting CPU consumption of DAMON_LRU_SORT. If the value is zero, the
+limit is disabled.
+
+10 ms by default.
+
+quota_reset_interval_ms
+-----------------------
+
+The time quota charge reset interval in milliseconds.
+
+The charge reset interval for the quota of time (quota_ms). That is,
+DAMON_LRU_SORT does not try LRU-lists sorting for more than quota_ms
+milliseconds or quota_sz bytes within quota_reset_interval_ms milliseconds.
+
+1 second by default.
+
+wmarks_interval
+---------------
+
+The watermarks check time interval in microseconds.
+
+Minimal time to wait before checking the watermarks, when DAMON_LRU_SORT is
+enabled but inactive due to its watermarks rule. 5 seconds by default.
+
+wmarks_high
+-----------
+
+Free memory rate (per thousand) for the high watermark.
+
+If free memory of the system in bytes per thousand bytes is higher than this,
+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the
+watermarks. 200 (20%) by default.
+
+wmarks_mid
+----------
+
+Free memory rate (per thousand) for the middle watermark.
+
+If free memory of the system in bytes per thousand bytes is between this and
+the low watermark, DAMON_LRU_SORT becomes active, so starts the monitoring and
+the LRU-lists sorting. 150 (15%) by default.
+
+wmarks_low
+----------
+
+Free memory rate (per thousand) for the low watermark.
+
+If free memory of the system in bytes per thousand bytes is lower than this,
+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the
+watermarks. 50 (5%) by default.
+
+sample_interval
+---------------
+
+Sampling interval for the monitoring in microseconds.
+
+The sampling interval of DAMON for the cold memory monitoring. Please refer to
+the DAMON documentation (:doc:`usage`) for more detail. 5ms by default.
+
+aggr_interval
+-------------
+
+Aggregation interval for the monitoring in microseconds.
+
+The aggregation interval of DAMON for the cold memory monitoring. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail. 100ms by
+default.
+
+min_nr_regions
+--------------
+
+Minimum number of monitoring regions.
+
+The minimal number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set lower-bound of the monitoring quality.
+But, setting this too high could result in increased monitoring overhead.
+Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by
+default.
+
+max_nr_regions
+--------------
+
+Maximum number of monitoring regions.
+
+The maximum number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set upper-bound of the monitoring overhead.
+However, setting this too low could result in bad monitoring quality. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail. 1000 by
+defaults.
+
+monitor_region_start
+--------------------
+
+Start of target memory region in physical address.
+
+The start physical address of memory region that DAMON_LRU_SORT will do work
+against. By default, biggest System RAM is used as the region.
+
+monitor_region_end
+------------------
+
+End of target memory region in physical address.
+
+The end physical address of memory region that DAMON_LRU_SORT will do work
+against. By default, biggest System RAM is used as the region.
+
+kdamond_pid
+-----------
+
+PID of the DAMON thread.
+
+If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread. Else,
+-1.
+
+nr_lru_sort_tried_hot_regions
+-----------------------------
+
+Number of hot memory regions that tried to be LRU-sorted.
+
+bytes_lru_sort_tried_hot_regions
+--------------------------------
+
+Total bytes of hot memory regions that tried to be LRU-sorted.
+
+nr_lru_sorted_hot_regions
+-------------------------
+
+Number of hot memory regions that successfully be LRU-sorted.
+
+bytes_lru_sorted_hot_regions
+----------------------------
+
+Total bytes of hot memory regions that successfully be LRU-sorted.
+
+nr_hot_quota_exceeds
+--------------------
+
+Number of times that the time quota limit for hot regions have exceeded.
+
+nr_lru_sort_tried_cold_regions
+------------------------------
+
+Number of cold memory regions that tried to be LRU-sorted.
+
+bytes_lru_sort_tried_cold_regions
+---------------------------------
+
+Total bytes of cold memory regions that tried to be LRU-sorted.
+
+nr_lru_sorted_cold_regions
+--------------------------
+
+Number of cold memory regions that successfully be LRU-sorted.
+
+bytes_lru_sorted_cold_regions
+-----------------------------
+
+Total bytes of cold memory regions that successfully be LRU-sorted.
+
+nr_cold_quota_exceeds
+---------------------
+
+Number of times that the time quota limit for cold regions have exceeded.
+
+Example
+=======
+
+Below runtime example commands make DAMON_LRU_SORT to find memory regions
+having >=50% access frequency and LRU-prioritize while LRU-deprioritizing
+memory regions that not accessed for 120 seconds. The prioritization and
+deprioritization is limited to be done using only up to 1% CPU time to avoid
+DAMON_LRU_SORT consuming too much CPU time for the (de)prioritization. It also
+asks DAMON_LRU_SORT to do nothing if the system's free memory rate is more than
+50%, but start the real works if it becomes lower than 40%. If DAMON_RECLAIM
+doesn't make progress and therefore the free memory rate becomes lower than
+20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to
+the LRU-list based page granularity reclamation. ::
+
+ # cd /sys/modules/damon_lru_sort/parameters
+ # echo 500 > hot_thres_access_freq
+ # echo 120000000 > cold_min_age
+ # echo 10 > quota_ms
+ # echo 1000 > quota_reset_interval_ms
+ # echo 500 > wmarks_high
+ # echo 400 > wmarks_mid
+ # echo 200 > wmarks_low
+ # echo Y > enabled
diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst
index fb9def3a7355..4f1479a11e63 100644
--- a/Documentation/admin-guide/mm/damon/reclaim.rst
+++ b/Documentation/admin-guide/mm/damon/reclaim.rst
@@ -48,12 +48,6 @@ DAMON_RECLAIM utilizes module parameters. That is, you can put
``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write
proper values to ``/sys/modules/damon_reclaim/parameters/<parameter>`` files.
-Note that the parameter values except ``enabled`` are applied only when
-DAMON_RECLAIM starts. Therefore, if you want to apply new parameter values in
-runtime and DAMON_RECLAIM is already enabled, you should disable and re-enable
-it via ``enabled`` parameter file. Writing of the new values to proper
-parameter values should be done before the re-enablement.
-
Below are the description of each parameter.
enabled
@@ -66,6 +60,17 @@ Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do
no real monitoring and reclamation due to the watermarks-based activation
condition. Refer to below descriptions for the watermarks parameter for this.
+commit_inputs
+-------------
+
+Make DAMON_RECLAIM reads the input parameters again, except ``enabled``.
+
+Input parameters that updated while DAMON_RECLAIM is running are not applied
+by default. Once this parameter is set as ``Y``, DAMON_RECLAIM reads values
+of parametrs except ``enabled`` again. Once the re-reading is done, this
+parameter is set as ``N``. If invalid parameters are found while the
+re-reading, DAMON_RECLAIM will be disabled.
+
min_age
-------
@@ -208,6 +213,31 @@ PID of the DAMON thread.
If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. Else,
-1.
+nr_reclaim_tried_regions
+------------------------
+
+Number of memory regions that tried to be reclaimed by DAMON_RECLAIM.
+
+bytes_reclaim_tried_regions
+---------------------------
+
+Total bytes of memory regions that tried to be reclaimed by DAMON_RECLAIM.
+
+nr_reclaimed_regions
+--------------------
+
+Number of memory regions that successfully be reclaimed by DAMON_RECLAIM.
+
+bytes_reclaimed_regions
+-----------------------
+
+Total bytes of memory regions that successfully be reclaimed by DAMON_RECLAIM.
+
+nr_quota_exceeds
+----------------
+
+Number of times that the time/space quota limits have exceeded.
+
Example
=======
@@ -232,4 +262,4 @@ granularity reclamation. ::
.. [1] https://research.google/pubs/pub48551/
.. [2] https://lwn.net/Articles/787611/
-.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
+.. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
index 4d5ca2c46288..9f88afc734da 100644
--- a/Documentation/admin-guide/mm/damon/start.rst
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -29,16 +29,9 @@ called DAMON Operator (DAMO). It is available at
https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
your ``$PATH``. It's not mandatory, though.
-Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
-detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
-below::
-
- # mount -t debugfs none /sys/kernel/debug/
-
-or append the following line to your ``/etc/fstab`` file so that your system
-can automatically mount debugfs upon booting::
-
- debugfs /sys/kernel/debug debugfs defaults 0 0
+Because DAMO is using the sysfs interface (refer to :doc:`usage` for the
+detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
+mounted.
Recording Data Access Patterns
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index ed96bbf0daff..b47b0cbbd491 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -4,49 +4,412 @@
Detailed Usages
===============
-DAMON provides below three interfaces for different users.
+DAMON provides below interfaces for different users.
- *DAMON user space tool.*
- This is for privileged people such as system administrators who want a
- just-working human-friendly interface. Using this, users can use the DAMON’s
- major features in a human-friendly way. It may not be highly tuned for
- special cases, though. It supports both virtual and physical address spaces
- monitoring.
+ `This <https://github.com/awslabs/damo>`_ is for privileged people such as
+ system administrators who want a just-working human-friendly interface.
+ Using this, users can use the DAMON’s major features in a human-friendly way.
+ It may not be highly tuned for special cases, though. It supports both
+ virtual and physical address spaces monitoring. For more detail, please
+ refer to its `usage document
+ <https://github.com/awslabs/damo/blob/next/USAGE.md>`_.
+- *sysfs interface.*
+ :ref:`This <sysfs_interface>` is for privileged user space programmers who
+ want more optimized use of DAMON. Using this, users can use DAMON’s major
+ features by reading from and writing to special sysfs files. Therefore,
+ you can write and use your personalized DAMON sysfs wrapper programs that
+ reads/writes the sysfs files instead of you. The `DAMON user space tool
+ <https://github.com/awslabs/damo>`_ is one example of such programs. It
+ supports both virtual and physical address spaces monitoring. Note that this
+ interface provides only simple :ref:`statistics <damos_stats>` for the
+ monitoring results. For detailed monitoring results, DAMON provides a
+ :ref:`tracepoint <tracepoint>`.
- *debugfs interface.*
- This is for privileged user space programmers who want more optimized use of
- DAMON. Using this, users can use DAMON’s major features by reading
- from and writing to special debugfs files. Therefore, you can write and use
- your personalized DAMON debugfs wrapper programs that reads/writes the
- debugfs files instead of you. The DAMON user space tool is also a reference
- implementation of such programs. It supports both virtual and physical
- address spaces monitoring.
+ :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
+ <sysfs_interface>`. This will be removed after next LTS kernel is released,
+ so users should move to the :ref:`sysfs interface <sysfs_interface>`.
- *Kernel Space Programming Interface.*
- This is for kernel space programmers. Using this, users can utilize every
- feature of DAMON most flexibly and efficiently by writing kernel space
- DAMON application programs for you. You can even extend DAMON for various
- address spaces.
+ :doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
+ users can utilize every feature of DAMON most flexibly and efficiently by
+ writing kernel space DAMON application programs for you. You can even extend
+ DAMON for various address spaces. For detail, please refer to the interface
+ :doc:`document </mm/damon/api>`.
-Nevertheless, you could write your own user space tool using the debugfs
-interface. A reference implementation is available at
-https://github.com/awslabs/damo. If you are a kernel programmer, you could
-refer to :doc:`/vm/damon/api` for the kernel space programming interface. For
-the reason, this document describes only the debugfs interface
+.. _sysfs_interface:
+
+sysfs Interface
+===============
+
+DAMON sysfs interface is built when ``CONFIG_DAMON_SYSFS`` is defined. It
+creates multiple directories and files under its sysfs directory,
+``<sysfs>/kernel/mm/damon/``. You can control DAMON by writing to and reading
+from the files under the directory.
+
+For a short example, users can monitor the virtual address space of a given
+workload as below. ::
+
+ # cd /sys/kernel/mm/damon/admin/
+ # echo 1 > kdamonds/nr_kdamonds && echo 1 > kdamonds/0/contexts/nr_contexts
+ # echo vaddr > kdamonds/0/contexts/0/operations
+ # echo 1 > kdamonds/0/contexts/0/targets/nr_targets
+ # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid_target
+ # echo on > kdamonds/0/state
+
+Files Hierarchy
+---------------
+
+The files hierarchy of DAMON sysfs interface is shown below. In the below
+figure, parents-children relations are represented with indentations, each
+directory is having ``/`` suffix, and files in each directory are separated by
+comma (","). ::
+
+ /sys/kernel/mm/damon/admin
+ │ kdamonds/nr_kdamonds
+ │ │ 0/state,pid
+ │ │ │ contexts/nr_contexts
+ │ │ │ │ 0/avail_operations,operations
+ │ │ │ │ │ monitoring_attrs/
+ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
+ │ │ │ │ │ │ nr_regions/min,max
+ │ │ │ │ │ targets/nr_targets
+ │ │ │ │ │ │ 0/pid_target
+ │ │ │ │ │ │ │ regions/nr_regions
+ │ │ │ │ │ │ │ │ 0/start,end
+ │ │ │ │ │ │ │ │ ...
+ │ │ │ │ │ │ ...
+ │ │ │ │ │ schemes/nr_schemes
+ │ │ │ │ │ │ 0/action
+ │ │ │ │ │ │ │ access_pattern/
+ │ │ │ │ │ │ │ │ sz/min,max
+ │ │ │ │ │ │ │ │ nr_accesses/min,max
+ │ │ │ │ │ │ │ │ age/min,max
+ │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms
+ │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
+ │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
+ │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
+ │ │ │ │ │ │ ...
+ │ │ │ │ ...
+ │ │ ...
+
+Root
+----
+
+The root of the DAMON sysfs interface is ``<sysfs>/kernel/mm/damon/``, and it
+has one directory named ``admin``. The directory contains the files for
+privileged user space programs' control of DAMON. User space tools or deamons
+having the root permission could use this directory.
+
+kdamonds/
+---------
+
+The monitoring-related information including request specifications and results
+are called DAMON context. DAMON executes each context with a kernel thread
+called kdamond, and multiple kdamonds could run in parallel.
+
+Under the ``admin`` directory, one directory, ``kdamonds``, which has files for
+controlling the kdamonds exist. In the beginning, this directory has only one
+file, ``nr_kdamonds``. Writing a number (``N``) to the file creates the number
+of child directories named ``0`` to ``N-1``. Each directory represents each
+kdamond.
+
+kdamonds/<N>/
+-------------
+
+In each kdamond directory, two files (``state`` and ``pid``) and one directory
+(``contexts``) exist.
+
+Reading ``state`` returns ``on`` if the kdamond is currently running, or
+``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be
+in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
+user inputs in the sysfs files except ``state`` file again. Writing
+``update_schemes_stats`` to ``state`` file updates the contents of stats files
+for each DAMON-based operation scheme of the kdamond. For details of the
+stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
+
+If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
+
+``contexts`` directory contains files for controlling the monitoring contexts
+that this kdamond will execute.
+
+kdamonds/<N>/contexts/
+----------------------
+
+In the beginning, this directory has only one file, ``nr_contexts``. Writing a
+number (``N``) to the file creates the number of child directories named as
+``0`` to ``N-1``. Each directory represents each monitoring context. At the
+moment, only one context per kdamond is supported, so only ``0`` or ``1`` can
+be written to the file.
+
+contexts/<N>/
+-------------
+
+In each context directory, two files (``avail_operations`` and ``operations``)
+and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
+exist.
+
+DAMON supports multiple types of monitoring operations, including those for
+virtual address space and the physical address space. You can get the list of
+available monitoring operations set on the currently running kernel by reading
+``avail_operations`` file. Based on the kernel configuration, the file will
+list some or all of below keywords.
+
+ - vaddr: Monitor virtual address spaces of specific processes
+ - fvaddr: Monitor fixed virtual address ranges
+ - paddr: Monitor the physical address space of the system
+
+Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
+differences between the operations sets in terms of the monitoring target
+regions.
+
+You can set and get what type of monitoring operations DAMON will use for the
+context by writing one of the keywords listed in ``avail_operations`` file and
+reading from the ``operations`` file.
+
+contexts/<N>/monitoring_attrs/
+------------------------------
+
+Files for specifying attributes of the monitoring including required quality
+and efficiency of the monitoring are in ``monitoring_attrs`` directory.
+Specifically, two directories, ``intervals`` and ``nr_regions`` exist in this
+directory.
+
+Under ``intervals`` directory, three files for DAMON's sampling interval
+(``sample_us``), aggregation interval (``aggr_us``), and update interval
+(``update_us``) exist. You can set and get the values in micro-seconds by
+writing to and reading from the files.
+
+Under ``nr_regions`` directory, two files for the lower-bound and upper-bound
+of DAMON's monitoring regions (``min`` and ``max``, respectively), which
+controls the monitoring overhead, exist. You can set and get the values by
+writing to and rading from the files.
+
+For more details about the intervals and monitoring regions range, please refer
+to the Design document (:doc:`/mm/damon/design`).
+
+contexts/<N>/targets/
+---------------------
+
+In the beginning, this directory has only one file, ``nr_targets``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each monitoring target.
+
+targets/<N>/
+------------
+
+In each target directory, one file (``pid_target``) and one directory
+(``regions``) exist.
+
+If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should
+be a process. You can specify the process to DAMON by writing the pid of the
+process to the ``pid_target`` file.
+
+.. _sysfs_regions:
+
+targets/<N>/regions
+-------------------
+
+When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to
+the ``contexts/<N>/operations`` file), DAMON automatically sets and updates the
+monitoring target regions so that entire memory mappings of target processes
+can be covered. However, users could want to set the initial monitoring region
+to specific address ranges.
+
+In contrast, DAMON do not automatically sets and updates the monitoring target
+regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
+(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
+Therefore, users should set the monitoring target regions by themselves in the
+cases.
+
+For such cases, users can explicitly set the initial monitoring target regions
+as they want, by writing proper values to the files under this directory.
+
+In the beginning, this directory has only one file, ``nr_regions``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each initial monitoring target region.
+
+regions/<N>/
+------------
+
+In each region directory, you will find two files (``start`` and ``end``). You
+can set and get the start and end addresses of the initial monitoring target
+region by writing to and reading from the files, respectively.
+
+contexts/<N>/schemes/
+---------------------
+
+For usual DAMON-based data access aware memory management optimizations, users
+would normally want the system to apply a memory management action to a memory
+region of a specific access pattern. DAMON receives such formalized operation
+schemes from the user and applies those to the target memory regions. Users
+can get and set the schemes by reading from and writing to files under this
+directory.
+
+In the beginning, this directory has only one file, ``nr_schemes``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each DAMON-based operation scheme.
+
+schemes/<N>/
+------------
+
+In each scheme directory, four directories (``access_pattern``, ``quotas``,
+``watermarks``, and ``stats``) and one file (``action``) exist.
+
+The ``action`` file is for setting and getting what action you want to apply to
+memory regions having specific access pattern of the interest. The keywords
+that can be written to and read from the file and their meaning are as below.
+
+ - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``
+ - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``
+ - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
+ - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``
+ - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
+ - ``lru_prio``: Prioritize the region on its LRU lists.
+ - ``lru_deprio``: Deprioritize the region on its LRU lists.
+ - ``stat``: Do nothing but count the statistics
+
+schemes/<N>/access_pattern/
+---------------------------
+
+The target access pattern of each DAMON-based operation scheme is constructed
+with three ranges including the size of the region in bytes, number of
+monitored accesses per aggregate interval, and number of aggregated intervals
+for the age of the region.
+
+Under the ``access_pattern`` directory, three directories (``sz``,
+``nr_accesses``, and ``age``) each having two files (``min`` and ``max``)
+exist. You can set and get the access pattern for the given scheme by writing
+to and reading from the ``min`` and ``max`` files under ``sz``,
+``nr_accesses``, and ``age`` directories, respectively.
+
+schemes/<N>/quotas/
+-------------------
+
+Optimal ``target access pattern`` for each ``action`` is workload dependent, so
+not easy to find. Worse yet, setting a scheme of some action too aggressive
+can cause severe overhead. To avoid such overhead, users can limit time and
+size quota for each scheme. In detail, users can ask DAMON to try to use only
+up to specific time (``time quota``) for applying the action, and to apply the
+action to only up to specific amount (``size quota``) of memory regions having
+the target access pattern within a given time interval (``reset interval``).
+
+When the quota limit is expected to be exceeded, DAMON prioritizes found memory
+regions of the ``target access pattern`` based on their size, access frequency,
+and age. For personalized prioritization, users can set the weights for the
+three properties.
+
+Under ``quotas`` directory, three files (``ms``, ``bytes``,
+``reset_interval_ms``) and one directory (``weights``) having three files
+(``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist.
+
+You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
+``reset interval`` in milliseconds by writing the values to the three files,
+respectively. You can also set the prioritization weights for size, access
+frequency, and age in per-thousand unit by writing the values to the three
+files under the ``weights`` directory.
+
+schemes/<N>/watermarks/
+-----------------------
+
+To allow easy activation and deactivation of each scheme based on system
+status, DAMON provides a feature called watermarks. The feature receives five
+values called ``metric``, ``interval``, ``high``, ``mid``, and ``low``. The
+``metric`` is the system metric such as free memory ratio that can be measured.
+If the metric value of the system is higher than the value in ``high`` or lower
+than ``low`` at the memoent, the scheme is deactivated. If the value is lower
+than ``mid``, the scheme is activated.
+
+Under the watermarks directory, five files (``metric``, ``interval_us``,
+``high``, ``mid``, and ``low``) for setting each value exist. You can set and
+get the five values by writing to the files, respectively.
+
+Keywords and meanings of those that can be written to the ``metric`` file are
+as below.
+
+ - none: Ignore the watermarks
+ - free_mem_rate: System's free memory rate (per thousand)
+
+The ``interval`` should written in microseconds unit.
+
+.. _sysfs_schemes_stats:
+
+schemes/<N>/stats/
+------------------
+
+DAMON counts the total number and bytes of regions that each scheme is tried to
+be applied, the two numbers for the regions that each scheme is successfully
+applied, and the total number of the quota limit exceeds. This statistics can
+be used for online analysis or tuning of the schemes.
+
+The statistics can be retrieved by reading the files under ``stats`` directory
+(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and
+``qt_exceeds``), respectively. The files are not updated in real time, so you
+should ask DAMON sysfs interface to updte the content of the files for the
+stats by writing a special keyword, ``update_schemes_stats`` to the relevant
+``kdamonds/<N>/state`` file.
+
+Example
+~~~~~~~
+
+Below commands applies a scheme saying "If a memory region of size in [4KiB,
+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
+interval in [10, 20], page out the region. For the paging out, use only up to
+10ms per second, and also don't page out more than 1GiB per second. Under the
+limitation, page out memory regions having longer age first. Also, check the
+free memory rate of the system every 5 seconds, start the monitoring and paging
+out when the free memory rate becomes lower than 50%, but stop it if the free
+memory rate becomes larger than 60%, or lower than 30%". ::
+
+ # cd <sysfs>/kernel/mm/damon/admin
+ # # populate directories
+ # echo 1 > kdamonds/nr_kdamonds; echo 1 > kdamonds/0/contexts/nr_contexts;
+ # echo 1 > kdamonds/0/contexts/0/schemes/nr_schemes
+ # cd kdamonds/0/contexts/0/schemes/0
+ # # set the basic access pattern and the action
+ # echo 4096 > access_pattern/sz/min
+ # echo 8192 > access_pattern/sz/max
+ # echo 0 > access_pattern/nr_accesses/min
+ # echo 5 > access_pattern/nr_accesses/max
+ # echo 10 > access_pattern/age/min
+ # echo 20 > access_pattern/age/max
+ # echo pageout > action
+ # # set quotas
+ # echo 10 > quotas/ms
+ # echo $((1024*1024*1024)) > quotas/bytes
+ # echo 1000 > quotas/reset_interval_ms
+ # # set watermark
+ # echo free_mem_rate > watermarks/metric
+ # echo 5000000 > watermarks/interval_us
+ # echo 600 > watermarks/high
+ # echo 500 > watermarks/mid
+ # echo 300 > watermarks/low
+
+Please note that it's highly recommended to use user space tools like `damo
+<https://github.com/awslabs/damo>`_ rather than manually reading and writing
+the files as above. Above is only for an example.
+
+.. _debugfs_interface:
debugfs Interface
=================
-DAMON exports five files, ``attrs``, ``target_ids``, ``init_regions``,
-``schemes`` and ``monitor_on`` under its debugfs directory,
-``<debugfs>/damon/``.
+.. note::
+
+ DAMON debugfs interface will be removed after next LTS kernel is released, so
+ users should move to the :ref:`sysfs interface <sysfs_interface>`.
+
+DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
+``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
+``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
Attributes
----------
Users can get and set the ``sampling interval``, ``aggregation interval``,
-``regions update interval``, and min/max number of monitoring target regions by
+``update interval``, and min/max number of monitoring target regions by
reading from and writing to the ``attrs`` file. To know about the monitoring
-attributes in detail, please refer to the :doc:`/vm/damon/design`. For
+attributes in detail, please refer to the :doc:`/mm/damon/design`. For
example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
1000, and then check it again::
@@ -105,24 +468,28 @@ In such cases, users can explicitly set the initial monitoring target regions
as they want, by writing proper values to the ``init_regions`` file. Each line
of the input should represent one region in below form.::
- <target id> <start address> <end address>
+ <target idx> <start address> <end address>
-The ``target id`` should already in ``target_ids`` file, and the regions should
-be passed in address order. For example, below commands will set a couple of
-address ranges, ``1-100`` and ``100-200`` as the initial monitoring target
-region of process 42, and another couple of address ranges, ``20-40`` and
-``50-100`` as that of process 4242.::
+The ``target idx`` should be the index of the target in ``target_ids`` file,
+starting from ``0``, and the regions should be passed in address order. For
+example, below commands will set a couple of address ranges, ``1-100`` and
+``100-200`` as the initial monitoring target region of pid 42, which is the
+first one (index ``0``) in ``target_ids``, and another couple of address
+ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one
+(index ``1``) in ``target_ids``.::
# cd <debugfs>/damon
- # echo "42 1 100
- 42 100 200
- 4242 20 40
- 4242 50 100" > init_regions
+ # cat target_ids
+ 42 4242
+ # echo "0 1 100
+ 0 100 200
+ 1 20 40
+ 1 50 100" > init_regions
Note that this sets the initial monitoring target regions only. In case of
virtual memory monitoring, DAMON will automatically updates the boundary of the
-regions after one ``regions update interval``. Therefore, users should set the
-``regions update interval`` large enough in this case, if they don't want the
+regions after one ``update interval``. Therefore, users should set the
+``update interval`` large enough in this case, if they don't want the
update.
@@ -131,24 +498,38 @@ Schemes
For usual DAMON-based data access aware memory management optimizations, users
would simply want the system to apply a memory management action to a memory
-region of a specific size having a specific access frequency for a specific
-time. DAMON receives such formalized operation schemes from the user and
-applies those to the target processes. It also counts the total number and
-size of regions that each scheme is applied. This statistics can be used for
-online analysis or tuning of the schemes.
+region of a specific access pattern. DAMON receives such formalized operation
+schemes from the user and applies those to the target processes.
Users can get and set the schemes by reading from and writing to ``schemes``
debugfs file. Reading the file also shows the statistics of each scheme. To
-the file, each of the schemes should be represented in each line in below form:
+the file, each of the schemes should be represented in each line in below
+form::
+
+ <target access pattern> <action> <quota> <watermarks>
+
+You can disable schemes by simply writing an empty string to the file.
- min-size max-size min-acc max-acc min-age max-age action
+Target Access Pattern
+~~~~~~~~~~~~~~~~~~~~~
-Note that the ranges are closed interval. Bytes for the size of regions
-(``min-size`` and ``max-size``), number of monitored accesses per aggregate
-interval for access frequency (``min-acc`` and ``max-acc``), number of
-aggregate intervals for the age of regions (``min-age`` and ``max-age``), and a
-predefined integer for memory management actions should be used. The supported
-numbers and their meanings are as below.
+The ``<target access pattern>`` is constructed with three ranges in below
+form::
+
+ min-size max-size min-acc max-acc min-age max-age
+
+Specifically, bytes for the size of regions (``min-size`` and ``max-size``),
+number of monitored accesses per aggregate interval for access frequency
+(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of
+regions (``min-age`` and ``max-age``) are specified. Note that the ranges are
+closed interval.
+
+Action
+~~~~~~
+
+The ``<action>`` is a predefined integer for memory management actions, which
+DAMON will apply to the regions having the target access pattern. The
+supported numbers and their meanings are as below.
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``
- 1: Call ``madvise()`` for the region with ``MADV_COLD``
@@ -157,20 +538,82 @@ numbers and their meanings are as below.
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
- 5: Do nothing but count the statistics
-You can disable schemes by simply writing an empty string to the file. For
-example, below commands applies a scheme saying "If a memory region of size in
-[4KiB, 8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
-interval in [10, 20], page out the region", check the entered scheme again, and
-finally remove the scheme. ::
+Quota
+~~~~~
- # cd <debugfs>/damon
- # echo "4096 8192 0 5 10 20 2" > schemes
- # cat schemes
- 4096 8192 0 5 10 20 2 0 0
- # echo > schemes
+Optimal ``target access pattern`` for each ``action`` is workload dependent, so
+not easy to find. Worse yet, setting a scheme of some action too aggressive
+can cause severe overhead. To avoid such overhead, users can limit time and
+size quota for the scheme via the ``<quota>`` in below form::
+
+ <ms> <sz> <reset interval> <priority weights>
+
+This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying
+the action to memory regions of the ``target access pattern`` within the
+``<reset interval>`` milliseconds, and to apply the action to only up to
+``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both
+``<ms>`` and ``<sz>`` zero disables the quota limits.
+
+When the quota limit is expected to be exceeded, DAMON prioritizes found memory
+regions of the ``target access pattern`` based on their size, access frequency,
+and age. For personalized prioritization, users can set the weights for the
+three properties in ``<priority weights>`` in below form::
-The last two integers in the 4th line of above example is the total number and
-the total size of the regions that the scheme is applied.
+ <size weight> <access frequency weight> <age weight>
+
+Watermarks
+~~~~~~~~~~
+
+Some schemes would need to run based on current value of the system's specific
+metrics like free memory ratio. For such cases, users can specify watermarks
+for the condition.::
+
+ <metric> <check interval> <high mark> <middle mark> <low mark>
+
+``<metric>`` is a predefined integer for the metric to be checked. The
+supported numbers and their meanings are as below.
+
+ - 0: Ignore the watermarks
+ - 1: System's free memory rate (per thousand)
+
+The value of the metric is checked every ``<check interval>`` microseconds.
+
+If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the
+scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme
+is activated.
+
+.. _damos_stats:
+
+Statistics
+~~~~~~~~~~
+
+It also counts the total number and bytes of regions that each scheme is tried
+to be applied, the two numbers for the regions that each scheme is successfully
+applied, and the total number of the quota limit exceeds. This statistics can
+be used for online analysis or tuning of the schemes.
+
+The statistics can be shown by reading the ``schemes`` file. Reading the file
+will show each scheme you entered in each line, and the five numbers for the
+statistics will be added at the end of each line.
+
+Example
+~~~~~~~
+
+Below commands applies a scheme saying "If a memory region of size in [4KiB,
+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
+interval in [10, 20], page out the region. For the paging out, use only up to
+10ms per second, and also don't page out more than 1GiB per second. Under the
+limitation, page out memory regions having longer age first. Also, check the
+free memory rate of the system every 5 seconds, start the monitoring and paging
+out when the free memory rate becomes lower than 50%, but stop it if the free
+memory rate becomes larger than 60%, or lower than 30%".::
+
+ # cd <debugfs>/damon
+ # scheme="4096 8192 0 5 10 20 2" # target access pattern and action
+ # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas
+ # scheme+=" 0 0 100" # prioritization weights
+ # scheme+=" 1 5000000 600 500 300" # watermarks
+ # echo "$scheme" > schemes
Turning On/Off
@@ -195,6 +638,54 @@ the monitoring is turned on. If you write to the files while DAMON is running,
an error code such as ``-EBUSY`` will be returned.
+Monitoring Thread PID
+---------------------
+
+DAMON does requested monitoring with a kernel thread called ``kdamond``. You
+can get the pid of the thread by reading the ``kdamond_pid`` file. When the
+monitoring is turned off, reading the file returns ``none``. ::
+
+ # cd <debugfs>/damon
+ # cat monitor_on
+ off
+ # cat kdamond_pid
+ none
+ # echo on > monitor_on
+ # cat kdamond_pid
+ 18594
+
+
+Using Multiple Monitoring Threads
+---------------------------------
+
+One ``kdamond`` thread is created for each monitoring context. You can create
+and remove monitoring contexts for multiple ``kdamond`` required use case using
+the ``mk_contexts`` and ``rm_contexts`` files.
+
+Writing the name of the new context to the ``mk_contexts`` file creates a
+directory of the name on the DAMON debugfs directory. The directory will have
+DAMON debugfs files for the context. ::
+
+ # cd <debugfs>/damon
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+ # echo foo > mk_contexts
+ # ls foo
+ # attrs init_regions kdamond_pid schemes target_ids
+
+If the context is not needed anymore, you can remove it and the corresponding
+directory by putting the name of the context to the ``rm_contexts`` file. ::
+
+ # echo foo > rm_contexts
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+
+Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
+root directory only.
+
+
+.. _tracepoint:
+
Tracepoint for Monitoring Results
=================================
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 0166f9de3428..19f27c0d92e0 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -65,7 +65,7 @@ HugePages_Surp
may be temporarily larger than the maximum number of surplus huge
pages when the system is under memory pressure.
Hugepagesize
- is the default hugepage size (in Kb).
+ is the default hugepage size (in kB).
Hugetlb
is the total amount of memory (in kB), consumed by huge
pages of all sizes.
@@ -164,8 +164,8 @@ default_hugepagesz
will all result in 256 2M huge pages being allocated. Valid default
huge page size is architecture dependent.
hugetlb_free_vmemmap
- When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
- unused vmemmap pages associated with each HugeTLB page.
+ When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB
+ Vmemmap Optimization (HVO).
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size.
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..d1064e0ba34a 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,10 +32,12 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
+ multigen_lru
nommu-mmap
numa_memory_policy
numaperf
pagemap
+ shrinker_debugfs
soft-dirty
swap_numa
transhuge
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 97d816791aca..fb6ba2002a4b 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -184,6 +184,60 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
+Monitoring KSM profit
+=====================
+
+KSM can save memory by merging identical pages, but also can consume
+additional memory, because it needs to generate a number of rmap_items to
+save each scanned page's brief rmap information. Some of these pages may
+be merged, but some may not be abled to be merged after being checked
+several times, which are unprofitable memory consumed.
+
+1) How to determine whether KSM save memory or consume memory in system-wide
+ range? Here is a simple approximate calculation for reference::
+
+ general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
+ sizeof(rmap_item);
+
+ where all_rmap_items can be easily obtained by summing ``pages_sharing``,
+ ``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
+
+2) The KSM profit inner a single process can be similarly obtained by the
+ following approximate calculation::
+
+ process_profit =~ ksm_merging_pages * sizeof(page) -
+ ksm_rmap_items * sizeof(rmap_item).
+
+ where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
+ and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
+
+From the perspective of application, a high ratio of ``ksm_rmap_items`` to
+``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
+administrators have to rethink how to change madvise policy. Giving an example
+for reference, a page's size is usually 4K, and the rmap_item's size is
+separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.
+so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU
+or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,
+because the ksm profit is approximately zero or negative.
+
+Monitoring KSM events
+=====================
+
+There are some counters in /proc/vmstat that may be used to monitor KSM events.
+KSM might help save memory, it's a tradeoff by may suffering delay on KSM COW
+or on swapping in copy. Those events could help users evaluate whether or how
+to use KSM. For example, if cow_ksm increases too fast, user may decrease the
+range of madvise(, , MADV_MERGEABLE).
+
+cow_ksm
+ is incremented every time a KSM page triggers copy on write (COW)
+ when users try to write to a KSM page, we have to make a copy.
+
+ksm_swpin_copy
+ is incremented every time a KSM page is copied when swapping in
+ note that KSM page might be copied when swapping in because do_swap_page()
+ cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
+
--
Izik Eidus,
Hugh Dickins, 17 Nov 2009
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 0f56ecd8ac05..a3c9e8ad8fa0 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -653,8 +653,8 @@ block might fail:
- Concurrent activity that operates on the same physical memory area, such as
allocating gigantic pages, can result in temporary offlining failures.
-- Out of memory when dissolving huge pages, especially when freeing unused
- vmemmap pages associated with each hugetlb page is enabled.
+- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap
+ Optimization (HVO) is enabled.
Offlining code may be able to migrate huge page contents, but may not be able
to dissolve the source huge page because it fails allocating (unmovable) pages
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..33e068830497
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enabled`` accepts different values to enable or disable the
+following components. Its default value depends on
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
+unless some of them have unforeseen side effects. Writing to
+``enabled`` has no effect when a component is not supported by the
+hardware, and valid values will be accepted even when the main switch
+is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+ batches, when MMU sets it (e.g., on x86). This behavior can
+ theoretically worsen lock contention (mmap_lock). If it is
+ disabled, the multi-gen LRU will suffer a minor performance
+ degradation for workloads that contiguously map hot pages,
+ whose accessed bits can be otherwise cleared by fewer larger
+ batches.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+ well, when MMU sets it (e.g., on x86). This behavior was not
+ verified on x86 varieties other than Intel and AMD. If it is
+ disabled, the multi-gen LRU will suffer a negligible
+ performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+ echo y >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0007
+ echo 5 >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+The default value ``0`` means disabled.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application needs
+in a given time interval, and it is usually done with little impact on
+the performance of the application. E.g., data centers want to
+optimize job scheduling (bin packing) to improve memory utilizations.
+When a new job comes in, the job scheduler needs to find out whether
+each server it manages can allocate a certain amount of memory for
+this new job before it can pick a candidate. To do so, the job
+scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram. The
+histograms are noncumulative.
+::
+
+ memcg memcg_id memcg_path
+ node node_id
+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages
+ ...
+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages
+
+Each bin contains an estimated number of pages that have been accessed
+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
+the former is the largest and that of the latter is the smallest.
+
+Users can write the following command to ``lru_gen`` to create a new
+generation ``max_gen_nr+1``:
+
+ ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
+
+``can_swap`` defaults to the swap setting and, if it is set to ``1``,
+it forces the scan of anon pages when swap is off, and vice versa.
+``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
+employs heuristics to reduce the overhead, which is likely to reduce
+the coverage as well.
+
+A typical use case is that a job scheduler runs this command at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold pages defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces page reclaim when there is no memory
+pressure. It usually targets cold pages only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim cold pages on
+the server it selected, to improve the chance of successfully landing
+this new job.
+
+Users can write the following command to ``lru_gen`` to evict
+generations less than or equal to ``min_gen_nr``.
+
+ ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
+
+``min_gen_nr`` should be less than ``max_gen_nr-1``, since
+``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
+the active list) and therefore cannot be evicted. ``swappiness``
+overrides the default value in ``/proc/sys/vm/swappiness``.
+``nr_to_reclaim`` limits the number of pages to evict.
+
+A typical use case is that a job scheduler runs this command before it
+tries to land a new job on a server. If it fails to materialize enough
+cold pages because of the overestimation, it retries on the next
+server according to the ranking result obtained from the working set
+estimation step. This less forceful approach limits the impacts on the
+existing jobs.
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 64fd0ba0d057..5a6afecbb0d0 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -408,7 +408,7 @@ follows:
Memory Policy APIs
==================
-Linux supports 3 system calls for controlling memory policy. These APIS
+Linux supports 4 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space.
@@ -460,6 +460,20 @@ requested via the 'flags' argument.
See the mbind(2) man page for more details.
+Set home node for a Range of Task's Address Spacec::
+
+ long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
+ unsigned long home_node,
+ unsigned long flags);
+
+sys_set_mempolicy_home_node set the home node for a VMA policy present in the
+task's address range. The system call updates the home node only for the existing
+mempolicy range. Other address ranges are ignored. A home node is the NUMA node
+closest to which page allocation will come from. Specifying the home node override
+the default allocation policy to allocate memory close to the local node for an
+executing CPU.
+
+
Memory Policy Command Line Interface
====================================
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index bfc28704856c..6e2e416af783 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -23,7 +23,7 @@ There are four components to pagemap:
* Bit 56 page exclusively mapped (since 4.2)
* Bit 57 pte is uffd-wp write-protected (since 5.13) (see
:ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
- * Bits 57-60 zero
+ * Bits 58-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present
diff --git a/Documentation/admin-guide/mm/shrinker_debugfs.rst b/Documentation/admin-guide/mm/shrinker_debugfs.rst
new file mode 100644
index 000000000000..3887f0b294fe
--- /dev/null
+++ b/Documentation/admin-guide/mm/shrinker_debugfs.rst
@@ -0,0 +1,135 @@
+.. _shrinker_debugfs:
+
+==========================
+Shrinker Debugfs Interface
+==========================
+
+Shrinker debugfs interface provides a visibility into the kernel memory
+shrinkers subsystem and allows to get information about individual shrinkers
+and interact with them.
+
+For each shrinker registered in the system a directory in **<debugfs>/shrinker/**
+is created. The directory's name is composed from the shrinker's name and an
+unique id: e.g. *kfree_rcu-0* or *sb-xfs:vda1-36*.
+
+Each shrinker directory contains **count** and **scan** files, which allow to
+trigger *count_objects()* and *scan_objects()* callbacks for each memcg and
+numa node (if applicable).
+
+Usage:
+------
+
+1. *List registered shrinkers*
+
+ ::
+
+ $ cd /sys/kernel/debug/shrinker/
+ $ ls
+ dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
+ mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
+ mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
+ rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
+ sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
+ sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
+ sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
+ sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
+ sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
+ sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
+ sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
+ sb-dax-11 sb-proc-45 sb-tmpfs-35
+ sb-debugfs-7 sb-proc-46 sb-tmpfs-40
+
+2. *Get information about a specific shrinker*
+
+ ::
+
+ $ cd sb-btrfs\:vda2-24/
+ $ ls
+ count scan
+
+3. *Count objects*
+
+ Each line in the output has the following format::
+
+ <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ...
+ <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ...
+ ...
+
+ If there are no objects on all numa nodes, a line is omitted. If there
+ are no objects at all, the output might be empty.
+
+ If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed
+ as cgroup inode id. If the shrinker is not numa-aware, 0's are printed
+ for all nodes except the first one.
+ ::
+
+ $ cat count
+ 1 224 2
+ 21 98 0
+ 55 818 10
+ 2367 2 0
+ 2401 30 0
+ 225 13 0
+ 599 35 0
+ 939 124 0
+ 1041 3 0
+ 1075 1 0
+ 1109 1 0
+ 1279 60 0
+ 1313 7 0
+ 1347 39 0
+ 1381 3 0
+ 1449 14 0
+ 1483 63 0
+ 1517 53 0
+ 1551 6 0
+ 1585 1 0
+ 1619 6 0
+ 1653 40 0
+ 1687 11 0
+ 1721 8 0
+ 1755 4 0
+ 1789 52 0
+ 1823 888 0
+ 1857 1 0
+ 1925 2 0
+ 1959 32 0
+ 2027 22 0
+ 2061 9 0
+ 2469 799 0
+ 2537 861 0
+ 2639 1 0
+ 2707 70 0
+ 2775 4 0
+ 2877 84 0
+ 293 1 0
+ 735 8 0
+
+4. *Scan objects*
+
+ The expected input format::
+
+ <cgroup inode id> <numa id> <number of objects to scan>
+
+ For a non-memcg-aware shrinker or on a system with no memory
+ cgrups **0** should be passed as cgroup id.
+ ::
+
+ $ cd /sys/kernel/debug/shrinker/
+ $ cd sb-btrfs\:vda2-24/
+
+ $ cat count | head -n 5
+ 1 212 0
+ 21 97 0
+ 55 802 5
+ 2367 2 0
+ 225 13 0
+
+ $ echo "55 0 200" > scan
+
+ $ cat count | head -n 5
+ 1 212 0
+ 21 96 0
+ 55 752 5
+ 2367 2 0
+ 225 13 0
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index c9c37f16eef8..8ee78ec232eb 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -191,7 +191,14 @@ allocation failure to throttle the next allocation attempt::
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
-The khugepaged progress can be seen in the number of pages collapsed::
+The khugepaged progress can be seen in the number of pages collapsed (note
+that this counter may not be an exact count of the number of pages
+collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
+being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
+one 2M hugepage. Each may happen independently, or together, depending on
+the type of memory and the failures that occur. As such, this value should
+be interpreted roughly as a sign of progress, and counters in /proc/vmstat
+consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@@ -366,10 +373,9 @@ thp_split_pmd
page table entry.
thp_zero_page_alloc
- is incremented every time a huge zero page is
- successfully allocated. It includes allocations which where
- dropped due race with other allocation. Note, it doesn't count
- every map of the huge zero page, only its allocation.
+ is incremented every time a huge zero page used for thp is
+ successfully allocated. Note, it doesn't count every map of
+ the huge zero page, only its allocation.
thp_zero_page_alloc_failed
is incremented if kernel fails to allocate
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 6528036093e1..83f31919ebb3 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
Design
======
-Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
+Userspace creates a new userfaultfd, initializes it, and registers one or more
+regions of virtual memory with it. Then, any page faults which occur within the
+region(s) result in a message being delivered to the userfaultfd, notifying
+userspace of the fault.
The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
@@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the
``userfaultfd`` runtime load never takes the mmap_lock for writing).
-
Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.
-The ``userfaultfd`` once opened by invoking the syscall, can also be
+The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
@@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``).
API
===
+Creating a userfaultfd
+----------------------
+
+There are two ways to create a new userfaultfd, each of which provide ways to
+restrict access to this functionality (since historically userfaultfds which
+handle kernel page faults have been a useful tool for exploiting the kernel).
+
+The first way, supported since userfaultfd was introduced, is the
+userfaultfd(2) syscall. Access to this is controlled in several ways:
+
+- Any user can always create a userfaultfd which traps userspace page faults
+ only. Such a userfaultfd can be created using the userfaultfd(2) syscall
+ with the flag UFFD_USER_MODE_ONLY.
+
+- In order to also trap kernel page faults for the address space, either the
+ process needs the CAP_SYS_PTRACE capability, or the system must have
+ vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
+ is set to 0.
+
+The second way, added to the kernel more recently, is by opening
+/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
+yields equivalent userfaultfds to the userfaultfd(2) syscall.
+
+Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
+filesystem permissions (user/group/mode), which gives fine grained access to
+userfaultfd specifically, without also granting other unrelated privileges at
+the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
+to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
+vm.unprivileged_userfaultfd is not considered.
+
+Initializing a userfaultfd
+--------------------------
+
When first opened the ``userfaultfd`` must be enabled invoking the
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the ``read/POLLIN`` protocol
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 8edb8d578caf..6e6f7b0d6562 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -130,9 +130,25 @@ attribute, e.g.::
echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled
When zswap same-filled page identification is disabled at runtime, it will stop
-checking for the same-value filled pages during store operation. However, the
-existing pages which are marked as same-value filled pages remain stored
-unchanged in zswap until they are either loaded or invalidated.
+checking for the same-value filled pages during store operation.
+In other words, every page will be then considered non-same-value filled.
+However, the existing pages which are marked as same-value filled pages remain
+stored unchanged in zswap until they are either loaded or invalidated.
+
+In some circumstances it might be advantageous to make use of just the zswap
+ability to efficiently store same-filled pages without enabling the whole
+compressed page storage.
+In this case the handling of non-same-value pages by zswap (enabled by default)
+can be disabled by setting the ``non_same_filled_pages_enabled`` attribute
+to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``.
+It can also be enabled and disabled at runtime using the sysfs
+``non_same_filled_pages_enabled`` attribute, e.g.::
+
+ echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled
+
+Disabling both ``zswap.same_filled_pages_enabled`` and
+``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new
+pages by zswap.
To prevent zswap from shrinking pool when zswap is full and there's a high
pressure on swap (this will result in flipping pages in and out zswap pool
diff --git a/Documentation/admin-guide/nfs/nfs-client.rst b/Documentation/admin-guide/nfs/nfs-client.rst
index 6adb6457bc69..36760685dd34 100644
--- a/Documentation/admin-guide/nfs/nfs-client.rst
+++ b/Documentation/admin-guide/nfs/nfs-client.rst
@@ -36,10 +36,9 @@ administrative requirements that require particular behavior that does not
work well as part of an nfs_client_id4 string.
The nfs.nfs4_unique_id boot parameter specifies a unique string that can be
-used instead of a system's node name when an NFS client identifies itself to
-a server. Thus, if the system's node name is not unique, or it changes, its
-nfs.nfs4_unique_id stays the same, preventing collision with other clients
-or loss of state during NFS reboot recovery or transparent state migration.
+used together with a system's node name when an NFS client identifies itself to
+a server. Thus, if the system's node name is not unique, its
+nfs.nfs4_unique_id can help prevent collisions with other clients.
The nfs.nfs4_unique_id string is typically a UUID, though it can contain
anything that is believed to be unique across all NFS clients. An
@@ -53,8 +52,12 @@ outstanding NFSv4 state has expired, to prevent loss of NFSv4 state.
This string can be stored in an NFS client's grub.conf, or it can be provided
via a net boot facility such as PXE. It may also be specified as an nfs.ko
-module parameter. Specifying a uniquifier string is not support for NFS
-clients running in containers.
+module parameter.
+
+This uniquifier string will be the same for all NFS clients running in
+containers unless it is overridden by a value written to
+/sys/fs/nfs/net/nfs_client/identifier which will be local to the network
+namespace of the process which writes.
The DNS resolver
diff --git a/Documentation/admin-guide/perf/alibaba_pmu.rst b/Documentation/admin-guide/perf/alibaba_pmu.rst
new file mode 100644
index 000000000000..11de998bb480
--- /dev/null
+++ b/Documentation/admin-guide/perf/alibaba_pmu.rst
@@ -0,0 +1,100 @@
+=============================================================
+Alibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU)
+=============================================================
+
+The Yitian 710, custom-built by Alibaba Group's chip development business,
+T-Head, implements uncore PMU for performance and functional debugging to
+facilitate system maintenance.
+
+DDR Sub-System Driveway (DRW) PMU Driver
+=========================================
+
+Yitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel
+is independent of others to service system memory requests. And one DDR5
+channel is split into two independent sub-channels. The DDR Sub-System Driveway
+implements separate PMUs for each sub-channel to monitor various performance
+metrics.
+
+The Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf.
+For example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two
+sub-channels of the same channel in die 0. And the PMU device of die 1 is
+prefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000.
+
+Each sub-channel has 36 PMU counters in total, which is classified into
+four groups:
+
+- Group 0: PMU Cycle Counter. This group has one pair of counters
+ pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count
+ based on DDRC core clock.
+
+- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used
+ to count the total access number of either the eight bank groups in a
+ selected rank, or four ranks separately in the first 4 counters. The base
+ transfer unit is 64B.
+
+- Group 2: PMU Retry Counters. This group has 10 counters, that intend to
+ count the total retry number of each type of uncorrectable error.
+
+- Group 3: PMU Common Counters. This group has 16 counters, that are used
+ to count the common events.
+
+For now, the Driveway PMU driver only uses counters in group 0 and group 3.
+
+The DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution
+for connecting an SoC application bus to DDR memory devices. The DDRCTL
+receives transactions Host Interface (HIF) which is custom-defined by Synopsys.
+These transactions are queued internally and scheduled for access while
+satisfying the SDRAM protocol timing requirements, transaction priorities, and
+dependencies between the transactions. The DDRCTL in turn issues commands on
+the DDR PHY Interface (DFI) to the PHY module, which launches and captures data
+to and from the SDRAM. The driveway PMUs have hardware logic to gather
+statistics and performance logging signals on HIF, DFI, etc.
+
+By counting the READ, WRITE and RMW commands sent to the DDRC through the HIF
+interface, we could calculate the bandwidth. Example usage of counting memory
+data bandwidth::
+
+ perf stat \
+ -e ali_drw_21000/hif_wr/ \
+ -e ali_drw_21000/hif_rd/ \
+ -e ali_drw_21000/hif_rmw/ \
+ -e ali_drw_21000/cycle/ \
+ -e ali_drw_21080/hif_wr/ \
+ -e ali_drw_21080/hif_rd/ \
+ -e ali_drw_21080/hif_rmw/ \
+ -e ali_drw_21080/cycle/ \
+ -e ali_drw_23000/hif_wr/ \
+ -e ali_drw_23000/hif_rd/ \
+ -e ali_drw_23000/hif_rmw/ \
+ -e ali_drw_23000/cycle/ \
+ -e ali_drw_23080/hif_wr/ \
+ -e ali_drw_23080/hif_rd/ \
+ -e ali_drw_23080/hif_rmw/ \
+ -e ali_drw_23080/cycle/ \
+ -e ali_drw_25000/hif_wr/ \
+ -e ali_drw_25000/hif_rd/ \
+ -e ali_drw_25000/hif_rmw/ \
+ -e ali_drw_25000/cycle/ \
+ -e ali_drw_25080/hif_wr/ \
+ -e ali_drw_25080/hif_rd/ \
+ -e ali_drw_25080/hif_rmw/ \
+ -e ali_drw_25080/cycle/ \
+ -e ali_drw_27000/hif_wr/ \
+ -e ali_drw_27000/hif_rd/ \
+ -e ali_drw_27000/hif_rmw/ \
+ -e ali_drw_27000/cycle/ \
+ -e ali_drw_27080/hif_wr/ \
+ -e ali_drw_27080/hif_rd/ \
+ -e ali_drw_27080/hif_rmw/ \
+ -e ali_drw_27080/cycle/ -- sleep 10
+
+The average DRAM bandwidth can be calculated as follows:
+
+- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle
+- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle
+
+Here, DDRC_WIDTH = 64 bytes.
+
+The current driver does not support sampling. So "perf record" is
+unsupported. Also attach to a task is unsupported as the events are all
+uncore.
diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst
new file mode 100644
index 000000000000..578407e487d6
--- /dev/null
+++ b/Documentation/admin-guide/perf/hns3-pmu.rst
@@ -0,0 +1,136 @@
+======================================
+HNS3 Performance Monitoring Unit (PMU)
+======================================
+
+HNS3(HiSilicon network system 3) Performance Monitoring Unit (PMU) is an
+End Point device to collect performance statistics of HiSilicon SoC NIC.
+On Hip09, each SICL(Super I/O cluster) has one PMU device.
+
+HNS3 PMU supports collection of performance statistics such as bandwidth,
+latency, packet rate and interrupt rate.
+
+Each HNS3 PMU supports 8 hardware events.
+
+HNS3 PMU driver
+===============
+
+The HNS3 PMU driver registers a perf PMU with the name of its sicl id.::
+
+ /sys/devices/hns3_pmu_sicl_<sicl_id>
+
+PMU driver provides description of available events, filter modes, format,
+identifier and cpumask in sysfs.
+
+The "events" directory describes the event code of all supported events
+shown in perf list.
+
+The "filtermode" directory describes the supported filter modes of each
+event.
+
+The "format" directory describes all formats of the config (events) and
+config1 (filter options) fields of the perf_event_attr structure.
+
+The "identifier" file shows version of PMU hardware device.
+
+The "bdf_min" and "bdf_max" files show the supported bdf range of each
+pmu device.
+
+The "hw_clk_freq" file shows the hardware clock frequency of each pmu
+device.
+
+Example usage of checking event code and subevent code::
+
+ $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time
+ config=0x00204
+ $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num
+ config=0x10204
+
+Each performance statistic has a pair of events to get two values to
+calculate real performance data in userspace.
+
+The bits 0~15 of config (here 0x0204) are the true hardware event code. If
+two events have same value of bits 0~15 of config, that means they are
+event pair. And the bit 16 of config indicates getting counter 0 or
+counter 1 of hardware event.
+
+After getting two values of event pair in usersapce, the formula of
+computation to calculate real performance data is:::
+
+ counter 0 / counter 1
+
+Example usage of checking supported filter mode::
+
+ $# cat /sys/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num
+ filter mode supported: global/port/port-tc/func/func-queue/
+
+Example usage of perf::
+
+ $# perf list
+ hns3_pmu_sicl_0/bw_ssu_rpu_byte_num/ [kernel PMU event]
+ hns3_pmu_sicl_0/bw_ssu_rpu_time/ [kernel PMU event]
+ ------------------------------------------
+
+ $# perf stat -g -e hns3_pmu_sicl_0/bw_ssu_rpu_byte_num,global=1/ -e hns3_pmu_sicl_0/bw_ssu_rpu_time,global=1/ -I 1000
+ or
+ $# perf stat -g -e hns3_pmu_sicl_0/config=0x00002,global=1/ -e hns3_pmu_sicl_0/config=0x10002,global=1/ -I 1000
+
+
+Filter modes
+--------------
+
+1. global mode
+PMU collect performance statistics for all HNS3 PCIe functions of IO DIE.
+Set the "global" filter option to 1 will enable this mode.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,global=1/ -I 1000
+
+2. port mode
+PMU collect performance statistic of one whole physical port. The port id
+is same as mac id. The "tc" filter option must be set to 0xF in this mode,
+here tc stands for traffic class.
+
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0xF/ -I 1000
+
+3. port-tc mode
+PMU collect performance statistic of one tc of physical port. The port id
+is same as mac id. The "tc" filter option must be set to 0 ~ 7 in this
+mode.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0/ -I 1000
+
+4. func mode
+PMU collect performance statistic of one PF/VF. The function id is BDF of
+PF/VF, its conversion formula::
+
+ func = (bus << 8) + (device << 3) + (function)
+
+for example:
+ BDF func
+ 35:00.0 0x3500
+ 35:00.1 0x3501
+ 35:01.0 0x3508
+
+In this mode, the "queue" filter option must be set to 0xFFFF.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0xFFFF/ -I 1000
+
+5. func-queue mode
+PMU collect performance statistic of one queue of PF/VF. The function id
+is BDF of PF/VF, the "queue" filter option must be set to the exact queue
+id of function.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0/ -I 1000
+
+6. func-intr mode
+PMU collect performance statistic of one interrupt of PF/VF. The function
+id is BDF of PF/VF, the "intr" filter option must be set to the exact
+interrupt id of function.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x00301,bdf=0x3500,intr=0/ -I 1000
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 5a8f2529a033..793e1970bc05 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -8,6 +8,8 @@ Performance monitor support
:maxdepth: 1
hisi-pmu
+ hisi-pcie-pmu
+ hns3-pmu
imx-ddr
qcom_l2_pmu
qcom_l3_pmu
@@ -16,3 +18,4 @@ Performance monitor support
xgene-pmu
arm_dsu_pmu
thunderx2-pmu
+ alibaba_pmu
diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst
index 2f066df4ee9c..8f3d30c5a0d8 100644
--- a/Documentation/admin-guide/pm/amd-pstate.rst
+++ b/Documentation/admin-guide/pm/amd-pstate.rst
@@ -19,7 +19,7 @@ Linux kernel. The new mechanism is based on Collaborative Processor
Performance Control (CPPC) which provides finer grain frequency management
than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
the ACPI P-states driver to manage CPU frequency and clocks with switching
-only in 3 P-states. CPPC replaces the ACPI P-states controls, allows a
+only in 3 P-states. CPPC replaces the ACPI P-states controls and allows a
flexible, low-latency interface for the Linux kernel to directly
communicate the performance hints to hardware.
@@ -27,7 +27,7 @@ communicate the performance hints to hardware.
``ondemand``, etc. to manage the performance hints which are provided by
CPPC hardware functionality that internally follows the hardware
specification (for details refer to AMD64 Architecture Programmer's Manual
-Volume 2: System Programming [1]_). Currently ``amd-pstate`` supports basic
+Volume 2: System Programming [1]_). Currently, ``amd-pstate`` supports basic
frequency control function according to kernel governors on some of the
Zen2 and Zen3 processors, and we will implement more AMD specific functions
in future after we verify them on the hardware and SBIOS.
@@ -41,9 +41,9 @@ continuous, abstract, and unit-less performance value in a scale that is
not tied to a specific performance state / frequency. This is an ACPI
standard [2]_ which software can specify application performance goals and
hints as a relative target to the infrastructure limits. AMD processors
-provides the low latency register model (MSR) instead of AML code
+provide the low latency register model (MSR) instead of an AML code
interpreter for performance adjustments. ``amd-pstate`` will initialize a
-``struct cpufreq_driver`` instance ``amd_pstate_driver`` with the callbacks
+``struct cpufreq_driver`` instance, ``amd_pstate_driver``, with the callbacks
to manage each performance update behavior. ::
Highest Perf ------>+-----------------------+ +-----------------------+
@@ -91,26 +91,26 @@ AMD CPPC Performance Capability
Highest Performance (RO)
.........................
-It is the absolute maximum performance an individual processor may reach,
+This is the absolute maximum performance an individual processor may reach,
assuming ideal conditions. This performance level may not be sustainable
for long durations and may only be achievable if other platform components
-are in a specific state; for example, it may require other processors be in
+are in a specific state; for example, it may require other processors to be in
an idle state. This would be equivalent to the highest frequencies
supported by the processor.
Nominal (Guaranteed) Performance (RO)
......................................
-It is the maximum sustained performance level of the processor, assuming
-ideal operating conditions. In absence of an external constraint (power,
-thermal, etc.) this is the performance level the processor is expected to
+This is the maximum sustained performance level of the processor, assuming
+ideal operating conditions. In the absence of an external constraint (power,
+thermal, etc.), this is the performance level the processor is expected to
be able to maintain continuously. All cores/processors are expected to be
able to sustain their nominal performance state simultaneously.
Lowest non-linear Performance (RO)
...................................
-It is the lowest performance level at which nonlinear power savings are
+This is the lowest performance level at which nonlinear power savings are
achieved, for example, due to the combined effects of voltage and frequency
scaling. Above this threshold, lower performance levels should be generally
more energy efficient than higher performance levels. This register
@@ -119,7 +119,7 @@ effectively conveys the most efficient performance level to ``amd-pstate``.
Lowest Performance (RO)
........................
-It is the absolute lowest performance level of the processor. Selecting a
+This is the absolute lowest performance level of the processor. Selecting a
performance level lower than the lowest nonlinear performance level may
cause an efficiency penalty but should reduce the instantaneous power
consumption of the processor.
@@ -149,14 +149,14 @@ a relative number. This can be expressed as percentage of nominal
performance (infrastructure max). Below the nominal sustained performance
level, desired performance expresses the average performance level of the
processor subject to hardware. Above the nominal performance level,
-processor must provide at least nominal performance requested and go higher
+the processor must provide at least nominal performance requested and go higher
if current operating conditions allow.
Energy Performance Preference (EPP) (RW)
.........................................
-Provides a hint to the hardware if software wants to bias toward performance
-(0x0) or energy efficiency (0xff).
+This attribute provides a hint to the hardware if software wants to bias
+toward performance (0x0) or energy efficiency (0xff).
Key Governors Support
@@ -173,35 +173,35 @@ operating frequencies supported by the hardware. Users can check the
``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic
frequency control. It is to fine tune the processor configuration on
``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate``
-registers adjust_perf callback to implement the CPPC similar performance
-update behavior. It is initialized by ``sugov_start`` and then populate the
-CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as
-the utilization update callback function in CPU scheduler. CPU scheduler
-will call ``cpufreq_update_util`` and assign the target performance
-according to the ``struct sugov_cpu`` that utilization update belongs to.
-Then ``amd-pstate`` updates the desired performance according to the CPU
+registers the adjust_perf callback to implement performance update behavior
+similar to CPPC. It is initialized by ``sugov_start`` and then populates the
+CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as the
+utilization update callback function in the CPU scheduler. The CPU scheduler
+will call ``cpufreq_update_util`` and assigns the target performance according
+to the ``struct sugov_cpu`` that the utilization update belongs to.
+Then, ``amd-pstate`` updates the desired performance according to the CPU
scheduler assigned.
+.. _processor_support:
Processor Support
=======================
-The ``amd-pstate`` initialization will fail if the _CPC in ACPI SBIOS is
-not existed at the detected processor, and it uses ``acpi_cpc_valid`` to
-check the _CPC existence. All Zen based processors support legacy ACPI
-hardware P-States function, so while the ``amd-pstate`` fails to be
-initialized, the kernel will fall back to initialize ``acpi-cpufreq``
-driver.
+The ``amd-pstate`` initialization will fail if the ``_CPC`` entry in the ACPI
+SBIOS does not exist in the detected processor. It uses ``acpi_cpc_valid``
+to check the existence of ``_CPC``. All Zen based processors support the legacy
+ACPI hardware P-States function, so when ``amd-pstate`` fails initialization,
+the kernel will fall back to initialize the ``acpi-cpufreq`` driver.
There are two types of hardware implementations for ``amd-pstate``: one is
`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support
-<perf_cap_>`_. It can use :c:macro:`X86_FEATURE_CPPC` feature flag (for
-details refer to Processor Programming Reference (PPR) for AMD Family
-19h Model 51h, Revision A1 Processors [3]_) to indicate the different
-types. ``amd-pstate`` is to register different ``static_call`` instances
-for different hardware implementations.
+<perf_cap_>`_. It can use the :c:macro:`X86_FEATURE_CPPC` feature flag to
+indicate the different types. (For details, refer to the Processor Programming
+Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors [3]_.)
+``amd-pstate`` is to register different ``static_call`` instances for different
+hardware implementations.
-Currently, some of Zen2 and Zen3 processors support ``amd-pstate``. In the
+Currently, some of the Zen2 and Zen3 processors support ``amd-pstate``. In the
future, it will be supported on more and more AMD processors.
Full MSR Support
@@ -210,18 +210,18 @@ Full MSR Support
Some new Zen3 processors such as Cezanne provide the MSR registers directly
while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set.
``amd-pstate`` can handle the MSR register to implement the fast switch
-function in ``CPUFreq`` that can shrink latency of frequency control on the
-interrupt context. The functions with ``pstate_xxx`` prefix represent the
-operations of MSR registers.
+function in ``CPUFreq`` that can reduce the latency of frequency control in
+interrupt context. The functions with a ``pstate_xxx`` prefix represent the
+operations on MSR registers.
Shared Memory Support
----------------------
-If :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, that means the
-processor supports shared memory solution. In this case, ``amd-pstate``
+If the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, the
+processor supports the shared memory solution. In this case, ``amd-pstate``
uses the ``cppc_acpi`` helper methods to implement the callback functions
-that defined on ``static_call``. The functions with ``cppc_xxx`` prefix
-represent the operations of acpi cppc helpers for shared memory solution.
+that are defined on ``static_call``. The functions with the ``cppc_xxx`` prefix
+represent the operations of ACPI CPPC helpers for the shared memory solution.
AMD P-States and ACPI hardware P-States always can be supported in one
@@ -234,7 +234,7 @@ User Space Interface in ``sysfs``
==================================
``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
-control its functionality at the system level. They located in the
+control its functionality at the system level. They are located in the
``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. ::
root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd*
@@ -246,66 +246,68 @@ control its functionality at the system level. They located in the
``amd_pstate_highest_perf / amd_pstate_max_freq``
Maximum CPPC performance and CPU frequency that the driver is allowed to
-set in percent of the maximum supported CPPC performance level (the highest
+set, in percent of the maximum supported CPPC performance level (the highest
performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).
-In some of ASICs, the highest CPPC performance is not the one in the _CPC
-table, so we need to expose it to sysfs. If boost is not active but
-supported, this maximum frequency will be larger than the one in
+In some ASICs, the highest CPPC performance is not the one in the ``_CPC``
+table, so we need to expose it to sysfs. If boost is not active, but
+still supported, this maximum frequency will be larger than the one in
``cpuinfo``.
This attribute is read-only.
``amd_pstate_lowest_nonlinear_freq``
-The lowest non-linear CPPC CPU frequency that the driver is allowed to set
-in percent of the maximum supported CPPC performance level (Please see the
+The lowest non-linear CPPC CPU frequency that the driver is allowed to set,
+in percent of the maximum supported CPPC performance level. (Please see the
lowest non-linear performance in `AMD CPPC Performance Capability
-<perf_cap_>`_).
+<perf_cap_>`_.)
This attribute is read-only.
-For other performance and frequency values, we can read them back from
+Other performance and frequency values can be read back from
``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
``amd-pstate`` vs ``acpi-cpufreq``
======================================
-On majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables
-provided by the platform firmware used for CPU performance scaling, but
-only provides 3 P-states on AMD processors.
-However, on modern AMD APU and CPU series, it provides the collaborative
-processor performance control according to ACPI protocol and customize this
-for AMD platforms. That is fine-grain and continuous frequency range
+On the majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables
+provided by the platform firmware are used for CPU performance scaling, but
+only provide 3 P-states on AMD processors.
+However, on modern AMD APU and CPU series, hardware provides the Collaborative
+Processor Performance Control according to the ACPI protocol and customizes this
+for AMD platforms. That is, fine-grained and continuous frequency ranges
instead of the legacy hardware P-states. ``amd-pstate`` is the kernel
-module which supports the new AMD P-States mechanism on most of future AMD
-platforms. The AMD P-States mechanism will be the more performance and energy
+module which supports the new AMD P-States mechanism on most of the future AMD
+platforms. The AMD P-States mechanism is the more performance and energy
efficiency frequency management method on AMD processors.
Kernel Module Options for ``amd-pstate``
=========================================
+.. _shared_mem:
+
``shared_mem``
Use a module param (shared_mem) to enable related processors manually with
**amd_pstate.shared_mem=1**.
Due to the performance issue on the processors with `Shared Memory Support
-<perf_cap_>`_, so we disable it for the moment and will enable this by default
-once we address performance issue on this solution.
+<perf_cap_>`_, we disable it presently and will re-enable this by default
+once we address performance issue with this solution.
-The way to check whether current processor is `Full MSR Support <perf_cap_>`_
+To check whether the current processor is using `Full MSR Support <perf_cap_>`_
or `Shared Memory Support <perf_cap_>`_ : ::
ray@hr-test1:~$ lscpu | grep cppc
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
-If CPU Flags have cppc, then this processor supports `Full MSR Support
-<perf_cap_>`_. Otherwise it supports `Shared Memory Support <perf_cap_>`_.
+If the CPU flags have ``cppc``, then this processor supports `Full MSR Support
+<perf_cap_>`_. Otherwise, it supports `Shared Memory Support <perf_cap_>`_.
``cpupower`` tool support for ``amd-pstate``
===============================================
-``amd-pstate`` is supported on ``cpupower`` tool that can be used to dump the frequency
-information. And it is in progress to support more and more operations for new
-``amd-pstate`` module with this tool. ::
+``amd-pstate`` is supported by the ``cpupower`` tool, which can be used to dump
+frequency information. Development is in progress to support more and more
+operations for the new ``amd-pstate`` module with this tool. ::
root@hr-test1:/home/ray# cpupower frequency-info
analyzing CPU 0:
@@ -336,10 +338,10 @@ Trace Events
--------------
There are two static trace events that can be used for ``amd-pstate``
-diagnostics. One of them is the cpu_frequency trace event generally used
+diagnostics. One of them is the ``cpu_frequency`` trace event generally used
by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
specific to ``amd-pstate``. The following sequence of shell commands can
-be used to enable them and see their output (if the kernel is generally
+be used to enable them and see their output (if the kernel is
configured to support event tracing). ::
root@hr-test1:/home/ray# cd /sys/kernel/tracing/
@@ -364,11 +366,107 @@ configured to support event tracing). ::
<idle>-0 [003] d.s.. 4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true
<idle>-0 [011] d.s.. 4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true
-The cpu_frequency trace event will be triggered either by the ``schedutil`` scaling
+The ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling
governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the
policies with other scaling governors).
+Tracer Tool
+-------------
+
+``amd_pstate_tracer.py`` can record and parse ``amd-pstate`` trace log, then
+generate performance plots. This utility can be used to debug and tune the
+performance of ``amd-pstate`` driver. The tracer tool needs to import intel
+pstate tracer.
+
+Tracer tool located in ``linux/tools/power/x86/amd_pstate_tracer``. It can be
+used in two ways. If trace file is available, then directly parse the file
+with command ::
+
+ ./amd_pstate_trace.py [-c cpus] -t <trace_file> -n <test_name>
+
+Or generate trace file with root privilege, then parse and plot with command ::
+
+ sudo ./amd_pstate_trace.py [-c cpus] -n <test_name> -i <interval> [-m kbytes]
+
+The test result can be found in ``results/test_name``. Following is the example
+about part of the output. ::
+
+ common_cpu common_secs common_usecs min_perf des_perf max_perf freq mperf apef tsc load duration_ms sample_num elapsed_time common_comm
+ CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40
+ CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264
+
+Unit Tests for amd-pstate
+-------------------------
+
+``amd-pstate-ut`` is a test module for testing the ``amd-pstate`` driver.
+
+ * It can help all users to verify their processor support (SBIOS/Firmware or Hardware).
+
+ * Kernel can have a basic function test to avoid the kernel regression during the update.
+
+ * We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization.
+
+1. Test case decriptions
+
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | Index | Functions | Description |
+ +=========+================================+====================================================================================+
+ | 0 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. |
+ | | || |
+ | | || The detail refer to `Processor Support <processor_support_>`_. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | 1 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. |
+ | | || |
+ | | || AMD P-States and ACPI hardware P-States always can be supported in one processor. |
+ | | | But AMD P-States has the higher priority and if it is enabled with |
+ | | | :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the |
+ | | | request from AMD P-States. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | 2 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. |
+ | | || highest_perf >= nominal_perf > lowest_nonlinear_perf > lowest_perf > 0. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | 3 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode |
+ | | | are reasonable. |
+ | | || max_freq >= nominal_freq > lowest_nonlinear_freq > min_freq > 0 |
+ | | || If boost is not active but supported, this maximum frequency will be larger than |
+ | | | the one in ``cpuinfo``. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+
+#. How to execute the tests
+
+ We use test module in the kselftest frameworks to implement it.
+ We create amd-pstate-ut module and tie it into kselftest.(for
+ details refer to Linux Kernel Selftests [4]_).
+
+ 1. Build
+
+ + open the :c:macro:`CONFIG_X86_AMD_PSTATE` configuration option.
+ + set the :c:macro:`CONFIG_X86_AMD_PSTATE_UT` configuration option to M.
+ + make project
+ + make selftest ::
+
+ $ cd linux
+ $ make -C tools/testing/selftests
+
+ #. Installation & Steps ::
+
+ $ make -C tools/testing/selftests install INSTALL_PATH=~/kselftest
+ $ sudo ./kselftest/run_kselftest.sh -c amd-pstate
+ TAP version 13
+ 1..1
+ # selftests: amd-pstate: amd-pstate-ut.sh
+ # amd-pstate-ut: ok
+ ok 1 selftests: amd-pstate: amd-pstate-ut.sh
+
+ #. Results ::
+
+ $ dmesg | grep "amd_pstate_ut" | tee log.txt
+ [12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success!
+ [12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success!
+ [12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success!
+ [12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success!
+
Reference
===========
@@ -380,3 +478,6 @@ Reference
.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors
https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip
+
+.. [4] Linux Kernel Selftests,
+ https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
index aec2cd2aaea7..19754beb5a4e 100644
--- a/Documentation/admin-guide/pm/cpuidle.rst
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -612,8 +612,8 @@ the ``menu`` governor to be used on the systems that use the ``ladder`` governor
by default this way, for example.
The other kernel command line parameters controlling CPU idle time management
-described below are only relevant for the *x86* architecture and some of
-them affect Intel processors only.
+described below are only relevant for the *x86* architecture and references
+to ``intel_idle`` affect Intel processors only.
The *x86* architecture support code recognizes three kernel command line
options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
@@ -635,10 +635,13 @@ idle, so it very well may hurt single-thread computations performance as well as
energy-efficiency. Thus using it for performance reasons may not be a good idea
at all.]
-The ``idle=nomwait`` option disables the ``intel_idle`` driver and causes
-``acpi_idle`` to be used (as long as all of the information needed by it is
-there in the system's ACPI tables), but it is not allowed to use the
-``MWAIT`` instruction of the CPUs to ask the hardware to enter idle states.
+The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of
+the CPU to enter idle states. When this option is used, the ``acpi_idle``
+driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems
+running Intel processors, this option disables the ``intel_idle`` driver
+and forces the use of the ``acpi_idle`` driver instead. Note that in either
+case, ``acpi_idle`` driver will function only if all the information needed
+by it is in the system's ACPI tables.
In addition to the architecture-level kernel command line options affecting CPU
idle time management, there are parameters affecting individual ``CPUIdle``
diff --git a/Documentation/admin-guide/pm/intel-speed-select.rst b/Documentation/admin-guide/pm/intel-speed-select.rst
index 0a1fbdb54bfe..a2bfb971654f 100644
--- a/Documentation/admin-guide/pm/intel-speed-select.rst
+++ b/Documentation/admin-guide/pm/intel-speed-select.rst
@@ -262,6 +262,28 @@ Which shows that the base frequency now increased from 2600 MHz at performance
level 0 to 2800 MHz at performance level 4. As a result, any workload, which can
use fewer CPUs, can see a boost of 200 MHz compared to performance level 0.
+Changing performance level via BMC Interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to change SST-PP level using out of band (OOB) agent (Via some
+remote management console, through BMC "Baseboard Management Controller"
+interface). This mode is supported from the Sapphire Rapids processor
+generation. The kernel and tool change to support this mode is added to Linux
+kernel version 5.18. To enable this feature, kernel config
+"CONFIG_INTEL_HFI_THERMAL" is required. The minimum version of the tool
+is "v1.12" to support this feature, which is part of Linux kernel version 5.18.
+
+To support such configuration, this tool can be used as a daemon. Add
+a command line option --oob::
+
+ # intel-speed-select --oob
+ Intel(R) Speed Select Technology
+ Executing on CPU model:143[0x8f]
+ OOB mode is enabled and will run as daemon
+
+In this mode the tool will online/offline CPUs based on the new performance
+level.
+
Check presence of other Intel(R) SST features
---------------------------------------------
diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
new file mode 100644
index 000000000000..09169d935835
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+==============================
+Intel Uncore Frequency Scaling
+==============================
+
+:Copyright: |copy| 2022 Intel Corporation
+
+:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
+
+Introduction
+------------
+
+The uncore can consume significant amount of power in Intel's Xeon servers based
+on the workload characteristics. To optimize the total power and improve overall
+performance, SoCs have internal algorithms for scaling uncore frequency. These
+algorithms monitor workload usage of uncore and set a desirable frequency.
+
+It is possible that users have different expectations of uncore performance and
+want to have control over it. The objective is similar to allowing users to set
+the scaling min/max frequencies via cpufreq sysfs to improve CPU performance.
+Users may have some latency sensitive workloads where they do not want any
+change to uncore frequency. Also, users may have workloads which require
+different core and uncore performance at distinct phases and they may want to
+use both cpufreq and the uncore scaling interface to distribute power and
+improve overall performance.
+
+Sysfs Interface
+---------------
+
+To control uncore frequency, a sysfs interface is provided in the directory:
+`/sys/devices/system/cpu/intel_uncore_frequency/`.
+
+There is one directory for each package and die combination as the scope of
+uncore scaling control is per die in multiple die/package SoCs or per
+package for single die per package SoCs. The name represents the
+scope of control. For example: 'package_00_die_00' is for package id 0 and
+die 0.
+
+Each package_*_die_* contains the following attributes:
+
+``initial_max_freq_khz``
+ Out of reset, this attribute represent the maximum possible frequency.
+ This is a read-only attribute. If users adjust max_freq_khz,
+ they can always go back to maximum using the value from this attribute.
+
+``initial_min_freq_khz``
+ Out of reset, this attribute represent the minimum possible frequency.
+ This is a read-only attribute. If users adjust min_freq_khz,
+ they can always go back to minimum using the value from this attribute.
+
+``max_freq_khz``
+ This attribute is used to set the maximum uncore frequency.
+
+``min_freq_khz``
+ This attribute is used to set the minimum uncore frequency.
+
+``current_freq_khz``
+ This attribute is used to get the current uncore frequency.
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
index 5d2757e2de65..ee45887811ff 100644
--- a/Documentation/admin-guide/pm/working-state.rst
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -15,3 +15,4 @@ Working-State Power Management
cpufreq_drivers
intel_epb
intel-speed-select
+ intel_uncore_frequency_scaling
diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst
index d7ac13f789cc..ec62151fe672 100644
--- a/Documentation/admin-guide/reporting-issues.rst
+++ b/Documentation/admin-guide/reporting-issues.rst
@@ -1,14 +1,5 @@
.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
-..
- If you want to distribute this text under CC-BY-4.0 only, please use 'The
- Linux kernel developers' for author attribution and link this as source:
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst
-..
- Note: Only the content of this RST file as found in the Linux kernel sources
- is available under CC-BY-4.0, as versions of this text that were processed
- (for example by the kernel's build system) might contain content taken from
- files which use a more restrictive license.
-
+.. See the bottom of this file for additional redistribution information.
Reporting issues
++++++++++++++++
@@ -395,22 +386,16 @@ fixed as soon as possible, hence there are 'issues of high priority' that get
handled slightly differently in the reporting process. Three type of cases
qualify: regressions, security issues, and really severe problems.
-You deal with a 'regression' if something that worked with an older version of
-the Linux kernel does not work with a newer one or somehow works worse with it.
-It thus is a regression when a WiFi driver that did a fine job with Linux 5.7
-somehow misbehaves with 5.8 or doesn't work at all. It's also a regression if
-an application shows erratic behavior with a newer kernel, which might happen
-due to incompatible changes in the interface between the kernel and the
-userland (like procfs and sysfs). Significantly reduced performance or
-increased power consumption also qualify as regression. But keep in mind: the
-new kernel needs to be built with a configuration that is similar to the one
-from the old kernel (see below how to achieve that). That's because the kernel
-developers sometimes can not avoid incompatibilities when implementing new
-features; but to avoid regressions such features have to be enabled explicitly
-during build time configuration.
+You deal with a regression if some application or practical use case running
+fine with one Linux kernel works worse or not at all with a newer version
+compiled using a similar configuration. The document
+Documentation/admin-guide/reporting-regressions.rst explains this in more
+detail. It also provides a good deal of other information about regressions you
+might want to be aware of; it for example explains how to add your issue to the
+list of tracked regressions, to ensure it won't fall through the cracks.
What qualifies as security issue is left to your judgment. Consider reading
-'Documentation/admin-guide/security-bugs.rst' before proceeding, as it
+Documentation/admin-guide/security-bugs.rst before proceeding, as it
provides additional details how to best handle security issues.
An issue is a 'really severe problem' when something totally unacceptably bad
@@ -517,7 +502,7 @@ line starting with 'CPU:'. It should end with 'Not tainted' if the kernel was
not tainted when it noticed the problem; it was tainted if you see 'Tainted:'
followed by a few spaces and some letters.
-If your kernel is tainted, study 'Documentation/admin-guide/tainted-kernels.rst'
+If your kernel is tainted, study Documentation/admin-guide/tainted-kernels.rst
to find out why. Try to eliminate the reason. Often it's caused by one these
three things:
@@ -1043,7 +1028,7 @@ down the culprit, as maintainers often won't have the time or setup at hand to
reproduce it themselves.
To find the change there is a process called 'bisection' which the document
-'Documentation/admin-guide/bug-bisect.rst' describes in detail. That process
+Documentation/admin-guide/bug-bisect.rst describes in detail. That process
will often require you to build about ten to twenty kernel images, trying to
reproduce the issue with each of them before building the next. Yes, that takes
some time, but don't worry, it works a lot quicker than most people assume.
@@ -1073,10 +1058,11 @@ When dealing with regressions make sure the issue you face is really caused by
the kernel and not by something else, as outlined above already.
In the whole process keep in mind: an issue only qualifies as regression if the
-older and the newer kernel got built with a similar configuration. The best way
-to archive this: copy the configuration file (``.config``) from the old working
-kernel freshly to each newer kernel version you try. Afterwards run ``make
-olddefconfig`` to adjust it for the needs of the new version.
+older and the newer kernel got built with a similar configuration. This can be
+achieved by using ``make olddefconfig``, as explained in more detail by
+Documentation/admin-guide/reporting-regressions.rst; that document also
+provides a good deal of other information about regressions you might want to be
+aware of.
Write and send the report
@@ -1283,7 +1269,7 @@ them when sending the report by mail. If you filed it in a bug tracker, forward
the report's text to these addresses; but on top of it put a small note where
you mention that you filed it with a link to the ticket.
-See 'Documentation/admin-guide/security-bugs.rst' for more information.
+See Documentation/admin-guide/security-bugs.rst for more information.
Duties after the report went out
@@ -1571,7 +1557,7 @@ Once your report is out your might get asked to do a proper one, as it allows to
pinpoint the exact change that causes the issue (which then can easily get
reverted to fix the issue quickly). Hence consider to do a proper bisection
right away if time permits. See the section 'Special care for regressions' and
-the document 'Documentation/admin-guide/bug-bisect.rst' for details how to
+the document Documentation/admin-guide/bug-bisect.rst for details how to
perform one. In case of a successful bisection add the author of the culprit to
the recipients; also CC everyone in the signed-off-by chain, which you find at
the end of its commit message.
@@ -1594,7 +1580,7 @@ Some fixes are too complex
Even small and seemingly obvious code-changes sometimes introduce new and
totally unexpected problems. The maintainers of the stable and longterm kernels
are very aware of that and thus only apply changes to these kernels that are
-within rules outlined in 'Documentation/process/stable-kernel-rules.rst'.
+within rules outlined in Documentation/process/stable-kernel-rules.rst.
Complex or risky changes for example do not qualify and thus only get applied
to mainline. Other fixes are easy to get backported to the newest stable and
@@ -1756,10 +1742,23 @@ art will lay some groundwork to improve the situation over time.
..
- This text is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If you
- spot a typo or small mistake, feel free to let him know directly and he'll
- fix it. You are free to do the same in a mostly informal way if you want
- to contribute changes to the text, but for copyright reasons please CC
+ end-of-content
+..
+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
+ you spot a typo or small mistake, feel free to let him know directly and
+ he'll fix it. You are free to do the same in a mostly informal way if you
+ want to contribute changes to the text, but for copyright reasons please CC
linux-doc@vger.kernel.org and "sign-off" your contribution as
Documentation/process/submitting-patches.rst outlines in the section "Sign
your work - the Developer's Certificate of Origin".
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use "The Linux kernel developers" for author attribution and link
+ this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/reporting-regressions.rst b/Documentation/admin-guide/reporting-regressions.rst
new file mode 100644
index 000000000000..d8adccdae23f
--- /dev/null
+++ b/Documentation/admin-guide/reporting-regressions.rst
@@ -0,0 +1,451 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. [see the bottom of this file for redistribution information]
+
+Reporting regressions
++++++++++++++++++++++
+
+"*We don't cause regressions*" is the first rule of Linux kernel development;
+Linux founder and lead developer Linus Torvalds established it himself and
+ensures it's obeyed.
+
+This document describes what the rule means for users and how the Linux kernel's
+development model ensures to address all reported regressions; aspects relevant
+for kernel developers are left to Documentation/process/handling-regressions.rst.
+
+
+The important bits (aka "TL;DR")
+================================
+
+#. It's a regression if something running fine with one Linux kernel works worse
+ or not at all with a newer version. Note, the newer kernel has to be compiled
+ using a similar configuration; the detailed explanations below describes this
+ and other fine print in more detail.
+
+#. Report your issue as outlined in Documentation/admin-guide/reporting-issues.rst,
+ it already covers all aspects important for regressions and repeated
+ below for convenience. Two of them are important: start your report's subject
+ with "[REGRESSION]" and CC or forward it to `the regression mailing list
+ <https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).
+
+#. Optional, but recommended: when sending or forwarding your report, make the
+ Linux kernel regression tracking bot "regzbot" track the issue by specifying
+ when the regression started like this::
+
+ #regzbot introduced v5.13..v5.14-rc1
+
+
+All the details on Linux kernel regressions relevant for users
+==============================================================
+
+
+The important basics
+--------------------
+
+
+What is a "regression" and what is the "no regressions rule"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's a regression if some application or practical use case running fine with
+one Linux kernel works worse or not at all with a newer version compiled using a
+similar configuration. The "no regressions rule" forbids this to take place; if
+it happens by accident, developers that caused it are expected to quickly fix
+the issue.
+
+It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
+5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
+It's also a regression if a perfectly working application suddenly shows erratic
+behavior with a newer kernel version; such issues can be caused by changes in
+procfs, sysfs, or one of the many other interfaces Linux provides to userland
+software. But keep in mind, as mentioned earlier: 5.14 in this example needs to
+be built from a configuration similar to the one from 5.13. This can be achieved
+using ``make olddefconfig``, as explained in more detail below.
+
+Note the "practical use case" in the first sentence of this section: developers
+despite the "no regressions" rule are free to change any aspect of the kernel
+and even APIs or ABIs to userland, as long as no existing application or use
+case breaks.
+
+Also be aware the "no regressions" rule covers only interfaces the kernel
+provides to the userland. It thus does not apply to kernel-internal interfaces
+like the module API, which some externally developed drivers use to hook into
+the kernel.
+
+How do I report a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Just report the issue as outlined in
+Documentation/admin-guide/reporting-issues.rst, it already describes the
+important points. The following aspects outlined there are especially relevant
+for regressions:
+
+ * When checking for existing reports to join, also search the `archives of the
+ Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
+ `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+ * Start your report's subject with "[REGRESSION]".
+
+ * In your report, clearly mention the last kernel version that worked fine and
+ the first broken one. Ideally try to find the exact change causing the
+ regression using a bisection, as explained below in more detail.
+
+ * Remember to let the Linux regressions mailing list
+ (regressions@lists.linux.dev) know about your report:
+
+ * If you report the regression by mail, CC the regressions list.
+
+ * If you report your regression to some bug tracker, forward the submitted
+ report by mail to the regressions list while CCing the maintainer and the
+ mailing list for the subsystem in question.
+
+ If it's a regression within a stable or longterm series (e.g.
+ v5.15.3..v5.15.5), remember to CC the `Linux stable mailing list
+ <https://lore.kernel.org/stable/>`_ (stable@vger.kernel.org).
+
+ In case you performed a successful bisection, add everyone to the CC the
+ culprit's commit message mentions in lines starting with "Signed-off-by:".
+
+When CCing for forwarding your report to the list, consider directly telling the
+aforementioned Linux kernel regression tracking bot about your report. To do
+that, include a paragraph like this in your mail::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+
+Regzbot will then consider your mail a report for a regression introduced in the
+specified version range. In above case Linux v5.13 still worked fine and Linux
+v5.14-rc1 was the first version where you encountered the issue. If you
+performed a bisection to find the commit that caused the regression, specify the
+culprit's commit-id instead::
+
+ #regzbot introduced: 1f2e3d4c5d
+
+Placing such a "regzbot command" is in your interest, as it will ensure the
+report won't fall through the cracks unnoticed. If you omit this, the Linux
+kernel's regressions tracker will take care of telling regzbot about your
+regression, as long as you send a copy to the regressions mailing lists. But the
+regression tracker is just one human which sometimes has to rest or occasionally
+might even enjoy some time away from computers (as crazy as that might sound).
+Relying on this person thus will result in an unnecessary delay before the
+regressions becomes mentioned `on the list of tracked and unresolved Linux
+kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
+weekly regression reports sent by regzbot. Such delays can result in Linus
+Torvalds being unaware of important regressions when deciding between "continue
+development or call this finished and release the final?".
+
+Are really all regressions fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Nearly all of them are, as long as the change causing the regression (the
+"culprit commit") is reliably identified. Some regressions can be fixed without
+this, but often it's required.
+
+Who needs to find the root cause of a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers of the affected code area should try to locate the culprit on their
+own. But for them that's often impossible to do with reasonable effort, as quite
+a lot of issues only occur in a particular environment outside the developer's
+reach -- for example, a specific hardware platform, firmware, Linux distro,
+system's configuration, or application. That's why in the end it's often up to
+the reporter to locate the culprit commit; sometimes users might even need to
+run additional tests afterwards to pinpoint the exact root cause. Developers
+should offer advice and reasonably help where they can, to make this process
+relatively easy and achievable for typical users.
+
+How can I find the culprit?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Perform a bisection, as roughly outlined in
+Documentation/admin-guide/reporting-issues.rst and described in more detail by
+Documentation/admin-guide/bug-bisect.rst. It might sound like a lot of work, but
+in many cases finds the culprit relatively quickly. If it's hard or
+time-consuming to reliably reproduce the issue, consider teaming up with other
+affected users to narrow down the search range together.
+
+Who can I ask for advice when it comes to regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
+CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
+issue might better be dealt with in private, feel free to omit the list.
+
+
+Additional details about regressions
+------------------------------------
+
+
+What is the goal of the "no regressions rule"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Users should feel safe when updating kernel versions and not have to worry
+something might break. This is in the interest of the kernel developers to make
+updating attractive: they don't want users to stay on stable or longterm Linux
+series that are either abandoned or more than one and a half years old. That's
+in everybody's interest, as `those series might have known bugs, security
+issues, or other problematic aspects already fixed in later versions
+<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
+Additionally, the kernel developers want to make it simple and appealing for
+users to test the latest pre-release or regular release. That's also in
+everybody's interest, as it's a lot easier to track down and fix problems, if
+they are reported shortly after being introduced.
+
+Is the "no regressions" rule really adhered in practice?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's taken really seriously, as can be seen by many mailing list posts from
+Linux creator and lead developer Linus Torvalds, some of which are quoted in
+Documentation/process/handling-regressions.rst.
+
+Exceptions to this rule are extremely rare; in the past developers almost always
+turned out to be wrong when they assumed a particular situation was warranting
+an exception.
+
+Who ensures the "no regressions" is actually followed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The subsystem maintainers should take care of that, which are watched and
+supported by the tree maintainers -- e.g. Linus Torvalds for mainline and
+Greg Kroah-Hartman et al. for various stable/longterm series.
+
+All of them are helped by people trying to ensure no regression report falls
+through the cracks. One of them is Thorsten Leemhuis, who's currently acting as
+the Linux kernel's "regressions tracker"; to facilitate this work he relies on
+regzbot, the Linux kernel regression tracking bot. That's why you want to bring
+your report on the radar of these people by CCing or forwarding each report to
+the regressions mailing list, ideally with a "regzbot command" in your mail to
+get it tracked immediately.
+
+How quickly are regressions normally fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers should fix any reported regression as quickly as possible, to provide
+affected users with a solution in a timely manner and prevent more users from
+running into the issue; nevertheless developers need to take enough time and
+care to ensure regression fixes do not cause additional damage.
+
+The answer thus depends on various factors like the impact of a regression, its
+age, or the Linux series in which it occurs. In the end though, most regressions
+should be fixed within two weeks.
+
+Is it a regression, if the issue can be avoided by updating some software?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Almost always: yes. If a developer tells you otherwise, ask the regression
+tracker for advice as outlined above.
+
+Is it a regression, if a newer kernel works slower or consumes more energy?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, but the difference has to be significant. A five percent slow-down in a
+micro-benchmark thus is unlikely to qualify as regression, unless it also
+influences the results of a broad benchmark by more than one percent. If in
+doubt, ask for advice.
+
+Is it a regression, if an external kernel module breaks when updating Linux?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+No, as the "no regression" rule is about interfaces and services the Linux
+kernel provides to the userland. It thus does not cover building or running
+externally developed kernel modules, as they run in kernel-space and hook into
+the kernel using internal interfaces occasionally changed.
+
+How are regressions handled that are caused by security fixes?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In extremely rare situations security issues can't be fixed without causing
+regressions; those fixes are given way, as they are the lesser evil in the end.
+Luckily this middling almost always can be avoided, as key developers for the
+affected area and often Linus Torvalds himself try very hard to fix security
+issues without causing regressions.
+
+If you nevertheless face such a case, check the mailing list archives if people
+tried their best to avoid the regression. If not, report it; if in doubt, ask
+for advice as outlined above.
+
+What happens if fixing a regression is impossible without causing another?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sadly these things happen, but luckily not very often; if they occur, expert
+developers of the affected code area should look into the issue to find a fix
+that avoids regressions or at least their impact. If you run into such a
+situation, do what was outlined already for regressions caused by security
+fixes: check earlier discussions if people already tried their best and ask for
+advice if in doubt.
+
+A quick note while at it: these situations could be avoided, if people would
+regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each
+development cycle a test run. This is best explained by imagining a change
+integrated between Linux v5.14 and v5.15-rc1 which causes a regression, but at
+the same time is a hard requirement for some other improvement applied for
+5.15-rc1. All these changes often can simply be reverted and the regression thus
+solved, if someone finds and reports it before 5.15 is released. A few days or
+weeks later this solution can become impossible, as some software might have
+started to rely on aspects introduced by one of the follow-up changes: reverting
+all changes would then cause a regression for users of said software and thus is
+out of the question.
+
+Is it a regression, if some feature I relied on was removed months ago?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is, but often it's hard to fix such regressions due to the aspects outlined
+in the previous section. It hence needs to be dealt with on a case-by-case
+basis. This is another reason why it's in everybody's interest to regularly test
+mainline pre-releases.
+
+Does the "no regression" rule apply if I seem to be the only affected person?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but only for practical usage: the Linux developers want to be free to
+remove support for hardware only to be found in attics and museums anymore.
+
+Note, sometimes regressions can't be avoided to make progress -- and the latter
+is needed to prevent Linux from stagnation. Hence, if only very few users seem
+to be affected by a regression, it for the greater good might be in their and
+everyone else's interest to lettings things pass. Especially if there is an
+easy way to circumvent the regression somehow, for example by updating some
+software or using a kernel parameter created just for this purpose.
+
+Does the regression rule apply for code in the staging tree as well?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not according to the `help text for the configuration option covering all
+staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
+which since its early days states::
+
+ Please note that these drivers are under heavy development, may or
+ may not work, and may contain userspace interfaces that most likely
+ will be changed in the near future.
+
+The staging developers nevertheless often adhere to the "no regressions" rule,
+but sometimes bend it to make progress. That's for example why some users had to
+deal with (often negligible) regressions when a WiFi driver from the staging
+tree was replaced by a totally different one written from scratch.
+
+Why do later versions have to be "compiled with a similar configuration"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because the Linux kernel developers sometimes integrate changes known to cause
+regressions, but make them optional and disable them in the kernel's default
+configuration. This trick allows progress, as the "no regressions" rule
+otherwise would lead to stagnation.
+
+Consider for example a new security feature blocking access to some kernel
+interfaces often abused by malware, which at the same time are required to run a
+few rarely used applications. The outlined approach makes both camps happy:
+people using these applications can leave the new security feature off, while
+everyone else can enable it without running into trouble.
+
+How to create a configuration similar to the one of an older kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Start your machine with a known-good kernel and configure the newer Linux
+version with ``make olddefconfig``. This makes the kernel's build scripts pick
+up the configuration file (the ".config" file) from the running kernel as base
+for the new one you are about to compile; afterwards they set all new
+configuration options to their default value, which should disable new features
+that might cause regressions.
+
+Can I report a regression I found with pre-compiled vanilla kernels?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You need to ensure the newer kernel was compiled with a similar configuration
+file as the older one (see above), as those that built them might have enabled
+some known-to-be incompatible feature for the newer kernel. If in doubt, report
+the matter to the kernel's provider and ask for advice.
+
+
+More about regression tracking with "regzbot"
+---------------------------------------------
+
+What is regression tracking and why should I care about it?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Rules like "no regressions" need someone to ensure they are followed, otherwise
+they are broken either accidentally or on purpose. History has shown this to be
+true for Linux kernel development as well. That's why Thorsten Leemhuis, the
+Linux Kernel's regression tracker, and some people try to ensure all regression
+are fixed by keeping an eye on them until they are resolved. Neither of them are
+paid for this, that's why the work is done on a best effort basis.
+
+Why and how are Linux kernel regressions tracked using a bot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Tracking regressions completely manually has proven to be quite hard due to the
+distributed and loosely structured nature of Linux kernel development process.
+That's why the Linux kernel's regression tracker developed regzbot to facilitate
+the work, with the long term goal to automate regression tracking as much as
+possible for everyone involved.
+
+Regzbot works by watching for replies to reports of tracked regressions.
+Additionally, it's looking out for posted or committed patches referencing such
+reports with "Link:" tags; replies to such patch postings are tracked as well.
+Combined this data provides good insights into the current state of the fixing
+process.
+
+How to see which regressions regzbot tracks currently?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check out `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+What kind of issues are supposed to be tracked by regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bot is meant to track regressions, hence please don't involve regzbot for
+regular issues. But it's okay for the Linux kernel's regression tracker if you
+involve regzbot to track severe issues, like reports about hangs, corrupted
+data, or internal errors (Panic, Oops, BUG(), warning, ...).
+
+How to change aspects of a tracked regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By using a 'regzbot command' in a direct or indirect reply to the mail with the
+report. The easiest way to do that: find the report in your "Sent" folder or the
+mailing list archive and reply to it using your mailer's "Reply-all" function.
+In that mail, use one of the following commands in a stand-alone paragraph (IOW:
+use blank lines to separate one or multiple of these commands from the rest of
+the mail's text).
+
+ * Update when the regression started to happen, for example after performing a
+ bisection::
+
+ #regzbot introduced: 1f2e3d4c5d
+
+ * Set or update the title::
+
+ #regzbot title: foo
+
+ * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
+ the issue or a fix are discussed:::
+
+ #regzbot monitor: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
+ #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Point to a place with further details of interest, like a mailing list post
+ or a ticket in a bug tracker that are slightly related, but about a different
+ topic::
+
+ #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Mark a regression as invalid::
+
+ #regzbot invalid: wasn't a regression, problem has always existed
+
+Regzbot supports a few other commands primarily used by developers or people
+tracking regressions. They and more details about the aforementioned regzbot
+commands can be found in the `getting started guide
+<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ and
+the `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
+for regzbot.
+
+..
+ end-of-content
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use "The Linux kernel developers" for author attribution and link
+ this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-regressions.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index d359bcfadd39..98d1b198b2b4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -38,8 +38,8 @@ acct
If BSD-style process accounting is enabled these values control
its behaviour. If free space on filesystem where the log lives
-goes below ``lowwater``% accounting suspends. If free space gets
-above ``highwater``% accounting resumes. ``frequency`` determines
+goes below ``lowwater``\ % accounting suspends. If free space gets
+above ``highwater``\ % accounting resumes. ``frequency`` determines
how often do we check the amount of free space (value is in
seconds). Default:
@@ -65,6 +65,11 @@ combining the following values:
4 s3_beep
= =======
+arch
+====
+
+The machine hardware name, the same output as ``uname -m``
+(e.g. ``x86_64`` or ``aarch64``).
auto_msgmni
===========
@@ -592,68 +597,59 @@ to the guest kernel command line (see
Documentation/admin-guide/kernel-parameters.rst).
-numa_balancing
-==============
+nmi_wd_lpm_factor (PPC only)
+============================
-Enables/disables automatic page fault based NUMA memory
-balancing. Memory is moved automatically to nodes
-that access it often.
+Factor to apply to the NMI watchdog timeout (only when ``nmi_watchdog`` is
+set to 1). This factor represents the percentage added to
+``watchdog_thresh`` when calculating the NMI watchdog timeout during an
+LPM. The soft lockup timeout is not impacted.
-Enables/disables automatic NUMA memory balancing. On NUMA machines, there
-is a performance penalty if remote memory is accessed by a CPU. When this
-feature is enabled the kernel samples what task thread is accessing memory
-by periodically unmapping pages and later trapping a page fault. At the
-time of the page fault, it is determined if the data being accessed should
-be migrated to a local memory node.
+A value of 0 means no change. The default value is 200 meaning the NMI
+watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).
-The unmapping of pages and trapping faults incur additional overhead that
-ideally is offset by improved memory locality but there is no universal
-guarantee. If the target workload is already bound to NUMA nodes then this
-feature should be disabled. Otherwise, if the system overhead from the
-feature is too high then the rate the kernel samples for NUMA hinting
-faults may be controlled by the `numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls.
-
-
-numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
-===============================================================================================================================
+numa_balancing
+==============
-Automatic NUMA balancing scans tasks address space and unmaps pages to
-detect if pages are properly placed or if the data should be migrated to a
-memory node local to where the task is running. Every "scan delay" the task
-scans the next "scan size" number of pages in its address space. When the
-end of the address space is reached the scanner restarts from the beginning.
+Enables/disables and configures automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes that access it often.
+The value to set can be the result of ORing the following:
-In combination, the "scan delay" and "scan size" determine the scan rate.
-When "scan delay" decreases, the scan rate increases. The scan delay and
-hence the scan rate of every task is adaptive and depends on historical
-behaviour. If pages are properly placed then the scan delay increases,
-otherwise the scan delay decreases. The "scan size" is not adaptive but
-the higher the "scan size", the higher the scan rate.
+= =================================
+0 NUMA_BALANCING_DISABLED
+1 NUMA_BALANCING_NORMAL
+2 NUMA_BALANCING_MEMORY_TIERING
+= =================================
-Higher scan rates incur higher system overhead as page faults must be
-trapped and potentially data must be migrated. However, the higher the scan
-rate, the more quickly a tasks memory is migrated to a local node if the
-workload pattern changes and minimises performance impact due to remote
-memory accesses. These sysctls control the thresholds for scan delays and
-the number of pages scanned.
+Or NUMA_BALANCING_NORMAL to optimize page placement among different
+NUMA nodes to reduce remote accessing. On NUMA machines, there is a
+performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing
+memory by periodically unmapping pages and later trapping a page
+fault. At the time of the page fault, it is determined if the data
+being accessed should be migrated to a local memory node.
-``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to
-scan a tasks virtual memory. It effectively controls the maximum scanning
-rate for each task.
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled.
-``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task
-when it initially forks.
+Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among
+different types of memory (represented as different NUMA nodes) to
+place the hot pages in the fast memory. This is implemented based on
+unmapping and page fault too.
-``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to
-scan a tasks virtual memory. It effectively controls the minimum scanning
-rate for each task.
+numa_balancing_promote_rate_limit_MBps
+======================================
-``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are
-scanned for a given scan.
+Too high promotion/demotion throughput between different memory types
+may hurt application latency. This can be used to rate limit the
+promotion throughput. The per-node max promotion throughput in MB/s
+will be limited to be no more than the set value.
+A rule of thumb is to set this to less than 1/10 of the PMEM node
+write bandwidth.
oops_all_cpu_backtrace
======================
@@ -795,6 +791,8 @@ bit 1 print system memory info
bit 2 print timer info
bit 3 print locks info if ``CONFIG_LOCKDEP`` is on
bit 4 print ftrace buffer
+bit 5 print all printk messages in buffer
+bit 6 print all CPUs backtrace (if available in the arch)
===== ============================================
So for example to print tasks and memory info on panic, user can::
@@ -813,6 +811,13 @@ is useful to define the root cause of RCU stalls using a vmcore.
1 panic() after printing RCU stall messages.
= ============================================================
+max_rcu_stall_to_panic
+======================
+
+When ``panic_on_rcu_stall`` is set to 1, this value determines the
+number of times that RCU can stall before panic() is called.
+
+When ``panic_on_rcu_stall`` is set to 0, this value is has no effect.
perf_cpu_time_max_percent
=========================
@@ -1024,28 +1029,22 @@ This is a directory, with the following entries:
* ``boot_id``: a UUID generated the first time this is retrieved, and
unvarying after that;
+* ``uuid``: a UUID generated every time this is retrieved (this can
+ thus be used to generate UUIDs at will);
+
* ``entropy_avail``: the pool's entropy count, in bits;
* ``poolsize``: the entropy pool size, in bits;
* ``urandom_min_reseed_secs``: obsolete (used to determine the minimum
- number of seconds between urandom pool reseeding).
-
-* ``uuid``: a UUID generated every time this is retrieved (this can
- thus be used to generate UUIDs at will);
+ number of seconds between urandom pool reseeding). This file is
+ writable for compatibility purposes, but writing to it has no effect
+ on any RNG behavior;
* ``write_wakeup_threshold``: when the entropy count drops below this
(as a number of bits), processes waiting to write to ``/dev/random``
- are woken up.
-
-If ``drivers/char/random.c`` is built with ``ADD_INTERRUPT_BENCH``
-defined, these additional entries are present:
-
-* ``add_interrupt_avg_cycles``: the average number of cycles between
- interrupts used to feed the pool;
-
-* ``add_interrupt_avg_deviation``: the standard deviation seen on the
- number of cycles between interrupts used to feed the pool.
+ are woken up. This file is writable for compatibility purposes, but
+ writing to it has no effect on any RNG behavior.
randomize_va_space
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 4150f74c521a..6394f5dc2303 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -31,17 +31,18 @@ see only some of them, depending on your kernel's configuration.
Table : Subdirectories in /proc/sys/net
- ========= =================== = ========== ==================
+ ========= =================== = ========== ===================
Directory Content Directory Content
- ========= =================== = ========== ==================
- core General parameter appletalk Appletalk protocol
- unix Unix domain sockets netrom NET/ROM
- 802 E802 protocol ax25 AX25
- ethernet Ethernet protocol rose X.25 PLP layer
+ ========= =================== = ========== ===================
+ 802 E802 protocol mptcp Multipath TCP
+ appletalk Appletalk protocol netfilter Network Filter
+ ax25 AX25 netrom NET/ROM
+ bridge Bridging rose X.25 PLP layer
+ core General parameter tipc TIPC
+ ethernet Ethernet protocol unix Unix domain sockets
ipv4 IP version 4 x25 X.25 protocol
- bridge Bridging decnet DEC net
- ipv6 IP version 6 tipc TIPC
- ========= =================== = ========== ==================
+ ipv6 IP version 6
+ ========= =================== = ========== ===================
1. /proc/sys/net/core - Network core options
============================================
@@ -101,6 +102,9 @@ Values:
- 1 - enable JIT hardening for unprivileged users only
- 2 - enable JIT hardening for all users
+where "privileged user" in this context means a process having
+CAP_BPF or CAP_SYS_ADMIN in the root user name space.
+
bpf_jit_kallsyms
----------------
@@ -271,7 +275,7 @@ poll cycle or the number of packets processed reaches netdev_budget.
netdev_max_backlog
------------------
-Maximum number of packets, queued on the INPUT side, when the interface
+Maximum number of packets, queued on the INPUT side, when the interface
receives packets faster than kernel can process them.
netdev_rss_key
@@ -322,6 +326,14 @@ a leaked reference faster. A larger value may be useful to prevent false
warnings on slow/loaded systems.
Default value is 10, minimum 1, maximum 3600.
+skb_defer_max
+-------------
+
+Max size (in skbs) of the per-cpu list of skbs being freed
+by the cpu which allocated them. Used by TCP stack so far.
+
+Default: 64
+
optmem_max
----------
@@ -365,6 +377,36 @@ new netns has been created.
Default : 0 (for compatibility reasons)
+txrehash
+--------
+
+Controls default hash rethink behaviour on listening socket when SO_TXREHASH
+option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
+
+If set to 1 (default), hash rethink is performed on listening socket.
+If set to 0, hash rethink is not performed.
+
+gro_normal_batch
+----------------
+
+Maximum number of the segments to batch up on output of GRO. When a packet
+exits GRO, either as a coalesced superframe or as an original packet which
+GRO has decided not to coalesce, it is placed on a per-NAPI list. This
+list is then passed to the stack when the number of segments reaches the
+gro_normal_batch limit.
+
+high_order_alloc_disable
+------------------------
+
+By default the allocator for page frags tries to use high order pages (order-3
+on x86). While the default behavior gives good results in most cases, some users
+might have hit a contention in page allocations/freeing. This was especially
+true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
+lists. This allows to opt-in for order-0 allocation instead but is now mostly of
+historical importance.
+
+Default: 0
+
2. /proc/sys/net/unix - Parameters for Unix domain sockets
----------------------------------------------------------
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 5e795202111f..988f6a4c8084 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -62,6 +62,7 @@ Currently, these files are in /proc/sys/vm:
- overcommit_memory
- overcommit_ratio
- page-cluster
+- page_lock_unfairness
- panic_on_oom
- percpu_pagelist_high_fraction
- stat_interval
@@ -561,6 +562,43 @@ Change the minimum size of the hugepage pool.
See Documentation/admin-guide/mm/hugetlbpage.rst
+hugetlb_optimize_vmemmap
+========================
+
+This knob is not available when the size of 'struct page' (a structure defined
+in include/linux/mm_types.h) is not power of two (an unusual system config could
+result in this).
+
+Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).
+
+Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
+buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
+per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be
+optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool
+to the buddy allocator, the vmemmap pages representing that range needs to be
+remapped again and the vmemmap pages discarded earlier need to be rellocated
+again. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.
+never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set
+'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on
+the fly') instead of being pulled from the HugeTLB pool, you should weigh the
+benefits of memory savings against the more overhead (~2x slower than before)
+of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy
+allocator. Another behavior to note is that if the system is under heavy memory
+pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB
+pool to the buddy allocator since the allocation of vmemmap pages could be
+failed, you have to retry later if your system encounter this situation.
+
+Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
+buddy allocator will not be optimized meaning the extra overhead at allocation
+time from buddy allocator disappears, whereas already optimized HugeTLB pages
+will not be affected. If you want to make sure there are no optimized HugeTLB
+pages, you can set "nr_hugepages" to 0 first and then disable this. Note that
+writing 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus
+pages. So, those surplus pages are still optimized until they are no longer
+in use. You would need to wait for those surplus pages to be released before
+there are no optimized pages in the system.
+
+
nr_hugepages_mempolicy
======================
@@ -720,7 +758,7 @@ and don't use much of it.
The default value is 0.
-See Documentation/vm/overcommit-accounting.rst and
+See Documentation/mm/overcommit-accounting.rst and
mm/util.c::__vm_enough_memory() for more information.
@@ -754,6 +792,14 @@ extra faults and I/O delays for following faults if they would have been part of
that consecutive pages readahead would have brought in.
+page_lock_unfairness
+====================
+
+This value determines the number of times that the page lock can be
+stolen from under a waiter. After the lock is stolen the number of times
+specified in this file (default is 5), the "fair lock handoff" semantics
+will apply, and the waiter will only be awakened if the lock can be taken.
+
panic_on_oom
============
@@ -880,6 +926,9 @@ calls without any restrictions.
The default value is 0.
+Another way to control permissions for userfaultfd is to use
+/dev/userfaultfd instead of userfaultfd(2). See
+Documentation/admin-guide/mm/userfaultfd.rst.
user_reserve_kbytes
===================
@@ -948,7 +997,7 @@ how much memory needs to be free before kswapd goes back to sleep.
The unit is in fractions of 10,000. The default value of 10 means the
distances between watermarks are 0.1% of the available memory in the
-node/system. The maximum value is 1000, or 10% of memory.
+node/system. The maximum value is 3000, or 30% of memory.
A high rate of threads entering direct reclaim (allocstall) or kswapd
going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index ceeed7b0798d..92a8a07f5c43 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
15 _/K 32768 kernel has been live patched
16 _/X 65536 auxiliary taint, defined for and used by distros
17 _/T 131072 kernel was built with the struct randomization plugin
+ 18 _/N 262144 an in-kernel test has been run
=== === ====== ========================================================
Note: The character ``_`` is representing a blank in this table to make reading
@@ -133,6 +134,12 @@ More detailed explanation for tainting
scsi/snic on something else than x86_64, scsi/ips on non
x86/x86_64/itanium, have broken firmware settings for the
irqchip/irq-gic on arm64 ...).
+ - x86/x86_64: Microcode late loading is dangerous and will result in
+ tainting the kernel. It requires that all CPUs rendezvous to make sure
+ the update happens when the system is as quiescent as possible. However,
+ a higher priority MCE/SMI/NMI can move control flow away from that
+ rendezvous and interrupt the update, which can be detrimental to the
+ machine.
3) ``R`` if a module was force unloaded by ``rmmod -f``, ``' '`` if all
modules were unloaded normally.