aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/LSM/LoadPin.rst6
-rw-r--r--Documentation/admin-guide/LSM/SafeSetID.rst29
-rw-r--r--Documentation/admin-guide/LSM/Yama.rst7
-rw-r--r--Documentation/admin-guide/RAS/address-translation.rst24
-rw-r--r--Documentation/admin-guide/RAS/error-decoding.rst21
-rw-r--r--Documentation/admin-guide/RAS/index.rst7
-rw-r--r--Documentation/admin-guide/RAS/main.rst (renamed from Documentation/admin-guide/ras.rst)12
-rw-r--r--Documentation/admin-guide/README.rst199
-rw-r--r--Documentation/admin-guide/abi-obsolete.rst11
-rw-r--r--Documentation/admin-guide/abi-removed.rst5
-rw-r--r--Documentation/admin-guide/abi-stable.rst14
-rw-r--r--Documentation/admin-guide/abi-testing.rst20
-rw-r--r--Documentation/admin-guide/abi.rst11
-rw-r--r--Documentation/admin-guide/acpi/cppc_sysfs.rst8
-rw-r--r--Documentation/admin-guide/acpi/dsdt-override.rst13
-rw-r--r--Documentation/admin-guide/acpi/fan_performance_states.rst28
-rw-r--r--Documentation/admin-guide/acpi/index.rst1
-rw-r--r--Documentation/admin-guide/acpi/ssdt-overlays.rst49
-rw-r--r--Documentation/admin-guide/auxdisplay/cfag12864b.rst2
-rw-r--r--Documentation/admin-guide/auxdisplay/ks0108.rst2
-rw-r--r--Documentation/admin-guide/bcache.rst36
-rw-r--r--Documentation/admin-guide/binderfs.rst15
-rw-r--r--Documentation/admin-guide/binfmt-misc.rst4
-rw-r--r--Documentation/admin-guide/blockdev/drbd/figures.rst4
-rw-r--r--Documentation/admin-guide/blockdev/drbd/index.rst2
-rw-r--r--Documentation/admin-guide/blockdev/drbd/peer-states-8.dot (renamed from Documentation/admin-guide/blockdev/drbd/node-states-8.dot)5
-rw-r--r--Documentation/admin-guide/blockdev/floppy.rst6
-rw-r--r--Documentation/admin-guide/blockdev/index.rst6
-rw-r--r--Documentation/admin-guide/blockdev/nbd.rst2
-rw-r--r--Documentation/admin-guide/blockdev/paride.rst388
-rw-r--r--Documentation/admin-guide/blockdev/ramdisk.rst66
-rw-r--r--Documentation/admin-guide/blockdev/zram.rst137
-rw-r--r--Documentation/admin-guide/bootconfig.rst129
-rw-r--r--Documentation/admin-guide/bug-bisect.rst2
-rw-r--r--Documentation/admin-guide/bug-hunting.rst2
-rw-r--r--Documentation/admin-guide/cgroup-v1/blkio-controller.rst155
-rw-r--r--Documentation/admin-guide/cgroup-v1/cgroups.rst2
-rw-r--r--Documentation/admin-guide/cgroup-v1/cpusets.rst6
-rw-r--r--Documentation/admin-guide/cgroup-v1/hugetlb.rst24
-rw-r--r--Documentation/admin-guide/cgroup-v1/index.rst1
-rw-r--r--Documentation/admin-guide/cgroup-v1/memcg_test.rst27
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst401
-rw-r--r--Documentation/admin-guide/cgroup-v1/misc.rst4
-rw-r--r--Documentation/admin-guide/cgroup-v1/rdma.rst2
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst816
-rw-r--r--Documentation/admin-guide/cifs/authors.rst6
-rw-r--r--Documentation/admin-guide/cifs/changes.rst5
-rw-r--r--Documentation/admin-guide/cifs/introduction.rst30
-rw-r--r--Documentation/admin-guide/cifs/todo.rst66
-rw-r--r--Documentation/admin-guide/cifs/usage.rst47
-rwxr-xr-xDocumentation/admin-guide/cifs/winucase_convert.pl2
-rw-r--r--Documentation/admin-guide/cpu-load.rst65
-rw-r--r--Documentation/admin-guide/cputopology.rst124
-rw-r--r--Documentation/admin-guide/dell_rbu.rst2
-rw-r--r--Documentation/admin-guide/device-mapper/cache-policies.rst2
-rw-r--r--Documentation/admin-guide/device-mapper/dm-crypt.rst14
-rw-r--r--Documentation/admin-guide/device-mapper/dm-dust.rst32
-rw-r--r--Documentation/admin-guide/device-mapper/dm-ebs.rst2
-rw-r--r--Documentation/admin-guide/device-mapper/dm-flakey.rst14
-rw-r--r--Documentation/admin-guide/device-mapper/dm-ima.rst715
-rw-r--r--Documentation/admin-guide/device-mapper/dm-init.rst8
-rw-r--r--Documentation/admin-guide/device-mapper/dm-integrity.rst78
-rw-r--r--Documentation/admin-guide/device-mapper/dm-raid.rst4
-rw-r--r--Documentation/admin-guide/device-mapper/dm-zoned.rst10
-rw-r--r--Documentation/admin-guide/device-mapper/index.rst4
-rw-r--r--Documentation/admin-guide/device-mapper/unstriped.rst10
-rw-r--r--Documentation/admin-guide/device-mapper/vdo-design.rst633
-rw-r--r--Documentation/admin-guide/device-mapper/vdo.rst406
-rw-r--r--Documentation/admin-guide/device-mapper/verity.rst17
-rw-r--r--Documentation/admin-guide/device-mapper/writecache.rst45
-rw-r--r--Documentation/admin-guide/devices.rst7
-rw-r--r--Documentation/admin-guide/devices.txt65
-rw-r--r--Documentation/admin-guide/dynamic-debug-howto.rst288
-rw-r--r--Documentation/admin-guide/edid.rst35
-rw-r--r--Documentation/admin-guide/efi-stub.rst6
-rw-r--r--Documentation/admin-guide/ext4.rst33
-rw-r--r--Documentation/admin-guide/features.rst3
-rw-r--r--Documentation/admin-guide/filesystem-monitoring.rst78
-rw-r--r--Documentation/admin-guide/gpio/gpio-mockup.rst59
-rw-r--r--Documentation/admin-guide/gpio/gpio-sim.rst134
-rw-r--r--Documentation/admin-guide/gpio/index.rst6
-rw-r--r--Documentation/admin-guide/gpio/obsolete.rst13
-rw-r--r--Documentation/admin-guide/gpio/sysfs.rst167
-rw-r--r--Documentation/admin-guide/hw-vuln/core-scheduling.rst226
-rw-r--r--Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst91
-rw-r--r--Documentation/admin-guide/hw-vuln/gather_data_sampling.rst109
-rw-r--r--Documentation/admin-guide/hw-vuln/index.rst11
-rw-r--r--Documentation/admin-guide/hw-vuln/l1d_flush.rst69
-rw-r--r--Documentation/admin-guide/hw-vuln/mds.rst40
-rw-r--r--Documentation/admin-guide/hw-vuln/multihit.rst4
-rw-r--r--Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst271
-rw-r--r--Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst104
-rw-r--r--Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst9
-rw-r--r--Documentation/admin-guide/hw-vuln/spectre.rst152
-rw-r--r--Documentation/admin-guide/hw-vuln/srso.rst160
-rw-r--r--Documentation/admin-guide/hw-vuln/tsx_async_abort.rst37
-rw-r--r--Documentation/admin-guide/hw_random.rst11
-rw-r--r--Documentation/admin-guide/index.rst29
-rw-r--r--Documentation/admin-guide/iostats.rst6
-rw-r--r--Documentation/admin-guide/kdump/gdbmacros.txt159
-rw-r--r--Documentation/admin-guide/kdump/kdump.rst223
-rw-r--r--Documentation/admin-guide/kdump/vmcoreinfo.rst256
-rw-r--r--Documentation/admin-guide/kernel-parameters.rst68
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt3672
-rw-r--r--Documentation/admin-guide/kernel-per-CPU-kthreads.rst42
-rw-r--r--Documentation/admin-guide/laptops/disk-shock-protection.rst2
-rw-r--r--Documentation/admin-guide/laptops/laptop-mode.rst11
-rw-r--r--Documentation/admin-guide/laptops/lg-laptop.rst6
-rw-r--r--Documentation/admin-guide/laptops/sonypi.rst2
-rw-r--r--Documentation/admin-guide/laptops/thinkpad-acpi.rst122
-rw-r--r--Documentation/admin-guide/lockup-watchdogs.rst4
-rw-r--r--Documentation/admin-guide/md.rst8
-rw-r--r--Documentation/admin-guide/media/bt8xx.rst15
-rw-r--r--Documentation/admin-guide/media/bttv.rst23
-rw-r--r--Documentation/admin-guide/media/building.rst6
-rw-r--r--Documentation/admin-guide/media/cec-drivers.rst10
-rw-r--r--Documentation/admin-guide/media/cec.rst380
-rw-r--r--Documentation/admin-guide/media/cpia2.rst145
-rw-r--r--Documentation/admin-guide/media/davinci-vpbe.rst65
-rw-r--r--Documentation/admin-guide/media/dvb-drivers.rst1
-rw-r--r--Documentation/admin-guide/media/dvb-usb-dvbsky-cardlist.rst8
-rw-r--r--Documentation/admin-guide/media/dvb-usb-dw2102-cardlist.rst4
-rw-r--r--Documentation/admin-guide/media/dvb_references.rst2
-rw-r--r--Documentation/admin-guide/media/em28xx-cardlist.rst4
-rw-r--r--Documentation/admin-guide/media/fimc.rst8
-rw-r--r--Documentation/admin-guide/media/frontend-cardlist.rst4
-rw-r--r--Documentation/admin-guide/media/gspca-cardlist.rst2
-rw-r--r--Documentation/admin-guide/media/i2c-cardlist.rst18
-rw-r--r--Documentation/admin-guide/media/imx7.rst62
-rw-r--r--Documentation/admin-guide/media/index.rst21
-rw-r--r--Documentation/admin-guide/media/ipu3.rst197
-rw-r--r--Documentation/admin-guide/media/ivtv.rst2
-rw-r--r--Documentation/admin-guide/media/meye.rst93
-rw-r--r--Documentation/admin-guide/media/mgb4.rst374
-rw-r--r--Documentation/admin-guide/media/omap3isp.rst2
-rw-r--r--Documentation/admin-guide/media/omap4_camera.rst2
-rw-r--r--Documentation/admin-guide/media/other-usb-cardlist.rst14
-rw-r--r--Documentation/admin-guide/media/pci-cardlist.rst4
-rw-r--r--Documentation/admin-guide/media/platform-cardlist.rst3
-rw-r--r--Documentation/admin-guide/media/pulse8-cec.rst13
-rw-r--r--Documentation/admin-guide/media/qcom_camss.rst6
-rw-r--r--Documentation/admin-guide/media/remote-controller.rst2
-rw-r--r--Documentation/admin-guide/media/rkisp1.dot18
-rw-r--r--Documentation/admin-guide/media/rkisp1.rst197
-rw-r--r--Documentation/admin-guide/media/saa7134.rst3
-rw-r--r--Documentation/admin-guide/media/si476x.rst2
-rw-r--r--Documentation/admin-guide/media/siano-cardlist.rst2
-rw-r--r--Documentation/admin-guide/media/starfive_camss.rst72
-rw-r--r--Documentation/admin-guide/media/starfive_camss_graph.dot12
-rw-r--r--Documentation/admin-guide/media/tm6000-cardlist.rst83
-rw-r--r--Documentation/admin-guide/media/usb-cardlist.rst8
-rw-r--r--Documentation/admin-guide/media/usbvision-cardlist.rst283
-rw-r--r--Documentation/admin-guide/media/v4l-drivers.rst7
-rw-r--r--Documentation/admin-guide/media/vimc.dot18
-rw-r--r--Documentation/admin-guide/media/vimc.rst46
-rw-r--r--Documentation/admin-guide/media/visl.rst185
-rw-r--r--Documentation/admin-guide/media/vivid.rst31
-rw-r--r--Documentation/admin-guide/media/zoran-cardlist.rst51
-rw-r--r--Documentation/admin-guide/media/zr364xx.rst102
-rw-r--r--Documentation/admin-guide/mm/cma_debugfs.rst10
-rw-r--r--Documentation/admin-guide/mm/concepts.rst17
-rw-r--r--Documentation/admin-guide/mm/damon/index.rst17
-rw-r--r--Documentation/admin-guide/mm/damon/lru_sort.rst294
-rw-r--r--Documentation/admin-guide/mm/damon/reclaim.rst301
-rw-r--r--Documentation/admin-guide/mm/damon/start.rst127
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst901
-rw-r--r--Documentation/admin-guide/mm/hugetlbpage.rst88
-rw-r--r--Documentation/admin-guide/mm/idle_page_tracking.rst9
-rw-r--r--Documentation/admin-guide/mm/index.rst13
-rw-r--r--Documentation/admin-guide/mm/ksm.rst144
-rw-r--r--Documentation/admin-guide/mm/memory-hotplug.rst909
-rw-r--r--Documentation/admin-guide/mm/multigen_lru.rst162
-rw-r--r--Documentation/admin-guide/mm/nommu-mmap.rst283
-rw-r--r--Documentation/admin-guide/mm/numa_memory_policy.rst46
-rw-r--r--Documentation/admin-guide/mm/numaperf.rst22
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst186
-rw-r--r--Documentation/admin-guide/mm/shrinker_debugfs.rst133
-rw-r--r--Documentation/admin-guide/mm/soft-dirty.rst2
-rw-r--r--Documentation/admin-guide/mm/swap_numa.rst78
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst135
-rw-r--r--Documentation/admin-guide/mm/userfaultfd.rst229
-rw-r--r--Documentation/admin-guide/mm/zswap.rst178
-rw-r--r--Documentation/admin-guide/module-signing.rst21
-rw-r--r--Documentation/admin-guide/nfs/fault_injection.rst70
-rw-r--r--Documentation/admin-guide/nfs/index.rst1
-rw-r--r--Documentation/admin-guide/nfs/nfs-client.rst19
-rw-r--r--Documentation/admin-guide/nfs/nfs-rdma.rst2
-rw-r--r--Documentation/admin-guide/nfs/nfsroot.rst6
-rw-r--r--Documentation/admin-guide/nfs/pnfs-block-server.rst2
-rw-r--r--Documentation/admin-guide/nfs/pnfs-scsi-server.rst2
-rw-r--r--Documentation/admin-guide/perf-security.rst83
-rw-r--r--Documentation/admin-guide/perf/alibaba_pmu.rst105
-rw-r--r--Documentation/admin-guide/perf/ampere_cspmu.rst29
-rw-r--r--Documentation/admin-guide/perf/arm-ccn.rst2
-rw-r--r--Documentation/admin-guide/perf/arm-cmn.rst65
-rw-r--r--Documentation/admin-guide/perf/cxl.rst68
-rw-r--r--Documentation/admin-guide/perf/dwc_pcie_pmu.rst94
-rw-r--r--Documentation/admin-guide/perf/hisi-pcie-pmu.rst146
-rw-r--r--Documentation/admin-guide/perf/hisi-pmu.rst66
-rw-r--r--Documentation/admin-guide/perf/hns3-pmu.rst136
-rw-r--r--Documentation/admin-guide/perf/imx-ddr.rst47
-rw-r--r--Documentation/admin-guide/perf/index.rst10
-rw-r--r--Documentation/admin-guide/perf/meson-ddr-pmu.rst70
-rw-r--r--Documentation/admin-guide/perf/nvidia-pmu.rst299
-rw-r--r--Documentation/admin-guide/perf/starfive_starlink_pmu.rst46
-rw-r--r--Documentation/admin-guide/pm/amd-pstate.rst775
-rw-r--r--Documentation/admin-guide/pm/cpufreq.rst17
-rw-r--r--Documentation/admin-guide/pm/cpuidle.rst107
-rw-r--r--Documentation/admin-guide/pm/intel-speed-select.rst30
-rw-r--r--Documentation/admin-guide/pm/intel_idle.rst33
-rw-r--r--Documentation/admin-guide/pm/intel_pstate.rst115
-rw-r--r--Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst115
-rw-r--r--Documentation/admin-guide/pm/working-state.rst2
-rw-r--r--Documentation/admin-guide/pmf.rst24
-rw-r--r--Documentation/admin-guide/pnp.rst4
-rw-r--r--Documentation/admin-guide/pstore-blk.rst37
-rw-r--r--Documentation/admin-guide/quickly-build-trimmed-linux.rst1097
-rw-r--r--Documentation/admin-guide/ramoops.rst8
-rw-r--r--Documentation/admin-guide/reporting-bugs.rst182
-rw-r--r--Documentation/admin-guide/reporting-issues.rst1764
-rw-r--r--Documentation/admin-guide/reporting-regressions.rst451
-rw-r--r--Documentation/admin-guide/security-bugs.rst89
-rw-r--r--Documentation/admin-guide/serial-console.rst36
-rw-r--r--Documentation/admin-guide/spkguide.txt1625
-rw-r--r--Documentation/admin-guide/svga.rst7
-rw-r--r--Documentation/admin-guide/syscall-user-dispatch.rst94
-rw-r--r--Documentation/admin-guide/sysctl/abi.rst73
-rw-r--r--Documentation/admin-guide/sysctl/fs.rst254
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst497
-rw-r--r--Documentation/admin-guide/sysctl/net.rst111
-rw-r--r--Documentation/admin-guide/sysctl/vm.rst160
-rw-r--r--Documentation/admin-guide/sysrq.rst31
-rw-r--r--Documentation/admin-guide/tainted-kernels.rst40
-rw-r--r--Documentation/admin-guide/thermal/index.rst8
-rw-r--r--Documentation/admin-guide/thermal/intel_powerclamp.rst345
-rw-r--r--Documentation/admin-guide/thunderbolt.rst63
-rw-r--r--Documentation/admin-guide/unicode.rst9
-rw-r--r--Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst1985
-rw-r--r--Documentation/admin-guide/wimax/i2400m.rst283
-rw-r--r--Documentation/admin-guide/wimax/index.rst19
-rw-r--r--Documentation/admin-guide/wimax/wimax.rst89
-rw-r--r--Documentation/admin-guide/workload-tracing.rst606
-rw-r--r--Documentation/admin-guide/xfs.rst91
243 files changed, 25314 insertions, 6095 deletions
diff --git a/Documentation/admin-guide/LSM/LoadPin.rst b/Documentation/admin-guide/LSM/LoadPin.rst
index 716ad9b23c9a..dd3ca68b5df1 100644
--- a/Documentation/admin-guide/LSM/LoadPin.rst
+++ b/Documentation/admin-guide/LSM/LoadPin.rst
@@ -11,8 +11,8 @@ restrictions without needing to sign the files individually.
The LSM is selectable at build-time with ``CONFIG_SECURITY_LOADPIN``, and
can be controlled at boot-time with the kernel command line option
-"``loadpin.enabled``". By default, it is enabled, but can be disabled at
-boot ("``loadpin.enabled=0``").
+"``loadpin.enforce``". By default, it is enabled, but can be disabled at
+boot ("``loadpin.enforce=0``").
LoadPin starts pinning when it sees the first file loaded. If the
block device backing the filesystem is not read-only, a sysctl is
@@ -28,4 +28,4 @@ different mechanisms such as ``CONFIG_MODULE_SIG`` and
``CONFIG_KEXEC_VERIFY_SIG`` to verify kernel module and kernel image while
still use LoadPin to protect the integrity of other files kernel loads. The
full list of valid file types can be found in ``kernel_read_file_str``
-defined in ``include/linux/fs.h``.
+defined in ``include/linux/kernel_read_file.h``.
diff --git a/Documentation/admin-guide/LSM/SafeSetID.rst b/Documentation/admin-guide/LSM/SafeSetID.rst
index 7bff07ce4fdd..0ec34863c674 100644
--- a/Documentation/admin-guide/LSM/SafeSetID.rst
+++ b/Documentation/admin-guide/LSM/SafeSetID.rst
@@ -3,9 +3,9 @@ SafeSetID
=========
SafeSetID is an LSM module that gates the setid family of syscalls to restrict
UID/GID transitions from a given UID/GID to only those approved by a
-system-wide whitelist. These restrictions also prohibit the given UIDs/GIDs
+system-wide allowlist. These restrictions also prohibit the given UIDs/GIDs
from obtaining auxiliary privileges associated with CAP_SET{U/G}ID, such as
-allowing a user to set up user namespace UID mappings.
+allowing a user to set up user namespace UID/GID mappings.
Background
@@ -98,10 +98,21 @@ Directions for use
==================
This LSM hooks the setid syscalls to make sure transitions are allowed if an
applicable restriction policy is in place. Policies are configured through
-securityfs by writing to the safesetid/add_whitelist_policy and
-safesetid/flush_whitelist_policies files at the location where securityfs is
-mounted. The format for adding a policy is '<UID>:<UID>', using literal
-numbers, such as '123:456'. To flush the policies, any write to the file is
-sufficient. Again, configuring a policy for a UID will prevent that UID from
-obtaining auxiliary setid privileges, such as allowing a user to set up user
-namespace UID mappings.
+securityfs by writing to the safesetid/uid_allowlist_policy and
+safesetid/gid_allowlist_policy files at the location where securityfs is
+mounted. The format for adding a policy is '<UID>:<UID>' or '<GID>:<GID>',
+using literal numbers, and ending with a newline character such as '123:456\n'.
+Writing an empty string "" will flush the policy. Again, configuring a policy
+for a UID/GID will prevent that UID/GID from obtaining auxiliary setid
+privileges, such as allowing a user to set up user namespace UID/GID mappings.
+
+Note on GID policies and setgroups()
+====================================
+In v5.9 we are adding support for limiting CAP_SETGID privileges as was done
+previously for CAP_SETUID. However, for compatibility with common sandboxing
+related code conventions in userspace, we currently allow arbitrary
+setgroups() calls for processes with CAP_SETGID restrictions. Until we add
+support in a future release for restricting setgroups() calls, these GID
+policies add no meaningful security. setgroups() restrictions will be enforced
+once we have the policy checking code in place, which will rely on GID policy
+configuration code added in v5.9.
diff --git a/Documentation/admin-guide/LSM/Yama.rst b/Documentation/admin-guide/LSM/Yama.rst
index d0a060de3973..d9cd937ebd2d 100644
--- a/Documentation/admin-guide/LSM/Yama.rst
+++ b/Documentation/admin-guide/LSM/Yama.rst
@@ -19,9 +19,10 @@ attach to other running processes (e.g. Firefox, SSH sessions, GPG agent,
etc) to extract additional credentials and continue to expand the scope
of their attack without resorting to user-assisted phishing.
-This is not a theoretical problem. SSH session hijacking
-(http://www.storm.net.nz/projects/7) and arbitrary code injection
-(http://c-skills.blogspot.com/2007/05/injectso.html) attacks already
+This is not a theoretical problem. `SSH session hijacking
+<https://www.blackhat.com/presentations/bh-usa-05/bh-us-05-boileau.pdf>`_
+and `arbitrary code injection
+<https://c-skills.blogspot.com/2007/05/injectso.html>`_ attacks already
exist and remain possible if ptrace is allowed to operate as before.
Since ptrace is not commonly used by non-developers and non-admins, system
builders should be allowed the option to disable this debugging system.
diff --git a/Documentation/admin-guide/RAS/address-translation.rst b/Documentation/admin-guide/RAS/address-translation.rst
new file mode 100644
index 000000000000..f0ca17b43cd3
--- /dev/null
+++ b/Documentation/admin-guide/RAS/address-translation.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Address translation
+===================
+
+x86 AMD
+-------
+
+Zen-based AMD systems include a Data Fabric that manages the layout of
+physical memory. Devices attached to the Fabric, like memory controllers,
+I/O, etc., may not have a complete view of the system physical memory map.
+These devices may provide a "normalized", i.e. device physical, address
+when reporting memory errors. Normalized addresses must be translated to
+a system physical address for the kernel to action on the memory.
+
+AMD Address Translation Library (CONFIG_AMD_ATL) provides translation for
+this case.
+
+Glossary of acronyms used in address translation for Zen-based systems
+
+* CCM = Cache Coherent Moderator
+* COD = Cluster-on-Die
+* COH_ST = Coherent Station
+* DF = Data Fabric
diff --git a/Documentation/admin-guide/RAS/error-decoding.rst b/Documentation/admin-guide/RAS/error-decoding.rst
new file mode 100644
index 000000000000..26a72f3fe5de
--- /dev/null
+++ b/Documentation/admin-guide/RAS/error-decoding.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Error decoding
+==============
+
+x86
+---
+
+Error decoding on AMD systems should be done using the rasdaemon tool:
+https://github.com/mchehab/rasdaemon/
+
+While the daemon is running, it would automatically log and decode
+errors. If not, one can still decode such errors by supplying the
+hardware information from the error::
+
+ $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca
+
+Also, the user can pass particular family and model to decode the error
+string::
+
+ $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
diff --git a/Documentation/admin-guide/RAS/index.rst b/Documentation/admin-guide/RAS/index.rst
new file mode 100644
index 000000000000..f4087040a7c0
--- /dev/null
+++ b/Documentation/admin-guide/RAS/index.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. toctree::
+ :maxdepth: 2
+
+ main
+ error-decoding
+ address-translation
diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/RAS/main.rst
index 7b481b2a368e..7ac1d4ccc509 100644
--- a/Documentation/admin-guide/ras.rst
+++ b/Documentation/admin-guide/RAS/main.rst
@@ -1,8 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
-============================================
-Reliability, Availability and Serviceability
-============================================
+==================================================
+Reliability, Availability and Serviceability (RAS)
+==================================================
+
+This documents different aspects of the RAS functionality present in the
+kernel.
RAS concepts
************
@@ -199,7 +203,7 @@ Architecture (MCA)\ [#f3]_.
mode).
.. [#f3] For more details about the Machine Check Architecture (MCA),
- please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree.
+ please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree.
EDAC - Error Detection And Correction
*************************************
diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst
index 5fb526900023..f2bebff6a733 100644
--- a/Documentation/admin-guide/README.rst
+++ b/Documentation/admin-guide/README.rst
@@ -1,9 +1,9 @@
.. _readme:
-Linux kernel release 5.x <http://kernel.org/>
+Linux kernel release 6.x <http://kernel.org/>
=============================================
-These are the release notes for Linux version 5. Read them carefully,
+These are the release notes for Linux version 6. Read them carefully,
as they tell you what this is all about, explain how to install the
kernel, and what to do if something goes wrong.
@@ -63,7 +63,7 @@ Installing the kernel source
directory where you have permissions (e.g. your home directory) and
unpack it::
- xz -cd linux-5.x.tar.xz | tar xvf -
+ xz -cd linux-6.x.tar.xz | tar xvf -
Replace "X" with the version number of the latest kernel.
@@ -72,12 +72,12 @@ Installing the kernel source
files. They should match the library, and not get messed up by
whatever the kernel-du-jour happens to be.
- - You can also upgrade between 5.x releases by patching. Patches are
+ - You can also upgrade between 6.x releases by patching. Patches are
distributed in the xz format. To install by patching, get all the
newer patch files, enter the top level directory of the kernel source
- (linux-5.x) and execute::
+ (linux-6.x) and execute::
- xz -cd ../patch-5.x.xz | patch -p1
+ xz -cd ../patch-6.x.xz | patch -p1
Replace "x" for all versions bigger than the version "x" of your current
source tree, **in_order**, and you should be ok. You may want to remove
@@ -85,13 +85,13 @@ Installing the kernel source
that there are no failed patches (some-file-name# or some-file-name.rej).
If there are, either you or I have made a mistake.
- Unlike patches for the 5.x kernels, patches for the 5.x.y kernels
+ Unlike patches for the 6.x kernels, patches for the 6.x.y kernels
(also known as the -stable kernels) are not incremental but instead apply
- directly to the base 5.x kernel. For example, if your base kernel is 5.0
- and you want to apply the 5.0.3 patch, you must not first apply the 5.0.1
- and 5.0.2 patches. Similarly, if you are running kernel version 5.0.2 and
- want to jump to 5.0.3, you must first reverse the 5.0.2 patch (that is,
- patch -R) **before** applying the 5.0.3 patch. You can read more on this in
+ directly to the base 6.x kernel. For example, if your base kernel is 6.0
+ and you want to apply the 6.0.3 patch, you must not first apply the 6.0.1
+ and 6.0.2 patches. Similarly, if you are running kernel version 6.0.2 and
+ want to jump to 6.0.3, you must first reverse the 6.0.2 patch (that is,
+ patch -R) **before** applying the 6.0.3 patch. You can read more on this in
:ref:`Documentation/process/applying-patches.rst <applying_patches>`.
Alternatively, the script patch-kernel can be used to automate this
@@ -114,7 +114,7 @@ Installing the kernel source
Software requirements
---------------------
- Compiling and running the 5.x kernels requires up-to-date
+ Compiling and running the 6.x kernels requires up-to-date
versions of various software packages. Consult
:ref:`Documentation/process/changes.rst <changes>` for the minimum version numbers
required and how to get updates for these packages. Beware that using
@@ -132,12 +132,12 @@ Build directory for the kernel
place for the output files (including .config).
Example::
- kernel source code: /usr/src/linux-5.x
+ kernel source code: /usr/src/linux-6.x
build directory: /home/name/build/kernel
To configure and build the kernel, use::
- cd /usr/src/linux-5.x
+ cd /usr/src/linux-6.x
make O=/home/name/build/kernel menuconfig
make O=/home/name/build/kernel
sudo make O=/home/name/build/kernel modules_install install
@@ -226,10 +226,11 @@ Configuring the kernel
all module options to built in (=y) options. You can
also preserve modules by LMC_KEEP.
- "make kvmconfig" Enable additional options for kvm guest kernel support.
+ "make kvm_guest.config" Enable additional options for kvm guest kernel
+ support.
- "make xenconfig" Enable additional options for xen dom0 guest kernel
- support.
+ "make xen.config" Enable additional options for xen dom0 guest kernel
+ support.
"make tinyconfig" Configure the tiniest possible kernel.
@@ -258,14 +259,14 @@ Configuring the kernel
Compiling the kernel
--------------------
- - Make sure you have at least gcc 4.6 available.
+ - Make sure you have at least gcc 5.1 available.
For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
- Please note that you can still run a.out user programs with this kernel.
-
- - Do a ``make`` to create a compressed kernel image. It is also
- possible to do ``make install`` if you have lilo installed to suit the
- kernel makefiles, but you may want to check your particular lilo setup first.
+ - Do a ``make`` to create a compressed kernel image. It is also possible to do
+ ``make install`` if you have lilo installed or if your distribution has an
+ install script recognised by the kernel's installer. Most popular
+ distributions will have a recognized install script. You may want to
+ check your distribution's setup first.
To do the actual install, you have to be root, but none of the normal
build should require that. Don't take the name of root in vain.
@@ -302,114 +303,58 @@ Compiling the kernel
image (e.g. .../linux/arch/x86/boot/bzImage after compilation)
to the place where your regular bootable kernel is found.
- - Booting a kernel directly from a floppy without the assistance of a
- bootloader such as LILO, is no longer supported.
-
- If you boot Linux from the hard drive, chances are you use LILO, which
- uses the kernel image as specified in the file /etc/lilo.conf. The
- kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
- /boot/bzImage. To use the new kernel, save a copy of the old image
- and copy the new image over the old one. Then, you MUST RERUN LILO
- to update the loading map! If you don't, you won't be able to boot
- the new kernel image.
-
- Reinstalling LILO is usually a matter of running /sbin/lilo.
- You may wish to edit /etc/lilo.conf to specify an entry for your
- old kernel image (say, /vmlinux.old) in case the new one does not
- work. See the LILO docs for more information.
-
- After reinstalling LILO, you should be all set. Shutdown the system,
+ - Booting a kernel directly from a storage device without the assistance
+ of a bootloader such as LILO or GRUB, is no longer supported in BIOS
+ (non-EFI systems). On UEFI/EFI systems, however, you can use EFISTUB
+ which allows the motherboard to boot directly to the kernel.
+ On modern workstations and desktops, it's generally recommended to use a
+ bootloader as difficulties can arise with multiple kernels and secure boot.
+ For more details on EFISTUB,
+ see "Documentation/admin-guide/efi-stub.rst".
+
+ - It's important to note that as of 2016 LILO (LInux LOader) is no longer in
+ active development, though as it was extremely popular, it often comes up
+ in documentation. Popular alternatives include GRUB2, rEFInd, Syslinux,
+ systemd-boot, or EFISTUB. For various reasons, it's not recommended to use
+ software that's no longer in active development.
+
+ - Chances are your distribution includes an install script and running
+ ``make install`` will be all that's needed. Should that not be the case
+ you'll have to identify your bootloader and reference its documentation or
+ configure your EFI.
+
+Legacy LILO Instructions
+------------------------
+
+
+ - If you use LILO the kernel images are specified in the file /etc/lilo.conf.
+ The kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
+ /boot/bzImage. To use the new kernel, save a copy of the old image and copy
+ the new image over the old one. Then, you MUST RERUN LILO to update the
+ loading map! If you don't, you won't be able to boot the new kernel image.
+
+ - Reinstalling LILO is usually a matter of running /sbin/lilo. You may wish
+ to edit /etc/lilo.conf to specify an entry for your old kernel image
+ (say, /vmlinux.old) in case the new one does not work. See the LILO docs
+ for more information.
+
+ - After reinstalling LILO, you should be all set. Shutdown the system,
reboot, and enjoy!
- If you ever need to change the default root device, video mode,
- ramdisk size, etc. in the kernel image, use the ``rdev`` program (or
- alternatively the LILO boot options when appropriate). No need to
- recompile the kernel to change these parameters.
+ - If you ever need to change the default root device, video mode, etc. in the
+ kernel image, use your bootloader's boot options where appropriate. No need
+ to recompile the kernel to change these parameters.
- Reboot with the new kernel and enjoy.
+
If something goes wrong
-----------------------
- - If you have problems that seem to be due to kernel bugs, please check
- the file MAINTAINERS to see if there is a particular person associated
- with the part of the kernel that you are having trouble with. If there
- isn't anyone listed there, then the second best thing is to mail
- them to me (torvalds@linux-foundation.org), and possibly to any other
- relevant mailing-list or to the newsgroup.
-
- - In all bug-reports, *please* tell what kernel you are talking about,
- how to duplicate the problem, and what your setup is (use your common
- sense). If the problem is new, tell me so, and if the problem is
- old, please try to tell me when you first noticed it.
-
- - If the bug results in a message like::
-
- unable to handle kernel paging request at address C0000010
- Oops: 0002
- EIP: 0010:XXXXXXXX
- eax: xxxxxxxx ebx: xxxxxxxx ecx: xxxxxxxx edx: xxxxxxxx
- esi: xxxxxxxx edi: xxxxxxxx ebp: xxxxxxxx
- ds: xxxx es: xxxx fs: xxxx gs: xxxx
- Pid: xx, process nr: xx
- xx xx xx xx xx xx xx xx xx xx
-
- or similar kernel debugging information on your screen or in your
- system log, please duplicate it *exactly*. The dump may look
- incomprehensible to you, but it does contain information that may
- help debugging the problem. The text above the dump is also
- important: it tells something about why the kernel dumped code (in
- the above example, it's due to a bad kernel pointer). More information
- on making sense of the dump is in Documentation/admin-guide/bug-hunting.rst
-
- - If you compiled the kernel with CONFIG_KALLSYMS you can send the dump
- as is, otherwise you will have to use the ``ksymoops`` program to make
- sense of the dump (but compiling with CONFIG_KALLSYMS is usually preferred).
- This utility can be downloaded from
- https://www.kernel.org/pub/linux/utils/kernel/ksymoops/ .
- Alternatively, you can do the dump lookup by hand:
-
- - In debugging dumps like the above, it helps enormously if you can
- look up what the EIP value means. The hex value as such doesn't help
- me or anybody else very much: it will depend on your particular
- kernel setup. What you should do is take the hex value from the EIP
- line (ignore the ``0010:``), and look it up in the kernel namelist to
- see which kernel function contains the offending address.
-
- To find out the kernel function name, you'll need to find the system
- binary associated with the kernel that exhibited the symptom. This is
- the file 'linux/vmlinux'. To extract the namelist and match it against
- the EIP from the kernel crash, do::
-
- nm vmlinux | sort | less
-
- This will give you a list of kernel addresses sorted in ascending
- order, from which it is simple to find the function that contains the
- offending address. Note that the address given by the kernel
- debugging messages will not necessarily match exactly with the
- function addresses (in fact, that is very unlikely), so you can't
- just 'grep' the list: the list will, however, give you the starting
- point of each kernel function, so by looking for the function that
- has a starting address lower than the one you are searching for but
- is followed by a function with a higher address you will find the one
- you want. In fact, it may be a good idea to include a bit of
- "context" in your problem report, giving a few lines around the
- interesting one.
-
- If you for some reason cannot do the above (you have a pre-compiled
- kernel image or similar), telling me as much about your setup as
- possible will help. Please read the :ref:`admin-guide/reporting-bugs.rst <reportingbugs>`
- document for details.
-
- - Alternatively, you can use gdb on a running kernel. (read-only; i.e. you
- cannot change values or set break points.) To do this, first compile the
- kernel with -g; edit arch/x86/Makefile appropriately, then do a ``make
- clean``. You'll also need to enable CONFIG_PROC_FS (via ``make config``).
-
- After you've rebooted with the new kernel, do ``gdb vmlinux /proc/kcore``.
- You can now use all the usual gdb commands. The command to look up the
- point where your system crashed is ``l *0xXXXXXXXX``. (Replace the XXXes
- with the EIP value.)
-
- gdb'ing a non-running kernel currently fails because ``gdb`` (wrongly)
- disregards the starting offset for which the kernel is compiled.
+If you have problems that seem to be due to kernel bugs, please follow the
+instructions at 'Documentation/admin-guide/reporting-issues.rst'.
+
+Hints on understanding kernel bug reports are in
+'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel
+with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and
+'Documentation/dev-tools/kgdb.rst'.
diff --git a/Documentation/admin-guide/abi-obsolete.rst b/Documentation/admin-guide/abi-obsolete.rst
new file mode 100644
index 000000000000..594e697aa1b2
--- /dev/null
+++ b/Documentation/admin-guide/abi-obsolete.rst
@@ -0,0 +1,11 @@
+ABI obsolete symbols
+====================
+
+Documents interfaces that are still remaining in the kernel, but are
+marked to be removed at some later point in time.
+
+The description of the interface will document the reason why it is
+obsolete and when it can be expected to be removed.
+
+.. kernel-abi:: ABI/obsolete
+ :rst:
diff --git a/Documentation/admin-guide/abi-removed.rst b/Documentation/admin-guide/abi-removed.rst
new file mode 100644
index 000000000000..f9e000c81828
--- /dev/null
+++ b/Documentation/admin-guide/abi-removed.rst
@@ -0,0 +1,5 @@
+ABI removed symbols
+===================
+
+.. kernel-abi:: ABI/removed
+ :rst:
diff --git a/Documentation/admin-guide/abi-stable.rst b/Documentation/admin-guide/abi-stable.rst
new file mode 100644
index 000000000000..fc3361d847b1
--- /dev/null
+++ b/Documentation/admin-guide/abi-stable.rst
@@ -0,0 +1,14 @@
+ABI stable symbols
+==================
+
+Documents the interfaces that the developer has defined to be stable.
+
+Userspace programs are free to use these interfaces with no
+restrictions, and backward compatibility for them will be guaranteed
+for at least 2 years.
+
+Most interfaces (like syscalls) are expected to never change and always
+be available.
+
+.. kernel-abi:: ABI/stable
+ :rst:
diff --git a/Documentation/admin-guide/abi-testing.rst b/Documentation/admin-guide/abi-testing.rst
new file mode 100644
index 000000000000..19767926b344
--- /dev/null
+++ b/Documentation/admin-guide/abi-testing.rst
@@ -0,0 +1,20 @@
+ABI testing symbols
+===================
+
+Documents interfaces that are felt to be stable,
+as the main development of this interface has been completed.
+
+The interface can be changed to add new features, but the
+current interface will not break by doing this, unless grave
+errors or security problems are found in them.
+
+Userspace programs can start to rely on these interfaces, but they must
+be aware of changes that can occur before these interfaces move to
+be marked stable.
+
+Programs that use these interfaces are strongly encouraged to add their
+name to the description of these interfaces, so that the kernel
+developers can easily notify them if any changes occur.
+
+.. kernel-abi:: ABI/testing
+ :rst:
diff --git a/Documentation/admin-guide/abi.rst b/Documentation/admin-guide/abi.rst
new file mode 100644
index 000000000000..bcab3ef2597c
--- /dev/null
+++ b/Documentation/admin-guide/abi.rst
@@ -0,0 +1,11 @@
+=====================
+Linux ABI description
+=====================
+
+.. toctree::
+ :maxdepth: 2
+
+ abi-stable
+ abi-testing
+ abi-obsolete
+ abi-removed
diff --git a/Documentation/admin-guide/acpi/cppc_sysfs.rst b/Documentation/admin-guide/acpi/cppc_sysfs.rst
index a4b99afbe331..36981c667823 100644
--- a/Documentation/admin-guide/acpi/cppc_sysfs.rst
+++ b/Documentation/admin-guide/acpi/cppc_sysfs.rst
@@ -4,11 +4,13 @@
Collaborative Processor Performance Control (CPPC)
==================================================
+.. _cppc_sysfs:
+
CPPC
====
CPPC defined in the ACPI spec describes a mechanism for the OS to manage the
-performance of a logical processor on a contigious and abstract performance
+performance of a logical processor on a contiguous and abstract performance
scale. CPPC exposes a set of registers to describe abstract performance scale,
to request performance levels and to measure per-cpu delivered performance.
@@ -45,7 +47,7 @@ for each cpu X::
* lowest_freq : CPU frequency corresponding to lowest_perf (in MHz).
* nominal_freq : CPU frequency corresponding to nominal_perf (in MHz).
The above frequencies should only be used to report processor performance in
- freqency instead of abstract scale. These values should not be used for any
+ frequency instead of abstract scale. These values should not be used for any
functional decisions.
* feedback_ctrs : Includes both Reference and delivered performance counter.
@@ -73,4 +75,4 @@ taking two different snapshots of feedback counters at time T1 and T2.
delivered_counter_delta = fbc_t2[del] - fbc_t1[del]
reference_counter_delta = fbc_t2[ref] - fbc_t1[ref]
- delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta
+ delivered_perf = (reference_perf x delivered_counter_delta) / reference_counter_delta
diff --git a/Documentation/admin-guide/acpi/dsdt-override.rst b/Documentation/admin-guide/acpi/dsdt-override.rst
deleted file mode 100644
index 50bd7f194bf4..000000000000
--- a/Documentation/admin-guide/acpi/dsdt-override.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===============
-Overriding DSDT
-===============
-
-Linux supports a method of overriding the BIOS DSDT:
-
-CONFIG_ACPI_CUSTOM_DSDT - builds the image into the kernel.
-
-When to use this method is described in detail on the
-Linux/ACPI home page:
-https://01.org/linux-acpi/documentation/overriding-dsdt
diff --git a/Documentation/admin-guide/acpi/fan_performance_states.rst b/Documentation/admin-guide/acpi/fan_performance_states.rst
index 98fe5c333121..b9e4b4d146c1 100644
--- a/Documentation/admin-guide/acpi/fan_performance_states.rst
+++ b/Documentation/admin-guide/acpi/fan_performance_states.rst
@@ -60,3 +60,31 @@ For example::
When a given field is not populated or its value provided by the platform
firmware is invalid, the "not-defined" string is shown instead of the value.
+
+ACPI Fan Fine Grain Control
+=============================
+
+When _FIF object specifies support for fine grain control, then fan speed
+can be set from 0 to 100% with the recommended minimum "step size" via
+_FSL object. User can adjust fan speed using thermal sysfs cooling device.
+
+Here use can look at fan performance states for a reference speed (speed_rpm)
+and set it by changing cooling device cur_state. If the fine grain control
+is supported then user can also adjust to some other speeds which are
+not defined in the performance states.
+
+The support of fine grain control is presented via sysfs attribute
+"fine_grain_control". If fine grain control is present, this attribute
+will show "1" otherwise "0".
+
+This sysfs attribute is presented in the same directory as performance states.
+
+ACPI Fan Performance Feedback
+=============================
+
+The optional _FST object provides status information for the fan device.
+This includes field to provide current fan speed in revolutions per minute
+at which the fan is rotating.
+
+This speed is presented in the sysfs using the attribute "fan_speed_rpm",
+in the same directory as performance states.
diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst
index 71277689ad97..b078fdb8f4c9 100644
--- a/Documentation/admin-guide/acpi/index.rst
+++ b/Documentation/admin-guide/acpi/index.rst
@@ -9,7 +9,6 @@ the Linux ACPI support.
:maxdepth: 1
initrd_table_override
- dsdt-override
ssdt-overlays
cppc_sysfs
fan_performance_states
diff --git a/Documentation/admin-guide/acpi/ssdt-overlays.rst b/Documentation/admin-guide/acpi/ssdt-overlays.rst
index 5d7e25988085..5ea9f4a3b76e 100644
--- a/Documentation/admin-guide/acpi/ssdt-overlays.rst
+++ b/Documentation/admin-guide/acpi/ssdt-overlays.rst
@@ -30,22 +30,21 @@ following ASL code can be used::
{
Device (STAC)
{
- Name (_ADR, Zero)
Name (_HID, "BMA222E")
+ Name (RBUF, ResourceTemplate ()
+ {
+ I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
+ AddressingMode7Bit, "\\_SB.I2C6", 0x00,
+ ResourceConsumer, ,)
+ GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
+ "\\_SB.GPO2", 0x00, ResourceConsumer, , )
+ { // Pin list
+ 0
+ }
+ })
Method (_CRS, 0, Serialized)
{
- Name (RBUF, ResourceTemplate ()
- {
- I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
- AddressingMode7Bit, "\\_SB.I2C6", 0x00,
- ResourceConsumer, ,)
- GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
- "\\_SB.GPO2", 0x00, ResourceConsumer, , )
- { // Pin list
- 0
- }
- })
Return (RBUF)
}
}
@@ -75,7 +74,7 @@ This option allows loading of user defined SSDTs from initrd and it is useful
when the system does not support EFI or when there is not enough EFI storage.
It works in a similar way with initrd based ACPI tables override/upgrade: SSDT
-aml code must be placed in the first, uncompressed, initrd under the
+AML code must be placed in the first, uncompressed, initrd under the
"kernel/firmware/acpi" path. Multiple files can be used and this will translate
in loading multiple tables. Only SSDT and OEM tables are allowed. See
initrd_table_override.txt for more details.
@@ -103,12 +102,14 @@ This is the preferred method, when EFI is supported on the platform, because it
allows a persistent, OS independent way of storing the user defined SSDTs. There
is also work underway to implement EFI support for loading user defined SSDTs
and using this method will make it easier to convert to the EFI loading
-mechanism when that will arrive.
+mechanism when that will arrive. To enable it, the
+CONFIG_EFI_CUSTOM_SSDT_OVERLAYS should be chosen to y.
-In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line
-parameter can be used. The argument for the option is the variable name to
-use. If there are multiple variables with the same name but with different
-vendor GUIDs, all of them will be loaded.
+In order to load SSDTs from an EFI variable the ``"efivar_ssdt=..."`` kernel
+command line parameter can be used (the name has a limitation of 16 characters).
+The argument for the option is the variable name to use. If there are multiple
+variables with the same name but with different vendor GUIDs, all of them will
+be loaded.
In order to store the AML code in an EFI variable the efivarfs filesystem can be
used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all
@@ -127,7 +128,7 @@ variable with the content from a given file::
#!/bin/sh -e
- while ! [ -z "$1" ]; do
+ while [ -n "$1" ]; do
case "$1" in
"-f") filename="$2"; shift;;
"-g") guid="$2"; shift;;
@@ -167,14 +168,14 @@ variable with the content from a given file::
Loading ACPI SSDTs from configfs
================================
-This option allows loading of user defined SSDTs from userspace via the configfs
+This option allows loading of user defined SSDTs from user space via the configfs
interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be
mounted. In the following examples, we assume that configfs has been mounted in
-/config.
+/sys/kernel/config.
-New tables can be loading by creating new directories in /config/acpi/table/ and
-writing the SSDT aml code in the aml attribute::
+New tables can be loading by creating new directories in /sys/kernel/config/acpi/table
+and writing the SSDT AML code in the aml attribute::
- cd /config/acpi/table
+ cd /sys/kernel/config/acpi/table
mkdir my_ssdt
cat ~/ssdt.aml > my_ssdt/aml
diff --git a/Documentation/admin-guide/auxdisplay/cfag12864b.rst b/Documentation/admin-guide/auxdisplay/cfag12864b.rst
index 18c2865bd322..da385d851acc 100644
--- a/Documentation/admin-guide/auxdisplay/cfag12864b.rst
+++ b/Documentation/admin-guide/auxdisplay/cfag12864b.rst
@@ -3,7 +3,7 @@ cfag12864b LCD Driver Documentation
===================================
:License: GPLv2
-:Author & Maintainer: Miguel Ojeda Sandonis
+:Author & Maintainer: Miguel Ojeda <ojeda@kernel.org>
:Date: 2006-10-27
diff --git a/Documentation/admin-guide/auxdisplay/ks0108.rst b/Documentation/admin-guide/auxdisplay/ks0108.rst
index c0b7faf73136..a7d3fe509373 100644
--- a/Documentation/admin-guide/auxdisplay/ks0108.rst
+++ b/Documentation/admin-guide/auxdisplay/ks0108.rst
@@ -3,7 +3,7 @@ ks0108 LCD Controller Driver Documentation
==========================================
:License: GPLv2
-:Author & Maintainer: Miguel Ojeda Sandonis
+:Author & Maintainer: Miguel Ojeda <ojeda@kernel.org>
:Date: 2006-10-27
diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst
index 1eccf952876d..6fdb495ac466 100644
--- a/Documentation/admin-guide/bcache.rst
+++ b/Documentation/admin-guide/bcache.rst
@@ -5,11 +5,14 @@ A block layer cache (bcache)
Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be
nice if you could use them as cache... Hence bcache.
-Wiki and git repositories are at:
+The bcache wiki can be found at:
+ https://bcache.evilpiepirate.org
- - https://bcache.evilpiepirate.org
- - http://evilpiepirate.org/git/linux-bcache.git
- - https://evilpiepirate.org/git/bcache-tools.git
+This is the git repository of bcache-tools:
+ https://git.kernel.org/pub/scm/linux/kernel/git/colyli/bcache-tools.git/
+
+The latest bcache kernel code can be found from mainline Linux kernel:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
It's designed around the performance characteristics of SSDs - it only allocates
in erase block sized buckets, and it uses a hybrid btree/log to track cached
@@ -41,17 +44,21 @@ in the cache it first disables writeback caching and waits for all dirty data
to be flushed.
Getting started:
-You'll need make-bcache from the bcache-tools repository. Both the cache device
+You'll need bcache util from the bcache-tools repository. Both the cache device
and backing device must be formatted before use::
- make-bcache -B /dev/sdb
- make-bcache -C /dev/sdc
+ bcache make -B /dev/sdb
+ bcache make -C /dev/sdc
-make-bcache has the ability to format multiple devices at the same time - if
+`bcache make` has the ability to format multiple devices at the same time - if
you format your backing devices and cache device at the same time, you won't
have to manually attach::
- make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
+ bcache make -B /dev/sda /dev/sdb -C /dev/sdc
+
+If your bcache-tools is not updated to latest version and does not have the
+unified `bcache` utility, you may use the legacy `make-bcache` utility to format
+bcache device with same -B and -C parameters.
bcache-tools now ships udev rules, and bcache devices are known to the kernel
immediately. Without udev, you can manually register devices like this::
@@ -188,7 +195,7 @@ D) Recovering data without bcache:
If bcache is not available in the kernel, a filesystem on the backing
device is still available at an 8KiB offset. So either via a loopdev
of the backing device created with --offset 8K, or any value defined by
---data-offset when you originally formatted bcache with `make-bcache`.
+--data-offset when you originally formatted bcache with `bcache make`.
For example::
@@ -197,7 +204,7 @@ For example::
This should present your unmodified backing device data in /dev/loop0
If your cache is in writethrough mode, then you can safely discard the
-cache device without loosing data.
+cache device without losing data.
E) Wiping a cache device
@@ -210,7 +217,7 @@ E) Wiping a cache device
After you boot back with bcache enabled, you recreate the cache and attach it::
- host:~# make-bcache -C /dev/sdh2
+ host:~# bcache make -C /dev/sdh2
UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045
Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1
version: 0
@@ -318,7 +325,7 @@ want for getting the best possible numbers when benchmarking.
The default metadata size in bcache is 8k. If your backing device is
RAID based, then be sure to align this by a multiple of your stride
- width using `make-bcache --data-offset`. If you intend to expand your
+ width using `bcache make --data-offset`. If you intend to expand your
disk array in the future, then multiply a series of primes by your
raid stripe size to get the disk multiples that you would like.
@@ -501,9 +508,6 @@ cache_miss_collisions
cache miss, but raced with a write and data was already present (usually 0
since the synchronization for cache misses was rewritten)
-cache_readaheads
- Count of times readahead occurred.
-
Sysfs - cache set
~~~~~~~~~~~~~~~~~
diff --git a/Documentation/admin-guide/binderfs.rst b/Documentation/admin-guide/binderfs.rst
index 8243af9b3510..41a4db00df8d 100644
--- a/Documentation/admin-guide/binderfs.rst
+++ b/Documentation/admin-guide/binderfs.rst
@@ -70,5 +70,18 @@ Deleting binder Devices
Binderfs binder devices can be deleted via `unlink() <unlink_>`_. This means
that the `rm() <rm_>`_ tool can be used to delete them. Note that the
``binder-control`` device cannot be deleted since this would make the binderfs
-instance unuseable. The ``binder-control`` device will be deleted when the
+instance unusable. The ``binder-control`` device will be deleted when the
binderfs instance is unmounted and all references to it have been dropped.
+
+Binder features
+---------------
+
+Assuming an instance of binderfs has been mounted at ``/dev/binderfs``, the
+features supported by the binder driver can be located under
+``/dev/binderfs/features/``. The presence of individual files can be tested
+to determine whether a particular feature is supported by the driver.
+
+Example::
+
+ cat /dev/binderfs/features/oneway_spam_detection
+ 1
diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst
index 7a864131e5ea..59cd902e3549 100644
--- a/Documentation/admin-guide/binfmt-misc.rst
+++ b/Documentation/admin-guide/binfmt-misc.rst
@@ -23,7 +23,7 @@ Here is what the fields mean:
- ``name``
is an identifier string. A new /proc file will be created with this
- ``name below /proc/sys/fs/binfmt_misc``; cannot contain slashes ``/`` for
+ name below ``/proc/sys/fs/binfmt_misc``; cannot contain slashes ``/`` for
obvious reasons.
- ``type``
is the type of recognition. Give ``M`` for magic and ``E`` for extension.
@@ -83,7 +83,7 @@ Here is what the fields mean:
``F`` - fix binary
The usual behaviour of binfmt_misc is to spawn the
binary lazily when the misc format file is invoked. However,
- this doesn``t work very well in the face of mount namespaces and
+ this doesn't work very well in the face of mount namespaces and
changeroots, so the ``F`` mode opens the binary as soon as the
emulation is installed and uses the opened image to spawn the
emulator, meaning it is always available once installed,
diff --git a/Documentation/admin-guide/blockdev/drbd/figures.rst b/Documentation/admin-guide/blockdev/drbd/figures.rst
index bd9a4901fe46..9f73253ea353 100644
--- a/Documentation/admin-guide/blockdev/drbd/figures.rst
+++ b/Documentation/admin-guide/blockdev/drbd/figures.rst
@@ -25,6 +25,6 @@ Sub graphs of DRBD's state transitions
:alt: disk-states-8.dot
:align: center
-.. kernel-figure:: node-states-8.dot
- :alt: node-states-8.dot
+.. kernel-figure:: peer-states-8.dot
+ :alt: peer-states-8.dot
:align: center
diff --git a/Documentation/admin-guide/blockdev/drbd/index.rst b/Documentation/admin-guide/blockdev/drbd/index.rst
index 68ecd5c113e9..561fd1e35917 100644
--- a/Documentation/admin-guide/blockdev/drbd/index.rst
+++ b/Documentation/admin-guide/blockdev/drbd/index.rst
@@ -10,7 +10,7 @@ Description
clusters and in this context, is a "drop-in" replacement for shared
storage. Simplistically, you could see it as a network RAID 1.
- Please visit http://www.drbd.org to find out more.
+ Please visit https://www.drbd.org to find out more.
.. toctree::
:maxdepth: 1
diff --git a/Documentation/admin-guide/blockdev/drbd/node-states-8.dot b/Documentation/admin-guide/blockdev/drbd/peer-states-8.dot
index bfa54e1f8016..6dc3954954d6 100644
--- a/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
+++ b/Documentation/admin-guide/blockdev/drbd/peer-states-8.dot
@@ -1,8 +1,3 @@
-digraph node_states {
- Secondary -> Primary [ label = "ioctl_set_state()" ]
- Primary -> Secondary [ label = "ioctl_set_state()" ]
-}
-
digraph peer_states {
Secondary -> Primary [ label = "recv state packet" ]
Primary -> Secondary [ label = "recv state packet" ]
diff --git a/Documentation/admin-guide/blockdev/floppy.rst b/Documentation/admin-guide/blockdev/floppy.rst
index 4a8f31cf4139..0328438ebe2c 100644
--- a/Documentation/admin-guide/blockdev/floppy.rst
+++ b/Documentation/admin-guide/blockdev/floppy.rst
@@ -6,7 +6,7 @@ FAQ list:
=========
A FAQ list may be found in the fdutils package (see below), and also
-at <http://fdutils.linux.lu/faq.html>.
+at <https://fdutils.linux.lu/faq.html>.
LILO configuration options (Thinkpad users, read this)
@@ -220,11 +220,11 @@ It also contains additional documentation about the floppy driver.
The latest version can be found at fdutils homepage:
- http://fdutils.linux.lu
+ https://fdutils.linux.lu
The fdutils releases can be found at:
- http://fdutils.linux.lu/download.html
+ https://fdutils.linux.lu/download.html
http://www.tux.org/pub/knaff/fdutils/
diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst
index b903cf152091..957ccf617797 100644
--- a/Documentation/admin-guide/blockdev/index.rst
+++ b/Documentation/admin-guide/blockdev/index.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-===========================
-The Linux RapidIO Subsystem
-===========================
+=============
+Block Devices
+=============
.. toctree::
:maxdepth: 1
diff --git a/Documentation/admin-guide/blockdev/nbd.rst b/Documentation/admin-guide/blockdev/nbd.rst
index d78dfe559dcf..faf2ac4b1509 100644
--- a/Documentation/admin-guide/blockdev/nbd.rst
+++ b/Documentation/admin-guide/blockdev/nbd.rst
@@ -14,7 +14,7 @@ to borrow disk space from another computer.
Unlike NFS, it is possible to put any filesystem on it, etc.
For more information, or to download the nbd-client and nbd-server
-tools, go to http://nbd.sf.net/.
+tools, go to https://github.com/NetworkBlockDevice/nbd.
The nbd kernel module need only be installed on the client
system, as the nbd-server is completely in userspace. In fact,
diff --git a/Documentation/admin-guide/blockdev/paride.rst b/Documentation/admin-guide/blockdev/paride.rst
index 87b4278bf314..e85ad37cc0e5 100644
--- a/Documentation/admin-guide/blockdev/paride.rst
+++ b/Documentation/admin-guide/blockdev/paride.rst
@@ -3,6 +3,7 @@ Linux and parallel port IDE devices
===================================
PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net>
+PATA_PARPORT (c) 2023 Ondrej Zary
1. Introduction
===============
@@ -51,27 +52,15 @@ parallel port IDE subsystem, including:
as well as most of the clone and no-name products on the market.
-To support such a wide range of devices, PARIDE, the parallel port IDE
-subsystem, is actually structured in three parts. There is a base
-paride module which provides a registry and some common methods for
-accessing the parallel ports. The second component is a set of
-high-level drivers for each of the different types of supported devices:
+To support such a wide range of devices, pata_parport is actually structured
+in two parts. There is a base pata_parport module which provides an interface
+to kernel libata subsystem, registry and some common methods for accessing
+the parallel ports.
- === =============
- pd IDE disk
- pcd ATAPI CD-ROM
- pf ATAPI disk
- pt ATAPI tape
- pg ATAPI generic
- === =============
-
-(Currently, the pg driver is only used with CD-R drives).
-
-The high-level drivers function according to the relevant standards.
-The third component of PARIDE is a set of low-level protocol drivers
-for each of the parallel port IDE adapter chips. Thanks to the interest
-and encouragement of Linux users from many parts of the world,
-support is available for almost all known adapter protocols:
+The second component is a set of low-level protocol drivers for each of the
+parallel port IDE adapter chips. Thanks to the interest and encouragement of
+Linux users from many parts of the world, support is available for almost all
+known adapter protocols:
==== ====================================== ====
aten ATEN EH-100 (HK)
@@ -91,251 +80,87 @@ support is available for almost all known adapter protocols:
==== ====================================== ====
-2. Using the PARIDE subsystem
-=============================
+2. Using pata_parport subsystem
+===============================
While configuring the Linux kernel, you may choose either to build
-the PARIDE drivers into your kernel, or to build them as modules.
+the pata_parport drivers into your kernel, or to build them as modules.
In either case, you will need to select "Parallel port IDE device support"
-as well as at least one of the high-level drivers and at least one
-of the parallel port communication protocols. If you do not know
-what kind of parallel port adapter is used in your drive, you could
-begin by checking the file names and any text files on your DOS
+and at least one of the parallel port communication protocols.
+If you do not know what kind of parallel port adapter is used in your drive,
+you could begin by checking the file names and any text files on your DOS
installation floppy. Alternatively, you can look at the markings on
the adapter chip itself. That's usually sufficient to identify the
correct device.
-You can actually select all the protocol modules, and allow the PARIDE
+You can actually select all the protocol modules, and allow the pata_parport
subsystem to try them all for you.
For the "brand-name" products listed above, here are the protocol
and high-level drivers that you would use:
- ================ ============ ====== ========
- Manufacturer Model Driver Protocol
- ================ ============ ====== ========
- MicroSolutions CD-ROM pcd bpck
- MicroSolutions PD drive pf bpck
- MicroSolutions hard-drive pd bpck
- MicroSolutions 8000t tape pt bpck
- SyQuest EZ, SparQ pd epat
- Imation Superdisk pf epat
- Maxell Superdisk pf friq
- Avatar Shark pd epat
- FreeCom CD-ROM pcd frpw
- Hewlett-Packard 5GB Tape pt epat
- Hewlett-Packard 7200e (CD) pcd epat
- Hewlett-Packard 7200e (CD-R) pg epat
- ================ ============ ====== ========
-
-2.1 Configuring built-in drivers
----------------------------------
-
-We recommend that you get to know how the drivers work and how to
-configure them as loadable modules, before attempting to compile a
-kernel with the drivers built-in.
-
-If you built all of your PARIDE support directly into your kernel,
-and you have just a single parallel port IDE device, your kernel should
-locate it automatically for you. If you have more than one device,
-you may need to give some command line options to your bootloader
-(eg: LILO), how to do that is beyond the scope of this document.
-
-The high-level drivers accept a number of command line parameters, all
-of which are documented in the source files in linux/drivers/block/paride.
-By default, each driver will automatically try all parallel ports it
-can find, and all protocol types that have been installed, until it finds
-a parallel port IDE adapter. Once it finds one, the probe stops. So,
-if you have more than one device, you will need to tell the drivers
-how to identify them. This requires specifying the port address, the
-protocol identification number and, for some devices, the drive's
-chain ID. While your system is booting, a number of messages are
-displayed on the console. Like all such messages, they can be
-reviewed with the 'dmesg' command. Among those messages will be
-some lines like::
-
- paride: bpck registered as protocol 0
- paride: epat registered as protocol 1
-
-The numbers will always be the same until you build a new kernel with
-different protocol selections. You should note these numbers as you
-will need them to identify the devices.
+ ================ ============ ========
+ Manufacturer Model Protocol
+ ================ ============ ========
+ MicroSolutions CD-ROM bpck
+ MicroSolutions PD drive bpck
+ MicroSolutions hard-drive bpck
+ MicroSolutions 8000t tape bpck
+ SyQuest EZ, SparQ epat
+ Imation Superdisk epat
+ Maxell Superdisk friq
+ Avatar Shark epat
+ FreeCom CD-ROM frpw
+ Hewlett-Packard 5GB Tape epat
+ Hewlett-Packard 7200e (CD) epat
+ Hewlett-Packard 7200e (CD-R) epat
+ ================ ============ ========
+
+All parports and all protocol drivers are probed automatically unless probe=0
+parameter is used. So just "modprobe epat" is enough for a Imation SuperDisk
+drive to work.
+
+Manual device creation::
+
+ # echo "port protocol mode unit delay" >/sys/bus/pata_parport/new_device
+
+where:
+
+ ======== ================================================
+ port parport name (or "auto" for all parports)
+ protocol protocol name (or "auto" for all protocols)
+ mode mode number (protocol-specific) or -1 for probe
+ unit unit number (for backpack only, see below)
+ delay I/O delay (see troubleshooting section below)
+ ======== ================================================
If you happen to be using a MicroSolutions backpack device, you will
also need to know the unit ID number for each drive. This is usually
the last two digits of the drive's serial number (but read MicroSolutions'
documentation about this).
-As an example, let's assume that you have a MicroSolutions PD/CD drive
-with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
-EZ-135 connected to the chained port on the PD/CD drive and also an
-Imation Superdisk connected to port 0x278. You could give the following
-options on your boot command::
-
- pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
-
-In the last option, pf.drive1 configures device /dev/pf1, the 0x378
-is the parallel port base address, the 0 is the protocol registration
-number and 36 is the chain ID.
-
-Please note: while PARIDE will work both with and without the
-PARPORT parallel port sharing system that is included by the
-"Parallel port support" option, PARPORT must be included and enabled
-if you want to use chains of devices on the same parallel port.
-
-2.2 Loading and configuring PARIDE as modules
-----------------------------------------------
-
-It is much faster and simpler to get to understand the PARIDE drivers
-if you use them as loadable kernel modules.
-
-Note 1:
- using these drivers with the "kerneld" automatic module loading
- system is not recommended for beginners, and is not documented here.
-
-Note 2:
- if you build PARPORT support as a loadable module, PARIDE must
- also be built as loadable modules, and PARPORT must be loaded before
- the PARIDE modules.
-
-To use PARIDE, you must begin by::
-
- insmod paride
-
-this loads a base module which provides a registry for the protocols,
-among other tasks.
-
-Then, load as many of the protocol modules as you think you might need.
-As you load each module, it will register the protocols that it supports,
-and print a log message to your kernel log file and your console. For
-example::
-
- # insmod epat
- paride: epat registered as protocol 0
- # insmod kbic
- paride: k951 registered as protocol 1
- paride: k971 registered as protocol 2
-
-Finally, you can load high-level drivers for each kind of device that
-you have connected. By default, each driver will autoprobe for a single
-device, but you can support up to four similar devices by giving their
-individual co-ordinates when you load the driver.
-
-For example, if you had two no-name CD-ROM drives both using the
-KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
-you could give the following command::
-
- # insmod pcd drive0=0x378,1 drive1=0x3bc,1
-
-For most adapters, giving a port address and protocol number is sufficient,
-but check the source files in linux/drivers/block/paride for more
-information. (Hopefully someone will write some man pages one day !).
-
-As another example, here's what happens when PARPORT is installed, and
-a SyQuest EZ-135 is attached to port 0x378::
-
- # insmod paride
- paride: version 1.0 installed
- # insmod epat
- paride: epat registered as protocol 0
- # insmod pd
- pd: pd version 1.0, major 45, cluster 64, nice 0
- pda: Sharing parport1 at 0x378
- pda: epat 1.0, Shuttle EPAT chip c3 at 0x378, mode 5 (EPP-32), delay 1
- pda: SyQuest EZ135A, 262144 blocks [128M], (512/16/32), removable media
- pda: pda1
-
-Note that the last line is the output from the generic partition table
-scanner - in this case it reports that it has found a disk with one partition.
-
-2.3 Using a PARIDE device
---------------------------
-
-Once the drivers have been loaded, you can access PARIDE devices in the
-same way as their traditional counterparts. You will probably need to
-create the device "special files". Here is a simple script that you can
-cut to a file and execute::
-
- #!/bin/bash
- #
- # mkd -- a script to create the device special files for the PARIDE subsystem
- #
- function mkdev {
- mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
- }
- #
- function pd {
- D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
- mkdev pd$D b 45 $[ $1 * 16 ]
- for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
- do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
- done
- }
- #
- cd /dev
- #
- for u in 0 1 2 3 ; do pd $u ; done
- for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
- for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
- for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
- for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
- for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
- #
- # end of mkd
-
-With the device files and drivers in place, you can access PARIDE devices
-like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
-
- mount /dev/pcd0 /cdrom
-
-If you have a fresh Avatar Shark cartridge, and the drive is pda, you
-might do something like::
-
- fdisk /dev/pda -- make a new partition table with
- partition 1 of type 83
-
- mke2fs /dev/pda1 -- to build the file system
-
- mkdir /shark -- make a place to mount the disk
-
- mount /dev/pda1 /shark
-
-Devices like the Imation superdisk work in the same way, except that
-they do not have a partition table. For example to make a 120MB
-floppy that you could share with a DOS system::
-
- mkdosfs /dev/pf0
- mount /dev/pf0 /mnt
-
-
-2.4 The pf driver
-------------------
-
-The pf driver is intended for use with parallel port ATAPI disk
-devices. The most common devices in this category are PD drives
-and LS-120 drives. Traditionally, media for these devices are not
-partitioned. Consequently, the pf driver does not support partitioned
-media. This may be changed in a future version of the driver.
-
-2.5 Using the pt driver
-------------------------
-
-The pt driver for parallel port ATAPI tape drives is a minimal driver.
-It does not yet support many of the standard tape ioctl operations.
-For best performance, a block size of 32KB should be used. You will
-probably want to set the parallel port delay to 0, if you can.
-
-2.6 Using the pg driver
-------------------------
-
-The pg driver can be used in conjunction with the cdrecord program
-to create CD-ROMs. Please get cdrecord version 1.6.1 or later
-from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
-your parallel port should ideally be set to EPP mode, and the "port delay"
-should be set to 0. With those settings it is possible to record at 2x
-speed without any buffer underruns. If you cannot get the driver to work
-in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
+If you omit the parameters from the end, defaults will be used, e.g.:
+
+Probe all parports with all protocols::
+
+ # echo auto >/sys/bus/pata_parport/new_device
+
+Probe parport0 using protocol epat and mode 4 (EPP-16)::
+
+ # echo "parport0 epat 4" >/sys/bus/pata_parport/new_device
+
+Probe parport0 using all protocols::
+
+ # echo "parport0 auto" >/sys/bus/pata_parport/new_device
+
+Probe all parports using protoocol epat::
+
+ # echo "auto epat" >/sys/bus/pata_parport/new_device
+
+Deleting devices::
+
+ # echo pata_parport.0 >/sys/bus/pata_parport/delete_device
3. Troubleshooting
@@ -344,9 +169,9 @@ in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
3.1 Use EPP mode if you can
----------------------------
-The most common problems that people report with the PARIDE drivers
+The most common problems that people report with the pata_parport drivers
concern the parallel port CMOS settings. At this time, none of the
-PARIDE protocol modules support ECP mode, or any ECP combination modes.
+protocol modules support ECP mode, or any ECP combination modes.
If you are able to do so, please set your parallel port into EPP mode
using your CMOS setup procedure.
@@ -354,17 +179,14 @@ using your CMOS setup procedure.
-------------------------
Some parallel ports cannot reliably transfer data at full speed. To
-offset the errors, the PARIDE protocol modules introduce a "port
+offset the errors, the protocol modules introduce a "port
delay" between each access to the i/o ports. Each protocol sets
a default value for this delay. In most cases, the user can override
the default and set it to 0 - resulting in somewhat higher transfer
rates. In some rare cases (especially with older 486 systems) the
default delays are not long enough. if you experience corrupt data
transfers, or unexpected failures, you may wish to increase the
-port delay. The delay can be programmed using the "driveN" parameters
-to each of the high-level drivers. Please see the notes above, or
-read the comments at the beginning of the driver source files in
-linux/drivers/block/paride.
+port delay.
3.3 Some drives need a printer reset
-------------------------------------
@@ -374,66 +196,12 @@ that do not always power up correctly. We have noticed this with some
drives based on OnSpec and older Freecom adapters. In these rare cases,
the adapter can often be reinitialised by issuing a "printer reset" on
the parallel port. As the reset operation is potentially disruptive in
-multiple device environments, the PARIDE drivers will not do it
+multiple device environments, the pata_parport drivers will not do it
automatically. You can however, force a printer reset by doing::
insmod lp reset=1
rmmod lp
If you have one of these marginal cases, you should probably build
-your paride drivers as modules, and arrange to do the printer reset
-before loading the PARIDE drivers.
-
-3.4 Use the verbose option and dmesg if you need help
-------------------------------------------------------
-
-While a lot of testing has gone into these drivers to make them work
-as smoothly as possible, problems will arise. If you do have problems,
-please check all the obvious things first: does the drive work in
-DOS with the manufacturer's drivers ? If that doesn't yield any useful
-clues, then please make sure that only one drive is hooked to your system,
-and that either (a) PARPORT is enabled or (b) no other device driver
-is using your parallel port (check in /proc/ioports). Then, load the
-appropriate drivers (you can load several protocol modules if you want)
-as in::
-
- # insmod paride
- # insmod epat
- # insmod bpck
- # insmod kbic
- ...
- # insmod pd verbose=1
-
-(using the correct driver for the type of device you have, of course).
-The verbose=1 parameter will cause the drivers to log a trace of their
-activity as they attempt to locate your drive.
-
-Use 'dmesg' to capture a log of all the PARIDE messages (any messages
-beginning with paride:, a protocol module's name or a driver's name) and
-include that with your bug report. You can submit a bug report in one
-of two ways. Either send it directly to the author of the PARIDE suite,
-by e-mail to grant@torque.net, or join the linux-parport mailing list
-and post your report there.
-
-3.5 For more information or help
----------------------------------
-
-You can join the linux-parport mailing list by sending a mail message
-to:
-
- linux-parport-request@torque.net
-
-with the single word::
-
- subscribe
-
-in the body of the mail message (not in the subject line). Please be
-sure that your mail program is correctly set up when you do this, as
-the list manager is a robot that will subscribe you using the reply
-address in your mail headers. REMOVE any anti-spam gimmicks you may
-have in your mail headers, when sending mail to the list server.
-
-You might also find some useful information on the linux-parport
-web pages (although they are not always up to date) at
-
- http://web.archive.org/web/%2E/http://www.torque.net/parport/
+your pata_parport drivers as modules, and arrange to do the printer reset
+before loading the pata_parport drivers.
diff --git a/Documentation/admin-guide/blockdev/ramdisk.rst b/Documentation/admin-guide/blockdev/ramdisk.rst
index b7c2268f8dec..9ce6101e8dd9 100644
--- a/Documentation/admin-guide/blockdev/ramdisk.rst
+++ b/Documentation/admin-guide/blockdev/ramdisk.rst
@@ -6,7 +6,7 @@ Using the RAM disk block device with Linux
1) Overview
2) Kernel Command Line Parameters
- 3) Using "rdev -r"
+ 3) Using "rdev"
4) An Example of Creating a Compressed RAM Disk
@@ -59,51 +59,27 @@ default is 4096 (4 MB).
rd_size
See ramdisk_size.
-3) Using "rdev -r"
-------------------
+3) Using "rdev"
+---------------
-The usage of the word (two bytes) that "rdev -r" sets in the kernel image is
-as follows. The low 11 bits (0 -> 10) specify an offset (in 1 k blocks) of up
-to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
-14 indicates that a RAM disk is to be loaded, and bit 15 indicates whether a
-prompt/wait sequence is to be given before trying to read the RAM disk. Since
-the RAM disk dynamically grows as data is being written into it, a size field
-is not required. Bits 11 to 13 are not currently used and may as well be zero.
-These numbers are no magical secrets, as seen below::
+"rdev" is an obsolete, deprecated, antiquated utility that could be used
+to set the boot device in a Linux kernel image.
- ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
- ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
- ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
+Instead of using rdev, just place the boot device information on the
+kernel command line and pass it to the kernel from the bootloader.
-Consider a typical two floppy disk setup, where you will have the
-kernel on disk one, and have already put a RAM disk image onto disk #2.
+You can also pass arguments to the kernel by setting FDARGS in
+arch/x86/boot/Makefile and specify in initrd image by setting FDINITRD in
+arch/x86/boot/Makefile.
-Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk
-starts at an offset of 0 kB from the beginning of the floppy.
-The command line equivalent is: "ramdisk_start=0"
+Some of the kernel command line boot options that may apply here are::
-You want bit 14 as one, indicating that a RAM disk is to be loaded.
-The command line equivalent is: "load_ramdisk=1"
-
-You want bit 15 as one, indicating that you want a prompt/keypress
-sequence so that you have a chance to switch floppy disks.
-The command line equivalent is: "prompt_ramdisk=1"
-
-Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
-So to create disk one of the set, you would do::
-
- /usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
- /usr/src/linux# rdev /dev/fd0 /dev/fd0
- /usr/src/linux# rdev -r /dev/fd0 49152
+ ramdisk_start=N
+ ramdisk_size=M
If you make a boot disk that has LILO, then for the above, you would use::
- append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
-
-Since the default start = 0 and the default prompt = 1, you could use::
-
- append = "load_ramdisk=1"
-
+ append = "ramdisk_start=N ramdisk_size=M"
4) An Example of Creating a Compressed RAM Disk
-----------------------------------------------
@@ -151,12 +127,9 @@ f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
-g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
- For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
- have 2^15 + 2^14 + 400 = 49552::
-
- rdev /dev/fd0 /dev/fd0
- rdev -r /dev/fd0 49552
+g) Make sure that you have already specified the boot information in
+ FDARGS and FDINITRD or that you use a bootloader to pass kernel
+ command line boot options to the kernel.
That is it. You now have your boot/root compressed RAM disk floppy. Some
users may wish to combine steps (d) and (f) by using a pipe.
@@ -167,11 +140,14 @@ users may wish to combine steps (d) and (f) by using a pipe.
Changelog:
----------
+SEPT-2020 :
+
+ Removed usage of "rdev"
+
10-22-04 :
Updated to reflect changes in command line options, remove
obsolete references, general cleanup.
James Nelson (james4765@gmail.com)
-
12-95 :
Original Document
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index a6fd1f9b5faf..ee2b0030d416 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -266,6 +266,7 @@ line of text and contains the following stats separated by whitespace:
No memory is allocated for such pages.
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
+ huge_pages_since the number of incompressible pages since zram set up
================ =============================================================
File /sys/block/zram<id>/bd_stat
@@ -314,8 +315,8 @@ To use the feature, admin should set up backing device via::
echo /dev/sda5 > /sys/block/zramX/backing_dev
-before disksize setting. It supports only partition at this moment.
-If admin wants to use incompressible page writeback, they could do via::
+before disksize setting. It supports only partitions at this moment.
+If admin wants to use incompressible page writeback, they could do it via::
echo huge > /sys/block/zramX/writeback
@@ -327,12 +328,35 @@ as idle::
From now on, any pages on zram are idle pages. The idle mark
will be removed until someone requests access of the block.
IOW, unless there is access request, those pages are still idle pages.
+Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be
+marked as idle based on how long (in seconds) it's been since they were
+last accessed::
+
+ echo 86400 > /sys/block/zramX/idle
+
+In this example all pages which haven't been accessed in more than 86400
+seconds (one day) will be marked idle.
Admin can request writeback of those idle pages at right timing via::
echo idle > /sys/block/zramX/writeback
-With the command, zram writeback idle pages from memory to the storage.
+With the command, zram will writeback idle pages from memory to the storage.
+
+Additionally, if a user choose to writeback only huge and idle pages
+this can be accomplished with::
+
+ echo huge_idle > /sys/block/zramX/writeback
+
+If a user chooses to writeback only incompressible pages (pages that none of
+algorithms can compress) this can be accomplished with::
+
+ echo incompressible > /sys/block/zramX/writeback
+
+If an admin wants to write a specific page in zram device to the backing device,
+they could write a page index into the interface::
+
+ echo "page_index=1251" > /sys/block/zramX/writeback
If there are lots of write IO with flash device, potentially, it has
flash wearout problem so that admin needs to design write limitation
@@ -340,7 +364,7 @@ to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
-any writeback. IOW, if admin wants to apply writeback budget, he should
+any writeback. IOW, if admin wants to apply writeback budget, they should
enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable
@@ -351,7 +375,7 @@ until admin sets the budget via /sys/block/zramX/writeback_limit.
(If admin doesn't enable writeback_limit_enable, writeback_limit's value
assigned via /sys/block/zramX/writeback_limit is meaningless.)
-If admin want to limit writeback as per-day 400M, he could do it
+If admin wants to limit writeback as per-day 400M, they could do it
like below::
$ MB_SHIFT=20
@@ -360,17 +384,17 @@ like below::
/sys/block/zram0/writeback_limit.
$ echo 1 > /sys/block/zram0/writeback_limit_enable
-If admins want to allow further write again once the bugdet is exhausted,
-he could do it like below::
+If admins want to allow further write again once the budget is exhausted,
+they could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit
-If admin wants to see remaining writeback budget since last set::
+If an admin wants to see the remaining writeback budget since last set::
$ cat /sys/block/zramX/writeback_limit
-If admin want to disable writeback limit, he could do::
+If an admin wants to disable writeback limit, they could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable
@@ -379,9 +403,90 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.
-If admin wants to measure writeback count in a certain period, he could
+If admin wants to measure writeback count in a certain period, they could
know it via /sys/block/zram0/bd_stat's 3rd column.
+recompression
+-------------
+
+With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative
+(secondary) compression algorithms. The basic idea is that alternative
+compression algorithm can provide better compression ratio at a price of
+(potentially) slower compression/decompression speeds. Alternative compression
+algorithm can, for example, be more successful compressing huge pages (those
+that default algorithm failed to compress). Another application is idle pages
+recompression - pages that are cold and sit in the memory can be recompressed
+using more effective algorithm and, hence, reduce zsmalloc memory usage.
+
+With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms:
+one primary and up to 3 secondary ones. Primary zram compressor is explained
+in "3) Select compression algorithm", secondary algorithms are configured
+using recomp_algorithm device attribute.
+
+Example:::
+
+ #show supported recompression algorithms
+ cat /sys/block/zramX/recomp_algorithm
+ #1: lzo lzo-rle lz4 lz4hc [zstd]
+ #2: lzo lzo-rle lz4 [lz4hc] zstd
+
+Alternative compression algorithms are sorted by priority. In the example
+above, zstd is used as the first alternative algorithm, which has priority
+of 1, while lz4hc is configured as a compression algorithm with priority 2.
+Alternative compression algorithm's priority is provided during algorithms
+configuration:::
+
+ #select zstd recompression algorithm, priority 1
+ echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
+
+ #select deflate recompression algorithm, priority 2
+ echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm
+
+Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress,
+which controls recompression.
+
+Examples:::
+
+ #IDLE pages recompression is activated by `idle` mode
+ echo "type=idle" > /sys/block/zramX/recompress
+
+ #HUGE pages recompression is activated by `huge` mode
+ echo "type=huge" > /sys/block/zram0/recompress
+
+ #HUGE_IDLE pages recompression is activated by `huge_idle` mode
+ echo "type=huge_idle" > /sys/block/zramX/recompress
+
+The number of idle pages can be significant, so user-space can pass a size
+threshold (in bytes) to the recompress knob: zram will recompress only pages
+of equal or greater size:::
+
+ #recompress all pages larger than 3000 bytes
+ echo "threshold=3000" > /sys/block/zramX/recompress
+
+ #recompress idle pages larger than 2000 bytes
+ echo "type=idle threshold=2000" > /sys/block/zramX/recompress
+
+Recompression of idle pages requires memory tracking.
+
+During re-compression for every page, that matches re-compression criteria,
+ZRAM iterates the list of registered alternative compression algorithms in
+order of their priorities. ZRAM stops either when re-compression was
+successful (re-compressed object is smaller in size than the original one)
+and matches re-compression criteria (e.g. size threshold) or when there are
+no secondary algorithms left to try. If none of the secondary algorithms can
+successfully re-compressed the page such a page is marked as incompressible,
+so ZRAM will not attempt to re-compress it in the future.
+
+This re-compression behaviour, when it iterates through the list of
+registered compression algorithms, increases our chances of finding the
+algorithm that successfully compresses a particular page. Sometimes, however,
+it is convenient (and sometimes even necessary) to limit recompression to
+only one particular algorithm so that it will not try any other algorithms.
+This can be achieved by providing a algo=NAME parameter:::
+
+ #use zstd algorithm only (if registered)
+ echo "type=huge algo=zstd" > /sys/block/zramX/recompress
+
memory tracking
===============
@@ -392,9 +497,11 @@ pages of the process with*pagemap.
If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
- 300 75.033841 .wh.
- 301 63.806904 s...
- 302 63.806919 ..hi
+ 300 75.033841 .wh...
+ 301 63.806904 s.....
+ 302 63.806919 ..hi..
+ 303 62.801919 ....r.
+ 304 146.781902 ..hi.n
First column
zram's block index.
@@ -411,6 +518,10 @@ Third column
huge page
i:
idle page
+ r:
+ recompressed page (secondary compression algorithm)
+ n:
+ none (including secondary) of algorithms could compress it
First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
index d6b3b77a4129..91339efdcb54 100644
--- a/Documentation/admin-guide/bootconfig.rst
+++ b/Documentation/admin-guide/bootconfig.rst
@@ -71,6 +71,16 @@ For example,::
foo = bar, baz
foo = qux # !ERROR! we can not re-define same key
+If you want to update the value, you must use the override operator
+``:=`` explicitly. For example::
+
+ foo = bar, baz
+ foo := qux
+
+then, the ``qux`` is assigned to ``foo`` key. This is useful for
+overriding the default value by adding (partial) custom bootconfigs
+without parsing the default bootconfig.
+
If you want to append the value to existing key as an array member,
you can use ``+=`` operator. For example::
@@ -79,12 +89,35 @@ you can use ``+=`` operator. For example::
In this case, the key ``foo`` has ``bar``, ``baz`` and ``qux``.
-However, a sub-key and a value can not co-exist under a parent key.
-For example, following config is NOT allowed.::
+Moreover, sub-keys and a value can coexist under a parent key.
+For example, following config is allowed.::
foo = value1
- foo.bar = value2 # !ERROR! subkey "bar" and value "value1" can NOT co-exist
+ foo.bar = value2
+ foo := value3 # This will update foo's value.
+
+Note, since there is no syntax to put a raw value directly under a
+structured key, you have to define it outside of the brace. For example::
+
+ foo {
+ bar = value1
+ bar {
+ baz = value2
+ qux = value3
+ }
+ }
+
+Also, the order of the value node under a key is fixed. If there
+are a value and subkeys, the value is always the first child node
+of the key. Thus if user specifies subkeys first, e.g.::
+
+ foo.bar = value1
+ foo = value2
+
+In the program (and /proc/bootconfig), it will be shown as below::
+ foo = value2
+ foo.bar = value1
Comments
--------
@@ -125,18 +158,33 @@ Each key-value pair is shown in each line with following style::
Boot Kernel With a Boot Config
==============================
-Since the boot configuration file is loaded with initrd, it will be added
-to the end of the initrd (initramfs) image file with size, checksum and
-12-byte magic word as below.
+There are two options to boot the kernel with bootconfig: attaching the
+bootconfig to the initrd image or embedding it in the kernel itself.
-[initrd][bootconfig][size(u32)][checksum(u32)][#BOOTCONFIG\n]
+Attaching a Boot Config to Initrd
+---------------------------------
+
+Since the boot configuration file is loaded with initrd by default,
+it will be added to the end of the initrd (initramfs) image file with
+padding, size, checksum and 12-byte magic word as below.
+
+[initrd][bootconfig][padding][size(le32)][checksum(le32)][#BOOTCONFIG\n]
+
+The size and checksum fields are unsigned 32bit little endian value.
+
+When the boot configuration is added to the initrd image, the total
+file size is aligned to 4 bytes. To fill the gap, null characters
+(``\0``) will be added. Thus the ``size`` is the length of the bootconfig
+file + padding bytes.
The Linux kernel decodes the last part of the initrd image in memory to
get the boot configuration data.
Because of this "piggyback" method, there is no need to change or
-update the boot loader and the kernel image itself.
+update the boot loader and the kernel image itself as long as the boot
+loader passes the correct initrd file size. If by any chance, the boot
+loader passes a longer size, the kernel fails to find the bootconfig data.
-To do this operation, Linux kernel provides "bootconfig" command under
+To do this operation, Linux kernel provides ``bootconfig`` command under
tools/bootconfig, which allows admin to apply or delete the config file
to/from initrd image. You can build it by the following command::
@@ -153,6 +201,66 @@ To remove the config from the image, you can use -d option as below::
Then add "bootconfig" on the normal kernel command line to tell the
kernel to look for the bootconfig at the end of the initrd file.
+Alternatively, build your kernel with the ``CONFIG_BOOT_CONFIG_FORCE``
+Kconfig option selected.
+
+Embedding a Boot Config into Kernel
+-----------------------------------
+
+If you can not use initrd, you can also embed the bootconfig file in the
+kernel by Kconfig options. In this case, you need to recompile the kernel
+with the following configs::
+
+ CONFIG_BOOT_CONFIG_EMBED=y
+ CONFIG_BOOT_CONFIG_EMBED_FILE="/PATH/TO/BOOTCONFIG/FILE"
+
+``CONFIG_BOOT_CONFIG_EMBED_FILE`` requires an absolute path or a relative
+path to the bootconfig file from source tree or object tree.
+The kernel will embed it as the default bootconfig.
+
+Just as when attaching the bootconfig to the initrd, you need ``bootconfig``
+option on the kernel command line to enable the embedded bootconfig, or,
+alternatively, build your kernel with the ``CONFIG_BOOT_CONFIG_FORCE``
+Kconfig option selected.
+
+Note that even if you set this option, you can override the embedded
+bootconfig by another bootconfig which attached to the initrd.
+
+Kernel parameters via Boot Config
+=================================
+
+In addition to the kernel command line, the boot config can be used for
+passing the kernel parameters. All the key-value pairs under ``kernel``
+key will be passed to kernel cmdline directly. Moreover, the key-value
+pairs under ``init`` will be passed to init process via the cmdline.
+The parameters are concatenated with user-given kernel cmdline string
+as the following order, so that the command line parameter can override
+bootconfig parameters (this depends on how the subsystem handles parameters
+but in general, earlier parameter will be overwritten by later one.)::
+
+ [bootconfig params][cmdline params] -- [bootconfig init params][cmdline init params]
+
+Here is an example of the bootconfig file for kernel/init parameters.::
+
+ kernel {
+ root = 01234567-89ab-cdef-0123-456789abcd
+ }
+ init {
+ splash
+ }
+
+This will be copied into the kernel cmdline string as the following::
+
+ root="01234567-89ab-cdef-0123-456789abcd" -- splash
+
+If user gives some other command line like,::
+
+ ro bootconfig -- quiet
+
+The final kernel cmdline will be the following::
+
+ root="01234567-89ab-cdef-0123-456789abcd" ro bootconfig -- splash quiet
+
Config File Limitation
======================
@@ -165,7 +273,8 @@ up to 512 key-value pairs. If keys contains 3 words in average, it can
contain 256 key-value pairs. In most cases, the number of config items
will be under 100 entries and smaller than 8KB, so it would be enough.
If the node number exceeds 1024, parser returns an error even if the file
-size is smaller than 32KB.
+size is smaller than 32KB. (Note that this maximum size is not including
+the padding null characters.)
Anyway, since bootconfig command verifies it when appending a boot config
to initrd image, user can notice it before boot.
diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst
index 59567da344e8..325c5d0ed34a 100644
--- a/Documentation/admin-guide/bug-bisect.rst
+++ b/Documentation/admin-guide/bug-bisect.rst
@@ -15,7 +15,7 @@ give up. Report as much as you have found to the relevant maintainer. See
MAINTAINERS for who that is for the subsystem you have worked on.
Before you submit a bug report read
-:ref:`Documentation/admin-guide/reporting-bugs.rst <reportingbugs>`.
+'Documentation/admin-guide/reporting-issues.rst'.
Devices not appearing
=====================
diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst
index f7c80f4649fc..95299b08c405 100644
--- a/Documentation/admin-guide/bug-hunting.rst
+++ b/Documentation/admin-guide/bug-hunting.rst
@@ -263,7 +263,7 @@ Please notice that it will point to:
- The last developers that touched the source code (if this is done inside
a git tree). On the above example, Tejun and Bhaktipriya (in this
- specific case, none really envolved on the development of this file);
+ specific case, none really involved on the development of this file);
- The driver maintainer (Hans Verkuil);
- The subsystem maintainer (Mauro Carvalho Chehab);
- The driver and/or subsystem mailing list (linux-media@vger.kernel.org);
diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
index 36d43ae7dc13..dabb80cdd25a 100644
--- a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
+++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
@@ -17,36 +17,37 @@ level logical devices like device mapper.
HOWTO
=====
+
Throttling/Upper Limit policy
-----------------------------
-- Enable Block IO controller::
+Enable Block IO controller::
CONFIG_BLK_CGROUP=y
-- Enable throttling in block layer::
+Enable throttling in block layer::
CONFIG_BLK_DEV_THROTTLING=y
-- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
+Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
-- Specify a bandwidth rate on particular device for root group. The format
- for policy is "<major>:<minor> <bytes_per_second>"::
+Specify a bandwidth rate on particular device for root group. The format
+for policy is "<major>:<minor> <bytes_per_second>"::
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
- Above will put a limit of 1MB/second on reads happening for root group
- on device having major/minor number 8:16.
+This will put a limit of 1MB/second on reads happening for root group
+on device having major/minor number 8:16.
-- Run dd to read a file and see if rate is throttled to 1MB/s or not::
+Run dd to read a file and see if rate is throttled to 1MB/s or not::
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
- Limits for writes can be put using blkio.throttle.write_bps_device file.
+Limits for writes can be put using blkio.throttle.write_bps_device file.
Hierarchical Cgroups
====================
@@ -79,85 +80,89 @@ following::
Various user visible config options
===================================
-CONFIG_BLK_CGROUP
- - Block IO controller.
-CONFIG_BFQ_CGROUP_DEBUG
- - Debug help. Right now some additional stats file show up in cgroup
+ CONFIG_BLK_CGROUP
+ Block IO controller.
+
+ CONFIG_BFQ_CGROUP_DEBUG
+ Debug help. Right now some additional stats file show up in cgroup
if this option is enabled.
-CONFIG_BLK_DEV_THROTTLING
- - Enable block device throttling support in block layer.
+ CONFIG_BLK_DEV_THROTTLING
+ Enable block device throttling support in block layer.
Details of cgroup files
=======================
+
Proportional weight policy files
--------------------------------
-- blkio.weight
- - Specifies per cgroup weight. This is default weight of the group
- on all the devices until and unless overridden by per device rule.
- (See blkio.weight_device).
- Currently allowed range of weights is from 10 to 1000.
-- blkio.weight_device
- - One can specify per cgroup per device rules using this interface.
- These rules override the default value of group weight as specified
- by blkio.weight.
+ blkio.bfq.weight
+ Specifies per cgroup weight. This is default weight of the group
+ on all the devices until and unless overridden by per device rule
+ (see `blkio.bfq.weight_device` below).
+
+ Currently allowed range of weights is from 1 to 1000. For more details,
+ see Documentation/block/bfq-iosched.rst.
+
+ blkio.bfq.weight_device
+ Specifies per cgroup per device weights, overriding the default group
+ weight. For more details, see Documentation/block/bfq-iosched.rst.
Following is the format::
- # echo dev_maj:dev_minor weight > blkio.weight_device
+ # echo dev_maj:dev_minor weight > blkio.bfq.weight_device
Configure weight=300 on /dev/sdb (8:16) in this cgroup::
- # echo 8:16 300 > blkio.weight_device
- # cat blkio.weight_device
+ # echo 8:16 300 > blkio.bfq.weight_device
+ # cat blkio.bfq.weight_device
dev weight
8:16 300
Configure weight=500 on /dev/sda (8:0) in this cgroup::
- # echo 8:0 500 > blkio.weight_device
- # cat blkio.weight_device
+ # echo 8:0 500 > blkio.bfq.weight_device
+ # cat blkio.bfq.weight_device
dev weight
8:0 500
8:16 300
Remove specific weight for /dev/sda in this cgroup::
- # echo 8:0 0 > blkio.weight_device
- # cat blkio.weight_device
+ # echo 8:0 0 > blkio.bfq.weight_device
+ # cat blkio.bfq.weight_device
dev weight
8:16 300
-- blkio.time
- - disk time allocated to cgroup per device in milliseconds. First
+ blkio.time
+ Disk time allocated to cgroup per device in milliseconds. First
two fields specify the major and minor number of the device and
third field specifies the disk time allocated to group in
milliseconds.
-- blkio.sectors
- - number of sectors transferred to/from disk by the group. First
+ blkio.sectors
+ Number of sectors transferred to/from disk by the group. First
two fields specify the major and minor number of the device and
third field specifies the number of sectors transferred by the
group to/from the device.
-- blkio.io_service_bytes
- - Number of bytes transferred to/from the disk by the group. These
+ blkio.io_service_bytes
+ Number of bytes transferred to/from the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of bytes.
-- blkio.io_serviced
- - Number of IOs (bio) issued to the disk by the group. These
+ blkio.io_serviced
+ Number of IOs (bio) issued to the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of IOs.
-- blkio.io_service_time
- - Total amount of time between request dispatch and request completion
+ blkio.io_service_time
+ Total amount of time between request dispatch and request completion
for the IOs done by this cgroup. This is in nanoseconds to make it
meaningful for flash devices too. For devices with queue depth of 1,
this time represents the actual service time. When queue_depth > 1,
@@ -170,8 +175,8 @@ Proportional weight policy files
specifies the operation type and the fourth field specifies the
io_service_time in ns.
-- blkio.io_wait_time
- - Total amount of time the IOs for this cgroup spent waiting in the
+ blkio.io_wait_time
+ Total amount of time the IOs for this cgroup spent waiting in the
scheduler queues for service. This can be greater than the total time
elapsed since it is cumulative io_wait_time for all IOs. It is not a
measure of total time the cgroup spent waiting but rather a measure of
@@ -185,24 +190,24 @@ Proportional weight policy files
minor number of the device, third field specifies the operation type
and the fourth field specifies the io_wait_time in ns.
-- blkio.io_merged
- - Total number of bios/requests merged into requests belonging to this
+ blkio.io_merged
+ Total number of bios/requests merged into requests belonging to this
cgroup. This is further divided by the type of operation - read or
write, sync or async.
-- blkio.io_queued
- - Total number of requests queued up at any given instant for this
+ blkio.io_queued
+ Total number of requests queued up at any given instant for this
cgroup. This is further divided by the type of operation - read or
write, sync or async.
-- blkio.avg_queue_size
- - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ blkio.avg_queue_size
+ Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
The average queue size for this cgroup over the entire time of this
cgroup's existence. Queue size samples are taken each time one of the
queues of this cgroup gets a timeslice.
-- blkio.group_wait_time
- - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ blkio.group_wait_time
+ Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time the cgroup had to wait since it became busy
(i.e., went from 0 to 1 request queued) to get a timeslice for one of
its queues. This is different from the io_wait_time which is the
@@ -212,8 +217,8 @@ Proportional weight policy files
will only report the group_wait_time accumulated till the last time it
got a timeslice and will not include the current delta.
-- blkio.empty_time
- - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ blkio.empty_time
+ Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time a cgroup spends without any pending
requests when not being served, i.e., it does not include any time
spent idling for one of the queues of the cgroup. This is in
@@ -221,8 +226,8 @@ Proportional weight policy files
the stat will only report the empty_time accumulated till the last
time it had a pending request and will not include the current delta.
-- blkio.idle_time
- - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ blkio.idle_time
+ Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time spent by the IO scheduler idling for a
given cgroup in anticipation of a better request than the existing ones
from other queues/cgroups. This is in nanoseconds. If this is read
@@ -230,60 +235,60 @@ Proportional weight policy files
idle_time accumulated till the last idle period and will not include
the current delta.
-- blkio.dequeue
- - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
+ blkio.dequeue
+ Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
gives the statistics about how many a times a group was dequeued
from service tree of the device. First two fields specify the major
and minor number of the device and third field specifies the number
of times a group was dequeued from a particular device.
-- blkio.*_recursive
- - Recursive version of various stats. These files show the
+ blkio.*_recursive
+ Recursive version of various stats. These files show the
same information as their non-recursive counterparts but
include stats from all the descendant cgroups.
Throttling/Upper limit policy files
-----------------------------------
-- blkio.throttle.read_bps_device
- - Specifies upper limit on READ rate from the device. IO rate is
+ blkio.throttle.read_bps_device
+ Specifies upper limit on READ rate from the device. IO rate is
specified in bytes per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
-- blkio.throttle.write_bps_device
- - Specifies upper limit on WRITE rate to the device. IO rate is
+ blkio.throttle.write_bps_device
+ Specifies upper limit on WRITE rate to the device. IO rate is
specified in bytes per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
-- blkio.throttle.read_iops_device
- - Specifies upper limit on READ rate from the device. IO rate is
+ blkio.throttle.read_iops_device
+ Specifies upper limit on READ rate from the device. IO rate is
specified in IO per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
-- blkio.throttle.write_iops_device
- - Specifies upper limit on WRITE rate to the device. IO rate is
+ blkio.throttle.write_iops_device
+ Specifies upper limit on WRITE rate to the device. IO rate is
specified in io per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
-Note: If both BW and IOPS rules are specified for a device, then IO is
- subjected to both the constraints.
+ Note: If both BW and IOPS rules are specified for a device, then IO is
+ subjected to both the constraints.
-- blkio.throttle.io_serviced
- - Number of IOs (bio) issued to the disk by the group. These
+ blkio.throttle.io_serviced
+ Number of IOs (bio) issued to the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of IOs.
-- blkio.throttle.io_service_bytes
- - Number of bytes transferred to/from the disk by the group. These
+ blkio.throttle.io_service_bytes
+ Number of bytes transferred to/from the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
@@ -291,6 +296,6 @@ Note: If both BW and IOPS rules are specified for a device, then IO is
Common files among various policies
-----------------------------------
-- blkio.reset_stats
- - Writing an int to this file will result in resetting all the stats
+ blkio.reset_stats
+ Writing an int to this file will result in resetting all the stats
for that cgroup.
diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst
index b0688011ed06..9343148ee993 100644
--- a/Documentation/admin-guide/cgroup-v1/cgroups.rst
+++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst
@@ -80,6 +80,8 @@ access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rs
you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup.
+.. _cgroups-why-needed:
+
1.2 Why are cgroups needed ?
----------------------------
diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 7ade3abd342a..7d3415eea05d 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -1,3 +1,5 @@
+.. _cpusets:
+
=======
CPUSETS
=======
@@ -177,7 +179,7 @@ files describing that cpuset:
- cpuset.mem_hardwall flag: is memory allocation hardwalled
- cpuset.memory_pressure: measure of how much paging pressure in cpuset
- cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
- - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
+ - cpuset.memory_spread_slab flag: OBSOLETE. Doesn't have any function.
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
@@ -717,7 +719,7 @@ There are ways to query or modify cpusets:
cat, rmdir commands from the shell, or their equivalent from C.
- via the C library libcpuset.
- via the C library libcgroup.
- (http://sourceforge.net/projects/libcg/)
+ (https://github.com/libcgroup/libcgroup/)
- via the python application cset.
(http://code.google.com/p/cpuset/)
diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index 338f2c7d7a1c..493a8e386700 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -29,12 +29,14 @@ Brief summary of control files::
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
+ hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include::
hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes
+ hugetlb.1GB.numa_stat
hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt
hugetlb.1GB.rsvd.limit_in_bytes
@@ -43,6 +45,7 @@ files include::
hugetlb.1GB.rsvd.failcnt
hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes
+ hugetlb.64KB.numa_stat
hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt
hugetlb.64KB.rsvd.limit_in_bytes
@@ -51,6 +54,7 @@ files include::
hugetlb.64KB.rsvd.failcnt
hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes
+ hugetlb.32MB.numa_stat
hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt
hugetlb.32MB.rsvd.limit_in_bytes
@@ -61,10 +65,12 @@ files include::
1. Page fault accounting
-hugetlb.<hugepagesize>.limit_in_bytes
-hugetlb.<hugepagesize>.max_usage_in_bytes
-hugetlb.<hugepagesize>.usage_in_bytes
-hugetlb.<hugepagesize>.failcnt
+::
+
+ hugetlb.<hugepagesize>.limit_in_bytes
+ hugetlb.<hugepagesize>.max_usage_in_bytes
+ hugetlb.<hugepagesize>.usage_in_bytes
+ hugetlb.<hugepagesize>.failcnt
The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
control group and enforces the limit during page fault. Since HugeTLB
@@ -78,10 +84,12 @@ getting SIGBUS.
2. Reservation accounting
-hugetlb.<hugepagesize>.rsvd.limit_in_bytes
-hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
-hugetlb.<hugepagesize>.rsvd.usage_in_bytes
-hugetlb.<hugepagesize>.rsvd.failcnt
+::
+
+ hugetlb.<hugepagesize>.rsvd.limit_in_bytes
+ hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
+ hugetlb.<hugepagesize>.rsvd.usage_in_bytes
+ hugetlb.<hugepagesize>.rsvd.failcnt
The HugeTLB controller allows to limit the HugeTLB reservations per control
group and enforces the controller limit at reservation time and at the fault of
diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst
index 226f64473e8e..99fbc8a64ba9 100644
--- a/Documentation/admin-guide/cgroup-v1/index.rst
+++ b/Documentation/admin-guide/cgroup-v1/index.rst
@@ -17,6 +17,7 @@ Control Groups version 1
hugetlb
memcg_test
memory
+ misc
net_cls
net_prio
pids
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..1f128458ddea 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -62,7 +62,7 @@ Please note that implementation details can be changed.
At cancel(), simply usage -= PAGE_SIZE.
-Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
+Under below explanation, we assume CONFIG_SWAP=y.
4. Anonymous
============
@@ -97,7 +97,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
=============
Page Cache is charged at
- - add_to_page_cache_locked().
+ - filemap_add_folio().
The logic is very clear. (About migration, see below)
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
8. LRU
======
- Each memcg has its own private LRU. Now, its handling is under global
- VM's control (means that it's handled under global pgdat->lru_lock).
- Almost all routines around memcg's LRU is called by global LRU's
- list management functions under pgdat->lru_lock.
-
- A special function is mem_cgroup_isolate_pages(). This scans
- memcg's private LRU and call __isolate_lru_page() to extract a page
- from LRU.
-
- (By __isolate_lru_page(), the page is removed from both of global and
- private LRU.)
-
+ Each memcg has its own vector of LRUs (inactive anon, active anon,
+ inactive file, active file, unevictable) of pages from each node,
+ each LRU handled under a single lru_lock for that memcg and node.
9. Typical Tests.
=================
@@ -219,13 +210,11 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
This is an easy way to test page migration, too.
-9.5 mkdir/rmdir
----------------
+9.5 nested cgroups
+------------------
- When using hierarchy, mkdir/rmdir test should be done.
- Use tests like the following::
+ Use tests like the following for testing nested cgroups::
- echo 1 >/opt/cgroup/01/memory/use_hierarchy
mkdir /opt/cgroup/01/child_a
mkdir /opt/cgroup/01/child_b
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 12757e63b26c..ca7d9402f6be 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -2,18 +2,18 @@
Memory Resource Controller
==========================
-NOTE:
+.. caution::
This document is hopelessly outdated and it asks for a complete
rewrite. It still contains a useful information so we are keeping it
here but make sure to check the current code if you need a deeper
understanding.
-NOTE:
+.. note::
The Memory Resource Controller has generically been referred to as the
memory controller in this document. Do not confuse memory controller
used here with the memory controller that is used in hardware.
-(For editors) In this document:
+.. hint::
When we mention a cgroup (cgroupfs's directory) with memory controller,
we call it "memory cgroup". When you see git-log and source code, you'll
see patch's title and function names tend to use "memcg".
@@ -23,7 +23,7 @@ Benefits and Purpose of the memory controller
=============================================
The memory controller isolates the memory behaviour of a group of tasks
-from the rest of the system. The article on LWN [12] mentions some probable
+from the rest of the system. The article on LWN [12]_ mentions some probable
uses of the memory controller. The memory controller can be used to
a. Isolate an application or a group of applications
@@ -55,7 +55,8 @@ Features:
- Root cgroup has no limit controls.
Kernel memory support is a work in progress, and the current version provides
- basically functionality. (See Section 2.7)
+ basically functionality. (See :ref:`section 2.7
+ <cgroup-v1-memory-kernel-extension>`)
Brief summary of control files.
@@ -64,6 +65,7 @@ Brief summary of control files.
threads
cgroup.procs show list of processes
cgroup.event_control an interface for event_fd()
+ This knob is not available on CONFIG_PREEMPT_RT systems.
memory.usage_in_bytes show current usage for memory
(See 5.5 for details)
memory.memsw.usage_in_bytes show current usage for memory+Swap
@@ -75,20 +77,28 @@ Brief summary of control files.
memory.max_usage_in_bytes show max memory usage recorded
memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
memory.soft_limit_in_bytes set/show soft limit of memory usage
+ This knob is not available on CONFIG_PREEMPT_RT systems.
memory.stat show various statistics
memory.use_hierarchy set/show hierarchical account enabled
+ This knob is deprecated and shouldn't be
+ used.
memory.force_empty trigger forced page reclaim
memory.pressure_level set memory pressure notifications
memory.swappiness set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate set/show controls of moving charges
+ This knob is deprecated and shouldn't be
+ used.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
- memory.kmem.limit_in_bytes set/show hard limit for kernel memory
- This knob is deprecated and shouldn't be
- used. It is planned that this be removed in
- the foreseeable future.
+ memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
+ memory hard limit. Kernel hard limit is not
+ supported since 5.16. Writing any value to
+ do file will not have any effect same as if
+ nokmem kernel parameter was specified.
+ Kernel memory is still charged and reported
+ by memory.kmem.usage_in_bytes.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
@@ -105,16 +115,16 @@ Brief summary of control files.
==========
The memory controller has a long history. A request for comments for the memory
-controller was posted by Balbir Singh [1]. At the time the RFC was posted
+controller was posted by Balbir Singh [1]_. At the time the RFC was posted
there were several implementations for memory control. The goal of the
RFC was to build consensus and agreement for the minimal features required
-for memory control. The first RSS controller was posted by Balbir Singh[2]
-in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
-RSS controller. At OLS, at the resource management BoF, everyone suggested
-that we handle both page cache and RSS together. Another request was raised
-to allow user space handling of OOM. The current memory controller is
+for memory control. The first RSS controller was posted by Balbir Singh [2]_
+in Feb 2007. Pavel Emelianov [3]_ [4]_ [5]_ has since posted three versions
+of the RSS controller. At OLS, at the resource management BoF, everyone
+suggested that we handle both page cache and RSS together. Another request was
+raised to allow user space handling of OOM. The current memory controller is
at version 6; it combines both mapped (RSS) and unmapped Page
-Cache Control [11].
+Cache Control [11]_.
2. Memory Control
=================
@@ -145,7 +155,8 @@ specific data structure (mem_cgroup) associated with it.
2.2. Accounting
---------------
-::
+.. code-block::
+ :caption: Figure 1: Hierarchy of Accounting
+--------------------+
| mem_cgroup |
@@ -165,7 +176,6 @@ specific data structure (mem_cgroup) associated with it.
| | | |
+---------------+ +---------------+
- (Figure 1: Hierarchy of Accounting)
Figure 1 shows the important aspects of the controller
@@ -192,11 +202,11 @@ are not accounted. We just account pages under usual VM management.
RSS pages are accounted at page_fault unless they've already been accounted
for earlier. A file page will be accounted for as Page Cache when it's
-inserted into inode (radix-tree). While it's mapped into the page tables of
+inserted into inode (xarray). While it's mapped into the page tables of
processes, duplicate accounting is carefully avoided.
An RSS page is unaccounted when it's fully unmapped. A PageCache page is
-unaccounted when it's removed from radix-tree. Even if RSS pages are fully
+unaccounted when it's removed from xarray. Even if RSS pages are fully
unmapped (by kswapd), they may exist as SwapCache in the system until they
are really freed. Such SwapCaches are also accounted.
A swapped-in page is accounted after adding into swapcache.
@@ -219,8 +229,9 @@ behind this approach is that a cgroup that aggressively uses a shared
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).
-But see section 8.2: when moving a task to another cgroup, its pages may
-be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
+But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a
+task to another cgroup, its pages may be recharged to the new cgroup, if
+move_charge_at_immigrate has been chosen.
2.4 Swap Extension
--------------------------------------
@@ -242,7 +253,8 @@ In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
By using the memsw limit, you can avoid system OOM which can be caused by swap
shortage.
-**why 'memory+swap' rather than swap**
+2.4.1 why 'memory+swap' rather than swap
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
@@ -250,7 +262,8 @@ memory+swap. In other words, when we want to limit the usage of swap without
affecting global LRU, memory+swap limit is better than just limiting swap from
an OS point of view.
-**What happens when a cgroup hits memory.memsw.limit_in_bytes**
+2.4.2. What happens when a cgroup hits memory.memsw.limit_in_bytes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
in this cgroup. Then, swap-out will not be done by cgroup routine and file
@@ -266,41 +279,40 @@ global VM. When a cgroup goes over its limit, we first try
to reclaim memory from the cgroup so as to make space for the new
pages that the cgroup has touched. If the reclaim is unsuccessful,
an OOM routine is invoked to select and kill the bulkiest task in the
-cgroup. (See 10. OOM Control below.)
+cgroup. (See :ref:`10. OOM Control <cgroup-v1-memory-oom-control>` below.)
The reclaim algorithm has not been modified for cgroups, except that
pages that are selected for reclaiming come from the per-cgroup LRU
list.
-NOTE:
- Reclaim does not work for the root cgroup, since we cannot set any
- limits on the root cgroup.
+.. note::
+ Reclaim does not work for the root cgroup, since we cannot set any
+ limits on the root cgroup.
-Note2:
- When panic_on_oom is set to "2", the whole system will panic.
+.. note::
+ When panic_on_oom is set to "2", the whole system will panic.
When oom event notifier is registered, event will be delivered.
-(See oom_control section)
+(See :ref:`oom_control <cgroup-v1-memory-oom-control>` section)
2.6 Locking
-----------
- lock_page_cgroup()/unlock_page_cgroup() should not be called under
- the i_pages lock.
-
- Other lock order is following:
+Lock order is as follows::
- PG_locked.
- mm->page_table_lock
- pgdat->lru_lock
- lock_page_cgroup.
+ Page lock (PG_locked bit of page->flags)
+ mm->page_table_lock or split pte_lock
+ folio_memcg_lock (memcg->move_lock)
+ mapping->i_pages lock
+ lruvec->lru_lock.
- In many cases, just lock_page_cgroup() is called.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
- per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
- pgdat->lru_lock, it has no lock of its own.
+.. _cgroup-v1-memory-kernel-extension:
-2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
+2.7 Kernel Memory Extension
-----------------------------------------------
With the Kernel memory extension, the Memory Controller is able to limit
@@ -361,17 +373,17 @@ U != 0, K = unlimited:
U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
- deployments where the total amount of memory per-cgroup is overcommited.
- Overcommiting kernel memory limits is definitely not recommended, since the
+ deployments where the total amount of memory per-cgroup is overcommitted.
+ Overcommitting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.
-WARNING:
- In the current implementation, memory reclaim will NOT be
- triggered for a cgroup when it hits K while staying below U, which makes
- this setup impractical.
+ .. warning::
+ In the current implementation, memory reclaim will NOT be triggered for
+ a cgroup when it hits K while staying below U, which makes this setup
+ impractical.
U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
@@ -382,47 +394,41 @@ U != 0, K >= U:
3. User Interface
=================
-3.0. Configuration
-------------------
-
-a. Enable CONFIG_CGROUPS
-b. Enable CONFIG_MEMCG
-c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
-d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
+To use the user interface:
-3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
--------------------------------------------------------------------
-
-::
+1. Enable CONFIG_CGROUPS and CONFIG_MEMCG options
+2. Prepare the cgroups (see :ref:`Why are cgroups needed?
+ <cgroups-why-needed>` for the background information)::
# mount -t tmpfs none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
-3.2. Make the new group and move bash into it::
+3. Make the new group and move bash into it::
# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks
-Since now we're in the 0 cgroup, we can alter the memory limit::
+4. Since now we're in the 0 cgroup, we can alter the memory limit::
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
-NOTE:
- We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
- mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
- Gibibytes.)
+ The limit can now be queried::
+
+ # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
+ 4194304
-NOTE:
- We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
+.. note::
+ We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
+ mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
+ Gibibytes.)
-NOTE:
- We cannot set limits on the root cgroup any more.
+.. note::
+ We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
-::
+.. note::
+ We cannot set limits on the root cgroup any more.
- # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
- 4194304
We can check the usage::
@@ -461,6 +467,8 @@ test because it has noise of shared objects/status.
But the above two are testing extreme situations.
Trying usual test under memory controller is always helpful.
+.. _cgroup-v1-memory-test-troubleshoot:
+
4.1 Troubleshooting
-------------------
@@ -473,8 +481,11 @@ terminated by the OOM killer. There are several causes for this:
A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
some of the pages cached in the cgroup (page cache pages).
-To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
-seeing what happens will be helpful.
+To know what happens, disabling OOM_Kill as per :ref:`"10. OOM Control"
+<cgroup-v1-memory-oom-control>` (below) and seeing what happens will be
+helpful.
+
+.. _cgroup-v1-memory-test-task-migration:
4.2 Task migration
------------------
@@ -485,26 +496,24 @@ remain charged to it, the charge is dropped when the page is freed or
reclaimed.
You can move charges of a task along with task migration.
-See 8. "Move charges at task migration"
+See :ref:`8. "Move charges at task migration" <cgroup-v1-memory-move-charges>`
4.3 Removing a cgroup
---------------------
-A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
-cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. (because we charge against pages, not
-against tasks.)
+A cgroup can be removed by rmdir, but as discussed in :ref:`sections 4.1
+<cgroup-v1-memory-test-troubleshoot>` and :ref:`4.2
+<cgroup-v1-memory-test-task-migration>`, a cgroup might have some charge
+associated with it, even though all tasks have migrated away from it. (because
+we charge against pages, not against tasks.)
-We move the stats to root (if use_hierarchy==0) or parent (if
-use_hierarchy==1), and no change on the charge except uncharging
+We move the stats to parent, and no change on the charge except uncharging
from the child.
Charges recorded in swap information is not updated at removal of cgroup.
Recorded information is discarded and a cgroup which uses swap (swapcache)
will be charged as a new owner of it.
-About use_hierarchy, see Section 6.
-
5. Misc. interfaces
===================
@@ -522,76 +531,70 @@ About use_hierarchy, see Section 6.
charged file caches. Some out-of-use page caches may keep charged until
memory pressure happens. If you want to avoid that, force_empty will be useful.
- Also, note that when memory.kmem.limit_in_bytes is set the charges due to
- kernel pages will still be seen. This is not considered a failure and the
- write will still return success. In this case, it is expected that
- memory.kmem.usage_in_bytes == memory.usage_in_bytes.
-
- About use_hierarchy, see Section 6.
-
5.2 stat file
-------------
-memory.stat file includes following statistics
-
-per-memory cgroup local status
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-=============== ===============================================================
-cache # of bytes of page cache memory.
-rss # of bytes of anonymous and swap cache memory (includes
- transparent hugepages).
-rss_huge # of bytes of anonymous transparent hugepages.
-mapped_file # of bytes of mapped file (includes tmpfs/shmem)
-pgpgin # of charging events to the memory cgroup. The charging
- event happens each time a page is accounted as either mapped
- anon page(RSS) or cache page(Page Cache) to the cgroup.
-pgpgout # of uncharging events to the memory cgroup. The uncharging
- event happens each time a page is unaccounted from the cgroup.
-swap # of bytes of swap usage
-dirty # of bytes that are waiting to get written back to the disk.
-writeback # of bytes of file/anon cache that are queued for syncing to
- disk.
-inactive_anon # of bytes of anonymous and swap cache memory on inactive
- LRU list.
-active_anon # of bytes of anonymous and swap cache memory on active
- LRU list.
-inactive_file # of bytes of file-backed memory on inactive LRU list.
-active_file # of bytes of file-backed memory on active LRU list.
-unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
-=============== ===============================================================
-
-status considering hierarchy (see memory.use_hierarchy settings)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-========================= ===================================================
-hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
- under which the memory cgroup is
-hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
- hierarchy under which memory cgroup is.
-
-total_<counter> # hierarchical version of <counter>, which in
- addition to the cgroup's own value includes the
- sum of all hierarchical children's values of
- <counter>, i.e. total_cache
-========================= ===================================================
-
-The following additional stats are dependent on CONFIG_DEBUG_VM
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-========================= ========================================
-recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
-recent_rotated_file VM internal parameter. (see mm/vmscan.c)
-recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
-recent_scanned_file VM internal parameter. (see mm/vmscan.c)
-========================= ========================================
-
-Memo:
+memory.stat file includes following statistics:
+
+ * per-memory cgroup local status
+
+ =============== ===============================================================
+ cache # of bytes of page cache memory.
+ rss # of bytes of anonymous and swap cache memory (includes
+ transparent hugepages).
+ rss_huge # of bytes of anonymous transparent hugepages.
+ mapped_file # of bytes of mapped file (includes tmpfs/shmem)
+ pgpgin # of charging events to the memory cgroup. The charging
+ event happens each time a page is accounted as either mapped
+ anon page(RSS) or cache page(Page Cache) to the cgroup.
+ pgpgout # of uncharging events to the memory cgroup. The uncharging
+ event happens each time a page is unaccounted from the
+ cgroup.
+ swap # of bytes of swap usage
+ swapcached # of bytes of swap cached in memory
+ dirty # of bytes that are waiting to get written back to the disk.
+ writeback # of bytes of file/anon cache that are queued for syncing to
+ disk.
+ inactive_anon # of bytes of anonymous and swap cache memory on inactive
+ LRU list.
+ active_anon # of bytes of anonymous and swap cache memory on active
+ LRU list.
+ inactive_file # of bytes of file-backed memory and MADV_FREE anonymous
+ memory (LazyFree pages) on inactive LRU list.
+ active_file # of bytes of file-backed memory on active LRU list.
+ unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
+ =============== ===============================================================
+
+ * status considering hierarchy (see memory.use_hierarchy settings):
+
+ ========================= ===================================================
+ hierarchical_memory_limit # of bytes of memory limit with regard to
+ hierarchy
+ under which the memory cgroup is
+ hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
+ hierarchy under which memory cgroup is.
+
+ total_<counter> # hierarchical version of <counter>, which in
+ addition to the cgroup's own value includes the
+ sum of all hierarchical children's values of
+ <counter>, i.e. total_cache
+ ========================= ===================================================
+
+ * additional vm parameters (depends on CONFIG_DEBUG_VM):
+
+ ========================= ========================================
+ recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
+ recent_rotated_file VM internal parameter. (see mm/vmscan.c)
+ recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
+ recent_scanned_file VM internal parameter. (see mm/vmscan.c)
+ ========================= ========================================
+
+.. hint::
recent_rotated means recent frequency of LRU rotation.
recent_scanned means recent # of scans to LRU.
showing for better debug please see the code for meanings.
-Note:
+.. note::
Only anonymous and swap cache memory is listed as part of 'rss' stat.
This should not be confused with the true 'resident set size' or the
amount of physical memory used by the cgroup.
@@ -675,31 +678,20 @@ hierarchy::
d e
In the diagram above, with hierarchical accounting enabled, all memory
-usage of e, is accounted to its ancestors up until the root (i.e, c and root),
-that has memory.use_hierarchy enabled. If one of the ancestors goes over its
-limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
-children of the ancestor.
-
-6.1 Enabling hierarchical accounting and reclaim
-------------------------------------------------
+usage of e, is accounted to its ancestors up until the root (i.e, c and root).
+If one of the ancestors goes over its limit, the reclaim algorithm reclaims
+from the tasks in the ancestor and the children of the ancestor.
-A memory cgroup by default disables the hierarchy feature. Support
-can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
+6.1 Hierarchical accounting and reclaim
+---------------------------------------
- # echo 1 > memory.use_hierarchy
+Hierarchical accounting is enabled by default. Disabling the hierarchical
+accounting is deprecated. An attempt to do it will result in a failure
+and a warning printed to dmesg.
-The feature can be disabled by::
+For compatibility reasons writing 1 to memory.use_hierarchy will always pass::
- # echo 0 > memory.use_hierarchy
-
-NOTE1:
- Enabling/disabling will fail if either the cgroup already has other
- cgroups created below it, or if the parent cgroup has use_hierarchy
- enabled.
-
-NOTE2:
- When panic_on_oom is set to "2", the whole system will panic in
- case of an OOM event in any cgroup.
+ # echo 1 > memory.use_hierarchy
7. Soft limits
==============
@@ -733,15 +725,25 @@ If we want to change this to 1G, we can at any time use::
# echo 1G > memory.soft_limit_in_bytes
-NOTE1:
+.. note::
Soft limits take effect over a long period of time, since they involve
reclaiming memory for balancing between memory cgroups
-NOTE2:
+
+.. note::
It is recommended to set the soft limit always below the hard limit,
otherwise the hard limit will take precedence.
-8. Move charges at task migration
-=================================
+.. _cgroup-v1-memory-move-charges:
+
+8. Move charges at task migration (DEPRECATED!)
+===============================================
+
+THIS IS DEPRECATED!
+
+It's expensive and unreliable! It's better practice to launch workload
+tasks directly from inside their target cgroup. Use dedicated workload
+cgroups to allow fine-grained policy adjustments without having to
+move physical pages between control domains.
Users can move charges associated with a task along with task migration, that
is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
@@ -758,23 +760,29 @@ If you want to enable it::
# echo (some positive value) > memory.move_charge_at_immigrate
-Note:
+.. note::
Each bits of move_charge_at_immigrate has its own meaning about what type
- of charges should be moved. See 8.2 for details.
-Note:
+ of charges should be moved. See :ref:`section 8.2
+ <cgroup-v1-memory-movable-charges>` for details.
+
+.. note::
Charges are moved only when you move mm->owner, in other words,
a leader of a thread group.
-Note:
+
+.. note::
If we cannot find enough space for the task in the destination cgroup, we
try to make space by reclaiming memory. Task migration may fail if we
cannot make enough space.
-Note:
+
+.. note::
It can take several seconds if you move charges much.
And if you want disable it again::
# echo 0 > memory.move_charge_at_immigrate
+.. _cgroup-v1-memory-movable-charges:
+
8.2 Type of charges which can be moved
--------------------------------------
@@ -824,6 +832,8 @@ threshold in any direction.
It's applicable for root and non-root cgroup.
+.. _cgroup-v1-memory-oom-control:
+
10. OOM Control
===============
@@ -868,6 +878,9 @@ At reading, current status of OOM is shown.
(if 1, oom-killer is disabled)
- under_oom 0 or 1
(if 1, the memory cgroup is under OOM, tasks may be stopped.)
+ - oom_kill integer counter
+ The number of processes belonging to this cgroup killed by any
+ kind of OOM killer.
11. Memory Pressure
===================
@@ -902,7 +915,7 @@ experiences some pressure. In this situation, only group C will receive the
notification, i.e. groups A and B will not receive it. This is done to avoid
excessive "broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. Group B, will receive
-notification only if there are no event listers for group C.
+notification only if there are no event listeners for group C.
There are three optional modes that specify different propagation behavior:
@@ -976,25 +989,27 @@ commented and discussed quite extensively in the community.
References
==========
-1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
-2. Singh, Balbir. Memory Controller (RSS Control),
+.. [1] Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
+.. [2] Singh, Balbir. Memory Controller (RSS Control),
http://lwn.net/Articles/222762/
-3. Emelianov, Pavel. Resource controllers based on process cgroups
- http://lkml.org/lkml/2007/3/6/198
-4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
- http://lkml.org/lkml/2007/4/9/78
-5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
- http://lkml.org/lkml/2007/5/30/244
+.. [3] Emelianov, Pavel. Resource controllers based on process cgroups
+ https://lore.kernel.org/r/45ED7DEC.7010403@sw.ru
+.. [4] Emelianov, Pavel. RSS controller based on process cgroups (v2)
+ https://lore.kernel.org/r/461A3010.90403@sw.ru
+.. [5] Emelianov, Pavel. RSS controller based on process cgroups (v3)
+ https://lore.kernel.org/r/465D9739.8070209@openvz.org
+
6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
subsystem (v3), http://lwn.net/Articles/235534/
8. Singh, Balbir. RSS controller v2 test results (lmbench),
- http://lkml.org/lkml/2007/5/17/232
+ https://lore.kernel.org/r/464C95D4.7070806@linux.vnet.ibm.com
9. Singh, Balbir. RSS controller v2 AIM9 results
- http://lkml.org/lkml/2007/5/18/1
+ https://lore.kernel.org/r/464D267A.50107@linux.vnet.ibm.com
10. Singh, Balbir. Memory controller v6 test results,
- http://lkml.org/lkml/2007/8/19/36
-11. Singh, Balbir. Memory controller introduction (v6),
- http://lkml.org/lkml/2007/8/17/69
-12. Corbet, Jonathan, Controlling memory use in cgroups,
- http://lwn.net/Articles/243795/
+ https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop
+
+.. [11] Singh, Balbir. Memory controller introduction (v6),
+ https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop
+.. [12] Corbet, Jonathan, Controlling memory use in cgroups,
+ http://lwn.net/Articles/243795/
diff --git a/Documentation/admin-guide/cgroup-v1/misc.rst b/Documentation/admin-guide/cgroup-v1/misc.rst
new file mode 100644
index 000000000000..661614c24df3
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/misc.rst
@@ -0,0 +1,4 @@
+===============
+Misc controller
+===============
+Please refer "Misc" documentation in Documentation/admin-guide/cgroup-v2.rst
diff --git a/Documentation/admin-guide/cgroup-v1/rdma.rst b/Documentation/admin-guide/cgroup-v1/rdma.rst
index 2fcb0a9bf790..e69369b7252e 100644
--- a/Documentation/admin-guide/cgroup-v1/rdma.rst
+++ b/Documentation/admin-guide/cgroup-v1/rdma.rst
@@ -114,4 +114,4 @@ Following resources can be accounted by rdma controller.
(d) Delete resource limit::
- echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
+ echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index ce3e05e41724..17e6e9565156 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1,3 +1,5 @@
+.. _cgroup-v2:
+
================
Control Group v2
================
@@ -54,6 +56,7 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
5-3-3. IO Latency
5-3-3-1. How IO Latency Throttling Works
5-3-3-2. IO Latency Interface Files
+ 5-3-4. IO Priority
5-4. PID
5-4-1. PID Interface Files
5-5. Cpuset
@@ -63,8 +66,11 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
5-7-1. RDMA Interface Files
5-8. HugeTLB
5.8-1. HugeTLB Interface Files
- 5-8. Misc
- 5-8-1. perf_event
+ 5-9. Misc
+ 5.9-1 Miscellaneous cgroup Interface Files
+ 5.9-2 Migration and Ownership
+ 5-10. Others
+ 5-10-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -172,15 +178,21 @@ disabling controllers in v1 and make them always available in v2.
cgroup v2 currently supports the following mount options.
nsdelegate
-
Consider cgroup namespaces as delegation boundaries. This
option is system wide and can only be set on mount or modified
through remount from the init namespace. The mount option is
ignored on non-init namespace mounts. Please refer to the
Delegation section for details.
- memory_localevents
+ favordynmods
+ Reduce the latencies of dynamic cgroup modifications such as
+ task migrations and controller on/offs at the cost of making
+ hot path operations such as forks and exits more expensive.
+ The static usage pattern of creating a cgroup, enabling
+ controllers, and then seeding it with CLONE_INTO_CGROUP is
+ not affected by this option.
+ memory_localevents
Only populate memory.events with data for the current cgroup,
and not any subtrees. This is legacy behaviour, the default
behaviour without this option is to include subtree counts.
@@ -189,7 +201,6 @@ cgroup v2 currently supports the following mount options.
option is ignored on non-init namespace mounts.
memory_recursiveprot
-
Recursively apply memory.min and memory.low protection to
entire subtrees, without requiring explicit downward
propagation into leaf cgroups. This allows protecting entire
@@ -199,6 +210,35 @@ cgroup v2 currently supports the following mount options.
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).
+ memory_hugetlb_accounting
+ Count HugeTLB memory usage towards the cgroup's overall
+ memory usage for the memory controller (for the purpose of
+ statistics reporting and memory protetion). This is a new
+ behavior that could regress existing setups, so it must be
+ explicitly opted in with this mount option.
+
+ A few caveats to keep in mind:
+
+ * There is no HugeTLB pool management involved in the memory
+ controller. The pre-allocated pool does not belong to anyone.
+ Specifically, when a new HugeTLB folio is allocated to
+ the pool, it is not accounted for from the perspective of the
+ memory controller. It is only charged to a cgroup when it is
+ actually used (for e.g at page fault time). Host memory
+ overcommit management has to consider this when configuring
+ hard limits. In general, HugeTLB pool management should be
+ done via other mechanisms (such as the HugeTLB controller).
+ * Failure to charge a HugeTLB folio to the memory controller
+ results in SIGBUS. This could happen even if the HugeTLB pool
+ still has pages available (but the cgroup limit is hit and
+ reclaim attempt fails).
+ * Charging HugeTLB memory towards the memory controller affects
+ memory protection and reclaim dynamics. Any userspace tuning
+ (of low, min limits for e.g) needs to take this into account.
+ * HugeTLB pages utilized while this option is not selected
+ will not be tracked by the memory controller (even if cgroup
+ v2 is remounted later on).
+
Organizing Processes and Threads
--------------------------------
@@ -353,6 +393,13 @@ constraint, a threaded controller must be able to handle competition
between threads in a non-leaf cgroup and its child cgroups. Each
threaded controller defines how such competitions are handled.
+Currently, the following controllers are threaded and can be enabled
+in a threaded cgroup::
+
+- cpu
+- cpuset
+- perf_event
+- pids
[Un]populated Notification
--------------------------
@@ -608,10 +655,12 @@ process migrations.
and is an example of this type.
+.. _cgroupv2-limits-distributor:
+
Limits
------
-A child can only consume upto the configured amount of the resource.
+A child can only consume up to the configured amount of the resource.
Limits can be over-committed - the sum of the limits of children can
exceed the amount of resource available to the parent.
@@ -624,15 +673,16 @@ process migrations.
"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
on an IO device and is an example of this type.
+.. _cgroupv2-protections-distributor:
Protections
-----------
-A cgroup is protected upto the configured amount of the resource
+A cgroup is protected up to the configured amount of the resource
as long as the usages of all its ancestors are under their
protected levels. Protections can be hard guarantees or best effort
soft boundaries. Protections can also be over-committed in which case
-only upto the amount available to the parent is protected among
+only up to the amount available to the parent is protected among
children.
Protections are in the range [0, max] and defaults to 0, which is
@@ -786,7 +836,6 @@ Core Interface Files
All cgroup core files are prefixed with "cgroup."
cgroup.type
-
A read-write single value file which exists on non-root
cgroups.
@@ -951,9 +1000,49 @@ All cgroup core files are prefixed with "cgroup."
it's possible to delete a frozen (and empty) cgroup, as well as
create new sub-cgroups.
+ cgroup.kill
+ A write-only single value file which exists in non-root cgroups.
+ The only allowed value is "1".
+
+ Writing "1" to the file causes the cgroup and all descendant cgroups to
+ be killed. This means that all processes located in the affected cgroup
+ tree will be killed via SIGKILL.
+
+ Killing a cgroup tree will deal with concurrent forks appropriately and
+ is protected against migrations.
+
+ In a threaded cgroup, writing this file fails with EOPNOTSUPP as
+ killing cgroups is a process directed operation, i.e. it affects
+ the whole thread-group.
+
+ cgroup.pressure
+ A read-write single value file that allowed values are "0" and "1".
+ The default is "1".
+
+ Writing "0" to the file will disable the cgroup PSI accounting.
+ Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+ This control attribute is not hierarchical, so disable or enable PSI
+ accounting in a cgroup does not affect PSI accounting in descendants
+ and doesn't need pass enablement via ancestors from root.
+
+ The reason this control attribute exists is that PSI accounts stalls for
+ each cgroup separately and aggregates it at each level of the hierarchy.
+ This may cause non-negligible overhead for some workloads when under
+ deep level of the hierarchy, in which case this control attribute can
+ be used to disable PSI accounting in the non-leaf cgroups.
+
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
+.. _cgroup-v2-cpu:
+
CPU
---
@@ -992,17 +1081,23 @@ All time durations are in microseconds.
- user_usec
- system_usec
- and the following three when the controller is enabled:
+ and the following five when the controller is enabled:
- nr_periods
- nr_throttled
- throttled_usec
+ - nr_bursts
+ - burst_usec
cpu.weight
A read-write single value file which exists on non-root
cgroups. The default is "100".
- The weight in the range [1, 10000].
+ For non idle groups (cpu.idle = 0), the weight is in the
+ range [1, 10000].
+
+ If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
+ then the weight will show as a 0.
cpu.weight.nice
A read-write single value file which exists on non-root
@@ -1024,12 +1119,18 @@ All time durations are in microseconds.
$MAX $PERIOD
- which indicates that the group may consume upto $MAX in each
+ which indicates that the group may consume up to $MAX in each
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.max.burst
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ The burst in the range [0, $MAX].
+
cpu.pressure
- A read-only nested-key file which exists on non-root cgroups.
+ A read-write nested-keyed file.
Shows pressure stall information for CPU. See
:ref:`Documentation/accounting/psi.rst <psi>` for details.
@@ -1060,6 +1161,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
+ cpu.idle
+ A read-write single value file which exists on non-root cgroups.
+ The default is 0.
+
+ This is the cgroup analog of the per-task SCHED_IDLE sched policy.
+ Setting this value to a 1 will make the scheduling policy of the
+ cgroup SCHED_IDLE. The threads inside the cgroup will retain their
+ own relative priorities, but the cgroup itself will be treated as
+ very low priority relative to its peers.
+
Memory
@@ -1152,23 +1263,25 @@ PAGE_SIZE multiple when read back.
A read-write single value file which exists on non-root
cgroups. The default is "max".
- Memory usage throttle limit. This is the main mechanism to
- control memory usage of a cgroup. If a cgroup's usage goes
+ Memory usage throttle limit. If a cgroup's usage goes
over the high boundary, the processes of the cgroup are
throttled and put under heavy reclaim pressure.
Going over the high limit never invokes the OOM killer and
- under extreme conditions the limit may be breached.
+ under extreme conditions the limit may be breached. The high
+ limit should be used in scenarios where an external process
+ monitors the limited cgroup to alleviate heavy reclaim
+ pressure.
memory.max
A read-write single value file which exists on non-root
cgroups. The default is "max".
- Memory usage hard limit. This is the final protection
- mechanism. If a cgroup's memory usage reaches this limit and
- can't be reduced, the OOM killer is invoked in the cgroup.
- Under certain circumstances, the usage may go over the limit
- temporarily.
+ Memory usage hard limit. This is the main mechanism to limit
+ memory usage of a cgroup. If a cgroup's memory usage reaches
+ this limit and can't be reduced, the OOM killer is invoked in
+ the cgroup. Under certain circumstances, the usage may go
+ over the limit temporarily.
In default configuration regular 0-order allocations always
succeed unless OOM killer chooses current task as a victim.
@@ -1177,9 +1290,40 @@ PAGE_SIZE multiple when read back.
Caller could retry them differently, return into userspace
as -ENOMEM or silently ignore in cases like disk readahead.
- This is the ultimate protection mechanism. As long as the
- high limit is used and monitored properly, this limit's
- utility is limited to providing the final safety net.
+ memory.reclaim
+ A write-only nested-keyed file which exists for all cgroups.
+
+ This is a simple interface to trigger memory reclaim in the
+ target cgroup.
+
+ This file accepts a single key, the number of bytes to reclaim.
+ No nested keys are currently supported.
+
+ Example::
+
+ echo "1G" > memory.reclaim
+
+ The interface can be later extended with nested keys to
+ configure the reclaim behavior. For example, specify the
+ type of memory to reclaim from (anon, file, ..).
+
+ Please note that the kernel can over or under reclaim from
+ the target cgroup. If less bytes are reclaimed than the
+ specified amount, -EAGAIN is returned.
+
+ Please note that the proactive reclaim (triggered by this
+ interface) is not meant to indicate memory pressure on the
+ memory cgroup. Therefore socket memory balancing triggered by
+ the memory reclaim normally is not exercised in this case.
+ This means that the networking layer will not adapt based on
+ reclaim induced by memory.reclaim.
+
+ memory.peak
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The max memory usage recorded for the cgroup and its
+ descendants since the creation of the cgroup.
memory.oom.group
A read-write single value file which exists on non-root
@@ -1207,7 +1351,7 @@ PAGE_SIZE multiple when read back.
Note that all fields in this file are hierarchical and the
file modified event can be generated due to an event down the
- hierarchy. For for the local events at the cgroup level see
+ hierarchy. For the local events at the cgroup level see
memory.events.local.
low
@@ -1241,6 +1385,9 @@ PAGE_SIZE multiple when read back.
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
+ oom_group_kill
+ The number of times a group OOM has occurred.
+
memory.events.local
Similar to memory.events but the fields in the file are local
to the cgroup i.e. not hierarchical. The file modified event
@@ -1259,6 +1406,10 @@ PAGE_SIZE multiple when read back.
can show up in the middle. Don't rely on items remaining in a
fixed position; use the keys to look up specific values!
+ If the entry has no per-node counter (or not show in the
+ memory.numa_stat). We use 'npn' (non-per-node) as the tag
+ to indicate that it will not show in the memory.numa_stat.
+
anon
Amount of memory used in anonymous mappings such as
brk(), sbrk(), and mmap(MAP_ANONYMOUS)
@@ -1267,20 +1418,42 @@ PAGE_SIZE multiple when read back.
Amount of memory used to cache filesystem data,
including tmpfs and shared memory.
+ kernel (npn)
+ Amount of total kernel memory, including
+ (kernel_stack, pagetables, percpu, vmalloc, slab) in
+ addition to other kernel memory use cases.
+
kernel_stack
Amount of memory allocated to kernel stacks.
- slab
- Amount of memory used for storing in-kernel data
- structures.
+ pagetables
+ Amount of memory allocated for page tables.
- sock
+ sec_pagetables
+ Amount of memory allocated for secondary page tables,
+ this currently includes KVM mmu allocations on x86
+ and arm64.
+
+ percpu (npn)
+ Amount of memory used for storing per-cpu kernel
+ data structures.
+
+ sock (npn)
Amount of memory used in network transmission buffers
+ vmalloc (npn)
+ Amount of memory used for vmap backed memory.
+
shmem
Amount of cached filesystem data that is swap-backed,
such as tmpfs, shm segments, shared anonymous mmap()s
+ zswap
+ Amount of memory consumed by the zswap compression backend.
+
+ zswapped
+ Amount of application memory swapped out to zswap.
+
file_mapped
Amount of cached filesystem data mapped with mmap()
@@ -1292,10 +1465,22 @@ PAGE_SIZE multiple when read back.
Amount of cached filesystem data that was modified and
is currently being written back to disk
+ swapcached
+ Amount of swap cached in memory. The swapcache is accounted
+ against both memory and swap usage.
+
anon_thp
Amount of memory used in anonymous mappings backed by
transparent hugepages
+ file_thp
+ Amount of cached filesystem data backed by transparent
+ hugepages
+
+ shmem_thp
+ Amount of shm, tmpfs, shared anonymous mmap()s backed by
+ transparent hugepages
+
inactive_anon, active_anon, inactive_file, active_file, unevictable
Amount of memory, swap-backed and filesystem-backed,
on the internal memory management lists used by the
@@ -1314,56 +1499,123 @@ PAGE_SIZE multiple when read back.
Part of "slab" that cannot be reclaimed on memory
pressure.
- pgfault
- Total number of page faults incurred
+ slab (npn)
+ Amount of memory used for storing in-kernel data
+ structures.
- pgmajfault
- Number of major page faults incurred
+ workingset_refault_anon
+ Number of refaults of previously evicted anonymous pages.
+
+ workingset_refault_file
+ Number of refaults of previously evicted file pages.
- workingset_refault
- Number of refaults of previously evicted pages
+ workingset_activate_anon
+ Number of refaulted anonymous pages that were immediately
+ activated.
- workingset_activate
- Number of refaulted pages that were immediately activated
+ workingset_activate_file
+ Number of refaulted file pages that were immediately activated.
- workingset_restore
- Number of restored pages which have been detected as an active
- workingset before they got reclaimed.
+ workingset_restore_anon
+ Number of restored anonymous pages which have been detected as
+ an active workingset before they got reclaimed.
+
+ workingset_restore_file
+ Number of restored file pages which have been detected as an
+ active workingset before they got reclaimed.
workingset_nodereclaim
Number of times a shadow node has been reclaimed
- pgrefill
- Amount of scanned pages (in an active LRU list)
-
- pgscan
+ pgscan (npn)
Amount of scanned pages (in an inactive LRU list)
- pgsteal
+ pgsteal (npn)
Amount of reclaimed pages
- pgactivate
+ pgscan_kswapd (npn)
+ Amount of scanned pages by kswapd (in an inactive LRU list)
+
+ pgscan_direct (npn)
+ Amount of scanned pages directly (in an inactive LRU list)
+
+ pgscan_khugepaged (npn)
+ Amount of scanned pages by khugepaged (in an inactive LRU list)
+
+ pgsteal_kswapd (npn)
+ Amount of reclaimed pages by kswapd
+
+ pgsteal_direct (npn)
+ Amount of reclaimed pages directly
+
+ pgsteal_khugepaged (npn)
+ Amount of reclaimed pages by khugepaged
+
+ pgfault (npn)
+ Total number of page faults incurred
+
+ pgmajfault (npn)
+ Number of major page faults incurred
+
+ pgrefill (npn)
+ Amount of scanned pages (in an active LRU list)
+
+ pgactivate (npn)
Amount of pages moved to the active LRU list
- pgdeactivate
+ pgdeactivate (npn)
Amount of pages moved to the inactive LRU list
- pglazyfree
+ pglazyfree (npn)
Amount of pages postponed to be freed under memory pressure
- pglazyfreed
+ pglazyfreed (npn)
Amount of reclaimed lazyfree pages
- thp_fault_alloc
+ thp_fault_alloc (npn)
Number of transparent hugepages which were allocated to satisfy
- a page fault, including COW faults. This counter is not present
- when CONFIG_TRANSPARENT_HUGEPAGE is not set.
+ a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
+ is not set.
- thp_collapse_alloc
+ thp_collapse_alloc (npn)
Number of transparent hugepages which were allocated to allow
collapsing an existing range of pages. This counter is not
present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
+ thp_swpout (npn)
+ Number of transparent hugepages which are swapout in one piece
+ without splitting.
+
+ thp_swpout_fallback (npn)
+ Number of transparent hugepages which were split before swapout.
+ Usually because failed to allocate some continuous swap space
+ for the huge page.
+
+ memory.numa_stat
+ A read-only nested-keyed file which exists on non-root cgroups.
+
+ This breaks down the cgroup's memory footprint into different
+ types of memory, type-specific details, and other information
+ per node on the state of the memory management system.
+
+ This is useful for providing visibility into the NUMA locality
+ information within an memcg since the pages are allowed to be
+ allocated from any physical node. One of the use case is evaluating
+ application performance by combining this information with the
+ application's CPU allocation.
+
+ All memory amounts are in bytes.
+
+ The output format of memory.numa_stat is::
+
+ type N0=<bytes in node 0> N1=<bytes in node 1> ...
+
+ The entries are ordered to be human readable, and new entries
+ can show up in the middle. Don't rely on items remaining in a
+ fixed position; use the keys to look up specific values!
+
+ The entries can refer to the memory.stat.
+
memory.swap.current
A read-only single value file which exists on non-root
cgroups.
@@ -1387,6 +1639,13 @@ PAGE_SIZE multiple when read back.
Healthy workloads are not expected to reach this limit.
+ memory.swap.peak
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The max swap usage recorded for the cgroup and its
+ descendants since the creation of the cgroup.
+
memory.swap.max
A read-write single value file which exists on non-root
cgroups. The default is "max".
@@ -1419,8 +1678,38 @@ PAGE_SIZE multiple when read back.
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.
+ memory.zswap.current
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The total amount of memory consumed by the zswap compression
+ backend.
+
+ memory.zswap.max
+ A read-write single value file which exists on non-root
+ cgroups. The default is "max".
+
+ Zswap usage hard limit. If a cgroup's zswap pool reaches this
+ limit, it will refuse to take any more stores before existing
+ entries fault back in or are written out to disk.
+
+ memory.zswap.writeback
+ A read-write single value file. The default value is "1". The
+ initial value of the root cgroup is 1, and when a new cgroup is
+ created, it inherits the current value of its parent.
+
+ When this is set to 0, all swapping attempts to swapping devices
+ are disabled. This included both zswap writebacks, and swapping due
+ to zswap store failures. If the zswap store failures are recurring
+ (for e.g if the pages are incompressible), users can observe
+ reclaim inefficiency after disabling writeback (because the same
+ pages might be rejected again and again).
+
+ Note that this is subtly different from setting memory.swap.max to
+ 0, as it still allows for pages to be written to the zswap pool.
+
memory.pressure
- A read-only nested-key file which exists on non-root cgroups.
+ A read-only nested-keyed file.
Shows pressure stall information for memory. See
:ref:`Documentation/accounting/psi.rst <psi>` for details.
@@ -1483,8 +1772,7 @@ IO Interface Files
~~~~~~~~~~~~~~~~~~
io.stat
- A read-only nested-keyed file which exists on non-root
- cgroups.
+ A read-only nested-keyed file.
Lines are keyed by $MAJ:$MIN device numbers and not ordered.
The following nested keys are defined.
@@ -1504,7 +1792,7 @@ IO Interface Files
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
io.cost.qos
- A read-write nested-keyed file with exists only on the root
+ A read-write nested-keyed file which exists only on the root
cgroup.
This file configures the Quality of Service of the IO cost
@@ -1559,7 +1847,7 @@ IO Interface Files
automatic mode can be restored by setting "ctrl" to "auto".
io.cost.model
- A read-write nested-keyed file with exists only on the root
+ A read-write nested-keyed file which exists only on the root
cgroup.
This file configures the cost model of the IO cost model based
@@ -1660,7 +1948,7 @@ IO Interface Files
8:16 rbps=2097152 wbps=max riops=max wiops=max
io.pressure
- A read-only nested-key file which exists on non-root cgroups.
+ A read-only nested-keyed file.
Shows pressure stall information for IO. See
:ref:`Documentation/accounting/psi.rst <psi>` for details.
@@ -1684,9 +1972,9 @@ per-cgroup dirty memory states are examined and the more restrictive
of the two is enforced.
cgroup writeback requires explicit support from the underlying
-filesystem. Currently, cgroup writeback is implemented on ext2, ext4
-and btrfs. On other filesystems, all writeback IOs are attributed to
-the root cgroup.
+filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
+attributed to the root cgroup.
There are inherent differences in memory and writeback management
which affects how cgroup ownership is tracked. Memory is tracked per
@@ -1785,7 +2073,7 @@ IO Latency Interface Files
io.latency
This takes a similar format as the other controllers.
- "MAJOR:MINOR target=<target time in microseconds"
+ "MAJOR:MINOR target=<target time in microseconds>"
io.stat
If the controller is enabled you will see extra stats in io.stat in
@@ -1805,6 +2093,68 @@ IO Latency Interface Files
duration of time between evaluation events. Windows only elapse
with IO activity. Idle periods extend the most recent window.
+IO Priority
+~~~~~~~~~~~
+
+A single attribute controls the behavior of the I/O priority cgroup policy,
+namely the io.prio.class attribute. The following values are accepted for
+that attribute:
+
+ no-change
+ Do not modify the I/O priority class.
+
+ promote-to-rt
+ For requests that have a non-RT I/O priority class, change it into RT.
+ Also change the priority level of these requests to 4. Do not modify
+ the I/O priority of requests that have priority class RT.
+
+ restrict-to-be
+ For requests that do not have an I/O priority class or that have I/O
+ priority class RT, change it into BE. Also change the priority level
+ of these requests to 0. Do not modify the I/O priority class of
+ requests that have priority class IDLE.
+
+ idle
+ Change the I/O priority class of all requests into IDLE, the lowest
+ I/O priority class.
+
+ none-to-rt
+ Deprecated. Just an alias for promote-to-rt.
+
+The following numerical values are associated with the I/O priority policies:
+
++----------------+---+
+| no-change | 0 |
++----------------+---+
+| promote-to-rt | 1 |
++----------------+---+
+| restrict-to-be | 2 |
++----------------+---+
+| idle | 3 |
++----------------+---+
+
+The numerical value that corresponds to each I/O priority class is as follows:
+
++-------------------------------+---+
+| IOPRIO_CLASS_NONE | 0 |
++-------------------------------+---+
+| IOPRIO_CLASS_RT (real-time) | 1 |
++-------------------------------+---+
+| IOPRIO_CLASS_BE (best effort) | 2 |
++-------------------------------+---+
+| IOPRIO_CLASS_IDLE | 3 |
++-------------------------------+---+
+
+The algorithm to set the I/O priority class for a request is as follows:
+
+- If I/O priority class policy is promote-to-rt, change the request I/O
+ priority class to IOPRIO_CLASS_RT and change the request I/O priority
+ level to 4.
+- If I/O priority class policy is not promote-to-rt, translate the I/O priority
+ class policy into a number, then change the request I/O priority class
+ into the maximum of the I/O priority class policy number and the numerical
+ I/O priority class.
+
PID
---
@@ -1925,6 +2275,17 @@ Cpuset Interface Files
The value of "cpuset.mems" stays constant until the next update
and won't be affected by any memory nodes hotplug events.
+ Setting a non-empty value to "cpuset.mems" causes memory of
+ tasks within the cgroup to be migrated to the designated nodes if
+ they are currently using memory outside of the designated nodes.
+
+ There is a cost for this memory migration. The migration
+ may not be complete and some memory pages may be left behind.
+ So it is recommended that "cpuset.mems" should be set properly
+ before spawning new tasks into the cpuset. Even if there is
+ a need to change "cpuset.mems" with active tasks, it shouldn't
+ be done frequently.
+
cpuset.mems.effective
A read-only multiple values file which exists on all
cpuset-enabled cgroups.
@@ -1941,78 +2302,166 @@ Cpuset Interface Files
Its value will be affected by memory nodes hotplug events.
+ cpuset.cpus.exclusive
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists all the exclusive CPUs that are allowed to be used
+ to create a new cpuset partition. Its value is not used
+ unless the cgroup becomes a valid partition root. See the
+ "cpuset.cpus.partition" section below for a description of what
+ a cpuset partition is.
+
+ When the cgroup becomes a partition root, the actual exclusive
+ CPUs that are allocated to that partition are listed in
+ "cpuset.cpus.exclusive.effective" which may be different
+ from "cpuset.cpus.exclusive". If "cpuset.cpus.exclusive"
+ has previously been set, "cpuset.cpus.exclusive.effective"
+ is always a subset of it.
+
+ Users can manually set it to a value that is different from
+ "cpuset.cpus". The only constraint in setting it is that the
+ list of CPUs must be exclusive with respect to its sibling.
+
+ For a parent cgroup, any one of its exclusive CPUs can only
+ be distributed to at most one of its child cgroups. Having an
+ exclusive CPU appearing in two or more of its child cgroups is
+ not allowed (the exclusivity rule). A value that violates the
+ exclusivity rule will be rejected with a write error.
+
+ The root cgroup is a partition root and all its available CPUs
+ are in its exclusive CPU set.
+
+ cpuset.cpus.exclusive.effective
+ A read-only multiple values file which exists on all non-root
+ cpuset-enabled cgroups.
+
+ This file shows the effective set of exclusive CPUs that
+ can be used to create a partition root. The content of this
+ file will always be a subset of "cpuset.cpus" and its parent's
+ "cpuset.cpus.exclusive.effective" if its parent is not the root
+ cgroup. It will also be a subset of "cpuset.cpus.exclusive"
+ if it is set. If "cpuset.cpus.exclusive" is not set, it is
+ treated to have an implicit value of "cpuset.cpus" in the
+ formation of local partition.
+
+ cpuset.cpus.isolated
+ A read-only and root cgroup only multiple values file.
+
+ This file shows the set of all isolated CPUs used in existing
+ isolated partitions. It will be empty if no isolated partition
+ is created.
+
cpuset.cpus.partition
A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup
and is not delegatable.
- It accepts only the following input values when written to.
-
- "root" - a partition root
- "member" - a non-root member of a partition
-
- When set to be a partition root, the current cgroup is the
- root of a new partition or scheduling domain that comprises
- itself and all its descendants except those that are separate
- partition roots themselves and their descendants. The root
- cgroup is always a partition root.
-
- There are constraints on where a partition root can be set.
- It can only be set in a cgroup if all the following conditions
- are true.
-
- 1) The "cpuset.cpus" is not empty and the list of CPUs are
- exclusive, i.e. they are not shared by any of its siblings.
- 2) The parent cgroup is a partition root.
- 3) The "cpuset.cpus" is also a proper subset of the parent's
- "cpuset.cpus.effective".
- 4) There is no child cgroups with cpuset enabled. This is for
- eliminating corner cases that have to be handled if such a
- condition is allowed.
-
- Setting it to partition root will take the CPUs away from the
- effective CPUs of the parent cgroup. Once it is set, this
- file cannot be reverted back to "member" if there are any child
- cgroups with cpuset enabled.
-
- A parent partition cannot distribute all its CPUs to its
- child partitions. There must be at least one cpu left in the
- parent partition.
-
- Once becoming a partition root, changes to "cpuset.cpus" is
- generally allowed as long as the first condition above is true,
- the change will not take away all the CPUs from the parent
- partition and the new "cpuset.cpus" value is a superset of its
- children's "cpuset.cpus" values.
-
- Sometimes, external factors like changes to ancestors'
- "cpuset.cpus" or cpu hotplug can cause the state of the partition
- root to change. On read, the "cpuset.sched.partition" file
- can show the following values.
-
- "member" Non-root member of a partition
- "root" Partition root
- "root invalid" Invalid partition root
-
- It is a partition root if the first 2 partition root conditions
- above are true and at least one CPU from "cpuset.cpus" is
- granted by the parent cgroup.
-
- A partition root can become invalid if none of CPUs requested
- in "cpuset.cpus" can be granted by the parent cgroup or the
- parent cgroup is no longer a partition root itself. In this
- case, it is not a real partition even though the restriction
- of the first partition root condition above will still apply.
- The cpu affinity of all the tasks in the cgroup will then be
- associated with CPUs in the nearest ancestor partition.
-
- An invalid partition root can be transitioned back to a
- real partition root if at least one of the requested CPUs
- can now be granted by its parent. In this case, the cpu
- affinity of all the tasks in the formerly invalid partition
- will be associated to the CPUs of the newly formed partition.
- Changing the partition state of an invalid partition root to
- "member" is always allowed even if child cpusets are present.
+ It accepts only the following input values when written to.
+
+ ========== =====================================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ ========== =====================================
+
+ A cpuset partition is a collection of cpuset-enabled cgroups with
+ a partition root at the top of the hierarchy and its descendants
+ except those that are separate partition roots themselves and
+ their descendants. A partition has exclusive access to the
+ set of exclusive CPUs allocated to it. Other cgroups outside
+ of that partition cannot use any CPUs in that set.
+
+ There are two types of partitions - local and remote. A local
+ partition is one whose parent cgroup is also a valid partition
+ root. A remote partition is one whose parent cgroup is not a
+ valid partition root itself. Writing to "cpuset.cpus.exclusive"
+ is optional for the creation of a local partition as its
+ "cpuset.cpus.exclusive" file will assume an implicit value that
+ is the same as "cpuset.cpus" if it is not set. Writing the
+ proper "cpuset.cpus.exclusive" values down the cgroup hierarchy
+ before the target partition root is mandatory for the creation
+ of a remote partition.
+
+ Currently, a remote partition cannot be created under a local
+ partition. All the ancestors of a remote partition root except
+ the root cgroup cannot be a partition root.
+
+ The root cgroup is always a partition root and its state cannot
+ be changed. All other non-root cgroups start out as "member".
+
+ When set to "root", the current cgroup is the root of a new
+ partition or scheduling domain. The set of exclusive CPUs is
+ determined by the value of its "cpuset.cpus.exclusive.effective".
+
+ When set to "isolated", the CPUs in that partition will be in
+ an isolated state without any load balancing from the scheduler
+ and excluded from the unbound workqueues. Tasks placed in such
+ a partition with multiple CPUs should be carefully distributed
+ and bound to each of the individual CPUs for optimal performance.
+
+ A partition root ("root" or "isolated") can be in one of the
+ two possible states - valid or invalid. An invalid partition
+ root is in a degraded state where some state information may
+ be retained, but behaves more like a "member".
+
+ All possible state transitions among "member", "root" and
+ "isolated" are allowed.
+
+ On read, the "cpuset.cpus.partition" file can show the following
+ values.
+
+ ============================= =====================================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ "root invalid (<reason>)" Invalid partition root
+ "isolated invalid (<reason>)" Invalid isolated partition root
+ ============================= =====================================
+
+ In the case of an invalid partition root, a descriptive string on
+ why the partition is invalid is included within parentheses.
+
+ For a local partition root to be valid, the following conditions
+ must be met.
+
+ 1) The parent cgroup is a valid partition root.
+ 2) The "cpuset.cpus.exclusive.effective" file cannot be empty,
+ though it may contain offline CPUs.
+ 3) The "cpuset.cpus.effective" cannot be empty unless there is
+ no task associated with this partition.
+
+ For a remote partition root to be valid, all the above conditions
+ except the first one must be met.
+
+ External events like hotplug or changes to "cpuset.cpus" or
+ "cpuset.cpus.exclusive" can cause a valid partition root to
+ become invalid and vice versa. Note that a task cannot be
+ moved to a cgroup with empty "cpuset.cpus.effective".
+
+ A valid non-root parent partition may distribute out all its CPUs
+ to its child local partitions when there is no task associated
+ with it.
+
+ Care must be taken to change a valid partition root to "member"
+ as all its child local partitions, if present, will become
+ invalid causing disruption to tasks running in those child
+ partitions. These inactivated partitions could be recovered if
+ their parent is switched back to a partition root with a proper
+ value in "cpuset.cpus" or "cpuset.cpus.exclusive".
+
+ Poll and inotify events are triggered whenever the state of
+ "cpuset.cpus.partition" changes. That includes changes caused
+ by write to "cpuset.cpus.partition", cpu hotplug or other
+ changes that modify the validity status of the partition.
+ This will allow user space agents to monitor unexpected changes
+ to "cpuset.cpus.partition" without the need to do continuous
+ polling.
+
+ A user can pre-configure certain CPUs to an isolated state
+ with load balancing disabled at boot time with the "isolcpus"
+ kernel boot command line option. If those CPUs are to be put
+ into a partition, they have to be used in an isolated partition.
Device controller
@@ -2024,26 +2473,26 @@ existing device files.
Cgroup v2 device controller has no interface files and is implemented
on top of cgroup BPF. To control access to device files, a user may
-create bpf programs of the BPF_CGROUP_DEVICE type and attach them
-to cgroups. On an attempt to access a device file, corresponding
-BPF programs will be executed, and depending on the return value
-the attempt will succeed or fail with -EPERM.
+create bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach
+them to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a
+device file, corresponding BPF programs will be executed, and depending
+on the return value the attempt will succeed or fail with -EPERM.
-A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx
-structure, which describes the device access attempt: access type
-(mknod/read/write) and device (type, major and minor numbers).
-If the program returns 0, the attempt fails with -EPERM, otherwise
-it succeeds.
+A BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the
+bpf_cgroup_dev_ctx structure, which describes the device access attempt:
+access type (mknod/read/write) and device (type, major and minor numbers).
+If the program returns 0, the attempt fails with -EPERM, otherwise it
+succeeds.
-An example of BPF_CGROUP_DEVICE program may be found in the kernel
-source tree in the tools/testing/selftests/bpf/dev_cgroup.c file.
+An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in
+tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.
RDMA
----
The "rdma" controller regulates the distribution and accounting of
-of RDMA resources.
+RDMA resources.
RDMA Interface Files
~~~~~~~~~~~~~~~~~~~~
@@ -2106,9 +2555,90 @@ HugeTLB Interface Files
are local to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.
+ hugetlb.<hugepagesize>.numa_stat
+ Similar to memory.numa_stat, it shows the numa information of the
+ hugetlb pages of <hugepagesize> in this cgroup. Only active in
+ use hugetlb pages are included. The per-node values are in bytes.
+
Misc
----
+The Miscellaneous cgroup provides the resource limiting and tracking
+mechanism for the scalar resources which cannot be abstracted like the other
+cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config
+option.
+
+A resource can be added to the controller via enum misc_res_type{} in the
+include/linux/misc_cgroup.h file and the corresponding name via misc_res_name[]
+in the kernel/cgroup/misc.c file. Provider of the resource must set its
+capacity prior to using the resource by calling misc_cg_set_capacity().
+
+Once a capacity is set then the resource usage can be updated using charge and
+uncharge APIs. All of the APIs to interact with misc controller are in
+include/linux/misc_cgroup.h.
+
+Misc Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+Miscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are registered then:
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. It shows
+ miscellaneous scalar resources available on the platform along with
+ their quantities::
+
+ $ cat misc.capacity
+ res_a 50
+ res_b 10
+
+ misc.current
+ A read-only flat-keyed file shown in the all cgroups. It shows
+ the current usage of the resources in the cgroup and its children.::
+
+ $ cat misc.current
+ res_a 3
+ res_b 0
+
+ misc.max
+ A read-write flat-keyed file shown in the non root cgroups. Allowed
+ maximum usage of the resources in the cgroup and its children.::
+
+ $ cat misc.max
+ res_a max
+ res_b 4
+
+ Limit can be set by::
+
+ # echo res_a 1 > misc.max
+
+ Limit can be set to max by::
+
+ # echo res_a max > misc.max
+
+ Limits can be set higher than the capacity value in the misc.capacity
+ file.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups. The
+ following entries are defined. Unless specified otherwise, a value
+ change in this file generates a file modified event. All fields in
+ this file are hierarchical.
+
+ max
+ The number of times the cgroup's resource usage was
+ about to go over the max boundary.
+
+Migration and Ownership
+~~~~~~~~~~~~~~~~~~~~~~~
+
+A miscellaneous scalar resource is charged to the cgroup in which it is used
+first, and stays charged to that cgroup until that resource is freed. Migrating
+a process to a different cgroup does not move the charge to the destination
+cgroup where the process has moved.
+
+Others
+------
+
perf_event
~~~~~~~~~~
@@ -2165,7 +2695,7 @@ Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
complete path of the cgroup of a process. In a container setup where
a set of cgroups and namespaces are intended to isolate processes the
"/proc/$PID/cgroup" file may leak potential system level information
-to the isolated processes. For Example::
+to the isolated processes. For example::
# cat /proc/self/cgroup
0::/batchjobs/container_id1
diff --git a/Documentation/admin-guide/cifs/authors.rst b/Documentation/admin-guide/cifs/authors.rst
index b02d6dd6c070..5c1d2f0fa7d1 100644
--- a/Documentation/admin-guide/cifs/authors.rst
+++ b/Documentation/admin-guide/cifs/authors.rst
@@ -5,10 +5,10 @@ Authors
Original Author
---------------
-Steve French (sfrench@samba.org)
+Steve French (smfrench@gmail.com, sfrench@samba.org)
The author wishes to express his appreciation and thanks to:
-Andrew Tridgell (Samba team) for his early suggestions about smb/cifs VFS
+Andrew Tridgell (Samba team) for his early suggestions about SMB/CIFS VFS
improvements. Thanks to IBM for allowing me time and test resources to pursue
this project, to Jim McDonough from IBM (and the Samba Team) for his help, to
the IBM Linux JFS team for explaining many esoteric Linux filesystem features.
@@ -51,7 +51,7 @@ Patch Contributors
- Ronnie Sahlberg (for SMB3 xattr work, bug fixes, and lots of great work on compounding)
- Shirish Pargaonkar (for many ACL patches over the years)
- Sachin Prabhu (many bug fixes, including for reconnect, copy offload and security)
-- Paulo Alcantara
+- Paulo Alcantara (for some excellent work in DFS, and in booting from SMB3)
- Long Li (some great work on RDMA, SMB Direct)
diff --git a/Documentation/admin-guide/cifs/changes.rst b/Documentation/admin-guide/cifs/changes.rst
index 71f2ecb62299..8c42c4de510b 100644
--- a/Documentation/admin-guide/cifs/changes.rst
+++ b/Documentation/admin-guide/cifs/changes.rst
@@ -3,6 +3,7 @@ Changes
=======
See https://wiki.samba.org/index.php/LinuxCIFSKernel for summary
-information (that may be easier to read than parsing the output of
-"git log fs/cifs") about fixes/improvements to CIFS/SMB2/SMB3 support (changes
+information about fixes/improvements to CIFS/SMB2/SMB3 support (changes
to cifs.ko module) by kernel version (and cifs internal module version).
+This may be easier to read than parsing the output of
+"git log fs/smb/client" by release.
diff --git a/Documentation/admin-guide/cifs/introduction.rst b/Documentation/admin-guide/cifs/introduction.rst
index 0b98f672d36f..ffc6e2564dd5 100644
--- a/Documentation/admin-guide/cifs/introduction.rst
+++ b/Documentation/admin-guide/cifs/introduction.rst
@@ -7,19 +7,19 @@ Introduction
protocol which was the successor to the Server Message Block
(SMB) protocol, the native file sharing mechanism for most early
PC operating systems. New and improved versions of CIFS are now
- called SMB2 and SMB3. Use of SMB3 (and later, including SMB3.1.1)
- is strongly preferred over using older dialects like CIFS due to
- security reaasons. All modern dialects, including the most recent,
- SMB3.1.1 are supported by the CIFS VFS module. The SMB3 protocol
- is implemented and supported by all major file servers
- such as all modern versions of Windows (including Windows 2016
- Server), as well as by Samba (which provides excellent
- CIFS/SMB2/SMB3 server support and tools for Linux and many other
- operating systems). Apple systems also support SMB3 well, as
- do most Network Attached Storage vendors, so this network
- filesystem client can mount to a wide variety of systems.
- It also supports mounting to the cloud (for example
- Microsoft Azure), including the necessary security features.
+ called SMB2 and SMB3. Use of SMB3 (and later, including SMB3.1.1
+ the most current dialect) is strongly preferred over using older
+ dialects like CIFS due to security reasons. All modern dialects,
+ including the most recent, SMB3.1.1, are supported by the CIFS VFS
+ module. The SMB3 protocol is implemented and supported by all major
+ file servers such as Windows (including Windows 2019 Server), as
+ well as by Samba (which provides excellent CIFS/SMB2/SMB3 server
+ support and tools for Linux and many other operating systems).
+ Apple systems also support SMB3 well, as do most Network Attached
+ Storage vendors, so this network filesystem client can mount to a
+ wide variety of systems. It also supports mounting to the cloud
+ (for example Microsoft Azure), including the necessary security
+ features.
The intent of this module is to provide the most advanced network
file system function for SMB3 compliant servers, including advanced
@@ -27,8 +27,8 @@ Introduction
POSIX compliance, secure per-user session establishment, encryption,
high performance safe distributed caching (leases/oplocks), optional packet
signing, large files, Unicode support and other internationalization
- improvements. Since both Samba server and this filesystem client support
- the CIFS Unix extensions (and in the future SMB3 POSIX extensions),
+ improvements. Since both Samba server and this filesystem client support the
+ CIFS Unix extensions, and the Linux client also supports SMB3 POSIX extensions,
the combination can provide a reasonable alternative to other network and
cluster file systems for fileserving in some Linux to Linux environments,
not just in Linux to Windows (or Linux to Mac) environments.
diff --git a/Documentation/admin-guide/cifs/todo.rst b/Documentation/admin-guide/cifs/todo.rst
index 084c25f92dcb..9a65c670774e 100644
--- a/Documentation/admin-guide/cifs/todo.rst
+++ b/Documentation/admin-guide/cifs/todo.rst
@@ -2,7 +2,8 @@
TODO
====
-Version 2.14 December 21, 2018
+As of 6.7 kernel. See https://wiki.samba.org/index.php/LinuxCIFSKernel
+for list of features added by release
A Partial List of Missing Features
==================================
@@ -12,25 +13,27 @@ for visible, important contributions to this module. Here
is a partial list of the known problems and missing features:
a) SMB3 (and SMB3.1.1) missing optional features:
+ multichannel performance optimizations, algorithmic channel selection,
+ directory leases optimizations,
+ support for faster packet signing (GMAC),
+ support for compression over the network,
+ T10 copy offload ie "ODX" (copy chunk, and "Duplicate Extents" ioctl
+ are currently the only two server side copy mechanisms supported)
- - multichannel (started), integration with RDMA
- - directory leases (improved metadata caching), started (root dir only)
- - T10 copy offload ie "ODX" (copy chunk, and "Duplicate Extents" ioctl
- currently the only two server side copy mechanisms supported)
+b) Better optimized compounding and error handling for sparse file support,
+ perhaps addition of new optional SMB3.1.1 fsctls to make collapse range
+ and insert range more atomic
-b) improved sparse file support (fiemap and SEEK_HOLE are implemented
- but additional features would be supportable by the protocol).
+c) Support for SMB3.1.1 over QUIC (and perhaps other socket based protocols
+ like SCTP)
-c) Directory entry caching relies on a 1 second timer, rather than
- using Directory Leases, currently only the root file handle is cached longer
-
-d) quota support (needs minor kernel change since quota calls
- to make it to network filesystems or deviceless filesystems)
+d) quota support (needs minor kernel change since quota calls otherwise
+ won't make it to network filesystems or deviceless filesystems).
e) Additional use cases can be optimized to use "compounding" (e.g.
open/query/close and open/setinfo/close) to reduce the number of
roundtrips to the server and improve performance. Various cases
- (stat, statfs, create, unlink, mkdir) already have been improved by
+ (stat, statfs, create, unlink, mkdir, xattrs) already have been improved by
using compounding but more can be done. In addition we could
significantly reduce redundant opens by using deferred close (with
handle caching leases) and better using reference counters on file
@@ -60,7 +63,9 @@ k) Add tools to take advantage of more smb3 specific ioctls and features
metadata attributes easier from tools (e.g. extending what was done
in smb-info tool).
-l) encrypted file support
+l) encrypted file support (currently the attribute showing the file is
+ encrypted on the server is reported, but changing the attribute is not
+ supported).
m) improved stats gathering tools (perhaps integration with nfsometer?)
to extend and make easier to use what is currently in /proc/fs/cifs/Stats
@@ -69,14 +74,13 @@ n) Add support for claims based ACLs ("DAC")
o) mount helper GUI (to simplify the various configuration options on mount)
-p) Add support for witness protocol (perhaps ioctl to cifs.ko from user space
- tool listening on witness protocol RPC) to allow for notification of share
- move, server failover, and server adapter changes. And also improve other
- failover scenarios, e.g. when client knows multiple DFS entries point to
- different servers, and the server we are connected to has gone down.
+p) Expand support for witness protocol to allow for notification of share
+ move, and server network adapter changes. Currently only notifications by
+ the witness protocol for server move is supported by the Linux client.
q) Allow mount.cifs to be more verbose in reporting errors with dialect
- or unsupported feature errors.
+ or unsupported feature errors. This would now be easier due to the
+ implementation of the new mount API.
r) updating cifs documentation, and user guide.
@@ -87,26 +91,22 @@ t) split cifs and smb3 support into separate modules so legacy (and less
secure) CIFS dialect can be disabled in environments that don't need it
and simplify the code.
-v) POSIX Extensions for SMB3.1.1 (started, create and mkdir support added
- so far).
+v) Additional testing of POSIX Extensions for SMB3.1.1
+
+w) Support for the Mac SMB3.1.1 extensions to improve interop with Apple servers
-w) Add support for additional strong encryption types, and additional spnego
- authentication mechanisms (see MS-SMB2)
+x) Support for additional authentication options (e.g. IAKERB, peer-to-peer
+ Kerberos, SCRAM and others supported by existing servers)
-x) Finish support for SMB3.1.1 compression
+y) Improved tracing, more eBPF trace points, better scripts for performance
+ analysis
Known Bugs
==========
-See http://bugzilla.samba.org - search on product "CifsVFS" for
+See https://bugzilla.samba.org - search on product "CifsVFS" for
current bug list. Also check http://bugzilla.kernel.org (Product = File System, Component = CIFS)
-
-1) existing symbolic links (Windows reparse points) are recognized but
- can not be created remotely. They are implemented for Samba and those that
- support the CIFS Unix extensions, although earlier versions of Samba
- overly restrict the pathnames.
-2) follow_link and readdir code does not follow dfs junctions
- but recognizes them
+and xfstest results e.g. https://wiki.samba.org/index.php/Xfstest-results-smb3
Misc testing to do
==================
diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst
index d3fb67b8a976..aa8290a29dc8 100644
--- a/Documentation/admin-guide/cifs/usage.rst
+++ b/Documentation/admin-guide/cifs/usage.rst
@@ -16,8 +16,7 @@ standard for interoperating between Macs and Windows and major NAS appliances.
Please see
MS-SMB2 (for detailed SMB2/SMB3/SMB3.1.1 protocol specification)
-http://protocolfreedom.org/ and
-http://samba.org/samba/PFIF/
+or https://samba.org/samba/PFIF/
for more details.
@@ -32,7 +31,7 @@ Build instructions
For Linux:
-1) Download the kernel (e.g. from http://www.kernel.org)
+1) Download the kernel (e.g. from https://www.kernel.org)
and change directory into the top of the kernel directory tree
(e.g. /usr/src/linux-2.5.73)
2) make menuconfig (or make xconfig)
@@ -46,7 +45,7 @@ Installation instructions
If you have built the CIFS vfs as module (successfully) simply
type ``make modules_install`` (or if you prefer, manually copy the file to
-the modules directory e.g. /lib/modules/2.4.10-4GB/kernel/fs/cifs/cifs.ko).
+the modules directory e.g. /lib/modules/6.3.0-060300-generic/kernel/fs/smb/client/cifs.ko).
If you have built the CIFS vfs into the kernel itself, follow the instructions
for your distribution on how to install a new kernel (usually you
@@ -67,24 +66,24 @@ If cifs is built as a module, then the size and number of network buffers
and maximum number of simultaneous requests to one server can be configured.
Changing these from their defaults is not recommended. By executing modinfo::
- modinfo kernel/fs/cifs/cifs.ko
+ modinfo <path to cifs.ko>
-on kernel/fs/cifs/cifs.ko the list of configuration changes that can be made
+on kernel/fs/smb/client/cifs.ko the list of configuration changes that can be made
at module initialization time (by running insmod cifs.ko) can be seen.
Recommendations
===============
-To improve security the SMB2.1 dialect or later (usually will get SMB3) is now
+To improve security the SMB2.1 dialect or later (usually will get SMB3.1.1) is now
the new default. To use old dialects (e.g. to mount Windows XP) use "vers=1.0"
on mount (or vers=2.0 for Windows Vista). Note that the CIFS (vers=1.0) is
much older and less secure than the default dialect SMB3 which includes
many advanced security features such as downgrade attack detection
and encrypted shares and stronger signing and authentication algorithms.
There are additional mount options that may be helpful for SMB3 to get
-improved POSIX behavior (NB: can use vers=3.0 to force only SMB3, never 2.1):
+improved POSIX behavior (NB: can use vers=3 to force SMB3 or later, never 2.1):
- ``mfsymlinks`` and ``cifsacl`` and ``idsfromsid``
+ ``mfsymlinks`` and either ``cifsacl`` or ``modefromsid`` (usually with ``idsfromsid``)
Allowing User Mounts
====================
@@ -116,7 +115,7 @@ later source tree in docs/manpages/mount.cifs.8
Allowing User Unmounts
======================
-To permit users to ummount directories that they have user mounted (see above),
+To permit users to unmount directories that they have user mounted (see above),
the utility umount.cifs may be used. It may be invoked directly, or if
umount.cifs is placed in /sbin, umount can invoke the cifs umount helper
(at least for most versions of the umount utility) for umount of cifs
@@ -198,7 +197,7 @@ that is ignored by local server applications and non-cifs clients and that will
not be traversed by the Samba server). This is opaque to the Linux client
application using the cifs vfs. Absolute symlinks will work to Samba 3.0.5 or
later, but only for remote clients using the CIFS Unix extensions, and will
-be invisbile to Windows clients and typically will not affect local
+be invisible to Windows clients and typically will not affect local
applications running on the same server as Samba.
Use instructions
@@ -268,7 +267,7 @@ would be forbidden for Windows/CIFS semantics) as long as the server is
configured for Unix Extensions (and the client has not disabled
/proc/fs/cifs/LinuxExtensionsEnabled). In addition the mount option
``mapposix`` can be used on CIFS (vers=1.0) to force the mapping of
-illegal Windows/NTFS/SMB characters to a remap range (this mount parm
+illegal Windows/NTFS/SMB characters to a remap range (this mount parameter
is the default for SMB3). This remap (``mapposix``) range is also
compatible with Mac (and "Services for Mac" on some older Windows).
@@ -400,7 +399,7 @@ A partial list of the supported mount options follows:
sep
if first mount option (after the -o), overrides
the comma as the separator between the mount
- parms. e.g.::
+ parameters. e.g.::
-o user=myname,password=mypassword,domain=mydom
@@ -715,6 +714,8 @@ DebugData Displays information about active CIFS sessions and
version.
Stats Lists summary resource usage information as well as per
share statistics.
+open_files List all the open file handles on all active SMB sessions.
+mount_params List of all mount parameters available for the module
======================= =======================================================
Configuration pseudo-files:
@@ -734,10 +735,9 @@ SecurityFlags Flags which control security negotiation and
using weaker password hashes is 0x37037 (lanman,
plaintext, ntlm, ntlmv2, signing allowed). Some
SecurityFlags require the corresponding menuconfig
- options to be enabled (lanman and plaintext require
- CONFIG_CIFS_WEAK_PW_HASH for example). Enabling
- plaintext authentication currently requires also
- enabling lanman authentication in the security flags
+ options to be enabled. Enabling plaintext
+ authentication currently requires also enabling
+ lanman authentication in the security flags
because the cifs module only supports sending
laintext passwords using the older lanman dialect
form of the session setup SMB. (e.g. for authentication
@@ -766,7 +766,7 @@ cifsFYI If set to non-zero value, additional debug information
Some debugging statements are not compiled into the
cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the
kernel configuration. cifsFYI may be set to one or
- nore of the following flags (7 sets them all)::
+ more of the following flags (7 sets them all)::
+-----------------------------------------------+------+
| log cifs informational messages | 0x01 |
@@ -795,6 +795,8 @@ LinuxExtensionsEnabled If set to one then the client will attempt to
support and want to map the uid and gid fields
to values supplied at mount (rather than the
actual values, then set this to zero. (default 1)
+dfscache List the content of the DFS cache.
+ If set to 0, the client will clear the cache.
======================= =======================================================
These experimental features and tracing can be enabled by changing flags in
@@ -831,7 +833,7 @@ the active sessions and the shares that are mounted.
Enabling Kerberos (extended security) works but requires version 1.2 or later
of the helper program cifs.upcall to be present and to be configured in the
/etc/request-key.conf file. The cifs.upcall helper program is from the Samba
-project(http://www.samba.org). NTLM and NTLMv2 and LANMAN support do not
+project(https://www.samba.org). NTLM and NTLMv2 and LANMAN support do not
require this helper. Note that NTLMv2 security (which does not require the
cifs.upcall helper program), instead of using Kerberos, is sufficient for
some use cases.
@@ -857,12 +859,17 @@ CIFS kernel module parameters
These module parameters can be specified or modified either during the time of
module loading or during the runtime by using the interface::
- /proc/module/cifs/parameters/<param>
+ /sys/module/cifs/parameters/<param>
i.e.::
echo "value" > /sys/module/cifs/parameters/<param>
+More detailed descriptions of the available module parameters and their values
+can be seen by doing:
+
+ modinfo cifs (or modinfo smb3)
+
================= ==========================================================
1. enable_oplocks Enable or disable oplocks. Oplocks are enabled by default.
[Y/y/1]. To disable use any of [N/n/0].
diff --git a/Documentation/admin-guide/cifs/winucase_convert.pl b/Documentation/admin-guide/cifs/winucase_convert.pl
index 322a9c833f23..993186beea20 100755
--- a/Documentation/admin-guide/cifs/winucase_convert.pl
+++ b/Documentation/admin-guide/cifs/winucase_convert.pl
@@ -16,7 +16,7 @@
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
-# along with this program. If not, see <http://www.gnu.org/licenses/>.
+# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
while(<>) {
diff --git a/Documentation/admin-guide/cpu-load.rst b/Documentation/admin-guide/cpu-load.rst
index ebdecf864080..21a984337080 100644
--- a/Documentation/admin-guide/cpu-load.rst
+++ b/Documentation/admin-guide/cpu-load.rst
@@ -61,50 +61,53 @@ will lead to quite erratic information inside ``/proc/stat``::
static volatile sig_atomic_t stop;
- static void sighandler (int signr)
+ static void sighandler(int signr)
{
- (void) signr;
- stop = 1;
+ (void) signr;
+ stop = 1;
}
+
static unsigned long hog (unsigned long niters)
{
- stop = 0;
- while (!stop && --niters);
- return niters;
+ stop = 0;
+ while (!stop && --niters);
+ return niters;
}
+
int main (void)
{
- int i;
- struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
- .it_value = { .tv_sec = 0, .tv_usec = 1 } };
- sigset_t set;
- unsigned long v[HIST];
- double tmp = 0.0;
- unsigned long n;
- signal (SIGALRM, &sighandler);
- setitimer (ITIMER_REAL, &it, NULL);
-
- hog (ULONG_MAX);
- for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
- for (i = 0; i < HIST; ++i) tmp += v[i];
- tmp /= HIST;
- n = tmp - (tmp / 3.0);
-
- sigemptyset (&set);
- sigaddset (&set, SIGALRM);
-
- for (;;) {
- hog (n);
- sigwait (&set, &i);
- }
- return 0;
+ int i;
+ struct itimerval it = {
+ .it_interval = { .tv_sec = 0, .tv_usec = 1 },
+ .it_value = { .tv_sec = 0, .tv_usec = 1 } };
+ sigset_t set;
+ unsigned long v[HIST];
+ double tmp = 0.0;
+ unsigned long n;
+ signal(SIGALRM, &sighandler);
+ setitimer(ITIMER_REAL, &it, NULL);
+
+ hog (ULONG_MAX);
+ for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog(ULONG_MAX);
+ for (i = 0; i < HIST; ++i) tmp += v[i];
+ tmp /= HIST;
+ n = tmp - (tmp / 3.0);
+
+ sigemptyset(&set);
+ sigaddset(&set, SIGALRM);
+
+ for (;;) {
+ hog(n);
+ sigwait(&set, &i);
+ }
+ return 0;
}
References
----------
-- http://lkml.org/lkml/2007/2/12/6
+- https://lore.kernel.org/r/loom.20070212T063225-663@post.gmane.org
- Documentation/filesystems/proc.rst (1.8)
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index b90dafcc8237..d29cacc9b3c3 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -2,105 +2,28 @@
How CPU topology info is exported via sysfs
===========================================
-Export CPU topology info via sysfs. Items (attributes) are similar
-to /proc/cpuinfo output of some architectures. They reside in
-/sys/devices/system/cpu/cpuX/topology/:
-
-physical_package_id:
-
- physical package id of cpuX. Typically corresponds to a physical
- socket number, but the actual value is architecture and platform
- dependent.
-
-die_id:
-
- the CPU die ID of cpuX. Typically it is the hardware platform's
- identifier (rather than the kernel's). The actual value is
- architecture and platform dependent.
-
-core_id:
-
- the CPU core ID of cpuX. Typically it is the hardware platform's
- identifier (rather than the kernel's). The actual value is
- architecture and platform dependent.
-
-book_id:
-
- the book ID of cpuX. Typically it is the hardware platform's
- identifier (rather than the kernel's). The actual value is
- architecture and platform dependent.
-
-drawer_id:
-
- the drawer ID of cpuX. Typically it is the hardware platform's
- identifier (rather than the kernel's). The actual value is
- architecture and platform dependent.
-
-core_cpus:
-
- internal kernel map of CPUs within the same core.
- (deprecated name: "thread_siblings")
-
-core_cpus_list:
-
- human-readable list of CPUs within the same core.
- (deprecated name: "thread_siblings_list");
-
-package_cpus:
-
- internal kernel map of the CPUs sharing the same physical_package_id.
- (deprecated name: "core_siblings")
-
-package_cpus_list:
-
- human-readable list of CPUs sharing the same physical_package_id.
- (deprecated name: "core_siblings_list")
-
-die_cpus:
-
- internal kernel map of CPUs within the same die.
-
-die_cpus_list:
-
- human-readable list of CPUs within the same die.
-
-book_siblings:
-
- internal kernel map of cpuX's hardware threads within the same
- book_id.
-
-book_siblings_list:
-
- human-readable list of cpuX's hardware threads within the same
- book_id.
-
-drawer_siblings:
-
- internal kernel map of cpuX's hardware threads within the same
- drawer_id.
-
-drawer_siblings_list:
-
- human-readable list of cpuX's hardware threads within the same
- drawer_id.
+CPU topology info is exported via sysfs. Items (attributes) are similar
+to /proc/cpuinfo output of some architectures. They reside in
+/sys/devices/system/cpu/cpuX/topology/. Please refer to the ABI file:
+Documentation/ABI/stable/sysfs-devices-system-cpu.
Architecture-neutral, drivers/base/topology.c, exports these attributes.
-However, the book and drawer related sysfs files will only be created if
-CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
-
-CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390,
-where they reflect the cpu and cache hierarchy.
+However the die, cluster, book, and drawer hierarchy related sysfs files will
+only be created if an architecture provides the related macros as described
+below.
For an architecture to support this feature, it must define some of
these macros in include/asm-XXX/topology.h::
#define topology_physical_package_id(cpu)
#define topology_die_id(cpu)
+ #define topology_cluster_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
+ #define topology_cluster_cpumask(cpu)
#define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
@@ -116,15 +39,16 @@ not defined by include/asm-XXX/topology.h:
1) topology_physical_package_id: -1
2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
-
-For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
-default definitions for topology_book_id() and topology_book_cpumask().
-For architectures that don't support drawers (CONFIG_SCHED_DRAWER) there are
-no default definitions for topology_drawer_id() and topology_drawer_cpumask().
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_book_id: -1
+6) topology_drawer_id: -1
+7) topology_sibling_cpumask: just the given CPU
+8) topology_core_cpumask: just the given CPU
+9) topology_cluster_cpumask: just the given CPU
+10) topology_die_cpumask: just the given CPU
+11) topology_book_cpumask: just the given CPU
+12) topology_drawer_cpumask: just the given CPU
Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal
@@ -135,9 +59,9 @@ source for the output is in brackets ("[]").
[NR_CPUS-1]
offline: CPUs that are not online because they have been
- HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
- of CPUs allowed by the kernel configuration (kernel_max
- above). [~cpu_online_mask + cpus >= NR_CPUS]
+ HOTPLUGGED off or exceed the limit of CPUs allowed by the
+ kernel configuration (kernel_max above).
+ [~cpu_online_mask + cpus >= NR_CPUS]
online: CPUs that are online and being scheduled [cpu_online_mask]
@@ -173,5 +97,5 @@ online.)::
possible: 0-127
present: 0-3
-See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
-as well as more information on the various cpumasks.
+See Documentation/core-api/cpu_hotplug.rst for the possible_cpus=NUM
+kernel start parameter as well as more information on the various cpumasks.
diff --git a/Documentation/admin-guide/dell_rbu.rst b/Documentation/admin-guide/dell_rbu.rst
index 8d70e1fc9f9d..2196caf1b939 100644
--- a/Documentation/admin-guide/dell_rbu.rst
+++ b/Documentation/admin-guide/dell_rbu.rst
@@ -26,7 +26,7 @@ Please go to http://support.dell.com register and you can find info on
OpenManage and Dell Update packages (DUP).
Libsmbios can also be used to update BIOS on Dell systems go to
-http://linux.dell.com/libsmbios/ for details.
+https://linux.dell.com/libsmbios/ for details.
Dell_RBU driver supports BIOS update using the monolithic image and packetized
image methods. In case of monolithic the driver allocates a contiguous chunk
diff --git a/Documentation/admin-guide/device-mapper/cache-policies.rst b/Documentation/admin-guide/device-mapper/cache-policies.rst
index b17fe352fc41..13da4d831d46 100644
--- a/Documentation/admin-guide/device-mapper/cache-policies.rst
+++ b/Documentation/admin-guide/device-mapper/cache-policies.rst
@@ -70,7 +70,7 @@ the entries (each hotspot block covers a larger area than a single
cache block).
All this means smq uses ~25bytes per cache block. Still a lot of
-memory, but a substantial improvement nontheless.
+memory, but a substantial improvement nonetheless.
Level balancing
^^^^^^^^^^^^^^^
diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst
index 8f4a3f889d43..aa2d04d95df6 100644
--- a/Documentation/admin-guide/device-mapper/dm-crypt.rst
+++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst
@@ -46,7 +46,7 @@ Parameters::
capi:authenc(hmac(sha256),xts(aes))-random
capi:rfc7539(chacha20,poly1305)-random
- The /proc/crypto contains a list of curently loaded crypto modes.
+ The /proc/crypto contains a list of currently loaded crypto modes.
<key>
Key used for encryption. It is encoded either as a hexadecimal number
@@ -67,7 +67,7 @@ Parameters::
the value passed in <key_size>.
<key_type>
- Either 'logon' or 'user' kernel key type.
+ Either 'logon', 'user', 'encrypted' or 'trusted' kernel key type.
<key_description>
The kernel keyring key description crypt target should look for
@@ -92,7 +92,7 @@ Parameters::
<#opt_params>
Number of optional parameters. If there are no optional parameters,
- the optional paramaters section can be skipped or #opt_params can be zero.
+ the optional parameters section can be skipped or #opt_params can be zero.
Otherwise #opt_params is the number of following arguments.
Example of optional parameters section:
@@ -121,6 +121,14 @@ submit_from_crypt_cpus
thread because it benefits CFQ to have writes submitted using the
same context.
+no_read_workqueue
+ Bypass dm-crypt internal workqueue and process read requests synchronously.
+
+no_write_workqueue
+ Bypass dm-crypt internal workqueue and process write requests synchronously.
+ This option is automatically enabled for host-managed zoned block devices
+ (e.g. host-managed SMR hard-disks).
+
integrity:<bytes>:<type>
The device requires additional <bytes> metadata per-sector stored
in per-bio integrity structure. This metadata must by provided
diff --git a/Documentation/admin-guide/device-mapper/dm-dust.rst b/Documentation/admin-guide/device-mapper/dm-dust.rst
index b6e7e7ead831..e35ec8cd2f88 100644
--- a/Documentation/admin-guide/device-mapper/dm-dust.rst
+++ b/Documentation/admin-guide/device-mapper/dm-dust.rst
@@ -69,10 +69,11 @@ Create the dm-dust device:
$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096'
Check the status of the read behavior ("bypass" indicates that all I/O
-will be passed through to the underlying device)::
+will be passed through to the underlying device; "verbose" indicates that
+bad block additions, removals, and remaps will be verbosely logged)::
$ sudo dmsetup status dust1
- 0 33552384 dust 252:17 bypass
+ 0 33552384 dust 252:17 bypass verbose
$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
128+0 records in
@@ -164,7 +165,7 @@ following message command::
A message will print with the number of bad blocks currently
configured on the device::
- kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
+ countbadblocks: 895 badblock(s) found
Querying for specific bad blocks
--------------------------------
@@ -176,11 +177,11 @@ following message command::
The following message will print if the block is in the list::
- device-mapper: dust: queryblock: block 72 found in badblocklist
+ dust_query_block: block 72 found in badblocklist
The following message will print if the block is not in the list::
- device-mapper: dust: queryblock: block 72 not found in badblocklist
+ dust_query_block: block 72 not found in badblocklist
The "queryblock" message command will work in both the "enabled"
and "disabled" modes, allowing the verification of whether a block
@@ -198,12 +199,28 @@ following message command::
After clearing the bad block list, the following message will appear::
- kernel: device-mapper: dust: clearbadblocks: badblocks cleared
+ dust_clear_badblocks: badblocks cleared
If there were no bad blocks to clear, the following message will
appear::
- kernel: device-mapper: dust: clearbadblocks: no badblocks found
+ dust_clear_badblocks: no badblocks found
+
+Listing the bad block list
+--------------------------
+
+To list all bad blocks in the bad block list (using an example device
+with blocks 1 and 2 in the bad block list), run the following message
+command::
+
+ $ sudo dmsetup message dust1 0 listbadblocks
+ 1
+ 2
+
+If there are no bad blocks in the bad block list, the command will
+execute with no output::
+
+ $ sudo dmsetup message dust1 0 listbadblocks
Message commands list
---------------------
@@ -223,6 +240,7 @@ Single argument message commands::
countbadblocks
clearbadblocks
+ listbadblocks
disable
enable
quiet
diff --git a/Documentation/admin-guide/device-mapper/dm-ebs.rst b/Documentation/admin-guide/device-mapper/dm-ebs.rst
index 534fa38e8862..c09f66db5621 100644
--- a/Documentation/admin-guide/device-mapper/dm-ebs.rst
+++ b/Documentation/admin-guide/device-mapper/dm-ebs.rst
@@ -31,7 +31,7 @@ Mandatory parameters:
Optional parameter:
- <underyling sectors>:
+ <underlying sectors>:
Number of sectors defining the logical block size of <dev path>.
2^N supported, e.g. 8 = emulate 8 sectors of 512 bytes = 4KiB.
If not provided, the logical block size of <dev path> will be used.
diff --git a/Documentation/admin-guide/device-mapper/dm-flakey.rst b/Documentation/admin-guide/device-mapper/dm-flakey.rst
index 86138735879d..f967c5fea219 100644
--- a/Documentation/admin-guide/device-mapper/dm-flakey.rst
+++ b/Documentation/admin-guide/device-mapper/dm-flakey.rst
@@ -39,6 +39,10 @@ Optional feature parameters:
If no feature parameters are present, during the periods of
unreliability, all I/O returns errors.
+ error_reads:
+ All read I/O is failed with an error signalled.
+ Write I/O is handled correctly.
+
drop_writes:
All write I/O is silently ignored.
Read I/O is handled correctly.
@@ -63,6 +67,16 @@ Optional feature parameters:
Perform the replacement only if bio->bi_opf has all the
selected flags set.
+ random_read_corrupt <probability>
+ During <down interval>, replace random byte in a read bio
+ with a random value. probability is an integer between
+ 0 and 1000000000 meaning 0% to 100% probability of corruption.
+
+ random_write_corrupt <probability>
+ During <down interval>, replace random byte in a write bio
+ with a random value. probability is an integer between
+ 0 and 1000000000 meaning 0% to 100% probability of corruption.
+
Examples:
Replaces the 32nd byte of READ bios with the value 1::
diff --git a/Documentation/admin-guide/device-mapper/dm-ima.rst b/Documentation/admin-guide/device-mapper/dm-ima.rst
new file mode 100644
index 000000000000..a4aa50a828e0
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-ima.rst
@@ -0,0 +1,715 @@
+======
+dm-ima
+======
+
+For a given system, various external services/infrastructure tools
+(including the attestation service) interact with it - both during the
+setup and during rest of the system run-time. They share sensitive data
+and/or execute critical workload on that system. The external services
+may want to verify the current run-time state of the relevant kernel
+subsystems before fully trusting the system with business-critical
+data/workload.
+
+Device mapper plays a critical role on a given system by providing
+various important functionalities to the block devices using various
+target types like crypt, verity, integrity etc. Each of these target
+types’ functionalities can be configured with various attributes.
+The attributes chosen to configure these target types can significantly
+impact the security profile of the block device, and in-turn, of the
+system itself. For instance, the type of encryption algorithm and the
+key size determines the strength of encryption for a given block device.
+
+Therefore, verifying the current state of various block devices as well
+as their various target attributes is crucial for external services before
+fully trusting the system with business-critical data/workload.
+
+IMA kernel subsystem provides the necessary functionality for
+device mapper to measure the state and configuration of
+various block devices -
+
+- by device mapper itself, from within the kernel,
+- in a tamper resistant way,
+- and re-measured - triggered on state/configuration change.
+
+Setting the IMA Policy:
+=======================
+For IMA to measure the data on a given system, the IMA policy on the
+system needs to be updated to have following line, and the system needs
+to be restarted for the measurements to take effect.
+
+::
+
+ /etc/ima/ima-policy
+ measure func=CRITICAL_DATA label=device-mapper template=ima-buf
+
+The measurements will be reflected in the IMA logs, which are located at:
+
+::
+
+ /sys/kernel/security/integrity/ima/ascii_runtime_measurements
+ /sys/kernel/security/integrity/ima/binary_runtime_measurements
+
+Then IMA ASCII measurement log has the following format:
+
+::
+
+ <PCR> <TEMPLATE_DATA_DIGEST> <TEMPLATE_NAME> <TEMPLATE_DATA>
+
+ PCR := Platform Configuration Register, in which the values are registered.
+ This is applicable if TPM chip is in use.
+
+ TEMPLATE_DATA_DIGEST := Template data digest of the IMA record.
+ TEMPLATE_NAME := Template name that registered the integrity value (e.g. ima-buf).
+
+ TEMPLATE_DATA := <ALG> ":" <EVENT_DIGEST> <EVENT_NAME> <EVENT_DATA>
+ It contains data for the specific event to be measured,
+ in a given template data format.
+
+ ALG := Algorithm to compute event digest
+ EVENT_DIGEST := Digest of the event data
+ EVENT_NAME := Description of the event (e.g. 'dm_table_load').
+ EVENT_DATA := The event data to be measured.
+
+|
+
+| *NOTE #1:*
+| The DM target data measured by IMA subsystem can alternatively
+ be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
+ DM_TABLE_STATUS_CMD.
+
+|
+
+| *NOTE #2:*
+| The Kernel configuration CONFIG_IMA_DISABLE_HTABLE allows measurement of duplicate records.
+| To support recording duplicate IMA events in the IMA log, the Kernel needs to be configured with
+ CONFIG_IMA_DISABLE_HTABLE=y.
+
+Supported Device States:
+========================
+Following device state changes will trigger IMA measurements:
+
+ 1. Table load
+ #. Device resume
+ #. Device remove
+ #. Table clear
+ #. Device rename
+
+1. Table load:
+---------------
+When a new table is loaded in a device's inactive table slot,
+the device information and target specific details from the
+targets in the table are measured.
+
+The IMA measurement log has the following format for 'dm_table_load':
+
+::
+
+ EVENT_NAME := "dm_table_load"
+ EVENT_DATA := <dm_version_str> ";" <device_metadata> ";" <table_load_data>
+
+ dm_version_str := "dm_version=" <N> "." <N> "." <N>
+ Same as Device Mapper driver version.
+ device_metadata := <device_name> "," <device_uuid> "," <device_major> "," <device_minor> ","
+ <minor_count> "," <num_device_targets> ";"
+
+ device_name := "name=" <dm-device-name>
+ device_uuid := "uuid=" <dm-device-uuid>
+ device_major := "major=" <N>
+ device_minor := "minor=" <N>
+ minor_count := "minor_count=" <N>
+ num_device_targets := "num_targets=" <N>
+ dm-device-name := Name of the device. If it contains special characters like '\', ',', ';',
+ they are prefixed with '\'.
+ dm-device-uuid := UUID of the device. If it contains special characters like '\', ',', ';',
+ they are prefixed with '\'.
+
+ table_load_data := <target_data>
+ Represents the data (as name=value pairs) from various targets in the table,
+ which is being loaded into the DM device's inactive table slot.
+ target_data := <target_data_row> | <target_data><target_data_row>
+
+ target_data_row := <target_index> "," <target_begin> "," <target_len> "," <target_name> ","
+ <target_version> "," <target_attributes> ";"
+ target_index := "target_index=" <N>
+ Represents nth target in the table (from 0 to N-1 targets specified in <num_device_targets>)
+ If all the data for N targets doesn't fit in the given buffer - then the data that fits
+ in the buffer (say from target 0 to x) is measured in a given IMA event.
+ The remaining data from targets x+1 to N-1 is measured in the subsequent IMA events,
+ with the same format as that of 'dm_table_load'
+ i.e. <dm_version_str> ";" <device_metadata> ";" <table_load_data>.
+
+ target_begin := "target_begin=" <N>
+ target_len := "target_len=" <N>
+ target_name := Name of the target. 'linear', 'crypt', 'integrity' etc.
+ The targets that are supported for IMA measurements are documented below in the
+ 'Supported targets' section.
+ target_version := "target_version=" <N> "." <N> "." <N>
+ target_attributes := Data containing comma separated list of name=value pairs of target specific attributes.
+
+ For instance, if a linear device is created with the following table entries,
+ # dmsetup create linear1
+ 0 2 linear /dev/loop0 512
+ 2 2 linear /dev/loop0 512
+ 4 2 linear /dev/loop0 512
+ 6 2 linear /dev/loop0 512
+
+ Then IMA ASCII measurement log will have the following entry:
+ (converted from ASCII to text for readability)
+
+ 10 a8c5ff755561c7a28146389d1514c318592af49a ima-buf sha256:4d73481ecce5eadba8ab084640d85bb9ca899af4d0a122989252a76efadc5b72
+ dm_table_load
+ dm_version=4.45.0;
+ name=linear1,uuid=,major=253,minor=0,minor_count=1,num_targets=4;
+ target_index=0,target_begin=0,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
+ target_index=1,target_begin=2,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
+ target_index=2,target_begin=4,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
+ target_index=3,target_begin=6,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
+
+2. Device resume:
+------------------
+When a suspended device is resumed, the device information and the hash of the
+data from previous load of an active table are measured.
+
+The IMA measurement log has the following format for 'dm_device_resume':
+
+::
+
+ EVENT_NAME := "dm_device_resume"
+ EVENT_DATA := <dm_version_str> ";" <device_metadata> ";" <active_table_hash> ";" <current_device_capacity> ";"
+
+ dm_version_str := As described in the 'Table load' section above.
+ device_metadata := As described in the 'Table load' section above.
+ active_table_hash := "active_table_hash=" <table_hash_alg> ":" <table_hash>
+ Rerpresents the hash of the IMA data being measured for the
+ active table for the device.
+ table_hash_alg := Algorithm used to compute the hash.
+ table_hash := Hash of the (<dm_version_str> ";" <device_metadata> ";" <table_load_data> ";")
+ as described in the 'dm_table_load' above.
+ Note: If the table_load data spans across multiple IMA 'dm_table_load'
+ events for a given device, the hash is computed combining all the event data
+ i.e. (<dm_version_str> ";" <device_metadata> ";" <table_load_data> ";")
+ across all those events.
+ current_device_capacity := "current_device_capacity=" <N>
+
+ For instance, if a linear device is resumed with the following command,
+ #dmsetup resume linear1
+
+ then IMA ASCII measurement log will have an entry with:
+ (converted from ASCII to text for readability)
+
+ 10 56c00cc062ffc24ccd9ac2d67d194af3282b934e ima-buf sha256:e7d12c03b958b4e0e53e7363a06376be88d98a1ac191fdbd3baf5e4b77f329b6
+ dm_device_resume
+ dm_version=4.45.0;
+ name=linear1,uuid=,major=253,minor=0,minor_count=1,num_targets=4;
+ active_table_hash=sha256:4d73481ecce5eadba8ab084640d85bb9ca899af4d0a122989252a76efadc5b72;current_device_capacity=8;
+
+3. Device remove:
+------------------
+When a device is removed, the device information and a sha256 hash of the
+data from an active and inactive table are measured.
+
+The IMA measurement log has the following format for 'dm_device_remove':
+
+::
+
+ EVENT_NAME := "dm_device_remove"
+ EVENT_DATA := <dm_version_str> ";" <device_active_metadata> ";" <device_inactive_metadata> ";"
+ <active_table_hash> "," <inactive_table_hash> "," <remove_all> ";" <current_device_capacity> ";"
+
+ dm_version_str := As described in the 'Table load' section above.
+ device_active_metadata := Device metadata that reflects the currently loaded active table.
+ The format is same as 'device_metadata' described in the 'Table load' section above.
+ device_inactive_metadata := Device metadata that reflects the inactive table.
+ The format is same as 'device_metadata' described in the 'Table load' section above.
+ active_table_hash := Hash of the currently loaded active table.
+ The format is same as 'active_table_hash' described in the 'Device resume' section above.
+ inactive_table_hash := Hash of the inactive table.
+ The format is same as 'active_table_hash' described in the 'Device resume' section above.
+ remove_all := "remove_all=" <yes_no>
+ yes_no := "y" | "n"
+ current_device_capacity := "current_device_capacity=" <N>
+
+ For instance, if a linear device is removed with the following command,
+ #dmsetup remove l1
+
+ then IMA ASCII measurement log will have the following entry:
+ (converted from ASCII to text for readability)
+
+ 10 790e830a3a7a31590824ac0642b3b31c2d0e8b38 ima-buf sha256:ab9f3c959367a8f5d4403d6ce9c3627dadfa8f9f0e7ec7899299782388de3840
+ dm_device_remove
+ dm_version=4.45.0;
+ device_active_metadata=name=l1,uuid=,major=253,minor=2,minor_count=1,num_targets=2;
+ device_inactive_metadata=name=l1,uuid=,major=253,minor=2,minor_count=1,num_targets=1;
+ active_table_hash=sha256:4a7e62efaebfc86af755831998b7db6f59b60d23c9534fb16a4455907957953a,
+ inactive_table_hash=sha256:9d79c175bc2302d55a183e8f50ad4bafd60f7692fd6249e5fd213e2464384b86,remove_all=n;
+ current_device_capacity=2048;
+
+4. Table clear:
+----------------
+When an inactive table is cleared from the device, the device information and a sha256 hash of the
+data from an inactive table are measured.
+
+The IMA measurement log has the following format for 'dm_table_clear':
+
+::
+
+ EVENT_NAME := "dm_table_clear"
+ EVENT_DATA := <dm_version_str> ";" <device_inactive_metadata> ";" <inactive_table_hash> ";" <current_device_capacity> ";"
+
+ dm_version_str := As described in the 'Table load' section above.
+ device_inactive_metadata := Device metadata that was captured during the load time inactive table being cleared.
+ The format is same as 'device_metadata' described in the 'Table load' section above.
+ inactive_table_hash := Hash of the inactive table being cleared from the device.
+ The format is same as 'active_table_hash' described in the 'Device resume' section above.
+ current_device_capacity := "current_device_capacity=" <N>
+
+ For instance, if a linear device's inactive table is cleared,
+ #dmsetup clear l1
+
+ then IMA ASCII measurement log will have an entry with:
+ (converted from ASCII to text for readability)
+
+ 10 77d347408f557f68f0041acb0072946bb2367fe5 ima-buf sha256:42f9ca22163fdfa548e6229dece2959bc5ce295c681644240035827ada0e1db5
+ dm_table_clear
+ dm_version=4.45.0;
+ name=l1,uuid=,major=253,minor=2,minor_count=1,num_targets=1;
+ inactive_table_hash=sha256:75c0dc347063bf474d28a9907037eba060bfe39d8847fc0646d75e149045d545;current_device_capacity=1024;
+
+5. Device rename:
+------------------
+When an device's NAME or UUID is changed, the device information and the new NAME and UUID
+are measured.
+
+The IMA measurement log has the following format for 'dm_device_rename':
+
+::
+
+ EVENT_NAME := "dm_device_rename"
+ EVENT_DATA := <dm_version_str> ";" <device_active_metadata> ";" <new_device_name> "," <new_device_uuid> ";" <current_device_capacity> ";"
+
+ dm_version_str := As described in the 'Table load' section above.
+ device_active_metadata := Device metadata that reflects the currently loaded active table.
+ The format is same as 'device_metadata' described in the 'Table load' section above.
+ new_device_name := "new_name=" <dm-device-name>
+ dm-device-name := Same as <dm-device-name> described in 'Table load' section above
+ new_device_uuid := "new_uuid=" <dm-device-uuid>
+ dm-device-uuid := Same as <dm-device-uuid> described in 'Table load' section above
+ current_device_capacity := "current_device_capacity=" <N>
+
+ E.g 1: if a linear device's name is changed with the following command,
+ #dmsetup rename linear1 --setuuid 1234-5678
+
+ then IMA ASCII measurement log will have an entry with:
+ (converted from ASCII to text for readability)
+
+ 10 8b0423209b4c66ac1523f4c9848c9b51ee332f48 ima-buf sha256:6847b7258134189531db593e9230b257c84f04038b5a18fd2e1473860e0569ac
+ dm_device_rename
+ dm_version=4.45.0;
+ name=linear1,uuid=,major=253,minor=2,minor_count=1,num_targets=1;new_name=linear1,new_uuid=1234-5678;
+ current_device_capacity=1024;
+
+ E.g 2: if a linear device's name is changed with the following command,
+ # dmsetup rename linear1 linear=2
+
+ then IMA ASCII measurement log will have an entry with:
+ (converted from ASCII to text for readability)
+
+ 10 bef70476b99c2bdf7136fae033aa8627da1bf76f ima-buf sha256:8c6f9f53b9ef9dc8f92a2f2cca8910e622543d0f0d37d484870cb16b95111402
+ dm_device_rename
+ dm_version=4.45.0;
+ name=linear1,uuid=1234-5678,major=253,minor=2,minor_count=1,num_targets=1;
+ new_name=linear\=2,new_uuid=1234-5678;
+ current_device_capacity=1024;
+
+Supported targets:
+==================
+
+Following targets are supported to measure their data using IMA:
+
+ 1. cache
+ #. crypt
+ #. integrity
+ #. linear
+ #. mirror
+ #. multipath
+ #. raid
+ #. snapshot
+ #. striped
+ #. verity
+
+1. cache
+---------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'cache' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <metadata_mode> "," <cache_metadata_device> ","
+ <cache_device> "," <cache_origin_device> "," <writethrough> "," <writeback> ","
+ <passthrough> "," <no_discard_passdown> ";"
+
+ target_name := "target_name=cache"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ metadata_mode := "metadata_mode=" <cache_metadata_mode>
+ cache_metadata_mode := "fail" | "ro" | "rw"
+ cache_device := "cache_device=" <cache_device_name_string>
+ cache_origin_device := "cache_origin_device=" <cache_origin_device_string>
+ writethrough := "writethrough=" <yes_no>
+ writeback := "writeback=" <yes_no>
+ passthrough := "passthrough=" <yes_no>
+ no_discard_passdown := "no_discard_passdown=" <yes_no>
+ yes_no := "y" | "n"
+
+ E.g.
+ When a 'cache' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'cache' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;name=cache1,uuid=cache_uuid,major=253,minor=2,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=28672,target_name=cache,target_version=2.2.0,metadata_mode=rw,
+ cache_metadata_device=253:4,cache_device=253:3,cache_origin_device=253:5,writethrough=y,writeback=n,
+ passthrough=n,metadata2=y,no_discard_passdown=n;
+
+
+2. crypt
+---------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'crypt' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <allow_discards> "," <same_cpu_crypt> ","
+ <submit_from_crypt_cpus> "," <no_read_workqueue> "," <no_write_workqueue> ","
+ <iv_large_sectors> "," <iv_large_sectors> "," [<integrity_tag_size> ","] [<cipher_auth> ","]
+ [<sector_size> ","] [<cipher_string> ","] <key_size> "," <key_parts> ","
+ <key_extra_size> "," <key_mac_size> ";"
+
+ target_name := "target_name=crypt"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ allow_discards := "allow_discards=" <yes_no>
+ same_cpu_crypt := "same_cpu_crypt=" <yes_no>
+ submit_from_crypt_cpus := "submit_from_crypt_cpus=" <yes_no>
+ no_read_workqueue := "no_read_workqueue=" <yes_no>
+ no_write_workqueue := "no_write_workqueue=" <yes_no>
+ iv_large_sectors := "iv_large_sectors=" <yes_no>
+ integrity_tag_size := "integrity_tag_size=" <N>
+ cipher_auth := "cipher_auth=" <string>
+ sector_size := "sector_size=" <N>
+ cipher_string := "cipher_string="
+ key_size := "key_size=" <N>
+ key_parts := "key_parts=" <N>
+ key_extra_size := "key_extra_size=" <N>
+ key_mac_size := "key_mac_size=" <N>
+ yes_no := "y" | "n"
+
+ E.g.
+ When a 'crypt' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'crypt' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=crypt1,uuid=crypt_uuid1,major=253,minor=0,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=1953125,target_name=crypt,target_version=1.23.0,
+ allow_discards=y,same_cpu=n,submit_from_crypt_cpus=n,no_read_workqueue=n,no_write_workqueue=n,
+ iv_large_sectors=n,cipher_string=aes-xts-plain64,key_size=32,key_parts=1,key_extra_size=0,key_mac_size=0;
+
+3. integrity
+-------------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'integrity' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <dev_name> "," <start>
+ <tag_size> "," <mode> "," [<meta_device> ","] [<block_size> ","] <recalculate> ","
+ <allow_discards> "," <fix_padding> "," <fix_hmac> "," <legacy_recalculate> ","
+ <journal_sectors> "," <interleave_sectors> "," <buffer_sectors> ";"
+
+ target_name := "target_name=integrity"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ dev_name := "dev_name=" <device_name_str>
+ start := "start=" <N>
+ tag_size := "tag_size=" <N>
+ mode := "mode=" <integrity_mode_str>
+ integrity_mode_str := "J" | "B" | "D" | "R"
+ meta_device := "meta_device=" <meta_device_str>
+ block_size := "block_size=" <N>
+ recalculate := "recalculate=" <yes_no>
+ allow_discards := "allow_discards=" <yes_no>
+ fix_padding := "fix_padding=" <yes_no>
+ fix_hmac := "fix_hmac=" <yes_no>
+ legacy_recalculate := "legacy_recalculate=" <yes_no>
+ journal_sectors := "journal_sectors=" <N>
+ interleave_sectors := "interleave_sectors=" <N>
+ buffer_sectors := "buffer_sectors=" <N>
+ yes_no := "y" | "n"
+
+ E.g.
+ When a 'integrity' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'integrity' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=integrity1,uuid=,major=253,minor=1,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=7856,target_name=integrity,target_version=1.10.0,
+ dev_name=253:0,start=0,tag_size=32,mode=J,recalculate=n,allow_discards=n,fix_padding=n,
+ fix_hmac=n,legacy_recalculate=n,journal_sectors=88,interleave_sectors=32768,buffer_sectors=128;
+
+
+4. linear
+----------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'linear' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <device_name> <,> <start> ";"
+
+ target_name := "target_name=linear"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ device_name := "device_name=" <linear_device_name_str>
+ start := "start=" <N>
+
+ E.g.
+ When a 'linear' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'linear' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=linear1,uuid=linear_uuid1,major=253,minor=2,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=28672,target_name=linear,target_version=1.4.0,
+ device_name=253:1,start=2048;
+
+5. mirror
+----------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'mirror' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <nr_mirrors> ","
+ <mirror_device_data> "," <handle_errors> "," <keep_log> "," <log_type_status> ";"
+
+ target_name := "target_name=mirror"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ nr_mirrors := "nr_mirrors=" <NR>
+ mirror_device_data := <mirror_device_row> | <mirror_device_data><mirror_device_row>
+ mirror_device_row is repeated <NR> times - for <NR> described in <nr_mirrors>.
+ mirror_device_row := <mirror_device_name> "," <mirror_device_status>
+ mirror_device_name := "mirror_device_" <X> "=" <mirror_device_name_str>
+ where <X> ranges from 0 to (<NR> -1) - for <NR> described in <nr_mirrors>.
+ mirror_device_status := "mirror_device_" <X> "_status=" <mirror_device_status_char>
+ where <X> ranges from 0 to (<NR> -1) - for <NR> described in <nr_mirrors>.
+ mirror_device_status_char := "A" | "F" | "D" | "S" | "R" | "U"
+ handle_errors := "handle_errors=" <yes_no>
+ keep_log := "keep_log=" <yes_no>
+ log_type_status := "log_type_status=" <log_type_status_str>
+ yes_no := "y" | "n"
+
+ E.g.
+ When a 'mirror' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'mirror' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=mirror1,uuid=mirror_uuid1,major=253,minor=6,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=2048,target_name=mirror,target_version=1.14.0,nr_mirrors=2,
+ mirror_device_0=253:4,mirror_device_0_status=A,
+ mirror_device_1=253:5,mirror_device_1_status=A,
+ handle_errors=y,keep_log=n,log_type_status=;
+
+6. multipath
+-------------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'multipath' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <nr_priority_groups>
+ ["," <pg_state> "," <priority_groups> "," <priority_group_paths>] ";"
+
+ target_name := "target_name=multipath"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ nr_priority_groups := "nr_priority_groups=" <NPG>
+ priority_groups := <priority_groups_row>|<priority_groups_row><priority_groups>
+ priority_groups_row := "pg_state_" <X> "=" <pg_state_str> "," "nr_pgpaths_" <X> "=" <NPGP> ","
+ "path_selector_name_" <X> "=" <string> "," <priority_group_paths>
+ where <X> ranges from 0 to (<NPG> -1) - for <NPG> described in <nr_priority_groups>.
+ pg_state_str := "E" | "A" | "D"
+ <priority_group_paths> := <priority_group_paths_row> | <priority_group_paths_row><priority_group_paths>
+ priority_group_paths_row := "path_name_" <X> "_" <Y> "=" <string> "," "is_active_" <X> "_" <Y> "=" <is_active_str>
+ "fail_count_" <X> "_" <Y> "=" <N> "," "path_selector_status_" <X> "_" <Y> "=" <path_selector_status_str>
+ where <X> ranges from 0 to (<NPG> -1) - for <NPG> described in <nr_priority_groups>,
+ and <Y> ranges from 0 to (<NPGP> -1) - for <NPGP> described in <priority_groups_row>.
+ is_active_str := "A" | "F"
+
+ E.g.
+ When a 'multipath' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'multipath' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=mp,uuid=,major=253,minor=0,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=2097152,target_name=multipath,target_version=1.14.0,nr_priority_groups=2,
+ pg_state_0=E,nr_pgpaths_0=2,path_selector_name_0=queue-length,
+ path_name_0_0=8:16,is_active_0_0=A,fail_count_0_0=0,path_selector_status_0_0=,
+ path_name_0_1=8:32,is_active_0_1=A,fail_count_0_1=0,path_selector_status_0_1=,
+ pg_state_1=E,nr_pgpaths_1=2,path_selector_name_1=queue-length,
+ path_name_1_0=8:48,is_active_1_0=A,fail_count_1_0=0,path_selector_status_1_0=,
+ path_name_1_1=8:64,is_active_1_1=A,fail_count_1_1=0,path_selector_status_1_1=;
+
+7. raid
+--------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'raid' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <raid_type> "," <raid_disks> "," <raid_state>
+ <raid_device_status> ["," journal_dev_mode] ";"
+
+ target_name := "target_name=raid"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ raid_type := "raid_type=" <raid_type_str>
+ raid_disks := "raid_disks=" <NRD>
+ raid_state := "raid_state=" <raid_state_str>
+ raid_state_str := "frozen" | "reshape" |"resync" | "check" | "repair" | "recover" | "idle" |"undef"
+ raid_device_status := <raid_device_status_row> | <raid_device_status_row><raid_device_status>
+ <raid_device_status_row> is repeated <NRD> times - for <NRD> described in <raid_disks>.
+ raid_device_status_row := "raid_device_" <X> "_status=" <raid_device_status_str>
+ where <X> ranges from 0 to (<NRD> -1) - for <NRD> described in <raid_disks>.
+ raid_device_status_str := "A" | "D" | "a" | "-"
+ journal_dev_mode := "journal_dev_mode=" <journal_dev_mode_str>
+ journal_dev_mode_str := "writethrough" | "writeback" | "invalid"
+
+ E.g.
+ When a 'raid' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'raid' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=raid_LV1,uuid=uuid_raid_LV1,major=253,minor=12,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=2048,target_name=raid,target_version=1.15.1,
+ raid_type=raid10,raid_disks=4,raid_state=idle,
+ raid_device_0_status=A,
+ raid_device_1_status=A,
+ raid_device_2_status=A,
+ raid_device_3_status=A;
+
+
+8. snapshot
+------------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'snapshot' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <snap_origin_name> ","
+ <snap_cow_name> "," <snap_valid> "," <snap_merge_failed> "," <snapshot_overflowed> ";"
+
+ target_name := "target_name=snapshot"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ snap_origin_name := "snap_origin_name=" <string>
+ snap_cow_name := "snap_cow_name=" <string>
+ snap_valid := "snap_valid=" <yes_no>
+ snap_merge_failed := "snap_merge_failed=" <yes_no>
+ snapshot_overflowed := "snapshot_overflowed=" <yes_no>
+ yes_no := "y" | "n"
+
+ E.g.
+ When a 'snapshot' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'snapshot' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=snap1,uuid=snap_uuid1,major=253,minor=13,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=4096,target_name=snapshot,target_version=1.16.0,
+ snap_origin_name=253:11,snap_cow_name=253:12,snap_valid=y,snap_merge_failed=n,snapshot_overflowed=n;
+
+9. striped
+-----------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'striped' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <stripes> "," <chunk_size> ","
+ <stripe_data> ";"
+
+ target_name := "target_name=striped"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ stripes := "stripes=" <NS>
+ chunk_size := "chunk_size=" <N>
+ stripe_data := <stripe_data_row>|<stripe_data><stripe_data_row>
+ stripe_data_row := <stripe_device_name> "," <stripe_physical_start> "," <stripe_status>
+ stripe_device_name := "stripe_" <X> "_device_name=" <stripe_device_name_str>
+ where <X> ranges from 0 to (<NS> -1) - for <NS> described in <stripes>.
+ stripe_physical_start := "stripe_" <X> "_physical_start=" <N>
+ where <X> ranges from 0 to (<NS> -1) - for <NS> described in <stripes>.
+ stripe_status := "stripe_" <X> "_status=" <stripe_status_str>
+ where <X> ranges from 0 to (<NS> -1) - for <NS> described in <stripes>.
+ stripe_status_str := "D" | "A"
+
+ E.g.
+ When a 'striped' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'striped' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=striped1,uuid=striped_uuid1,major=253,minor=5,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=640,target_name=striped,target_version=1.6.0,stripes=2,chunk_size=64,
+ stripe_0_device_name=253:0,stripe_0_physical_start=2048,stripe_0_status=A,
+ stripe_1_device_name=253:3,stripe_1_physical_start=2048,stripe_1_status=A;
+
+10. verity
+----------
+The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
+section above) has the following data format for 'verity' target.
+
+::
+
+ target_attributes := <target_name> "," <target_version> "," <hash_failed> "," <verity_version> ","
+ <data_device_name> "," <hash_device_name> "," <verity_algorithm> "," <root_digest> ","
+ <salt> "," <ignore_zero_blocks> "," <check_at_most_once> ["," <root_hash_sig_key_desc>]
+ ["," <verity_mode>] ";"
+
+ target_name := "target_name=verity"
+ target_version := "target_version=" <N> "." <N> "." <N>
+ hash_failed := "hash_failed=" <hash_failed_str>
+ hash_failed_str := "C" | "V"
+ verity_version := "verity_version=" <verity_version_str>
+ data_device_name := "data_device_name=" <data_device_name_str>
+ hash_device_name := "hash_device_name=" <hash_device_name_str>
+ verity_algorithm := "verity_algorithm=" <verity_algorithm_str>
+ root_digest := "root_digest=" <root_digest_str>
+ salt := "salt=" <salt_str>
+ salt_str := "-" <verity_salt_str>
+ ignore_zero_blocks := "ignore_zero_blocks=" <yes_no>
+ check_at_most_once := "check_at_most_once=" <yes_no>
+ root_hash_sig_key_desc := "root_hash_sig_key_desc="
+ verity_mode := "verity_mode=" <verity_mode_str>
+ verity_mode_str := "ignore_corruption" | "restart_on_corruption" | "panic_on_corruption" | "invalid"
+ yes_no := "y" | "n"
+
+ E.g.
+ When a 'verity' target is loaded, then IMA ASCII measurement log will have an entry
+ similar to the following, depicting what 'verity' attributes are measured in EVENT_DATA
+ for 'dm_table_load' event.
+ (converted from ASCII to text for readability)
+
+ dm_version=4.45.0;
+ name=test-verity,uuid=,major=253,minor=2,minor_count=1,num_targets=1;
+ target_index=0,target_begin=0,target_len=1953120,target_name=verity,target_version=1.8.0,hash_failed=V,
+ verity_version=1,data_device_name=253:1,hash_device_name=253:0,verity_algorithm=sha256,
+ root_digest=29cb87e60ce7b12b443ba6008266f3e41e93e403d7f298f8e3f316b29ff89c5e,
+ salt=e48da609055204e89ae53b655ca2216dd983cf3cb829f34f63a297d106d53e2d,
+ ignore_zero_blocks=n,check_at_most_once=n;
diff --git a/Documentation/admin-guide/device-mapper/dm-init.rst b/Documentation/admin-guide/device-mapper/dm-init.rst
index e5242ff17e9b..981d6a907699 100644
--- a/Documentation/admin-guide/device-mapper/dm-init.rst
+++ b/Documentation/admin-guide/device-mapper/dm-init.rst
@@ -123,3 +123,11 @@ Other examples (per target):
0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584
+
+For setups using device-mapper on top of asynchronously probed block
+devices (MMC, USB, ..), it may be necessary to tell dm-init to
+explicitly wait for them to become available before setting up the
+device-mapper tables. This can be done with the "dm-mod.waitfor="
+module parameter, which takes a list of devices to wait for::
+
+ dm-mod.waitfor=<device1>[,..,<deviceN>]
diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst
index 9edd45593abd..d8a5f14d0e3c 100644
--- a/Documentation/admin-guide/device-mapper/dm-integrity.rst
+++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst
@@ -25,7 +25,7 @@ mode it calculates and verifies the integrity tag internally. In this
mode, the dm-integrity target can be used to detect silent data
corruption on the disk or in the I/O path.
-There's an alternate mode of operation where dm-integrity uses bitmap
+There's an alternate mode of operation where dm-integrity uses a bitmap
instead of a journal. If a bit in the bitmap is 1, the corresponding
region's data and integrity tags are not synchronized - if the machine
crashes, the unsynchronized regions will be recalculated. The bitmap mode
@@ -38,6 +38,15 @@ the device. But it will only format the device if the superblock contains
zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
target can't be loaded.
+Accesses to the on-disk metadata area containing checksums (aka tags) are
+buffered using dm-bufio. When an access to any given metadata area
+occurs, each unique metadata area gets its own buffer(s). The buffer size
+is capped at the size of the metadata area, but may be smaller, thereby
+requiring multiple buffers to represent the full metadata area. A smaller
+buffer size will produce a smaller resulting read/write operation to the
+metadata area for small reads/writes. The metadata is still read even in
+a full write to the data covered by a single buffer.
+
To use the target for the first time:
1. overwrite the superblock with zeroes
@@ -45,7 +54,7 @@ To use the target for the first time:
will format the device
3. unload the dm-integrity target
4. read the "provided_data_sectors" value from the superblock
-5. load the dm-integrity target with the the target size
+5. load the dm-integrity target with the target size
"provided_data_sectors"
6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target
with the size "provided_data_sectors"
@@ -93,31 +102,27 @@ journal_sectors:number
device. If the device is already formatted, the value from the
superblock is used.
-interleave_sectors:number
+interleave_sectors:number (default 32768)
The number of interleaved sectors. This values is rounded down to
a power of two. If the device is already formatted, the value from
the superblock is used.
meta_device:device
- Don't interleave the data and metadata on on device. Use a
+ Don't interleave the data and metadata on the device. Use a
separate device for metadata.
-buffer_sectors:number
- The number of sectors in one buffer. The value is rounded down to
- a power of two.
-
- The tag area is accessed using buffers, the buffer size is
- configurable. The large buffer size means that the I/O size will
- be larger, but there could be less I/Os issued.
+buffer_sectors:number (default 128)
+ The number of sectors in one metadata buffer. The value is rounded
+ down to a power of two.
-journal_watermark:number
+journal_watermark:number (default 50)
The journal watermark in percents. When the size of the journal
exceeds this watermark, the thread that flushes the journal will
be started.
-commit_time:number
+commit_time:number (default 10000)
Commit time in milliseconds. When this time passes, the journal is
- written. The journal is also written immediatelly if the FLUSH
+ written. The journal is also written immediately if the FLUSH
request is received.
internal_hash:algorithm(:key) (the key is optional)
@@ -143,11 +148,11 @@ recalculate
journal_crypt:algorithm(:key) (the key is optional)
Encrypt the journal using given algorithm to make sure that the
attacker can't read the journal. You can use a block cipher here
- (such as "cbc(aes)") or a stream cipher (for example "chacha20",
- "salsa20" or "ctr(aes)").
+ (such as "cbc(aes)") or a stream cipher (for example "chacha20"
+ or "ctr(aes)").
The journal contains history of last writes to the block device,
- an attacker reading the journal could see the last sector nubmers
+ an attacker reading the journal could see the last sector numbers
that were written. From the sector numbers, the attacker can infer
the size of files that were written. To protect against this
situation, you can encrypt the journal.
@@ -163,11 +168,10 @@ journal_mac:algorithm(:key) (the key is optional)
the journal. Thus, modified sector number would be detected at
this stage.
-block_size:number
- The size of a data block in bytes. The larger the block size the
+block_size:number (default 512)
+ The size of a data block in bytes. The larger the block size the
less overhead there is for per-block integrity metadata.
- Supported values are 512, 1024, 2048 and 4096 bytes. If not
- specified the default block size is 512 bytes.
+ Supported values are 512, 1024, 2048 and 4096 bytes.
sectors_per_bit:number
In the bitmap mode, this parameter specifies the number of
@@ -177,14 +181,31 @@ bitmap_flush_interval:number
The bitmap flush interval in milliseconds. The metadata buffers
are synchronized when this interval expires.
+allow_discards
+ Allow block discard requests (a.k.a. TRIM) for the integrity device.
+ Discards are only allowed to devices using internal hash.
+
fix_padding
Use a smaller padding of the tag area that is more
space-efficient. If this option is not present, large padding is
used - that is for compatibility with older kernels.
-allow_discards
- Allow block discard requests (a.k.a. TRIM) for the integrity device.
- Discards are only allowed to devices using internal hash.
+fix_hmac
+ Improve security of internal_hash and journal_mac:
+
+ - the section number is mixed to the mac, so that an attacker can't
+ copy sectors from one journal section to another journal section
+ - the superblock is protected by journal_mac
+ - a 16-byte salt stored in the superblock is mixed to the mac, so
+ that the attacker can't detect that two disks have the same hmac
+ key and also to disallow the attacker to move sectors from one
+ disk to another
+
+legacy_recalculate
+ Allow recalculating of volumes with HMAC keys. This is disabled by
+ default for security reasons - an attacker could modify the volume,
+ set recalc_sector to zero, and the kernel would not detect the
+ modification.
The journal mode (D/J), buffer_sectors, journal_watermark, commit_time and
allow_discards can be changed when reloading the target (load an inactive
@@ -192,6 +213,12 @@ table and swap the tables with suspend and resume). The other arguments
should not be changed when reloading the target because the layout of disk
data depend on them and the reloaded target would be non-functional.
+For example, on a device using the default interleave_sectors of 32768, a
+block_size of 512, and an internal_hash of crc32c with a tag size of 4
+bytes, it will take 128 KiB of tags to track a full data area, requiring
+256 sectors of metadata per data area. With the default buffer_sectors of
+128, that means there will be 2 buffers per metadata area, or 2 buffers
+per 16 MiB of data.
Status line:
@@ -269,7 +296,8 @@ The layout of the formatted block device:
Each run contains:
* tag area - it contains integrity tags. There is one tag for each
- sector in the data area
+ sector in the data area. The size of this area is always 4KiB or
+ greater.
* data area - it contains data sectors. The number of data sectors
in one run must be a power of two. log2 of this value is stored
in the superblock.
diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst
index 695a2ea1d1ae..bb17e26e3c1b 100644
--- a/Documentation/admin-guide/device-mapper/dm-raid.rst
+++ b/Documentation/admin-guide/device-mapper/dm-raid.rst
@@ -71,7 +71,7 @@ The target is named "raid" and it accepts the following parameters::
============= ===============================================================
Reference: Chapter 4 of
- http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
+ https://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
<#raid_params>: The number of parameters that follow.
@@ -418,6 +418,6 @@ Version History
specific devices are requested via rebuild. Fix RAID leg
rebuild errors.
1.15.0 Fix size extensions not being synchronized in case of new MD bitmap
- pages allocated; also fix those not occuring after previous reductions
+ pages allocated; also fix those not occurring after previous reductions
1.15.1 Fix argument count and arguments for rebuild/write_mostly/journal_(dev|mode)
on the status line.
diff --git a/Documentation/admin-guide/device-mapper/dm-zoned.rst b/Documentation/admin-guide/device-mapper/dm-zoned.rst
index 553752ea2521..932383fe6e88 100644
--- a/Documentation/admin-guide/device-mapper/dm-zoned.rst
+++ b/Documentation/admin-guide/device-mapper/dm-zoned.rst
@@ -14,7 +14,7 @@ host-aware zoned block devices.
For a more detailed description of the zoned block device models and
their constraints see (for SCSI devices):
-http://www.t10.org/drafts.htm#ZBC_Family
+https://www.t10.org/drafts.htm#ZBC_Family
and (for ATA devices):
@@ -24,7 +24,7 @@ The dm-zoned implementation is simple and minimizes system overhead (CPU
and memory usage as well as storage capacity loss). For a 10TB
host-managed disk with 256 MB zones, dm-zoned memory usage per disk
instance is at most 4.5 MB and as little as 5 zones will be used
-internally for storing metadata and performaing reclaim operations.
+internally for storing metadata and performing reclaim operations.
dm-zoned target devices are formatted and checked using the dmzadm
utility available at:
@@ -46,7 +46,7 @@ just like conventional zones.
The zones of the device(s) are separated into 2 types:
1) Metadata zones: these are conventional zones used to store metadata.
-Metadata zones are not reported as useable capacity to the user.
+Metadata zones are not reported as usable capacity to the user.
2) Data zones: all remaining zones, the vast majority of which will be
sequential zones used exclusively to store user data. The conventional
@@ -102,7 +102,7 @@ the buffer zone assigned. If the accessed chunk has no mapping, or the
accessed blocks are invalid, the read buffer is zeroed and the read
operation terminated.
-After some time, the limited number of convnetional zones available may
+After some time, the limited number of conventional zones available may
be exhausted (all used to map chunks or buffer sequential zones) and
unaligned writes to unbuffered chunks become impossible. To avoid this
situation, a reclaim process regularly scans used conventional zones and
@@ -158,7 +158,7 @@ Ex::
dmzadm --format /dev/sdxx /dev/sdyy
-Fomatted device(s) can be started with the dmzadm utility, too.:
+Formatted device(s) can be started with the dmzadm utility, too.:
Ex::
diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
index ec62fcc8eece..cc5aec861576 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -11,7 +11,9 @@ Device Mapper
dm-clone
dm-crypt
dm-dust
+ dm-ebs
dm-flakey
+ dm-ima
dm-init
dm-integrity
dm-io
@@ -32,6 +34,8 @@ Device Mapper
switch
thin-provisioning
unstriped
+ vdo-design
+ vdo
verity
writecache
zero
diff --git a/Documentation/admin-guide/device-mapper/unstriped.rst b/Documentation/admin-guide/device-mapper/unstriped.rst
index 0a8d3eb3f072..5772ccdd1f5f 100644
--- a/Documentation/admin-guide/device-mapper/unstriped.rst
+++ b/Documentation/admin-guide/device-mapper/unstriped.rst
@@ -35,7 +35,7 @@ An example of undoing an existing dm-stripe
This small bash script will setup 4 loop devices and use the existing
striped target to combine the 4 devices into one. It then will use
-the unstriped target ontop of the striped device to access the
+the unstriped target on top of the striped device to access the
individual backing loop devices. We write data to the newly exposed
unstriped devices and verify the data written matches the correct
underlying device on the striped array::
@@ -110,8 +110,8 @@ to get a 92% reduction in read latency using this device mapper target.
Example dmsetup usage
=====================
-unstriped ontop of Intel NVMe device that has 2 cores
------------------------------------------------------
+unstriped on top of Intel NVMe device that has 2 cores
+------------------------------------------------------
::
@@ -124,8 +124,8 @@ respectively::
/dev/mapper/nvmset0
/dev/mapper/nvmset1
-unstriped ontop of striped with 4 drives using 128K chunk size
---------------------------------------------------------------
+unstriped on top of striped with 4 drives using 128K chunk size
+---------------------------------------------------------------
::
diff --git a/Documentation/admin-guide/device-mapper/vdo-design.rst b/Documentation/admin-guide/device-mapper/vdo-design.rst
new file mode 100644
index 000000000000..3cd59decbec0
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/vdo-design.rst
@@ -0,0 +1,633 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+================
+Design of dm-vdo
+================
+
+The dm-vdo (virtual data optimizer) target provides inline deduplication,
+compression, zero-block elimination, and thin provisioning. A dm-vdo target
+can be backed by up to 256TB of storage, and can present a logical size of
+up to 4PB. This target was originally developed at Permabit Technology
+Corp. starting in 2009. It was first released in 2013 and has been used in
+production environments ever since. It was made open-source in 2017 after
+Permabit was acquired by Red Hat. This document describes the design of
+dm-vdo. For usage, see vdo.rst in the same directory as this file.
+
+Because deduplication rates fall drastically as the block size increases, a
+vdo target has a maximum block size of 4K. However, it can achieve
+deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
+reference a single 4K of actual storage. It can achieve compression rates
+of 14:1. All zero blocks consume no storage at all.
+
+Theory of Operation
+===================
+
+The design of dm-vdo is based on the idea that deduplication is a two-part
+problem. The first is to recognize duplicate data. The second is to avoid
+storing multiple copies of those duplicates. Therefore, dm-vdo has two main
+parts: a deduplication index (called UDS) that is used to discover
+duplicate data, and a data store with a reference counted block map that
+maps from logical block addresses to the actual storage location of the
+data.
+
+Zones and Threading
+-------------------
+
+Due to the complexity of data optimization, the number of metadata
+structures involved in a single write operation to a vdo target is larger
+than most other targets. Furthermore, because vdo must operate on small
+block sizes in order to achieve good deduplication rates, acceptable
+performance can only be achieved through parallelism. Therefore, vdo's
+design attempts to be lock-free.
+
+Most of a vdo's main data structures are designed to be easily divided into
+"zones" such that any given bio must only access a single zone of any zoned
+structure. Safety with minimal locking is achieved by ensuring that during
+normal operation, each zone is assigned to a specific thread, and only that
+thread will access the portion of the data structure in that zone.
+Associated with each thread is a work queue. Each bio is associated with a
+request object (the "data_vio") which will be added to a work queue when
+the next phase of its operation requires access to the structures in the
+zone associated with that queue.
+
+Another way of thinking about this arrangement is that the work queue for
+each zone has an implicit lock on the structures it manages for all its
+operations, because vdo guarantees that no other thread will alter those
+structures.
+
+Although each structure is divided into zones, this division is not
+reflected in the on-disk representation of each data structure. Therefore,
+the number of zones for each structure, and hence the number of threads,
+can be reconfigured each time a vdo target is started.
+
+The Deduplication Index
+-----------------------
+
+In order to identify duplicate data efficiently, vdo was designed to
+leverage some common characteristics of duplicate data. From empirical
+observations, we gathered two key insights. The first is that in most data
+sets with significant amounts of duplicate data, the duplicates tend to
+have temporal locality. When a duplicate appears, it is more likely that
+other duplicates will be detected, and that those duplicates will have been
+written at about the same time. This is why the index keeps records in
+temporal order. The second insight is that new data is more likely to
+duplicate recent data than it is to duplicate older data and in general,
+there are diminishing returns to looking further back in time. Therefore,
+when the index is full, it should cull its oldest records to make space for
+new ones. Another important idea behind the design of the index is that the
+ultimate goal of deduplication is to reduce storage costs. Since there is a
+trade-off between the storage saved and the resources expended to achieve
+those savings, vdo does not attempt to find every last duplicate block. It
+is sufficient to find and eliminate most of the redundancy.
+
+Each block of data is hashed to produce a 16-byte block name. An index
+record consists of this block name paired with the presumed location of
+that data on the underlying storage. However, it is not possible to
+guarantee that the index is accurate. In the most common case, this occurs
+because it is too costly to update the index when a block is over-written
+or discarded. Doing so would require either storing the block name along
+with the blocks, which is difficult to do efficiently in block-based
+storage, or reading and rehashing each block before overwriting it.
+Inaccuracy can also result from a hash collision where two different blocks
+have the same name. In practice, this is extremely unlikely, but because
+vdo does not use a cryptographic hash, a malicious workload could be
+constructed. Because of these inaccuracies, vdo treats the locations in the
+index as hints, and reads each indicated block to verify that it is indeed
+a duplicate before sharing the existing block with a new one.
+
+Records are collected into groups called chapters. New records are added to
+the newest chapter, called the open chapter. This chapter is stored in a
+format optimized for adding and modifying records, and the content of the
+open chapter is not finalized until it runs out of space for new records.
+When the open chapter fills up, it is closed and a new open chapter is
+created to collect new records.
+
+Closing a chapter converts it to a different format which is optimized for
+reading. The records are written to a series of record pages based on the
+order in which they were received. This means that records with temporal
+locality should be on a small number of pages, reducing the I/O required to
+retrieve them. The chapter also compiles an index that indicates which
+record page contains any given name. This index means that a request for a
+name can determine exactly which record page may contain that record,
+without having to load the entire chapter from storage. This index uses
+only a subset of the block name as its key, so it cannot guarantee that an
+index entry refers to the desired block name. It can only guarantee that if
+there is a record for this name, it will be on the indicated page. Closed
+chapters are read-only structures and their contents are never altered in
+any way.
+
+Once enough records have been written to fill up all the available index
+space, the oldest chapter is removed to make space for new chapters. Any
+time a request finds a matching record in the index, that record is copied
+into the open chapter. This ensures that useful block names remain available
+in the index, while unreferenced block names are forgotten over time.
+
+In order to find records in older chapters, the index also maintains a
+higher level structure called the volume index, which contains entries
+mapping each block name to the chapter containing its newest record. This
+mapping is updated as records for the block name are copied or updated,
+ensuring that only the newest record for a given block name can be found.
+An older record for a block name will no longer be found even though it has
+not been deleted from its chapter. Like the chapter index, the volume index
+uses only a subset of the block name as its key and can not definitively
+say that a record exists for a name. It can only say which chapter would
+contain the record if a record exists. The volume index is stored entirely
+in memory and is saved to storage only when the vdo target is shut down.
+
+From the viewpoint of a request for a particular block name, it will first
+look up the name in the volume index. This search will either indicate that
+the name is new, or which chapter to search. If it returns a chapter, the
+request looks up its name in the chapter index. This will indicate either
+that the name is new, or which record page to search. Finally, if it is not
+new, the request will look for its name in the indicated record page.
+This process may require up to two page reads per request (one for the
+chapter index page and one for the request page). However, recently
+accessed pages are cached so that these page reads can be amortized across
+many block name requests.
+
+The volume index and the chapter indexes are implemented using a
+memory-efficient structure called a delta index. Instead of storing the
+entire block name (the key) for each entry, the entries are sorted by name
+and only the difference between adjacent keys (the delta) is stored.
+Because we expect the hashes to be randomly distributed, the size of the
+deltas follows an exponential distribution. Because of this distribution,
+the deltas are expressed using a Huffman code to take up even less space.
+The entire sorted list of keys is called a delta list. This structure
+allows the index to use many fewer bytes per entry than a traditional hash
+table, but it is slightly more expensive to look up entries, because a
+request must read every entry in a delta list to add up the deltas in order
+to find the record it needs. The delta index reduces this lookup cost by
+splitting its key space into many sub-lists, each starting at a fixed key
+value, so that each individual list is short.
+
+The default index size can hold 64 million records, corresponding to about
+256GB of data. This means that the index can identify duplicate data if the
+original data was written within the last 256GB of writes. This range is
+called the deduplication window. If new writes duplicate data that is older
+than that, the index will not be able to find it because the records of the
+older data have been removed. This means that if an application writes a
+200 GB file to a vdo target and then immediately writes it again, the two
+copies will deduplicate perfectly. Doing the same with a 500 GB file will
+result in no deduplication, because the beginning of the file will no
+longer be in the index by the time the second write begins (assuming there
+is no duplication within the file itself).
+
+If an application anticipates a data workload that will see useful
+deduplication beyond the 256GB threshold, vdo can be configured to use a
+larger index with a correspondingly larger deduplication window. (This
+configuration can only be set when the target is created, not altered
+later. It is important to consider the expected workload for a vdo target
+before configuring it.) There are two ways to do this.
+
+One way is to increase the memory size of the index, which also increases
+the amount of backing storage required. Doubling the size of the index will
+double the length of the deduplication window at the expense of doubling
+the storage size and the memory requirements.
+
+The other option is to enable sparse indexing. Sparse indexing increases
+the deduplication window by a factor of 10, at the expense of also
+increasing the storage size by a factor of 10. However with sparse
+indexing, the memory requirements do not increase. The trade-off is
+slightly more computation per request and a slight decrease in the amount
+of deduplication detected. For most workloads with significant amounts of
+duplicate data, sparse indexing will detect 97-99% of the deduplication
+that a standard index will detect.
+
+The vio and data_vio Structures
+-------------------------------
+
+A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
+fields and data to track vdo-specific information. A struct vio maintains a
+pointer to a bio but also tracks other fields specific to the operation of
+vdo. The vio is kept separate from its related bio because there are many
+circumstances where vdo completes the bio but must continue to do work
+related to deduplication or compression.
+
+Metadata reads and writes, and other writes that originate within vdo, use
+a struct vio directly. Application reads and writes use a larger structure
+called a data_vio to track information about their progress. A struct
+data_vio contain a struct vio and also includes several other fields
+related to deduplication and other vdo features. The data_vio is the
+primary unit of application work in vdo. Each data_vio proceeds through a
+set of steps to handle the application data, after which it is reset and
+returned to a pool of data_vios for reuse.
+
+There is a fixed pool of 2048 data_vios. This number was chosen to bound
+the amount of work that is required to recover from a crash. In addition,
+benchmarks have indicated that increasing the size of the pool does not
+significantly improve performance.
+
+The Data Store
+--------------
+
+The data store is implemented by three main data structures, all of which
+work in concert to reduce or amortize metadata updates across as many data
+writes as possible.
+
+*The Slab Depot*
+
+Most of the vdo volume belongs to the slab depot. The depot contains a
+collection of slabs. The slabs can be up to 32GB, and are divided into
+three sections. Most of a slab consists of a linear sequence of 4K blocks.
+These blocks are used either to store data, or to hold portions of the
+block map (see below). In addition to the data blocks, each slab has a set
+of reference counters, using 1 byte for each data block. Finally each slab
+has a journal.
+
+Reference updates are written to the slab journal. Slab journal blocks are
+written out either when they are full, or when the recovery journal
+requests they do so in order to allow the main recovery journal (see below)
+to free up space. The slab journal is used both to ensure that the main
+recovery journal can regularly free up space, and also to amortize the cost
+of updating individual reference blocks. The reference counters are kept in
+memory and are written out, a block at a time in oldest-dirtied-order, only
+when there is a need to reclaim slab journal space. The write operations
+are performed in the background as needed so they do not add latency to
+particular I/O operations.
+
+Each slab is independent of every other. They are assigned to "physical
+zones" in round-robin fashion. If there are P physical zones, then slab n
+is assigned to zone n mod P.
+
+The slab depot maintains an additional small data structure, the "slab
+summary," which is used to reduce the amount of work needed to come back
+online after a crash. The slab summary maintains an entry for each slab
+indicating whether or not the slab has ever been used, whether all of its
+reference count updates have been persisted to storage, and approximately
+how full it is. During recovery, each physical zone will attempt to recover
+at least one slab, stopping whenever it has recovered a slab which has some
+free blocks. Once each zone has some space, or has determined that none is
+available, the target can resume normal operation in a degraded mode. Read
+and write requests can be serviced, perhaps with degraded performance,
+while the remainder of the dirty slabs are recovered.
+
+*The Block Map*
+
+The block map contains the logical to physical mapping. It can be thought
+of as an array with one entry per logical address. Each entry is 5 bytes,
+36 bits of which contain the physical block number which holds the data for
+the given logical address. The other 4 bits are used to indicate the nature
+of the mapping. Of the 16 possible states, one represents a logical address
+which is unmapped (i.e. it has never been written, or has been discarded),
+one represents an uncompressed block, and the other 14 states are used to
+indicate that the mapped data is compressed, and which of the compression
+slots in the compressed block contains the data for this logical address.
+
+In practice, the array of mapping entries is divided into "block map
+pages," each of which fits in a single 4K block. Each block map page
+consists of a header and 812 mapping entries. Each mapping page is actually
+a leaf of a radix tree which consists of block map pages at each level.
+There are 60 radix trees which are assigned to "logical zones" in round
+robin fashion. (If there are L logical zones, tree n will belong to zone n
+mod L.) At each level, the trees are interleaved, so logical addresses
+0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
+on. The interleaving is maintained all the way up to the 60 root nodes.
+Choosing 60 trees results in an evenly distributed number of trees per zone
+for a large number of possible logical zone counts. The storage for the 60
+tree roots is allocated at format time. All other block map pages are
+allocated out of the slabs as needed. This flexible allocation avoids the
+need to pre-allocate space for the entire set of logical mappings and also
+makes growing the logical size of a vdo relatively easy.
+
+In operation, the block map maintains two caches. It is prohibitive to keep
+the entire leaf level of the trees in memory, so each logical zone
+maintains its own cache of leaf pages. The size of this cache is
+configurable at target start time. The second cache is allocated at start
+time, and is large enough to hold all the non-leaf pages of the entire
+block map. This cache is populated as pages are needed.
+
+*The Recovery Journal*
+
+The recovery journal is used to amortize updates across the block map and
+slab depot. Each write request causes an entry to be made in the journal.
+Entries are either "data remappings" or "block map remappings." For a data
+remapping, the journal records the logical address affected and its old and
+new physical mappings. For a block map remapping, the journal records the
+block map page number and the physical block allocated for it. Block map
+pages are never reclaimed or repurposed, so the old mapping is always 0.
+
+Each journal entry is an intent record summarizing the metadata updates
+that are required for a data_vio. The recovery journal issues a flush
+before each journal block write to ensure that the physical data for the
+new block mappings in that block are stable on storage, and journal block
+writes are all issued with the FUA bit set to ensure the recovery journal
+entries themselves are stable. The journal entry and the data write it
+represents must be stable on disk before the other metadata structures may
+be updated to reflect the operation. These entries allow the vdo device to
+reconstruct the logical to physical mappings after an unexpected
+interruption such as a loss of power.
+
+*Write Path*
+
+All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
+as vdo has done enough work to guarantee that it can complete the write
+eventually. Generally, the data for acknowledged but unflushed write I/O
+can be treated as though it is cached in memory. If an application
+requires data to be stable on storage, it must issue a flush or write the
+data with the FUA bit set like any other asynchronous I/O. Shutting down
+the vdo target will also flush any remaining I/O.
+
+Application write bios follow the steps outlined below.
+
+1. A data_vio is obtained from the data_vio pool and associated with the
+ application bio. If there are no data_vios available, the incoming bio
+ will block until a data_vio is available. This provides back pressure
+ to the application. The data_vio pool is protected by a spin lock.
+
+ The newly acquired data_vio is reset and the bio's data is copied into
+ the data_vio if it is a write and the data is not all zeroes. The data
+ must be copied because the application bio can be acknowledged before
+ the data_vio processing is complete, which means later processing steps
+ will no longer have access to the application bio. The application bio
+ may also be smaller than 4K, in which case the data_vio will have
+ already read the underlying block and the data is instead copied over
+ the relevant portion of the larger block.
+
+2. The data_vio places a claim (the "logical lock") on the logical address
+ of the bio. It is vital to prevent simultaneous modifications of the
+ same logical address, because deduplication involves sharing blocks.
+ This claim is implemented as an entry in a hashtable where the key is
+ the logical address and the value is a pointer to the data_vio
+ currently handling that address.
+
+ If a data_vio looks in the hashtable and finds that another data_vio is
+ already operating on that logical address, it waits until the previous
+ operation finishes. It also sends a message to inform the current
+ lock holder that it is waiting. Most notably, a new data_vio waiting
+ for a logical lock will flush the previous lock holder out of the
+ compression packer (step 8d) rather than allowing it to continue
+ waiting to be packed.
+
+ This stage requires the data_vio to get an implicit lock on the
+ appropriate logical zone to prevent concurrent modifications of the
+ hashtable. This implicit locking is handled by the zone divisions
+ described above.
+
+3. The data_vio traverses the block map tree to ensure that all the
+ necessary internal tree nodes have been allocated, by trying to find
+ the leaf page for its logical address. If any interior tree page is
+ missing, it is allocated at this time out of the same physical storage
+ pool used to store application data.
+
+ a. If any page-node in the tree has not yet been allocated, it must be
+ allocated before the write can continue. This step requires the
+ data_vio to lock the page-node that needs to be allocated. This
+ lock, like the logical block lock in step 2, is a hashtable entry
+ that causes other data_vios to wait for the allocation process to
+ complete.
+
+ The implicit logical zone lock is released while the allocation is
+ happening, in order to allow other operations in the same logical
+ zone to proceed. The details of allocation are the same as in
+ step 4. Once a new node has been allocated, that node is added to
+ the tree using a similar process to adding a new data block mapping.
+ The data_vio journals the intent to add the new node to the block
+ map tree (step 10), updates the reference count of the new block
+ (step 11), and reacquires the implicit logical zone lock to add the
+ new mapping to the parent tree node (step 12). Once the tree is
+ updated, the data_vio proceeds down the tree. Any other data_vios
+ waiting on this allocation also proceed.
+
+ b. In the steady-state case, the block map tree nodes will already be
+ allocated, so the data_vio just traverses the tree until it finds
+ the required leaf node. The location of the mapping (the "block map
+ slot") is recorded in the data_vio so that later steps do not need
+ to traverse the tree again. The data_vio then releases the implicit
+ logical zone lock.
+
+4. If the block is a zero block, skip to step 9. Otherwise, an attempt is
+ made to allocate a free data block. This allocation ensures that the
+ data_vio can write its data somewhere even if deduplication and
+ compression are not possible. This stage gets an implicit lock on a
+ physical zone to search for free space within that zone.
+
+ The data_vio will search each slab in a zone until it finds a free
+ block or decides there are none. If the first zone has no free space,
+ it will proceed to search the next physical zone by taking the implicit
+ lock for that zone and releasing the previous one until it finds a
+ free block or runs out of zones to search. The data_vio will acquire a
+ struct pbn_lock (the "physical block lock") on the free block. The
+ struct pbn_lock also has several fields to record the various kinds of
+ claims that data_vios can have on physical blocks. The pbn_lock is
+ added to a hashtable like the logical block locks in step 2. This
+ hashtable is also covered by the implicit physical zone lock. The
+ reference count of the free block is updated to prevent any other
+ data_vio from considering it free. The reference counters are a
+ sub-component of the slab and are thus also covered by the implicit
+ physical zone lock.
+
+5. If an allocation was obtained, the data_vio has all the resources it
+ needs to complete the write. The application bio can safely be
+ acknowledged at this point. The acknowledgment happens on a separate
+ thread to prevent the application callback from blocking other data_vio
+ operations.
+
+ If an allocation could not be obtained, the data_vio continues to
+ attempt to deduplicate or compress the data, but the bio is not
+ acknowledged because the vdo device may be out of space.
+
+6. At this point vdo must determine where to store the application data.
+ The data_vio's data is hashed and the hash (the "record name") is
+ recorded in the data_vio.
+
+7. The data_vio reserves or joins a struct hash_lock, which manages all of
+ the data_vios currently writing the same data. Active hash locks are
+ tracked in a hashtable similar to the way logical block locks are
+ tracked in step 2. This hashtable is covered by the implicit lock on
+ the hash zone.
+
+ If there is no existing hash lock for this data_vio's record_name, the
+ data_vio obtains a hash lock from the pool, adds it to the hashtable,
+ and sets itself as the new hash lock's "agent." The hash_lock pool is
+ also covered by the implicit hash zone lock. The hash lock agent will
+ do all the work to decide where the application data will be
+ written. If a hash lock for the data_vio's record_name already exists,
+ and the data_vio's data is the same as the agent's data, the new
+ data_vio will wait for the agent to complete its work and then share
+ its result.
+
+ In the rare case that a hash lock exists for the data_vio's hash but
+ the data does not match the hash lock's agent, the data_vio skips to
+ step 8h and attempts to write its data directly. This can happen if two
+ different data blocks produce the same hash, for example.
+
+8. The hash lock agent attempts to deduplicate or compress its data with
+ the following steps.
+
+ a. The agent initializes and sends its embedded deduplication request
+ (struct uds_request) to the deduplication index. This does not
+ require the data_vio to get any locks because the index components
+ manage their own locking. The data_vio waits until it either gets a
+ response from the index or times out.
+
+ b. If the deduplication index returns advice, the data_vio attempts to
+ obtain a physical block lock on the indicated physical address, in
+ order to read the data and verify that it is the same as the
+ data_vio's data, and that it can accept more references. If the
+ physical address is already locked by another data_vio, the data at
+ that address may soon be overwritten so it is not safe to use the
+ address for deduplication.
+
+ c. If the data matches and the physical block can add references, the
+ agent and any other data_vios waiting on it will record this
+ physical block as their new physical address and proceed to step 9
+ to record their new mapping. If there are more data_vios in the hash
+ lock than there are references available, one of the remaining
+ data_vios becomes the new agent and continues to step 8d as if no
+ valid advice was returned.
+
+ d. If no usable duplicate block was found, the agent first checks that
+ it has an allocated physical block (from step 3) that it can write
+ to. If the agent does not have an allocation, some other data_vio in
+ the hash lock that does have an allocation takes over as agent. If
+ none of the data_vios have an allocated physical block, these writes
+ are out of space, so they proceed to step 13 for cleanup.
+
+ e. The agent attempts to compress its data. If the data does not
+ compress, the data_vio will continue to step 8h to write its data
+ directly.
+
+ If the compressed size is small enough, the agent will release the
+ implicit hash zone lock and go to the packer (struct packer) where
+ it will be placed in a bin (struct packer_bin) along with other
+ data_vios. All compression operations require the implicit lock on
+ the packer zone.
+
+ The packer can combine up to 14 compressed blocks in a single 4k
+ data block. Compression is only helpful if vdo can pack at least 2
+ data_vios into a single data block. This means that a data_vio may
+ wait in the packer for an arbitrarily long time for other data_vios
+ to fill out the compressed block. There is a mechanism for vdo to
+ evict waiting data_vios when continuing to wait would cause
+ problems. Circumstances causing an eviction include an application
+ flush, device shutdown, or a subsequent data_vio trying to overwrite
+ the same logical block address. A data_vio may also be evicted from
+ the packer if it cannot be paired with any other compressed block
+ before more compressible blocks need to use its bin. An evicted
+ data_vio will proceed to step 8h to write its data directly.
+
+ f. If the agent fills a packer bin, either because all 14 of its slots
+ are used or because it has no remaining space, it is written out
+ using the allocated physical block from one of its data_vios. Step
+ 8d has already ensured that an allocation is available.
+
+ g. Each data_vio sets the compressed block as its new physical address.
+ The data_vio obtains an implicit lock on the physical zone and
+ acquires the struct pbn_lock for the compressed block, which is
+ modified to be a shared lock. Then it releases the implicit physical
+ zone lock and proceeds to step 8i.
+
+ h. Any data_vio evicted from the packer will have an allocation from
+ step 3. It will write its data to that allocated physical block.
+
+ i. After the data is written, if the data_vio is the agent of a hash
+ lock, it will reacquire the implicit hash zone lock and share its
+ physical address with as many other data_vios in the hash lock as
+ possible. Each data_vio will then proceed to step 9 to record its
+ new mapping.
+
+ j. If the agent actually wrote new data (whether compressed or not),
+ the deduplication index is updated to reflect the location of the
+ new data. The agent then releases the implicit hash zone lock.
+
+9. The data_vio determines the previous mapping of the logical address.
+ There is a cache for block map leaf pages (the "block map cache"),
+ because there are usually too many block map leaf nodes to store
+ entirely in memory. If the desired leaf page is not in the cache, the
+ data_vio will reserve a slot in the cache and load the desired page
+ into it, possibly evicting an older cached page. The data_vio then
+ finds the current physical address for this logical address (the "old
+ physical mapping"), if any, and records it. This step requires a lock
+ on the block map cache structures, covered by the implicit logical zone
+ lock.
+
+10. The data_vio makes an entry in the recovery journal containing the
+ logical block address, the old physical mapping, and the new physical
+ mapping. Making this journal entry requires holding the implicit
+ recovery journal lock. The data_vio will wait in the journal until all
+ recovery blocks up to the one containing its entry have been written
+ and flushed to ensure the transaction is stable on storage.
+
+11. Once the recovery journal entry is stable, the data_vio makes two slab
+ journal entries: an increment entry for the new mapping, and a
+ decrement entry for the old mapping. These two operations each require
+ holding a lock on the affected physical slab, covered by its implicit
+ physical zone lock. For correctness during recovery, the slab journal
+ entries in any given slab journal must be in the same order as the
+ corresponding recovery journal entries. Therefore, if the two entries
+ are in different zones, they are made concurrently, and if they are in
+ the same zone, the increment is always made before the decrement in
+ order to avoid underflow. After each slab journal entry is made in
+ memory, the associated reference count is also updated in memory.
+
+12. Once both of the reference count updates are done, the data_vio
+ acquires the implicit logical zone lock and updates the
+ logical-to-physical mapping in the block map to point to the new
+ physical block. At this point the write operation is complete.
+
+13. If the data_vio has a hash lock, it acquires the implicit hash zone
+ lock and releases its hash lock to the pool.
+
+ The data_vio then acquires the implicit physical zone lock and releases
+ the struct pbn_lock it holds for its allocated block. If it had an
+ allocation that it did not use, it also sets the reference count for
+ that block back to zero to free it for use by subsequent data_vios.
+
+ The data_vio then acquires the implicit logical zone lock and releases
+ the logical block lock acquired in step 2.
+
+ The application bio is then acknowledged if it has not previously been
+ acknowledged, and the data_vio is returned to the pool.
+
+*Read Path*
+
+An application read bio follows a much simpler set of steps. It does steps
+1 and 2 in the write path to obtain a data_vio and lock its logical
+address. If there is already a write data_vio in progress for that logical
+address that is guaranteed to complete, the read data_vio will copy the
+data from the write data_vio and return it. Otherwise, it will look up the
+logical-to-physical mapping by traversing the block map tree as in step 3,
+and then read and possibly decompress the indicated data at the indicated
+physical block address. A read data_vio will not allocate block map tree
+nodes if they are missing. If the interior block map nodes do not exist
+yet, the logical block map address must still be unmapped and the read
+data_vio will return all zeroes. A read data_vio handles cleanup and
+acknowledgment as in step 13, although it only needs to release the logical
+lock and return itself to the pool.
+
+*Small Writes*
+
+All storage within vdo is managed as 4KB blocks, but it can accept writes
+as small as 512 bytes. Processing a write that is smaller than 4K requires
+a read-modify-write operation that reads the relevant 4K block, copies the
+new data over the approriate sectors of the block, and then launches a
+write operation for the modified data block. The read and write stages of
+this operation are nearly identical to the normal read and write
+operations, and a single data_vio is used throughout this operation.
+
+*Recovery*
+
+When a vdo is restarted after a crash, it will attempt to recover from the
+recovery journal. During the pre-resume phase of the next start, the
+recovery journal is read. The increment portion of valid entries are played
+into the block map. Next, valid entries are played, in order as required,
+into the slab journals. Finally, each physical zone attempts to replay at
+least one slab journal to reconstruct the reference counts of one slab.
+Once each zone has some free space (or has determined that it has none),
+the vdo comes back online, while the remainder of the slab journals are
+used to reconstruct the rest of the reference counts in the background.
+
+*Read-only Rebuild*
+
+If a vdo encounters an unrecoverable error, it will enter read-only mode.
+This mode indicates that some previously acknowledged data may have been
+lost. The vdo may be instructed to rebuild as best it can in order to
+return to a writable state. However, this is never done automatically due
+to the possibility that data has been lost. During a read-only rebuild, the
+block map is recovered from the recovery journal as before. However, the
+reference counts are not rebuilt from the slab journals. Instead, the
+reference counts are zeroed, the entire block map is traversed, and the
+reference counts are updated from the block mappings. While this may lose
+some data, it ensures that the block map and reference counts are
+consistent with each other. This allows vdo to resume normal operation and
+accept further writes.
diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst
new file mode 100644
index 000000000000..7e1ecafdf91e
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/vdo.rst
@@ -0,0 +1,406 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+dm-vdo
+======
+
+The dm-vdo (virtual data optimizer) device mapper target provides
+block-level deduplication, compression, and thin provisioning. As a device
+mapper target, it can add these features to the storage stack, compatible
+with any file system. The vdo target does not protect against data
+corruption, relying instead on integrity protection of the storage below
+it. It is strongly recommended that lvm be used to manage vdo volumes. See
+lvmvdo(7).
+
+Userspace component
+===================
+
+Formatting a vdo volume requires the use of the 'vdoformat' tool, available
+at:
+
+https://github.com/dm-vdo/vdo/
+
+In most cases, a vdo target will recover from a crash automatically the
+next time it is started. In cases where it encountered an unrecoverable
+error (either during normal operation or crash recovery) the target will
+enter or come up in read-only mode. Because read-only mode is indicative of
+data-loss, a positive action must be taken to bring vdo out of read-only
+mode. The 'vdoforcerebuild' tool, available from the same repo, is used to
+prepare a read-only vdo to exit read-only mode. After running this tool,
+the vdo target will rebuild its metadata the next time it is
+started. Although some data may be lost, the rebuilt vdo's metadata will be
+internally consistent and the target will be writable again.
+
+The repo also contains additional userspace tools which can be used to
+inspect a vdo target's on-disk metadata. Fortunately, these tools are
+rarely needed except by dm-vdo developers.
+
+Metadata requirements
+=====================
+
+Each vdo volume reserves 3GB of space for metadata, or more depending on
+its configuration. It is helpful to check that the space saved by
+deduplication and compression is not cancelled out by the metadata
+requirements. An estimation of the space saved for a specific dataset can
+be computed with the vdo estimator tool, which is available at:
+
+https://github.com/dm-vdo/vdoestimator/
+
+Target interface
+================
+
+Table line
+----------
+
+::
+
+ <offset> <logical device size> vdo V4 <storage device>
+ <storage device size> <minimum I/O size> <block map cache size>
+ <block map era length> [optional arguments]
+
+
+Required parameters:
+
+ offset:
+ The offset, in sectors, at which the vdo volume's logical
+ space begins.
+
+ logical device size:
+ The size of the device which the vdo volume will service,
+ in sectors. Must match the current logical size of the vdo
+ volume.
+
+ storage device:
+ The device holding the vdo volume's data and metadata.
+
+ storage device size:
+ The size of the device holding the vdo volume, as a number
+ of 4096-byte blocks. Must match the current size of the vdo
+ volume.
+
+ minimum I/O size:
+ The minimum I/O size for this vdo volume to accept, in
+ bytes. Valid values are 512 or 4096. The recommended value
+ is 4096.
+
+ block map cache size:
+ The size of the block map cache, as a number of 4096-byte
+ blocks. The minimum and recommended value is 32768 blocks.
+ If the logical thread count is non-zero, the cache size
+ must be at least 4096 blocks per logical thread.
+
+ block map era length:
+ The speed with which the block map cache writes out
+ modified block map pages. A smaller era length is likely to
+ reduce the amount of time spent rebuilding, at the cost of
+ increased block map writes during normal operation. The
+ maximum and recommended value is 16380; the minimum value
+ is 1.
+
+Optional parameters:
+--------------------
+Some or all of these parameters may be specified as <key> <value> pairs.
+
+Thread related parameters:
+
+Different categories of work are assigned to separate thread groups, and
+the number of threads in each group can be configured separately.
+
+If <hash>, <logical>, and <physical> are all set to 0, the work handled by
+all three thread types will be handled by a single thread. If any of these
+values are non-zero, all of them must be non-zero.
+
+ ack:
+ The number of threads used to complete bios. Since
+ completing a bio calls an arbitrary completion function
+ outside the vdo volume, threads of this type allow the vdo
+ volume to continue processing requests even when bio
+ completion is slow. The default is 1.
+
+ bio:
+ The number of threads used to issue bios to the underlying
+ storage. Threads of this type allow the vdo volume to
+ continue processing requests even when bio submission is
+ slow. The default is 4.
+
+ bioRotationInterval:
+ The number of bios to enqueue on each bio thread before
+ switching to the next thread. The value must be greater
+ than 0 and not more than 1024; the default is 64.
+
+ cpu:
+ The number of threads used to do CPU-intensive work, such
+ as hashing and compression. The default is 1.
+
+ hash:
+ The number of threads used to manage data comparisons for
+ deduplication based on the hash value of data blocks. The
+ default is 0.
+
+ logical:
+ The number of threads used to manage caching and locking
+ based on the logical address of incoming bios. The default
+ is 0; the maximum is 60.
+
+ physical:
+ The number of threads used to manage administration of the
+ underlying storage device. At format time, a slab size for
+ the vdo is chosen; the vdo storage device must be large
+ enough to have at least 1 slab per physical thread. The
+ default is 0; the maximum is 16.
+
+Miscellaneous parameters:
+
+ maxDiscard:
+ The maximum size of discard bio accepted, in 4096-byte
+ blocks. I/O requests to a vdo volume are normally split
+ into 4096-byte blocks, and processed up to 2048 at a time.
+ However, discard requests to a vdo volume can be
+ automatically split to a larger size, up to <maxDiscard>
+ 4096-byte blocks in a single bio, and are limited to 1500
+ at a time. Increasing this value may provide better overall
+ performance, at the cost of increased latency for the
+ individual discard requests. The default and minimum is 1;
+ the maximum is UINT_MAX / 4096.
+
+ deduplication:
+ Whether deduplication is enabled. The default is 'on'; the
+ acceptable values are 'on' and 'off'.
+
+ compression:
+ Whether compression is enabled. The default is 'off'; the
+ acceptable values are 'on' and 'off'.
+
+Device modification
+-------------------
+
+A modified table may be loaded into a running, non-suspended vdo volume.
+The modifications will take effect when the device is next resumed. The
+modifiable parameters are <logical device size>, <physical device size>,
+<maxDiscard>, <compression>, and <deduplication>.
+
+If the logical device size or physical device size are changed, upon
+successful resume vdo will store the new values and require them on future
+startups. These two parameters may not be decreased. The logical device
+size may not exceed 4 PB. The physical device size must increase by at
+least 32832 4096-byte blocks if at all, and must not exceed the size of the
+underlying storage device. Additionally, when formatting the vdo device, a
+slab size is chosen: the physical device size may never increase above the
+size which provides 8192 slabs, and each increase must be large enough to
+add at least one new slab.
+
+Examples:
+
+Start a previously-formatted vdo volume with 1 GB logical space and 1 GB
+physical space, storing to /dev/dm-1 which has more than 1 GB of space.
+
+::
+
+ dmsetup create vdo0 --table \
+ "0 2097152 vdo V4 /dev/dm-1 262144 4096 32768 16380"
+
+Grow the logical size to 4 GB.
+
+::
+
+ dmsetup reload vdo0 --table \
+ "0 8388608 vdo V4 /dev/dm-1 262144 4096 32768 16380"
+ dmsetup resume vdo0
+
+Grow the physical size to 2 GB.
+
+::
+
+ dmsetup reload vdo0 --table \
+ "0 8388608 vdo V4 /dev/dm-1 524288 4096 32768 16380"
+ dmsetup resume vdo0
+
+Grow the physical size by 1 GB more and increase max discard sectors.
+
+::
+
+ dmsetup reload vdo0 --table \
+ "0 10485760 vdo V4 /dev/dm-1 786432 4096 32768 16380 maxDiscard 8"
+ dmsetup resume vdo0
+
+Stop the vdo volume.
+
+::
+
+ dmsetup remove vdo0
+
+Start the vdo volume again. Note that the logical and physical device sizes
+must still match, but other parameters can change.
+
+::
+
+ dmsetup create vdo1 --table \
+ "0 10485760 vdo V4 /dev/dm-1 786432 512 65550 5000 hash 1 logical 3 physical 2"
+
+Messages
+--------
+All vdo devices accept messages in the form:
+
+::
+ dmsetup message <target-name> 0 <message-name> <message-parameters>
+
+The messages are:
+
+ stats:
+ Outputs the current view of the vdo statistics. Mostly used
+ by the vdostats userspace program to interpret the output
+ buffer.
+
+ dump:
+ Dumps many internal structures to the system log. This is
+ not always safe to run, so it should only be used to debug
+ a hung vdo. Optional parameters to specify structures to
+ dump are:
+
+ viopool: The pool of I/O requests incoming bios
+ pools: A synonym of 'viopool'
+ vdo: Most of the structures managing on-disk data
+ queues: Basic information about each vdo thread
+ threads: A synonym of 'queues'
+ default: Equivalent to 'queues vdo'
+ all: All of the above.
+
+ dump-on-shutdown:
+ Perform a default dump next time vdo shuts down.
+
+
+Status
+------
+
+::
+
+ <device> <operating mode> <in recovery> <index state>
+ <compression state> <physical blocks used> <total physical blocks>
+
+ device:
+ The name of the vdo volume.
+
+ operating mode:
+ The current operating mode of the vdo volume; values may be
+ 'normal', 'recovering' (the volume has detected an issue
+ with its metadata and is attempting to repair itself), and
+ 'read-only' (an error has occurred that forces the vdo
+ volume to only support read operations and not writes).
+
+ in recovery:
+ Whether the vdo volume is currently in recovery mode;
+ values may be 'recovering' or '-' which indicates not
+ recovering.
+
+ index state:
+ The current state of the deduplication index in the vdo
+ volume; values may be 'closed', 'closing', 'error',
+ 'offline', 'online', 'opening', and 'unknown'.
+
+ compression state:
+ The current state of compression in the vdo volume; values
+ may be 'offline' and 'online'.
+
+ used physical blocks:
+ The number of physical blocks in use by the vdo volume.
+
+ total physical blocks:
+ The total number of physical blocks the vdo volume may use;
+ the difference between this value and the
+ <used physical blocks> is the number of blocks the vdo
+ volume has left before being full.
+
+Memory Requirements
+===================
+
+A vdo target requires a fixed 38 MB of RAM along with the following amounts
+that scale with the target:
+
+- 1.15 MB of RAM for each 1 MB of configured block map cache size. The
+ block map cache requires a minimum of 150 MB.
+- 1.6 MB of RAM for each 1 TB of logical space.
+- 268 MB of RAM for each 1 TB of physical storage managed by the volume.
+
+The deduplication index requires additional memory which scales with the
+size of the deduplication window. For dense indexes, the index requires 1
+GB of RAM per 1 TB of window. For sparse indexes, the index requires 1 GB
+of RAM per 10 TB of window. The index configuration is set when the target
+is formatted and may not be modified.
+
+Module Parameters
+=================
+
+The vdo driver has a numeric parameter 'log_level' which controls the
+verbosity of logging from the driver. The default setting is 6
+(LOGLEVEL_INFO and more severe messages).
+
+Run-time Usage
+==============
+
+When using dm-vdo, it is important to be aware of the ways in which its
+behavior differs from other storage targets.
+
+- There is no guarantee that over-writes of existing blocks will succeed.
+ Because the underlying storage may be multiply referenced, over-writing
+ an existing block generally requires a vdo to have a free block
+ available.
+
+- When blocks are no longer in use, sending a discard request for those
+ blocks lets the vdo release references for those blocks. If the vdo is
+ thinly provisioned, discarding unused blocks is essential to prevent the
+ target from running out of space. However, due to the sharing of
+ duplicate blocks, no discard request for any given logical block is
+ guaranteed to reclaim space.
+
+- Assuming the underlying storage properly implements flush requests, vdo
+ is resilient against crashes, however, unflushed writes may or may not
+ persist after a crash.
+
+- Each write to a vdo target entails a significant amount of processing.
+ However, much of the work is paralellizable. Therefore, vdo targets
+ achieve better throughput at higher I/O depths, and can support up 2048
+ requests in parallel.
+
+Tuning
+======
+
+The vdo device has many options, and it can be difficult to make optimal
+choices without perfect knowledge of the workload. Additionally, most
+configuration options must be set when a vdo target is started, and cannot
+be changed without shutting it down completely; the configuration cannot be
+changed while the target is active. Ideally, tuning with simulated
+workloads should be performed before deploying vdo in production
+environments.
+
+The most important value to adjust is the block map cache size. In order to
+service a request for any logical address, a vdo must load the portion of
+the block map which holds the relevant mapping. These mappings are cached.
+Performance will suffer when the working set does not fit in the cache. By
+default, a vdo allocates 128 MB of metadata cache in RAM to support
+efficient access to 100 GB of logical space at a time. It should be scaled
+up proportionally for larger working sets.
+
+The logical and physical thread counts should also be adjusted. A logical
+thread controls a disjoint section of the block map, so additional logical
+threads increase parallelism and can increase throughput. Physical threads
+control a disjoint section of the data blocks, so additional physical
+threads can also increase throughput. However, excess threads can waste
+resources and increase contention.
+
+Bio submission threads control the parallelism involved in sending I/O to
+the underlying storage; fewer threads mean there is more opportunity to
+reorder I/O requests for performance benefit, but also that each I/O
+request has to wait longer before being submitted.
+
+Bio acknowledgment threads are used for finishing I/O requests. This is
+done on dedicated threads since the amount of work required to execute a
+bio's callback can not be controlled by the vdo itself. Usually one thread
+is sufficient but additional threads may be beneficial, particularly when
+bios have CPU-heavy callbacks.
+
+CPU threads are used for hashing and for compression; in workloads with
+compression enabled, more threads may result in higher throughput.
+
+Hash threads are used to sort active requests by hash and determine whether
+they should deduplicate; the most CPU intensive actions done by these
+threads are comparison of 4096-byte data blocks. In most cases, a single
+hash thread is sufficient.
diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
index bb02caa45289..a65c1602cb23 100644
--- a/Documentation/admin-guide/device-mapper/verity.rst
+++ b/Documentation/admin-guide/device-mapper/verity.rst
@@ -69,7 +69,7 @@ Construction Parameters
<#opt_params>
Number of optional parameters. If there are no optional parameters,
- the optional paramaters section can be skipped or #opt_params can be zero.
+ the optional parameters section can be skipped or #opt_params can be zero.
Otherwise #opt_params is the number of following arguments.
Example of optional parameters section:
@@ -83,6 +83,10 @@ restart_on_corruption
not compatible with ignore_corruption and requires user space support to
avoid restart loops.
+panic_on_corruption
+ Panic the device when a corrupted block is discovered. This option is
+ not compatible with ignore_corruption and restart_on_corruption.
+
ignore_zero_blocks
Do not verify blocks that are expected to contain zeroes and always return
zeroes instead. This may be useful if the partition contains unused blocks
@@ -130,7 +134,16 @@ root_hash_sig_key_desc <key_description>
the pkcs7 signature of the roothash. The pkcs7 signature is used to validate
the root hash during the creation of the device mapper block device.
Verification of roothash depends on the config DM_VERITY_VERIFY_ROOTHASH_SIG
- being set in the kernel.
+ being set in the kernel. The signatures are checked against the builtin
+ trusted keyring by default, or the secondary trusted keyring if
+ DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING is set. The secondary
+ trusted keyring includes by default the builtin trusted keyring, and it can
+ also gain new certificates at run time if they are signed by a certificate
+ already in the secondary trusted keyring.
+
+try_verify_in_tasklet
+ If verity hashes are in cache, verify data blocks in kernel tasklet instead
+ of workqueue. This option can reduce IO latency.
Theory of operation
===================
diff --git a/Documentation/admin-guide/device-mapper/writecache.rst b/Documentation/admin-guide/device-mapper/writecache.rst
index d3d7690f5e8d..60c16b7fd5ac 100644
--- a/Documentation/admin-guide/device-mapper/writecache.rst
+++ b/Documentation/admin-guide/device-mapper/writecache.rst
@@ -12,7 +12,6 @@ first sector should contain valid superblock from previous invocation.
Constructor parameters:
1. type of the cache device - "p" or "s"
-
- p - persistent memory
- s - SSD
2. the underlying device that will be cached
@@ -37,10 +36,10 @@ Constructor parameters:
autocommit_blocks n (default: 64 for pmem, 65536 for ssd)
when the application writes this amount of blocks without
issuing the FLUSH request, the blocks are automatically
- commited
+ committed
autocommit_time ms (default: 1000)
autocommit time in milliseconds. The data is automatically
- commited if this time passes and no FLUSH request is
+ committed if this time passes and no FLUSH request is
received
fua (by default on)
applicable only to persistent memory - use the FUA flag
@@ -53,19 +52,51 @@ Constructor parameters:
- some underlying devices perform better with fua, some
with nofua. The user should test it
+ cleaner
+ when this option is activated (either in the constructor
+ arguments or by a message), the cache will not promote
+ new writes (however, writes to already cached blocks are
+ promoted, to avoid data corruption due to misordered
+ writes) and it will gradually writeback any cached
+ data. The userspace can then monitor the cleaning
+ process with "dmsetup status". When the number of cached
+ blocks drops to zero, userspace can unload the
+ dm-writecache target and replace it with dm-linear or
+ other targets.
+ max_age n
+ specifies the maximum age of a block in milliseconds. If
+ a block is stored in the cache for too long, it will be
+ written to the underlying device and cleaned up.
+ metadata_only
+ only metadata is promoted to the cache. This option
+ improves performance for heavier REQ_META workloads.
+ pause_writeback n (default: 3000)
+ pause writeback if there was some write I/O redirected to
+ the origin volume in the last n milliseconds
Status:
+
1. error indicator - 0 if there was no error, otherwise error number
2. the number of blocks
3. the number of free blocks
4. the number of blocks under writeback
+5. the number of read blocks
+6. the number of read blocks that hit the cache
+7. the number of write blocks
+8. the number of write blocks that hit uncommitted block
+9. the number of write blocks that hit committed block
+10. the number of write blocks that bypass the cache
+11. the number of write blocks that are allocated in the cache
+12. the number of write requests that are blocked on the freelist
+13. the number of flush requests
+14. the number of discarded blocks
Messages:
flush
- flush the cache device. The message returns successfully
+ Flush the cache device. The message returns successfully
if the cache device was flushed without an error
flush_on_suspend
- flush the cache device on next suspend. Use this message
+ Flush the cache device on next suspend. Use this message
when you are going to remove the cache device. The proper
sequence for removing the cache device is:
@@ -77,3 +108,7 @@ Messages:
5. resume the device, so that it will use the linear
target
6. the cache device is now inactive and it can be deleted
+ cleaner
+ See above "cleaner" constructor documentation.
+ clear_stats
+ Clear the statistics that are reported on the status line
diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst
index 035275fedbdd..e3776d77374b 100644
--- a/Documentation/admin-guide/devices.rst
+++ b/Documentation/admin-guide/devices.rst
@@ -7,10 +7,9 @@ This list is the Linux Device List, the official registry of allocated
device numbers and ``/dev`` directory nodes for the Linux operating
system.
-The LaTeX version of this document is no longer maintained, nor is
-the document that used to reside at lanana.org. This version in the
-mainline Linux kernel is the master document. Updates shall be sent
-as patches to the kernel maintainers (see the
+The version of this document at lanana.org is no longer maintained. This
+version in the mainline Linux kernel is the master document. Updates
+shall be sent as patches to the kernel maintainers (see the
:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` document).
Specifically explore the sections titled "CHAR and MISC DRIVERS", and
"BLOCK LAYER" in the MAINTAINERS file to find the right maintainers
diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt
index 2a97aaec8b12..94c98be1329a 100644
--- a/Documentation/admin-guide/devices.txt
+++ b/Documentation/admin-guide/devices.txt
@@ -4,7 +4,7 @@
1 char Memory devices
1 = /dev/mem Physical memory access
- 2 = /dev/kmem Kernel virtual memory access
+ 2 = /dev/kmem OBSOLETE - replaced by /proc/kcore
3 = /dev/null Null device
4 = /dev/port I/O port access
5 = /dev/zero Null byte source
@@ -289,7 +289,7 @@
152 = /dev/kpoll Kernel Poll Driver
153 = /dev/mergemem Memory merge device
154 = /dev/pmu Macintosh PowerBook power manager
- 155 = /dev/isictl MultiTech ISICom serial control
+ 155 =
156 = /dev/lcd Front panel LCD display
157 = /dev/ac Applicom Intl Profibus card
158 = /dev/nwbutton Netwinder external button
@@ -375,8 +375,9 @@
239 = /dev/uhid User-space I/O driver support for HID subsystem
240 = /dev/userio Serio driver testing device
241 = /dev/vhost-vsock Host kernel driver for virtio vsock
+ 242 = /dev/rfkill Turning off radio transmissions (rfkill)
- 242-254 Reserved for local use
+ 243-254 Reserved for local use
255 Reserved for MISC_DYNAMIC_MINOR
11 char Raw keyboard device (Linux/SPARC only)
@@ -476,11 +477,6 @@
18 block Sanyo CD-ROM
0 = /dev/sjcd Sanyo CD-ROM
- 19 char Cyclades serial card
- 0 = /dev/ttyC0 First Cyclades port
- ...
- 31 = /dev/ttyC31 32nd Cyclades port
-
19 block "Double" compressed disk
0 = /dev/double0 First compressed disk
...
@@ -492,11 +488,6 @@
See the Double documentation for the meaning of the
mirror devices.
- 20 char Cyclades serial card - alternate devices
- 0 = /dev/cub0 Callout device for ttyC0
- ...
- 31 = /dev/cub31 Callout device for ttyC31
-
20 block Hitachi CD-ROM (under development)
0 = /dev/hitcd Hitachi CD-ROM
@@ -1442,7 +1433,7 @@
...
The driver and documentation may be obtained from
- http://www.winradio.com/
+ https://www.winradio.com/
82 block I2O hard disk
0 = /dev/i2o/hdag 33rd I2O hard disk, whole disk
@@ -1656,12 +1647,12 @@
dynamically, so there is no fixed mapping from subdevice
pathnames to minor numbers.
- See http://www.comedi.org/ for information about the Comedi
+ See https://www.comedi.org/ for information about the Comedi
project.
98 block User-mode virtual block device
0 = /dev/ubda First user-mode block device
- 16 = /dev/udbb Second user-mode block device
+ 16 = /dev/ubdb Second user-mode block device
...
Partitions are handled in the same way as for IDE
@@ -1723,7 +1714,7 @@
implementations a kernel presence for caching and easy
mounting. For more information about the project,
write to <arla-drinkers@stacken.kth.se> or see
- http://www.stacken.kth.se/project/arla/
+ https://www.stacken.kth.se/project/arla/
103 block Audit device
0 = /dev/audit Audit device
@@ -1942,7 +1933,7 @@
...
255= /dev/umem/d15p15 15th partition of 16th board.
- 117 char COSA/SRP synchronous serial card
+ 117 char [REMOVED] COSA/SRP synchronous serial card
0 = /dev/cosa0c0 1st board, 1st channel
1 = /dev/cosa0c1 1st board, 2nd channel
...
@@ -2348,13 +2339,7 @@
disks (see major number 3) except that the limit on
partitions is 31.
- 162 char Raw block device interface
- 0 = /dev/rawctl Raw I/O control device
- 1 = /dev/raw/raw1 First raw I/O device
- 2 = /dev/raw/raw2 Second raw I/O device
- ...
- max minor number of raw device is set by kernel config
- MAX_RAW_DEVS or raw module parameter 'max_raw_devs'
+ 162 char Used for (now removed) raw block device interface
163 char
@@ -2706,18 +2691,9 @@
45 = /dev/ttyMM1 Marvell MPSC - port 1 (obsolete unused)
46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0
...
- 47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5
- 50 = /dev/ttyIOC0 Altix serial card
- ...
- 81 = /dev/ttyIOC31 Altix serial card
+ 51 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5
82 = /dev/ttyVR0 NEC VR4100 series SIU
83 = /dev/ttyVR1 NEC VR4100 series DSIU
- 84 = /dev/ttyIOC84 Altix ioc4 serial card
- ...
- 115 = /dev/ttyIOC115 Altix ioc4 serial card
- 116 = /dev/ttySIOC0 Altix ioc3 serial card
- ...
- 147 = /dev/ttySIOC31 Altix ioc3 serial card
148 = /dev/ttyPSC0 PPC PSC - port 0
...
153 = /dev/ttyPSC5 PPC PSC - port 5
@@ -2728,6 +2704,9 @@
...
185 = /dev/ttyNX15 Hilscher netX serial port 15
186 = /dev/ttyJ0 JTAG1 DCC protocol based serial port emulation
+
+ If maximum number of uartlite serial ports is more than 4, then the driver
+ uses dynamic allocation instead of static allocation for major number.
187 = /dev/ttyUL0 Xilinx uartlite - port 0
...
190 = /dev/ttyUL3 Xilinx uartlite - port 3
@@ -2776,10 +2755,7 @@
43 = /dev/ttycusmx2 Callout device for ttySMX2
46 = /dev/cucpm0 Callout device for ttyCPM0
...
- 49 = /dev/cucpm5 Callout device for ttyCPM5
- 50 = /dev/cuioc40 Callout device for ttyIOC40
- ...
- 81 = /dev/cuioc431 Callout device for ttyIOC431
+ 51 = /dev/cucpm5 Callout device for ttyCPM5
82 = /dev/cuvr0 Callout device for ttyVR0
83 = /dev/cuvr1 Callout device for ttyVR1
@@ -3002,10 +2978,10 @@
65 = /dev/infiniband/issm1 Second InfiniBand IsSM device
...
127 = /dev/infiniband/issm63 63rd InfiniBand IsSM device
- 128 = /dev/infiniband/uverbs0 First InfiniBand verbs device
- 129 = /dev/infiniband/uverbs1 Second InfiniBand verbs device
+ 192 = /dev/infiniband/uverbs0 First InfiniBand verbs device
+ 193 = /dev/infiniband/uverbs1 Second InfiniBand verbs device
...
- 159 = /dev/infiniband/uverbs31 31st InfiniBand verbs device
+ 223 = /dev/infiniband/uverbs31 31st InfiniBand verbs device
232 char Biometric Devices
0 = /dev/biometric/sensor0/fingerprint first fingerprint sensor on first device
@@ -3095,6 +3071,11 @@
...
255 = /dev/osd255 256th OSD Device
+ 261 char Compute Acceleration Devices
+ 0 = /dev/accel/accel0 First acceleration device
+ 1 = /dev/accel/accel1 Second acceleration device
+ ...
+
384-511 char RESERVED FOR DYNAMIC ASSIGNMENT
Character devices that request a dynamic allocation of major
number will take numbers starting from 511 and downward,
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index 1012bd9305e9..0e9b48daf690 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -5,143 +5,115 @@ Dynamic debug
Introduction
============
-This document describes how to use the dynamic debug (dyndbg) feature.
+Dynamic debug allows you to dynamically enable/disable kernel
+debug-print code to obtain additional kernel information.
-Dynamic debug is designed to allow you to dynamically enable/disable
-kernel code to obtain additional kernel information. Currently, if
-``CONFIG_DYNAMIC_DEBUG`` is set, then all ``pr_debug()``/``dev_dbg()`` and
-``print_hex_dump_debug()``/``print_hex_dump_bytes()`` calls can be dynamically
-enabled per-callsite.
+If ``/proc/dynamic_debug/control`` exists, your kernel has dynamic
+debug. You'll need root access (sudo su) to use this.
-If you do not want to enable dynamic debug globally (i.e. in some embedded
-system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic
-debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any
-modules which you'd like to dynamically debug later.
-
-If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is just
-shortcut for ``print_hex_dump(KERN_DEBUG)``.
-
-For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is
-its ``prefix_str`` argument, if it is constant string; or ``hexdump``
-in case ``prefix_str`` is built dynamically.
+Dynamic debug provides:
-Dynamic debug has even more useful features:
+ * a Catalog of all *prdbgs* in your kernel.
+ ``cat /proc/dynamic_debug/control`` to see them.
- * Simple query language allows turning on and off debugging
- statements by matching any combination of 0 or 1 of:
+ * a Simple query/command language to alter *prdbgs* by selecting on
+ any combination of 0 or 1 of:
- source filename
- function name
- line number (including ranges of line numbers)
- module name
- format string
-
- * Provides a debugfs control file: ``<debugfs>/dynamic_debug/control``
- which can be read to display the complete list of known debug
- statements, to help guide you
-
-Controlling dynamic debug Behaviour
-===================================
-
-The behaviour of ``pr_debug()``/``dev_dbg()`` are controlled via writing to a
-control file in the 'debugfs' filesystem. Thus, you must first mount
-the debugfs filesystem, in order to make use of this feature.
-Subsequently, we refer to the control file as:
-``<debugfs>/dynamic_debug/control``. For example, if you want to enable
-printing from source file ``svcsock.c``, line 1603 you simply do::
-
- nullarbor:~ # echo 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
-
-If you make a mistake with the syntax, the write will fail thus::
-
- nullarbor:~ # echo 'file svcsock.c wtf 1 +p' >
- <debugfs>/dynamic_debug/control
- -bash: echo: write error: Invalid argument
-
-Note, for systems without 'debugfs' enabled, the control file can be
-found in ``/proc/dynamic_debug/control``.
+ - class name (as known/declared by each module)
Viewing Dynamic Debug Behaviour
===============================
-You can view the currently configured behaviour of all the debug
-statements via::
+You can view the currently configured behaviour in the *prdbg* catalog::
- nullarbor:~ # cat <debugfs>/dynamic_debug/control
+ :#> head -n7 /proc/dynamic_debug/control
# filename:lineno [module]function flags format
- /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:323 [svcxprt_rdma]svc_rdma_cleanup =_ "SVCRDMA Module Removed, deregister RPC RDMA transport\012"
- /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:341 [svcxprt_rdma]svc_rdma_init =_ "\011max_inline : %d\012"
- /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:340 [svcxprt_rdma]svc_rdma_init =_ "\011sq_depth : %d\012"
- /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:338 [svcxprt_rdma]svc_rdma_init =_ "\011max_requests : %d\012"
- ...
+ init/main.c:1179 [main]initcall_blacklist =_ "blacklisting initcall %s\012
+ init/main.c:1218 [main]initcall_blacklisted =_ "initcall %s blacklisted\012"
+ init/main.c:1424 [main]run_init_process =_ " with arguments:\012"
+ init/main.c:1426 [main]run_init_process =_ " %s\012"
+ init/main.c:1427 [main]run_init_process =_ " with environment:\012"
+ init/main.c:1429 [main]run_init_process =_ " %s\012"
+The 3rd space-delimited column shows the current flags, preceded by
+a ``=`` for easy use with grep/cut. ``=p`` shows enabled callsites.
-You can also apply standard Unix text manipulation filters to this
-data, e.g.::
+Controlling dynamic debug Behaviour
+===================================
- nullarbor:~ # grep -i rdma <debugfs>/dynamic_debug/control | wc -l
- 62
+The behaviour of *prdbg* sites are controlled by writing
+query/commands to the control file. Example::
- nullarbor:~ # grep -i tcp <debugfs>/dynamic_debug/control | wc -l
- 42
+ # grease the interface
+ :#> alias ddcmd='echo $* > /proc/dynamic_debug/control'
-The third column shows the currently enabled flags for each debug
-statement callsite (see below for definitions of the flags). The
-default value, with no flags enabled, is ``=_``. So you can view all
-the debug statement callsites with any non-default flags::
+ :#> ddcmd '-p; module main func run* +p'
+ :#> grep =p /proc/dynamic_debug/control
+ init/main.c:1424 [main]run_init_process =p " with arguments:\012"
+ init/main.c:1426 [main]run_init_process =p " %s\012"
+ init/main.c:1427 [main]run_init_process =p " with environment:\012"
+ init/main.c:1429 [main]run_init_process =p " %s\012"
- nullarbor:~ # awk '$3 != "=_"' <debugfs>/dynamic_debug/control
- # filename:lineno [module]function flags format
- /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c:1603 [sunrpc]svc_send p "svc_process: st_sendto returned %d\012"
+Error messages go to console/syslog::
+
+ :#> ddcmd mode foo +p
+ dyndbg: unknown keyword "mode"
+ dyndbg: query parse failed
+ bash: echo: write error: Invalid argument
+
+If debugfs is also enabled and mounted, ``dynamic_debug/control`` is
+also under the mount-dir, typically ``/sys/kernel/debug/``.
Command Language Reference
==========================
-At the lexical level, a command comprises a sequence of words separated
+At the basic lexical level, a command is a sequence of words separated
by spaces or tabs. So these are all equivalent::
- nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
- nullarbor:~ # echo -n ' file svcsock.c line 1603 +p ' >
- <debugfs>/dynamic_debug/control
- nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd file svcsock.c line 1603 +p
+ :#> ddcmd "file svcsock.c line 1603 +p"
+ :#> ddcmd ' file svcsock.c line 1603 +p '
Command submissions are bounded by a write() system call.
Multiple commands can be written together, separated by ``;`` or ``\n``::
- ~# echo "func pnpacpi_get_resources +p; func pnp_assign_mem +p" \
- > <debugfs>/dynamic_debug/control
-
-If your query set is big, you can batch them too::
+ :#> ddcmd "func pnpacpi_get_resources +p; func pnp_assign_mem +p"
+ :#> ddcmd <<"EOC"
+ func pnpacpi_get_resources +p
+ func pnp_assign_mem +p
+ EOC
+ :#> cat query-batch-file > /proc/dynamic_debug/control
- ~# cat query-batch-file > <debugfs>/dynamic_debug/control
+You can also use wildcards in each query term. The match rule supports
+``*`` (matches zero or more characters) and ``?`` (matches exactly one
+character). For example, you can match all usb drivers::
-Another way is to use wildcards. The match rule supports ``*`` (matches
-zero or more characters) and ``?`` (matches exactly one character). For
-example, you can match all usb drivers::
+ :#> ddcmd file "drivers/usb/*" +p # "" to suppress shell expansion
- ~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
-
-At the syntactical level, a command comprises a sequence of match
-specifications, followed by a flags change specification::
+Syntactically, a command is pairs of keyword values, followed by a
+flags change or setting::
command ::= match-spec* flags-spec
-The match-spec's are used to choose a subset of the known pr_debug()
-callsites to which to apply the flags-spec. Think of them as a query
-with implicit ANDs between each pair. Note that an empty list of
-match-specs will select all debug statement callsites.
+The match-spec's select *prdbgs* from the catalog, upon which to apply
+the flags-spec, all constraints are ANDed together. An absent keyword
+is the same as keyword "*".
+
-A match specification comprises a keyword, which controls the
-attribute of the callsite to be compared, and a value to compare
-against. Possible keywords are:::
+A match specification is a keyword, which selects the attribute of
+the callsite to be compared, and a value to compare against. Possible
+keywords are:::
match-spec ::= 'func' string |
'file' string |
'module' string |
'format' string |
+ 'class' string |
'line' line-range
line-range ::= lineno |
@@ -164,15 +136,18 @@ func
of each callsite. Example::
func svc_tcp_accept
+ func *recv* # in rfcomm, bluetooth, ping, tcp
file
- The given string is compared against either the full pathname, the
- src-root relative pathname, or the basename of the source file of
- each callsite. Examples::
+ The given string is compared against either the src-root relative
+ pathname, or the basename of the source file of each callsite.
+ Examples::
file svcsock.c
- file kernel/freezer.c
- file /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c
+ file kernel/freezer.c # ie column 1 of control file
+ file drivers/usb/* # all callsites under it
+ file inode.c:start_* # parse :tail as a func (above)
+ file inode.c:1-100 # parse :tail as a line-range (above)
module
The given string is compared against the module name
@@ -182,6 +157,7 @@ module
module sunrpc
module nfsd
+ module drm* # both drm, drm_kms_helper
format
The given string is searched for in the dynamic debug format
@@ -199,6 +175,16 @@ format
format "nfsd: SETATTR" // a neater way to match a format with whitespace
format 'nfsd: SETATTR' // yet another way to match a format with whitespace
+class
+ The given class_name is validated against each module, which may
+ have declared a list of known class_names. If the class_name is
+ found for a module, callsite & class matching and adjustment
+ proceeds. Examples::
+
+ class DRM_UT_KMS # a DRM.debug category
+ class JUNK # silent non-match
+ // class TLD_* # NOTICE: no wildcard in class names
+
line
The given line number or range of line numbers is compared
against the line number of each ``pr_debug()`` callsite. A single
@@ -224,20 +210,20 @@ of the characters::
The flags are::
p enables the pr_debug() callsite.
- f Include the function name in the printed message
- l Include line number in the printed message
- m Include module name in the printed message
- t Include thread ID in messages not generated from interrupt context
- _ No flags are set. (Or'd with others on input)
+ _ enables no flags.
-For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only ``p`` flag
-have meaning, other flags ignored.
+ Decorator flags add to the message-prefix, in order:
+ t Include thread ID, or <intr>
+ m Include module name
+ f Include the function name
+ s Include the source file name
+ l Include line number
-For display, the flags are preceded by ``=``
-(mnemonic: what the flags are currently equal to).
+For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only
+the ``p`` flag has meaning, other flags are ignored.
-Note the regexp ``^[-+=][flmpt_]+$`` matches a flags specification.
-To clear all flags at once, use ``=_`` or ``-flmpt``.
+Note the regexp ``^[-+=][fslmpt_]+$`` matches a flags specification.
+To clear all flags at once, use ``=_`` or ``-fslmpt``.
Debug messages during Boot Process
@@ -245,14 +231,13 @@ Debug messages during Boot Process
To activate debug messages for core code and built-in modules during
the boot process, even before userspace and debugfs exists, use
-``dyndbg="QUERY"``, ``module.dyndbg="QUERY"``, or ``ddebug_query="QUERY"``
-(``ddebug_query`` is obsoleted by ``dyndbg``, and deprecated). QUERY follows
+``dyndbg="QUERY"`` or ``module.dyndbg="QUERY"``. QUERY follows
the syntax described above, but must not exceed 1023 characters. Your
bootloader may impose lower limits.
These ``dyndbg`` params are processed just after the ddebug tables are
-processed, as part of the arch_initcall. Thus you can enable debug
-messages in all code run after this arch_initcall via this boot
+processed, as part of the early_initcall. Thus you can enable debug
+messages in all code run after this early_initcall via this boot
parameter.
On an x86 system for example ACPI enablement is a subsys_initcall and::
@@ -266,8 +251,7 @@ this boot parameter for debugging purposes.
If ``foo`` module is not built-in, ``foo.dyndbg`` will still be processed at
boot time, without effect, but will be reprocessed when module is
-loaded later. ``ddebug_query=`` and bare ``dyndbg=`` are only processed at
-boot.
+loaded later. Bare ``dyndbg=`` is only processed at boot.
Debug Messages at Module Initialization Time
@@ -275,7 +259,7 @@ Debug Messages at Module Initialization Time
When ``modprobe foo`` is called, modprobe scans ``/proc/cmdline`` for
``foo.params``, strips ``foo.``, and passes them to the kernel along with
-params given in modprobe args or ``/etc/modprob.d/*.conf`` files,
+params given in modprobe args or ``/etc/modprobe.d/*.conf`` files,
in the following order:
1. parameters given via ``/etc/modprobe.d/*.conf``::
@@ -311,7 +295,7 @@ For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or
enabled by ``-DDEBUG`` flag during compilation) can be disabled later via
the debugfs interface if the debug messages are no longer needed::
- echo "module module_name -p" > <debugfs>/dynamic_debug/control
+ echo "module module_name -p" > /proc/dynamic_debug/control
Examples
========
@@ -319,43 +303,75 @@ Examples
::
// enable the message at line 1603 of file svcsock.c
- nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'file svcsock.c line 1603 +p'
// enable all the messages in file svcsock.c
- nullarbor:~ # echo -n 'file svcsock.c +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'file svcsock.c +p'
// enable all the messages in the NFS server module
- nullarbor:~ # echo -n 'module nfsd +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'module nfsd +p'
// enable all 12 messages in the function svc_process()
- nullarbor:~ # echo -n 'func svc_process +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'func svc_process +p'
// disable all 12 messages in the function svc_process()
- nullarbor:~ # echo -n 'func svc_process -p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'func svc_process -p'
// enable messages for NFS calls READ, READLINK, READDIR and READDIR+.
- nullarbor:~ # echo -n 'format "nfsd: READ" +p' >
- <debugfs>/dynamic_debug/control
+ :#> ddcmd 'format "nfsd: READ" +p'
// enable messages in files of which the paths include string "usb"
- nullarbor:~ # echo -n '*usb* +p' > <debugfs>/dynamic_debug/control
+ :#> ddcmd 'file *usb* +p'
// enable all messages
- nullarbor:~ # echo -n '+p' > <debugfs>/dynamic_debug/control
+ :#> ddcmd '+p'
// add module, function to all enabled messages
- nullarbor:~ # echo -n '+mf' > <debugfs>/dynamic_debug/control
+ :#> ddcmd '+mf'
// boot-args example, with newlines and comments for readability
Kernel command line: ...
- // see whats going on in dyndbg=value processing
- dynamic_debug.verbose=1
- // enable pr_debugs in 2 builtins, #cmt is stripped
- dyndbg="module params +p #cmt ; module sys +p"
+ // see what's going on in dyndbg=value processing
+ dynamic_debug.verbose=3
+ // enable pr_debugs in the btrfs module (can be builtin or loadable)
+ btrfs.dyndbg="+p"
+ // enable pr_debugs in all files under init/
+ // and the function parse_one, #cmt is stripped
+ dyndbg="file init/* +p #cmt ; func parse_one +p"
// enable pr_debugs in 2 functions in a module loaded later
pc87360.dyndbg="func pc87360_init_device +p; func pc87360_find +p"
+
+Kernel Configuration
+====================
+
+Dynamic Debug is enabled via kernel config items::
+
+ CONFIG_DYNAMIC_DEBUG=y # build catalog, enables CORE
+ CONFIG_DYNAMIC_DEBUG_CORE=y # enable mechanics only, skip catalog
+
+If you do not want to enable dynamic debug globally (i.e. in some embedded
+system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic
+debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any
+modules which you'd like to dynamically debug later.
+
+
+Kernel *prdbg* API
+==================
+
+The following functions are cataloged and controllable when dynamic
+debug is enabled::
+
+ pr_debug()
+ dev_dbg()
+ print_hex_dump_debug()
+ print_hex_dump_bytes()
+
+Otherwise, they are off by default; ``ccflags += -DDEBUG`` or
+``#define DEBUG`` in a source file will enable them appropriately.
+
+If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is
+just a shortcut for ``print_hex_dump(KERN_DEBUG)``.
+
+For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is
+its ``prefix_str`` argument, if it is constant string; or ``hexdump``
+in case ``prefix_str`` is built dynamically.
diff --git a/Documentation/admin-guide/edid.rst b/Documentation/admin-guide/edid.rst
index 80deeb21a265..1a9b965aa486 100644
--- a/Documentation/admin-guide/edid.rst
+++ b/Documentation/admin-guide/edid.rst
@@ -24,37 +24,4 @@ restrictions later on.
As a remedy for such situations, the kernel configuration item
CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
individually prepared or corrected EDID data set in the /lib/firmware
-directory from where it is loaded via the firmware interface. The code
-(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
-commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
-1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
-not contain code to create these data. In order to elucidate the origin
-of the built-in binary EDID blobs and to facilitate the creation of
-individual data for a specific misbehaving monitor, commented sources
-and a Makefile environment are given here.
-
-To create binary EDID and C source code files from the existing data
-material, simply type "make" in tools/edid/.
-
-If you want to create your own EDID file, copy the file 1024x768.S,
-replace the settings with your own data and add a new target to the
-Makefile. Please note that the EDID data structure expects the timing
-values in a different way as compared to the standard X11 format.
-
-X11:
- HTimings:
- hdisp hsyncstart hsyncend htotal
- VTimings:
- vdisp vsyncstart vsyncend vtotal
-
-EDID::
-
- #define XPIX hdisp
- #define XBLANK htotal-hdisp
- #define XOFFSET hsyncstart-hdisp
- #define XPULSE hsyncend-hsyncstart
-
- #define YPIX vdisp
- #define YBLANK vtotal-vdisp
- #define YOFFSET vsyncstart-vdisp
- #define YPULSE vsyncend-vsyncstart
+directory from where it is loaded via the firmware interface.
diff --git a/Documentation/admin-guide/efi-stub.rst b/Documentation/admin-guide/efi-stub.rst
index 833edb0d0bc4..090f3a185e18 100644
--- a/Documentation/admin-guide/efi-stub.rst
+++ b/Documentation/admin-guide/efi-stub.rst
@@ -7,15 +7,15 @@ as a PE/COFF image, thereby convincing EFI firmware loaders to load
it as an EFI executable. The code that modifies the bzImage header,
along with the EFI-specific entry point that the firmware loader
jumps to are collectively known as the "EFI boot stub", and live in
-arch/x86/boot/header.S and arch/x86/boot/compressed/eboot.c,
+arch/x86/boot/header.S and drivers/firmware/efi/libstub/x86-stub.c,
respectively. For ARM the EFI stub is implemented in
arch/arm/boot/compressed/efi-header.S and
-arch/arm/boot/compressed/efi-stub.c. EFI stub code that is shared
+drivers/firmware/efi/libstub/arm32-stub.c. EFI stub code that is shared
between architectures is in drivers/firmware/efi/libstub.
For arm64, there is no compressed kernel support, so the Image itself
masquerades as a PE/COFF image and the EFI stub is linked into the
-kernel. The arm64 EFI stub lives in arch/arm64/kernel/efi-entry.S
+kernel. The arm64 EFI stub lives in drivers/firmware/efi/libstub/arm64.c
and drivers/firmware/efi/libstub/arm64-stub.c.
By using the EFI boot stub it's possible to boot a Linux kernel
diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index 9443fcef1876..5740d85439ff 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -392,9 +392,16 @@ When mounting an ext4 filesystem, the following option are accepted:
dax
Use direct access (no page cache). See
- Documentation/filesystems/dax.txt. Note that this option is
+ Documentation/filesystems/dax.rst. Note that this option is
incompatible with data=journal.
+ inlinecrypt
+ When possible, encrypt/decrypt the contents of encrypted files using the
+ blk-crypto framework rather than filesystem-layer encryption. This
+ allows the use of inline encryption hardware. The on-disk format is
+ unaffected. For more details, see
+ Documentation/block/inline-encryption.rst.
+
Data Mode
=========
There are 3 different data modes:
@@ -522,21 +529,21 @@ Files in /sys/fs/ext4/<devname>:
Ioctls
======
-There is some Ext4 specific functionality which can be accessed by applications
-through the system call interfaces. The list of all Ext4 specific ioctls are
-shown in the table below.
+Ext4 implements various ioctls which can be used by applications to access
+ext4-specific functionality. An incomplete list of these ioctls is shown in the
+table below. This list includes truly ext4-specific ioctls (``EXT4_IOC_*``) as
+well as ioctls that may have been ext4-specific originally but are now supported
+by some other filesystem(s) too (``FS_IOC_*``).
-Table of Ext4 specific ioctls
+Table of Ext4 ioctls
- EXT4_IOC_GETFLAGS
+ FS_IOC_GETFLAGS
Get additional attributes associated with inode. The ioctl argument is
- an integer bitfield, with bit values described in ext4.h. This ioctl is
- an alias for FS_IOC_GETFLAGS.
+ an integer bitfield, with bit values described in ext4.h.
- EXT4_IOC_SETFLAGS
+ FS_IOC_SETFLAGS
Set additional attributes associated with inode. The ioctl argument is
- an integer bitfield, with bit values described in ext4.h. This ioctl is
- an alias for FS_IOC_SETFLAGS.
+ an integer bitfield, with bit values described in ext4.h.
EXT4_IOC_GETVERSION, EXT4_IOC_GETVERSION_OLD
Get the inode i_generation number stored for each inode. The
@@ -611,7 +618,7 @@ kernel source: <file:fs/ext4/>
programs: http://e2fsprogs.sourceforge.net/
-useful links: http://fedoraproject.org/wiki/ext3-devel
+useful links: https://fedoraproject.org/wiki/ext3-devel
http://www.bullopensource.org/ext4/
http://ext4.wiki.kernel.org/index.php/Main_Page
- http://fedoraproject.org/wiki/Features/Ext4
+ https://fedoraproject.org/wiki/Features/Ext4
diff --git a/Documentation/admin-guide/features.rst b/Documentation/admin-guide/features.rst
new file mode 100644
index 000000000000..7651eca38227
--- /dev/null
+++ b/Documentation/admin-guide/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: features
diff --git a/Documentation/admin-guide/filesystem-monitoring.rst b/Documentation/admin-guide/filesystem-monitoring.rst
new file mode 100644
index 000000000000..ab8dba76283c
--- /dev/null
+++ b/Documentation/admin-guide/filesystem-monitoring.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+File system Monitoring with fanotify
+====================================
+
+File system Error Reporting
+===========================
+
+Fanotify supports the FAN_FS_ERROR event type for file system-wide error
+reporting. It is meant to be used by file system health monitoring
+daemons, which listen for these events and take actions (notify
+sysadmin, start recovery) when a file system problem is detected.
+
+By design, a FAN_FS_ERROR notification exposes sufficient information
+for a monitoring tool to know a problem in the file system has happened.
+It doesn't necessarily provide a user space application with semantics
+to verify an IO operation was successfully executed. That is out of
+scope for this feature. Instead, it is only meant as a framework for
+early file system problem detection and reporting recovery tools.
+
+When a file system operation fails, it is common for dozens of kernel
+errors to cascade after the initial failure, hiding the original failure
+log, which is usually the most useful debug data to troubleshoot the
+problem. For this reason, FAN_FS_ERROR tries to report only the first
+error that occurred for a file system since the last notification, and
+it simply counts additional errors. This ensures that the most
+important pieces of information are never lost.
+
+FAN_FS_ERROR requires the fanotify group to be setup with the
+FAN_REPORT_FID flag.
+
+At the time of this writing, the only file system that emits FAN_FS_ERROR
+notifications is Ext4.
+
+A FAN_FS_ERROR Notification has the following format::
+
+ ::
+
+ [ Notification Metadata (Mandatory) ]
+ [ Generic Error Record (Mandatory) ]
+ [ FID record (Mandatory) ]
+
+The order of records is not guaranteed, and new records might be added
+in the future. Therefore, applications must not rely on the order and
+must be prepared to skip over unknown records. Please refer to
+``samples/fanotify/fs-monitor.c`` for an example parser.
+
+Generic error record
+--------------------
+
+The generic error record provides enough information for a file system
+agnostic tool to learn about a problem in the file system, without
+providing any additional details about the problem. This record is
+identified by ``struct fanotify_event_info_header.info_type`` being set
+to FAN_EVENT_INFO_TYPE_ERROR.
+
+ ::
+
+ struct fanotify_event_info_error {
+ struct fanotify_event_info_header hdr;
+ __s32 error;
+ __u32 error_count;
+ };
+
+The `error` field identifies the type of error using errno values.
+`error_count` tracks the number of errors that occurred and were
+suppressed to preserve the original error information, since the last
+notification.
+
+FID record
+----------
+
+The FID record can be used to uniquely identify the inode that triggered
+the error through the combination of fsid and file handle. A file system
+specific application can use that information to attempt a recovery
+procedure. Errors that are not related to an inode are reported with an
+empty file handle of type FILEID_INVALID.
diff --git a/Documentation/admin-guide/gpio/gpio-mockup.rst b/Documentation/admin-guide/gpio/gpio-mockup.rst
new file mode 100644
index 000000000000..d6e7438a7550
--- /dev/null
+++ b/Documentation/admin-guide/gpio/gpio-mockup.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+GPIO Testing Driver
+===================
+
+.. note::
+
+ This module has been obsoleted by the more flexible gpio-sim.rst.
+ New developments should use that API and existing developments are
+ encouraged to migrate as soon as possible.
+ This module will continue to be maintained but no new features will be
+ added.
+
+The GPIO Testing Driver (gpio-mockup) provides a way to create simulated GPIO
+chips for testing purposes. The lines exposed by these chips can be accessed
+using the standard GPIO character device interface as well as manipulated
+using the dedicated debugfs directory structure.
+
+Creating simulated chips using module params
+--------------------------------------------
+
+When loading the gpio-mockup driver a number of parameters can be passed to the
+module.
+
+ gpio_mockup_ranges
+
+ This parameter takes an argument in the form of an array of integer
+ pairs. Each pair defines the base GPIO number (non-negative integer)
+ and the first number after the last of this chip. If the base GPIO
+ is -1, the gpiolib will assign it automatically. while the following
+ parameter is the number of lines exposed by the chip.
+
+ Example: gpio_mockup_ranges=-1,8,-1,16,405,409
+
+ The line above creates three chips. The first one will expose 8 lines,
+ the second 16 and the third 4. The base GPIO for the third chip is set
+ to 405 while for two first chips it will be assigned automatically.
+
+ gpio_mockup_named_lines
+
+ This parameter doesn't take any arguments. It lets the driver know that
+ GPIO lines exposed by it should be named.
+
+ The name format is: gpio-mockup-X-Y where X is mockup chip's ID
+ and Y is the line offset.
+
+Manipulating simulated lines
+----------------------------
+
+Each mockup chip creates its own subdirectory in /sys/kernel/debug/gpio-mockup/.
+The directory is named after the chip's label. A symlink is also created, named
+after the chip's name, which points to the label directory.
+
+Inside each subdirectory, there's a separate attribute for each GPIO line. The
+name of the attribute represents the line's offset in the chip.
+
+Reading from a line attribute returns the current value. Writing to it (0 or 1)
+changes the configuration of the simulated pull-up/pull-down resistor
+(1 - pull-up, 0 - pull-down).
diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst
new file mode 100644
index 000000000000..1cc5567a4bbe
--- /dev/null
+++ b/Documentation/admin-guide/gpio/gpio-sim.rst
@@ -0,0 +1,134 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+Configfs GPIO Simulator
+=======================
+
+The configfs GPIO Simulator (gpio-sim) provides a way to create simulated GPIO
+chips for testing purposes. The lines exposed by these chips can be accessed
+using the standard GPIO character device interface as well as manipulated
+using sysfs attributes.
+
+Creating simulated chips
+------------------------
+
+The gpio-sim module registers a configfs subsystem called ``'gpio-sim'``. For
+details of the configfs filesystem, please refer to the configfs documentation.
+
+The user can create a hierarchy of configfs groups and items as well as modify
+values of exposed attributes. Once the chip is instantiated, this hierarchy
+will be translated to appropriate device properties. The general structure is:
+
+**Group:** ``/config/gpio-sim``
+
+This is the top directory of the gpio-sim configfs tree.
+
+**Group:** ``/config/gpio-sim/gpio-device``
+
+**Attribute:** ``/config/gpio-sim/gpio-device/dev_name``
+
+**Attribute:** ``/config/gpio-sim/gpio-device/live``
+
+This is a directory representing a GPIO platform device. The ``'dev_name'``
+attribute is read-only and allows the user-space to read the platform device
+name (e.g. ``'gpio-sim.0'``). The ``'live'`` attribute allows to trigger the
+actual creation of the device once it's fully configured. The accepted values
+are: ``'1'`` to enable the simulated device and ``'0'`` to disable and tear
+it down.
+
+**Group:** ``/config/gpio-sim/gpio-device/gpio-bankX``
+
+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/chip_name``
+
+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/num_lines``
+
+This group represents a bank of GPIOs under the top platform device. The
+``'chip_name'`` attribute is read-only and allows the user-space to read the
+device name of the bank device. The ``'num_lines'`` attribute allows to specify
+the number of lines exposed by this bank.
+
+**Group:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY``
+
+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/name``
+
+This group represents a single line at the offset Y. The 'name' attribute
+allows to set the line name as represented by the 'gpio-line-names' property.
+
+**Item:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog``
+
+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog/name``
+
+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog/direction``
+
+This item makes the gpio-sim module hog the associated line. The ``'name'``
+attribute specifies the in-kernel consumer name to use. The ``'direction'``
+attribute specifies the hog direction and must be one of: ``'input'``,
+``'output-high'`` and ``'output-low'``.
+
+Inside each bank directory, there's a set of attributes that can be used to
+configure the new chip. Additionally the user can ``mkdir()`` subdirectories
+inside the chip's directory that allow to pass additional configuration for
+specific lines. The name of those subdirectories must take the form of:
+``'line<offset>'`` (e.g. ``'line0'``, ``'line20'``, etc.) as the name will be
+used by the module to assign the config to the specific line at given offset.
+
+Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in
+order to instantiate the chip. It can be set back to 0 to destroy the simulated
+chip. The module will synchronously wait for the new simulated device to be
+successfully probed and if this doesn't happen, writing to ``'live'`` will
+result in an error.
+
+Simulated GPIO chips can also be defined in device-tree. The compatible string
+must be: ``"gpio-simulator"``. Supported properties are:
+
+ ``"gpio-sim,label"`` - chip label
+
+Other standard GPIO properties (like ``"gpio-line-names"``, ``"ngpios"`` or
+``"gpio-hog"``) are also supported. Please refer to the GPIO documentation for
+details.
+
+An example device-tree code defining a GPIO simulator:
+
+.. code-block :: none
+
+ gpio-sim {
+ compatible = "gpio-simulator";
+
+ bank0 {
+ gpio-controller;
+ #gpio-cells = <2>;
+ ngpios = <16>;
+ gpio-sim,label = "dt-bank0";
+ gpio-line-names = "", "sim-foo", "", "sim-bar";
+ };
+
+ bank1 {
+ gpio-controller;
+ #gpio-cells = <2>;
+ ngpios = <8>;
+ gpio-sim,label = "dt-bank1";
+
+ line3 {
+ gpio-hog;
+ gpios = <3 0>;
+ output-high;
+ line-name = "sim-hog-from-dt";
+ };
+ };
+ };
+
+Manipulating simulated lines
+----------------------------
+
+Each simulated GPIO chip creates a separate sysfs group under its device
+directory for each exposed line
+(e.g. ``/sys/devices/platform/gpio-sim.X/gpiochipY/``). The name of each group
+is of the form: ``'sim_gpioX'`` where X is the offset of the line. Inside each
+group there are two attributes:
+
+ ``pull`` - allows to read and set the current simulated pull setting for
+ every line, when writing the value must be one of: ``'pull-up'``,
+ ``'pull-down'``
+
+ ``value`` - allows to read the current value of the line which may be
+ different from the pull if the line is being driven from
+ user-space
diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst
index ef2838638e96..460afd29617e 100644
--- a/Documentation/admin-guide/gpio/index.rst
+++ b/Documentation/admin-guide/gpio/index.rst
@@ -1,14 +1,16 @@
.. SPDX-License-Identifier: GPL-2.0
====
-gpio
+GPIO
====
.. toctree::
:maxdepth: 1
+ Character Device Userspace API <../../userspace-api/gpio/chardev>
gpio-aggregator
- sysfs
+ gpio-sim
+ Obsolete APIs <obsolete>
.. only:: subproject and html
diff --git a/Documentation/admin-guide/gpio/obsolete.rst b/Documentation/admin-guide/gpio/obsolete.rst
new file mode 100644
index 000000000000..5adbff02d61f
--- /dev/null
+++ b/Documentation/admin-guide/gpio/obsolete.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Obsolete GPIO APIs
+==================
+
+.. toctree::
+ :maxdepth: 1
+
+ Character Device Userspace API (v1) <../../userspace-api/gpio/chardev_v1>
+ Sysfs Interface <../../userspace-api/gpio/sysfs>
+ Mockup Testing Module <gpio-mockup>
+
diff --git a/Documentation/admin-guide/gpio/sysfs.rst b/Documentation/admin-guide/gpio/sysfs.rst
deleted file mode 100644
index ec09ffd983e7..000000000000
--- a/Documentation/admin-guide/gpio/sysfs.rst
+++ /dev/null
@@ -1,167 +0,0 @@
-GPIO Sysfs Interface for Userspace
-==================================
-
-.. warning::
-
- THIS ABI IS DEPRECATED, THE ABI DOCUMENTATION HAS BEEN MOVED TO
- Documentation/ABI/obsolete/sysfs-gpio AND NEW USERSPACE CONSUMERS
- ARE SUPPOSED TO USE THE CHARACTER DEVICE ABI. THIS OLD SYSFS ABI WILL
- NOT BE DEVELOPED (NO NEW FEATURES), IT WILL JUST BE MAINTAINED.
-
-Refer to the examples in tools/gpio/* for an introduction to the new
-character device ABI. Also see the userspace header in
-include/uapi/linux/gpio.h
-
-The deprecated sysfs ABI
-------------------------
-Platforms which use the "gpiolib" implementors framework may choose to
-configure a sysfs user interface to GPIOs. This is different from the
-debugfs interface, since it provides control over GPIO direction and
-value instead of just showing a gpio state summary. Plus, it could be
-present on production systems without debugging support.
-
-Given appropriate hardware documentation for the system, userspace could
-know for example that GPIO #23 controls the write protect line used to
-protect boot loader segments in flash memory. System upgrade procedures
-may need to temporarily remove that protection, first importing a GPIO,
-then changing its output state, then updating the code before re-enabling
-the write protection. In normal use, GPIO #23 would never be touched,
-and the kernel would have no need to know about it.
-
-Again depending on appropriate hardware documentation, on some systems
-userspace GPIO can be used to determine system configuration data that
-standard kernels won't know about. And for some tasks, simple userspace
-GPIO drivers could be all that the system really needs.
-
-DO NOT ABUSE SYSFS TO CONTROL HARDWARE THAT HAS PROPER KERNEL DRIVERS.
-PLEASE READ THE DOCUMENT AT Documentation/driver-api/gpio/drivers-on-gpio.rst
-TO AVOID REINVENTING KERNEL WHEELS IN USERSPACE. I MEAN IT. REALLY.
-
-Paths in Sysfs
---------------
-There are three kinds of entries in /sys/class/gpio:
-
- - Control interfaces used to get userspace control over GPIOs;
-
- - GPIOs themselves; and
-
- - GPIO controllers ("gpio_chip" instances).
-
-That's in addition to standard files including the "device" symlink.
-
-The control interfaces are write-only:
-
- /sys/class/gpio/
-
- "export" ...
- Userspace may ask the kernel to export control of
- a GPIO to userspace by writing its number to this file.
-
- Example: "echo 19 > export" will create a "gpio19" node
- for GPIO #19, if that's not requested by kernel code.
-
- "unexport" ...
- Reverses the effect of exporting to userspace.
-
- Example: "echo 19 > unexport" will remove a "gpio19"
- node exported using the "export" file.
-
-GPIO signals have paths like /sys/class/gpio/gpio42/ (for GPIO #42)
-and have the following read/write attributes:
-
- /sys/class/gpio/gpioN/
-
- "direction" ...
- reads as either "in" or "out". This value may
- normally be written. Writing as "out" defaults to
- initializing the value as low. To ensure glitch free
- operation, values "low" and "high" may be written to
- configure the GPIO as an output with that initial value.
-
- Note that this attribute *will not exist* if the kernel
- doesn't support changing the direction of a GPIO, or
- it was exported by kernel code that didn't explicitly
- allow userspace to reconfigure this GPIO's direction.
-
- "value" ...
- reads as either 0 (low) or 1 (high). If the GPIO
- is configured as an output, this value may be written;
- any nonzero value is treated as high.
-
- If the pin can be configured as interrupt-generating interrupt
- and if it has been configured to generate interrupts (see the
- description of "edge"), you can poll(2) on that file and
- poll(2) will return whenever the interrupt was triggered. If
- you use poll(2), set the events POLLPRI and POLLERR. If you
- use select(2), set the file descriptor in exceptfds. After
- poll(2) returns, either lseek(2) to the beginning of the sysfs
- file and read the new value or close the file and re-open it
- to read the value.
-
- "edge" ...
- reads as either "none", "rising", "falling", or
- "both". Write these strings to select the signal edge(s)
- that will make poll(2) on the "value" file return.
-
- This file exists only if the pin can be configured as an
- interrupt generating input pin.
-
- "active_low" ...
- reads as either 0 (false) or 1 (true). Write
- any nonzero value to invert the value attribute both
- for reading and writing. Existing and subsequent
- poll(2) support configuration via the edge attribute
- for "rising" and "falling" edges will follow this
- setting.
-
-GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the
-controller implementing GPIOs starting at #42) and have the following
-read-only attributes:
-
- /sys/class/gpio/gpiochipN/
-
- "base" ...
- same as N, the first GPIO managed by this chip
-
- "label" ...
- provided for diagnostics (not always unique)
-
- "ngpio" ...
- how many GPIOs this manages (N to N + ngpio - 1)
-
-Board documentation should in most cases cover what GPIOs are used for
-what purposes. However, those numbers are not always stable; GPIOs on
-a daughtercard might be different depending on the base board being used,
-or other cards in the stack. In such cases, you may need to use the
-gpiochip nodes (possibly in conjunction with schematics) to determine
-the correct GPIO number to use for a given signal.
-
-
-Exporting from Kernel code
---------------------------
-Kernel code can explicitly manage exports of GPIOs which have already been
-requested using gpio_request()::
-
- /* export the GPIO to userspace */
- int gpiod_export(struct gpio_desc *desc, bool direction_may_change);
-
- /* reverse gpio_export() */
- void gpiod_unexport(struct gpio_desc *desc);
-
- /* create a sysfs link to an exported GPIO node */
- int gpiod_export_link(struct device *dev, const char *name,
- struct gpio_desc *desc);
-
-After a kernel driver requests a GPIO, it may only be made available in
-the sysfs interface by gpiod_export(). The driver can control whether the
-signal direction may change. This helps drivers prevent userspace code
-from accidentally clobbering important system state.
-
-This explicit exporting can help with debugging (by making some kinds
-of experiments easier), or can provide an always-there interface that's
-suitable for documenting as part of a board support package.
-
-After the GPIO has been exported, gpiod_export_link() allows creating
-symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can
-use this to provide the interface under their own device in sysfs with
-a descriptive name.
diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..cf1eeefdfc32
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,226 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Core Scheduling
+===============
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core, or may prefer different cores if they
+do share hardware resource needs). This document only describes the security
+usecase.
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different Hyper
+Threads of the same core. MDS and L1TF are examples of such attacks. The only
+full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
+scheduling is a scheduler feature that can mitigate some (not all) cross-HT
+attacks. It allows HT to be turned on safely by ensuring that only tasks in a
+user-designated trusted group can share a core. This increase in core sharing
+can also improve performance, however it is not guaranteed that performance
+will always improve, though that is seen to be the case with a number of real
+world workloads. In theory, core scheduling aims to perform at least as good as
+when Hyper Threading is disabled. In practice, this is mostly the case though
+not always: as synchronizing scheduling decisions across 2 or more CPUs in a
+core involves additional overhead - especially when the system is lightly
+loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
+scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
+total number of CPUs. Please measure the performance of your workloads always.
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that can be co-scheduled
+on the same core. The core scheduler uses this information to make sure that
+tasks that are not in the same group never run simultaneously on a core, while
+doing its best to satisfy the system's scheduling requirements.
+
+Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
+This interface provides support for the creation of core scheduling groups, as
+well as admission and removal of tasks from created groups::
+
+ #include <sys/prctl.h>
+
+ int prctl(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
+option:
+ ``PR_SCHED_CORE``
+
+arg2:
+ Command for operation, must be one off:
+
+ - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``.
+ - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``.
+ - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``.
+ - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``.
+
+arg3:
+ ``pid`` of the task for which the operation applies.
+
+arg4:
+ ``pid_type`` for which the operation applies. It is one of
+ ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants. For example, if arg4
+ is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
+ will be performed for all tasks in the task group of ``pid``.
+
+arg5:
+ userspace pointer to an unsigned long for storing the cookie returned by
+ ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
+
+In order for a process to push a cookie to, or pull a cookie from a process, it
+is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
+process.
+
+Building hierarchies of tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The simplest way to build hierarchies of threads/processes which share a
+cookie and thus a core is to rely on the fact that the core-sched cookie is
+inherited across forks/clones and execs, thus setting a cookie for the
+'initial' script/executable/daemon will place every spawned child in the
+same core-sched group.
+
+Cookie Transferral
+~~~~~~~~~~~~~~~~~~
+Transferring a cookie between the current and other tasks is possible using
+PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
+specified task or a share a cookie with a task. In combination this allows a
+simple helper program to pull a cookie from a task in an existing core
+scheduling group and share it with already running tasks.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, as it trusts everything and everything
+trusts it.
+
+During a schedule() event on any sibling of a core, the highest priority task on
+the sibling's core is picked and assigned to the sibling calling schedule(), if
+the sibling has the task enqueued. For rest of the siblings in the core,
+highest priority task with the same cookie is selected if there is one runnable
+in their individual run queues. If a task with same cookie is not available,
+the idle task is selected. Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. I.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of hyperthreads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core. However,
+it is possible that some runqueues had tasks that were incompatible with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task. If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler (idle thread is scheduled to run).
+
+When the highest priority task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+ HT1 (attack) HT2 (victim)
+ A idle -> user space user space -> idle
+ B idle -> user space guest -> idle
+ C idle -> guest user space -> idle
+ D idle -> guest guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Trust model
+~~~~~~~~~~~
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning them a tag that is the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The forced-idling of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+Limitations of core-scheduling
+------------------------------
+Core scheduling tries to guarantee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+IPI processing delays
+~~~~~~~~~~~~~~~~~~~~~
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a CPU before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro architectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+~~~~~~~~~~
+Core scheduling cannot protect against MDS attacks between the siblings
+running in user mode and the others running in kernel mode. Even though all
+siblings run tasks which trust each other, when the kernel is executing
+code on behalf of a task, it cannot trust the code running in the
+sibling. Such attacks are possible for any combination of sibling CPU modes
+(host or guest mode).
+
+2. For L1TF
+~~~~~~~~~~~
+Core scheduling cannot protect against an L1TF guest attacker exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT (Extended Page Tables).
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+3. Protecting the kernel (IRQ, syscall, VMEXIT)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Unfortunately, core scheduling does not protect kernel contexts running on
+sibling hyperthreads from one another. Prototypes of mitigations have been posted
+to LKML to solve this, but it is debatable whether such windows are practically
+exploitable, and whether the performance overhead of the prototypes are worth
+it (not to mention, the added code complexity).
+
+Other Use cases
+---------------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+ that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+ together could also be realized using core scheduling. One example is vCPUs of
+ a VM.
diff --git a/Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst b/Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst
new file mode 100644
index 000000000000..875616d675fe
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst
@@ -0,0 +1,91 @@
+
+.. SPDX-License-Identifier: GPL-2.0
+
+Cross-Thread Return Address Predictions
+=======================================
+
+Certain AMD and Hygon processors are subject to a cross-thread return address
+predictions vulnerability. When running in SMT mode and one sibling thread
+transitions out of C0 state, the other sibling thread could use return target
+predictions from the sibling thread that transitioned out of C0.
+
+The Spectre v2 mitigations protect the Linux kernel, as it fills the return
+address prediction entries with safe targets when context switching to the idle
+thread. However, KVM does allow a VMM to prevent exiting guest mode when
+transitioning out of C0. This could result in a guest-controlled return target
+being consumed by the sibling thread.
+
+Affected processors
+-------------------
+
+The following CPUs are vulnerable:
+
+ - AMD Family 17h processors
+ - Hygon Family 18h processors
+
+Related CVEs
+------------
+
+The following CVE entry is related to this issue:
+
+ ============== =======================================
+ CVE-2022-27672 Cross-Thread Return Address Predictions
+ ============== =======================================
+
+Problem
+-------
+
+Affected SMT-capable processors support 1T and 2T modes of execution when SMT
+is enabled. In 2T mode, both threads in a core are executing code. For the
+processor core to enter 1T mode, it is required that one of the threads
+requests to transition out of the C0 state. This can be communicated with the
+HLT instruction or with an MWAIT instruction that requests non-C0.
+When the thread re-enters the C0 state, the processor transitions back
+to 2T mode, assuming the other thread is also still in C0 state.
+
+In affected processors, the return address predictor (RAP) is partitioned
+depending on the SMT mode. For instance, in 2T mode each thread uses a private
+16-entry RAP, but in 1T mode, the active thread uses a 32-entry RAP. Upon
+transition between 1T/2T mode, the RAP contents are not modified but the RAP
+pointers (which control the next return target to use for predictions) may
+change. This behavior may result in return targets from one SMT thread being
+used by RET predictions in the sibling thread following a 1T/2T switch. In
+particular, a RET instruction executed immediately after a transition to 1T may
+use a return target from the thread that just became idle. In theory, this
+could lead to information disclosure if the return targets used do not come
+from trustworthy code.
+
+Attack scenarios
+----------------
+
+An attack can be mounted on affected processors by performing a series of CALL
+instructions with targeted return locations and then transitioning out of C0
+state.
+
+Mitigation mechanism
+--------------------
+
+Before entering idle state, the kernel context switches to the idle thread. The
+context switch fills the RAP entries (referred to as the RSB in Linux) with safe
+targets by performing a sequence of CALL instructions.
+
+Prevent a guest VM from directly putting the processor into an idle state by
+intercepting HLT and MWAIT instructions.
+
+Both mitigations are required to fully address this issue.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+Use existing Spectre v2 mitigations that will fill the RSB on context switch.
+
+Mitigation control for KVM - module parameter
+---------------------------------------------
+
+By default, the KVM hypervisor mitigates this issue by intercepting guest
+attempts to transition out of C0. A VMM can use the KVM_CAP_X86_DISABLE_EXITS
+capability to override those interceptions, but since this is not common, the
+mitigation that covers this path is not enabled by default.
+
+The mitigation for the KVM_CAP_X86_DISABLE_EXITS capability can be turned on
+using the boolean module parameter mitigate_smt_rsb, e.g. ``kvm.mitigate_smt_rsb=1``.
diff --git a/Documentation/admin-guide/hw-vuln/gather_data_sampling.rst b/Documentation/admin-guide/hw-vuln/gather_data_sampling.rst
new file mode 100644
index 000000000000..264bfa937f7d
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/gather_data_sampling.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+GDS - Gather Data Sampling
+==========================
+
+Gather Data Sampling is a hardware vulnerability which allows unprivileged
+speculative access to data which was previously stored in vector registers.
+
+Problem
+-------
+When a gather instruction performs loads from memory, different data elements
+are merged into the destination vector register. However, when a gather
+instruction that is transiently executed encounters a fault, stale data from
+architectural or internal vector registers may get transiently forwarded to the
+destination vector register instead. This will allow a malicious attacker to
+infer stale data using typical side channel techniques like cache timing
+attacks. GDS is a purely sampling-based attack.
+
+The attacker uses gather instructions to infer the stale vector register data.
+The victim does not need to do anything special other than use the vector
+registers. The victim does not need to use gather instructions to be
+vulnerable.
+
+Because the buffers are shared between Hyper-Threads cross Hyper-Thread attacks
+are possible.
+
+Attack scenarios
+----------------
+Without mitigation, GDS can infer stale data across virtually all
+permission boundaries:
+
+ Non-enclaves can infer SGX enclave data
+ Userspace can infer kernel data
+ Guests can infer data from hosts
+ Guest can infer guest from other guests
+ Users can infer data from other users
+
+Because of this, it is important to ensure that the mitigation stays enabled in
+lower-privilege contexts like guests and when running outside SGX enclaves.
+
+The hardware enforces the mitigation for SGX. Likewise, VMMs should ensure
+that guests are not allowed to disable the GDS mitigation. If a host erred and
+allowed this, a guest could theoretically disable GDS mitigation, mount an
+attack, and re-enable it.
+
+Mitigation mechanism
+--------------------
+This issue is mitigated in microcode. The microcode defines the following new
+bits:
+
+ ================================ === ============================
+ IA32_ARCH_CAPABILITIES[GDS_CTRL] R/O Enumerates GDS vulnerability
+ and mitigation support.
+ IA32_ARCH_CAPABILITIES[GDS_NO] R/O Processor is not vulnerable.
+ IA32_MCU_OPT_CTRL[GDS_MITG_DIS] R/W Disables the mitigation
+ 0 by default.
+ IA32_MCU_OPT_CTRL[GDS_MITG_LOCK] R/W Locks GDS_MITG_DIS=0. Writes
+ to GDS_MITG_DIS are ignored
+ Can't be cleared once set.
+ ================================ === ============================
+
+GDS can also be mitigated on systems that don't have updated microcode by
+disabling AVX. This can be done by setting gather_data_sampling="force" or
+"clearcpuid=avx" on the kernel command-line.
+
+If used, these options will disable AVX use by turning off XSAVE YMM support.
+However, the processor will still enumerate AVX support. Userspace that
+does not follow proper AVX enumeration to check both AVX *and* XSAVE YMM
+support will break.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The mitigation can be disabled by setting "gather_data_sampling=off" or
+"mitigations=off" on the kernel command line. Not specifying either will default
+to the mitigation being enabled. Specifying "gather_data_sampling=force" will
+use the microcode mitigation when available or disable AVX on affected systems
+where the microcode hasn't been updated to include the mitigation.
+
+GDS System Information
+------------------------
+The kernel provides vulnerability status information through sysfs. For
+GDS this can be accessed by the following sysfs file:
+
+/sys/devices/system/cpu/vulnerabilities/gather_data_sampling
+
+The possible values contained in this file are:
+
+ ============================== =============================================
+ Not affected Processor not vulnerable.
+ Vulnerable Processor vulnerable and mitigation disabled.
+ Vulnerable: No microcode Processor vulnerable and microcode is missing
+ mitigation.
+ Mitigation: AVX disabled,
+ no microcode Processor is vulnerable and microcode is missing
+ mitigation. AVX disabled as mitigation.
+ Mitigation: Microcode Processor is vulnerable and mitigation is in
+ effect.
+ Mitigation: Microcode (locked) Processor is vulnerable and mitigation is in
+ effect and cannot be disabled.
+ Unknown: Dependent on
+ hypervisor status Running on a virtual guest processor that is
+ affected but with no way to know if host
+ processor is mitigated or vulnerable.
+ ============================== =============================================
+
+GDS Default mitigation
+----------------------
+The updated microcode will enable the mitigation by default. The kernel's
+default action is to leave the mitigation enabled.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..ff0b440ef2dc 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -13,5 +13,12 @@ are configurable at compile, boot or run time.
l1tf
mds
tsx_async_abort
- multihit.rst
- special-register-buffer-data-sampling.rst
+ multihit
+ special-register-buffer-data-sampling
+ core-scheduling
+ l1d_flush
+ processor_mmio_stale_data
+ cross-thread-rsb
+ srso
+ gather_data_sampling
+ reg-file-data-sampling
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 000000000000..210020bc3f56
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,69 @@
+L1D Flushing
+============
+
+With an increasing number of vulnerabilities being reported around data
+leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in
+mechanism to flush the L1D cache on context switch.
+
+This mechanism can be used to address e.g. CVE-2020-0550. For applications
+the mechanism keeps them safe from vulnerabilities, related to leaks
+(snooping of) from the L1D cache.
+
+
+Related CVEs
+------------
+The following CVEs can be addressed by this
+mechanism
+
+ ============= ======================== ==================
+ CVE-2020-0550 Improper Data Forwarding OS related aspects
+ ============= ======================== ==================
+
+Usage Guidelines
+----------------
+
+Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst
+<set_spec_ctrl>` for details.
+
+**NOTE**: The feature is disabled by default, applications need to
+specifically opt into the feature to enable it.
+
+Mitigation
+----------
+
+When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is
+performed when the task is scheduled out and the incoming task belongs to a
+different process and therefore to a different address space.
+
+If the underlying CPU supports L1D flushing in hardware, the hardware
+mechanism is used, software fallback for the mitigation, is not supported.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1D flush mitigations at boot
+time with the option "l1d_flush=". The valid arguments for this option are:
+
+ ============ =============================================================
+ on Enables the prctl interface, applications trying to use
+ the prctl() will fail with an error if l1d_flush is not
+ enabled
+ ============ =============================================================
+
+By default the mechanism is disabled.
+
+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+**NOTE** : The opt-in of a task for L1D flushing works only when the task's
+affinity is limited to cores running in non-SMT mode. If a task which
+requested L1D flushing is scheduled on a SMT-enabled core the kernel sends
+a SIGBUS to the task.
diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
index 2d19c9f4c1fe..48c7b0b72aed 100644
--- a/Documentation/admin-guide/hw-vuln/mds.rst
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -58,14 +58,14 @@ Because the buffers are potentially shared between Hyper-Threads cross
Hyper-Thread attacks are possible.
Deeper technical information is available in the MDS specific x86
-architecture section: :ref:`Documentation/x86/mds.rst <mds>`.
+architecture section: :ref:`Documentation/arch/x86/mds.rst <mds>`.
Attack scenarios
----------------
-Attacks against the MDS vulnerabilities can be mounted from malicious non
-priviledged user space applications running on hosts or guest. Malicious
+Attacks against the MDS vulnerabilities can be mounted from malicious non-
+privileged user space applications running on hosts or guest. Malicious
guest OSes can obviously mount attacks as well.
Contrary to other speculation based vulnerabilities the MDS vulnerability
@@ -102,9 +102,19 @@ The possible values in this file are:
* - 'Vulnerable'
- The processor is vulnerable, but no mitigation enabled
* - 'Vulnerable: Clear CPU buffers attempted, no microcode'
- - The processor is vulnerable but microcode is not updated.
-
- The mitigation is enabled on a best effort basis. See :ref:`vmwerv`
+ - The processor is vulnerable but microcode is not updated. The
+ mitigation is enabled on a best effort basis.
+
+ If the processor is vulnerable but the availability of the microcode
+ based mitigation mechanism is not advertised via CPUID, the kernel
+ selects a best effort mitigation mode. This mode invokes the mitigation
+ instructions without a guarantee that they clear the CPU buffers.
+
+ This is done to address virtualization scenarios where the host has the
+ microcode update applied, but the hypervisor is not yet updated to
+ expose the CPUID to the guest. If the host has updated microcode the
+ protection takes effect; otherwise a few CPU cycles are wasted
+ pointlessly.
* - 'Mitigation: Clear CPU buffers'
- The processor is vulnerable and the CPU buffer clearing mitigation is
enabled.
@@ -119,24 +129,6 @@ to the above information:
'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
======================== ============================================
-.. _vmwerv:
-
-Best effort mitigation mode
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- If the processor is vulnerable, but the availability of the microcode based
- mitigation mechanism is not advertised via CPUID the kernel selects a best
- effort mitigation mode. This mode invokes the mitigation instructions
- without a guarantee that they clear the CPU buffers.
-
- This is done to address virtualization scenarios where the host has the
- microcode update applied, but the hypervisor is not yet updated to expose
- the CPUID to the guest. If the host has updated microcode the protection
- takes effect otherwise a few cpu cycles are wasted pointlessly.
-
- The state in the mds sysfs file reflects this situation accordingly.
-
-
Mitigation mechanism
-------------------------
diff --git a/Documentation/admin-guide/hw-vuln/multihit.rst b/Documentation/admin-guide/hw-vuln/multihit.rst
index ba9988d8bce5..140e4cec38c3 100644
--- a/Documentation/admin-guide/hw-vuln/multihit.rst
+++ b/Documentation/admin-guide/hw-vuln/multihit.rst
@@ -80,6 +80,10 @@ The possible values in this file are:
- The processor is not vulnerable.
* - KVM: Mitigation: Split huge pages
- Software changes mitigate this issue.
+ * - KVM: Mitigation: VMX unsupported
+ - KVM is not vulnerable because Virtual Machine Extensions (VMX) is not supported.
+ * - KVM: Mitigation: VMX disabled
+ - KVM is not vulnerable because Virtual Machine Extensions (VMX) is disabled.
* - KVM: Vulnerable
- The processor is vulnerable, but no mitigation enabled
diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
new file mode 100644
index 000000000000..1302fd1b55e8
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
@@ -0,0 +1,271 @@
+=========================================
+Processor MMIO Stale Data Vulnerabilities
+=========================================
+
+Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O
+(MMIO) vulnerabilities that can expose data. The sequences of operations for
+exposing data range from simple to very complex. Because most of the
+vulnerabilities require the attacker to have access to MMIO, many environments
+are not affected. System environments using virtualization where MMIO access is
+provided to untrusted guests may need mitigation. These vulnerabilities are
+not transient execution attacks. However, these vulnerabilities may propagate
+stale data into core fill buffers where the data can subsequently be inferred
+by an unmitigated transient execution attack. Mitigation for these
+vulnerabilities includes a combination of microcode update and software
+changes, depending on the platform and usage model. Some of these mitigations
+are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or
+those used to mitigate Special Register Buffer Data Sampling (SRBDS).
+
+Data Propagators
+================
+Propagators are operations that result in stale data being copied or moved from
+one microarchitectural buffer or register to another. Processor MMIO Stale Data
+Vulnerabilities are operations that may result in stale data being directly
+read into an architectural, software-visible state or sampled from a buffer or
+register.
+
+Fill Buffer Stale Data Propagator (FBSDP)
+-----------------------------------------
+Stale data may propagate from fill buffers (FB) into the non-coherent portion
+of the uncore on some non-coherent writes. Fill buffer propagation by itself
+does not make stale data architecturally visible. Stale data must be propagated
+to a location where it is subject to reading or sampling.
+
+Sideband Stale Data Propagator (SSDP)
+-------------------------------------
+The sideband stale data propagator (SSDP) is limited to the client (including
+Intel Xeon server E3) uncore implementation. The sideband response buffer is
+shared by all client cores. For non-coherent reads that go to sideband
+destinations, the uncore logic returns 64 bytes of data to the core, including
+both requested data and unrequested stale data, from a transaction buffer and
+the sideband response buffer. As a result, stale data from the sideband
+response and transaction buffers may now reside in a core fill buffer.
+
+Primary Stale Data Propagator (PSDP)
+------------------------------------
+The primary stale data propagator (PSDP) is limited to the client (including
+Intel Xeon server E3) uncore implementation. Similar to the sideband response
+buffer, the primary response buffer is shared by all client cores. For some
+processors, MMIO primary reads will return 64 bytes of data to the core fill
+buffer including both requested data and unrequested stale data. This is
+similar to the sideband stale data propagator.
+
+Vulnerabilities
+===============
+Device Register Partial Write (DRPW) (CVE-2022-21166)
+-----------------------------------------------------
+Some endpoint MMIO registers incorrectly handle writes that are smaller than
+the register size. Instead of aborting the write or only copying the correct
+subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than
+specified by the write transaction may be written to the register. On
+processors affected by FBSDP, this may expose stale data from the fill buffers
+of the core that created the write transaction.
+
+Shared Buffers Data Sampling (SBDS) (CVE-2022-21125)
+----------------------------------------------------
+After propagators may have moved data around the uncore and copied stale data
+into client core fill buffers, processors affected by MFBDS can leak data from
+the fill buffer. It is limited to the client (including Intel Xeon server E3)
+uncore implementation.
+
+Shared Buffers Data Read (SBDR) (CVE-2022-21123)
+------------------------------------------------
+It is similar to Shared Buffer Data Sampling (SBDS) except that the data is
+directly read into the architectural software-visible state. It is limited to
+the client (including Intel Xeon server E3) uncore implementation.
+
+Affected Processors
+===================
+Not all the CPUs are affected by all the variants. For instance, most
+processors for the server market (excluding Intel Xeon E3 processors) are
+impacted by only Device Register Partial Write (DRPW).
+
+Below is the list of affected Intel processors [#f1]_:
+
+ =================== ============ =========
+ Common name Family_Model Steppings
+ =================== ============ =========
+ HASWELL_X 06_3FH 2,4
+ SKYLAKE_L 06_4EH 3
+ BROADWELL_X 06_4FH All
+ SKYLAKE_X 06_55H 3,4,6,7,11
+ BROADWELL_D 06_56H 3,4,5
+ SKYLAKE 06_5EH 3
+ ICELAKE_X 06_6AH 4,5,6
+ ICELAKE_D 06_6CH 1
+ ICELAKE_L 06_7EH 5
+ ATOM_TREMONT_D 06_86H All
+ LAKEFIELD 06_8AH 1
+ KABYLAKE_L 06_8EH 9 to 12
+ ATOM_TREMONT 06_96H 1
+ ATOM_TREMONT_L 06_9CH 0
+ KABYLAKE 06_9EH 9 to 13
+ COMETLAKE 06_A5H 2,3,5
+ COMETLAKE_L 06_A6H 0,1
+ ROCKETLAKE 06_A7H 1
+ =================== ============ =========
+
+If a CPU is in the affected processor list, but not affected by a variant, it
+is indicated by new bits in MSR IA32_ARCH_CAPABILITIES. As described in a later
+section, mitigation largely remains the same for all the variants, i.e. to
+clear the CPU fill buffers via VERW instruction.
+
+New bits in MSRs
+================
+Newer processors and microcode update on existing affected processors added new
+bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate
+specific variants of Processor MMIO Stale Data vulnerabilities and mitigation
+capability.
+
+MSR IA32_ARCH_CAPABILITIES
+--------------------------
+Bit 13 - SBDR_SSDP_NO - When set, processor is not affected by either the
+ Shared Buffers Data Read (SBDR) vulnerability or the sideband stale
+ data propagator (SSDP).
+Bit 14 - FBSDP_NO - When set, processor is not affected by the Fill Buffer
+ Stale Data Propagator (FBSDP).
+Bit 15 - PSDP_NO - When set, processor is not affected by Primary Stale Data
+ Propagator (PSDP).
+Bit 17 - FB_CLEAR - When set, VERW instruction will overwrite CPU fill buffer
+ values as part of MD_CLEAR operations. Processors that do not
+ enumerate MDS_NO (meaning they are affected by MDS) but that do
+ enumerate support for both L1D_FLUSH and MD_CLEAR implicitly enumerate
+ FB_CLEAR as part of their MD_CLEAR support.
+Bit 18 - FB_CLEAR_CTRL - Processor supports read and write to MSR
+ IA32_MCU_OPT_CTRL[FB_CLEAR_DIS]. On such processors, the FB_CLEAR_DIS
+ bit can be set to cause the VERW instruction to not perform the
+ FB_CLEAR action. Not all processors that support FB_CLEAR will support
+ FB_CLEAR_CTRL.
+
+MSR IA32_MCU_OPT_CTRL
+---------------------
+Bit 3 - FB_CLEAR_DIS - When set, VERW instruction does not perform the FB_CLEAR
+action. This may be useful to reduce the performance impact of FB_CLEAR in
+cases where system software deems it warranted (for example, when performance
+is more critical, or the untrusted software has no MMIO access). Note that
+FB_CLEAR_DIS has no impact on enumeration (for example, it does not change
+FB_CLEAR or MD_CLEAR enumeration) and it may not be supported on all processors
+that enumerate FB_CLEAR.
+
+Mitigation
+==========
+Like MDS, all variants of Processor MMIO Stale Data vulnerabilities have the
+same mitigation strategy to force the CPU to clear the affected buffers before
+an attacker can extract the secrets.
+
+This is achieved by using the otherwise unused and obsolete VERW instruction in
+combination with a microcode update. The microcode clears the affected CPU
+buffers when the VERW instruction is executed.
+
+Kernel reuses the MDS function to invoke the buffer clearing:
+
+ mds_clear_cpu_buffers()
+
+On MDS affected CPUs, the kernel already invokes CPU buffer clear on
+kernel/userspace, hypervisor/guest and C-state (idle) transitions. No
+additional mitigation is needed on such CPUs.
+
+For CPUs not affected by MDS or TAA, mitigation is needed only for the attacker
+with MMIO capability. Therefore, VERW is not required for kernel/userspace. For
+virtualization case, VERW is only needed at VMENTER for a guest with MMIO
+capability.
+
+Mitigation points
+-----------------
+Return to user space
+^^^^^^^^^^^^^^^^^^^^
+Same mitigation as MDS when affected by MDS/TAA, otherwise no mitigation
+needed.
+
+C-State transition
+^^^^^^^^^^^^^^^^^^
+Control register writes by CPU during C-state transition can propagate data
+from fill buffer to uncore buffers. Execute VERW before C-state transition to
+clear CPU fill buffers.
+
+Guest entry point
+^^^^^^^^^^^^^^^^^
+Same mitigation as MDS when processor is also affected by MDS/TAA, otherwise
+execute VERW at VMENTER only for MMIO capable guests. On CPUs not affected by
+MDS/TAA, guest without MMIO access cannot extract secrets using Processor MMIO
+Stale Data vulnerabilities, so there is no need to execute VERW for such guests.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The kernel command line allows to control the Processor MMIO Stale Data
+mitigations at boot time with the option "mmio_stale_data=". The valid
+arguments for this option are:
+
+ ========== =================================================================
+ full If the CPU is vulnerable, enable mitigation; CPU buffer clearing
+ on exit to userspace and when entering a VM. Idle transitions are
+ protected as well. It does not automatically disable SMT.
+ full,nosmt Same as full, with SMT disabled on vulnerable CPUs. This is the
+ complete mitigation.
+ off Disables mitigation completely.
+ ========== =================================================================
+
+If the CPU is affected and mmio_stale_data=off is not supplied on the kernel
+command line, then the kernel selects the appropriate mitigation.
+
+Mitigation status information
+-----------------------------
+The Linux kernel provides a sysfs interface to enumerate the current
+vulnerability status of the system: whether the system is vulnerable, and
+which mitigations are active. The relevant sysfs file is:
+
+ /sys/devices/system/cpu/vulnerabilities/mmio_stale_data
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable
+ * - 'Vulnerable'
+ - The processor is vulnerable, but no mitigation enabled
+ * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+ - The processor is vulnerable but microcode is not updated. The
+ mitigation is enabled on a best effort basis.
+
+ If the processor is vulnerable but the availability of the microcode
+ based mitigation mechanism is not advertised via CPUID, the kernel
+ selects a best effort mitigation mode. This mode invokes the mitigation
+ instructions without a guarantee that they clear the CPU buffers.
+
+ This is done to address virtualization scenarios where the host has the
+ microcode update applied, but the hypervisor is not yet updated to
+ expose the CPUID to the guest. If the host has updated microcode the
+ protection takes effect; otherwise a few CPU cycles are wasted
+ pointlessly.
+ * - 'Mitigation: Clear CPU buffers'
+ - The processor is vulnerable and the CPU buffer clearing mitigation is
+ enabled.
+ * - 'Unknown: No mitigations'
+ - The processor vulnerability status is unknown because it is
+ out of Servicing period. Mitigation is not attempted.
+
+Definitions:
+------------
+
+Servicing period: The process of providing functional and security updates to
+Intel processors or platforms, utilizing the Intel Platform Update (IPU)
+process or other similar mechanisms.
+
+End of Servicing Updates (ESU): ESU is the date at which Intel will no
+longer provide Servicing, such as through IPU or other similar update
+processes. ESU dates will typically be aligned to end of quarter.
+
+If the processor is vulnerable then the following information is appended to
+the above information:
+
+ ======================== ===========================================
+ 'SMT vulnerable' SMT is enabled
+ 'SMT disabled' SMT is disabled
+ 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
+ ======================== ===========================================
+
+References
+----------
+.. [#f1] Affected Processors
+ https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
diff --git a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
new file mode 100644
index 000000000000..0585d02b9a6c
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
@@ -0,0 +1,104 @@
+==================================
+Register File Data Sampling (RFDS)
+==================================
+
+Register File Data Sampling (RFDS) is a microarchitectural vulnerability that
+only affects Intel Atom parts(also branded as E-cores). RFDS may allow
+a malicious actor to infer data values previously used in floating point
+registers, vector registers, or integer registers. RFDS does not provide the
+ability to choose which data is inferred. CVE-2023-28746 is assigned to RFDS.
+
+Affected Processors
+===================
+Below is the list of affected Intel processors [#f1]_:
+
+ =================== ============
+ Common name Family_Model
+ =================== ============
+ ATOM_GOLDMONT 06_5CH
+ ATOM_GOLDMONT_D 06_5FH
+ ATOM_GOLDMONT_PLUS 06_7AH
+ ATOM_TREMONT_D 06_86H
+ ATOM_TREMONT 06_96H
+ ALDERLAKE 06_97H
+ ALDERLAKE_L 06_9AH
+ ATOM_TREMONT_L 06_9CH
+ RAPTORLAKE 06_B7H
+ RAPTORLAKE_P 06_BAH
+ ATOM_GRACEMONT 06_BEH
+ RAPTORLAKE_S 06_BFH
+ =================== ============
+
+As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and
+RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as
+vulnerable in Linux because they share the same family/model with an affected
+part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or
+CPUID.HYBRID. This information could be used to distinguish between the
+affected and unaffected parts, but it is deemed not worth adding complexity as
+the reporting is fixed automatically when these parts enumerate RFDS_NO.
+
+Mitigation
+==========
+Intel released a microcode update that enables software to clear sensitive
+information using the VERW instruction. Like MDS, RFDS deploys the same
+mitigation strategy to force the CPU to clear the affected buffers before an
+attacker can extract the secrets. This is achieved by using the otherwise
+unused and obsolete VERW instruction in combination with a microcode update.
+The microcode clears the affected CPU buffers when the VERW instruction is
+executed.
+
+Mitigation points
+-----------------
+VERW is executed by the kernel before returning to user space, and by KVM
+before VMentry. None of the affected cores support SMT, so VERW is not required
+at C-state transitions.
+
+New bits in IA32_ARCH_CAPABILITIES
+----------------------------------
+Newer processors and microcode update on existing affected processors added new
+bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate
+vulnerability and mitigation capability:
+
+- Bit 27 - RFDS_NO - When set, processor is not affected by RFDS.
+- Bit 28 - RFDS_CLEAR - When set, processor is affected by RFDS, and has the
+ microcode that clears the affected buffers on VERW execution.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The kernel command line allows to control RFDS mitigation at boot time with the
+parameter "reg_file_data_sampling=". The valid arguments are:
+
+ ========== =================================================================
+ on If the CPU is vulnerable, enable mitigation; CPU buffer clearing
+ on exit to userspace and before entering a VM.
+ off Disables mitigation.
+ ========== =================================================================
+
+Mitigation default is selected by CONFIG_MITIGATION_RFDS.
+
+Mitigation status information
+-----------------------------
+The Linux kernel provides a sysfs interface to enumerate the current
+vulnerability status of the system: whether the system is vulnerable, and
+which mitigations are active. The relevant sysfs file is:
+
+ /sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable
+ * - 'Vulnerable'
+ - The processor is vulnerable, but no mitigation enabled
+ * - 'Vulnerable: No microcode'
+ - The processor is vulnerable but microcode is not updated.
+ * - 'Mitigation: Clear Register File'
+ - The processor is vulnerable and the CPU buffer clearing mitigation is
+ enabled.
+
+References
+----------
+.. [#f1] Affected Processors
+ https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
diff --git a/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst
index 47b1b3afac99..966c9b3296ea 100644
--- a/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst
+++ b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst
@@ -3,7 +3,8 @@
SRBDS - Special Register Buffer Data Sampling
=============================================
-SRBDS is a hardware vulnerability that allows MDS :doc:`mds` techniques to
+SRBDS is a hardware vulnerability that allows MDS
+Documentation/admin-guide/hw-vuln/mds.rst techniques to
infer values returned from special register accesses. Special register
accesses are accesses to off core registers. According to Intel's evaluation,
the special register reads that have a security expectation of privacy are
@@ -14,7 +15,7 @@ to the core through the special register mechanism that is susceptible
to MDS attacks.
Affected processors
---------------------
+-------------------
Core models (desktop, mobile, Xeon-E3) that implement RDRAND and/or RDSEED may
be affected.
@@ -59,7 +60,7 @@ executed on another core or sibling thread using MDS techniques.
Mitigation mechanism
--------------------
+--------------------
Intel will release microcode updates that modify the RDRAND, RDSEED, and
EGETKEY instructions to overwrite secret special register data in the shared
staging buffer before the secret data can be accessed by another logical
@@ -118,7 +119,7 @@ with the option "srbds=". The option for this is:
============= =============================================================
SRBDS System Information
------------------------
+------------------------
The Linux kernel provides vulnerability status information through sysfs. For
SRBDS this can be accessed by the following sysfs file:
/sys/devices/system/cpu/vulnerabilities/srbds
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
index e05e581af5cf..cce768afec6b 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -60,8 +60,8 @@ privileged data touched during the speculative execution.
Spectre variant 1 attacks take advantage of speculative execution of
conditional branches, while Spectre variant 2 attacks use speculative
execution of indirect branches to leak privileged memory.
-See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[7] <spec_ref7>`
-:ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
+See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[6] <spec_ref6>`
+:ref:`[7] <spec_ref7>` :ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
Spectre variant 1 (Bounds Check Bypass)
---------------------------------------
@@ -131,6 +131,19 @@ steer its indirect branch speculations to gadget code, and measure the
speculative execution's side effects left in level 1 cache to infer the
victim's data.
+Yet another variant 2 attack vector is for the attacker to poison the
+Branch History Buffer (BHB) to speculatively steer an indirect branch
+to a specific Branch Target Buffer (BTB) entry, even if the entry isn't
+associated with the source address of the indirect branch. Specifically,
+the BHB might be shared across privilege levels even in the presence of
+Enhanced IBRS.
+
+Currently the only known real-world BHB attack vector is via
+unprivileged eBPF. Therefore, it's highly recommended to not enable
+unprivileged eBPF, especially when eIBRS is used (without retpolines).
+For a full mitigation against BHB attacks, it's recommended to use
+retpolines (or eIBRS combined with retpolines).
+
Attack scenarios
----------------
@@ -364,13 +377,15 @@ The possible values in this file are:
- Kernel status:
- ==================================== =================================
- 'Not affected' The processor is not vulnerable
- 'Vulnerable' Vulnerable, no mitigation
- 'Mitigation: Full generic retpoline' Software-focused mitigation
- 'Mitigation: Full AMD retpoline' AMD-specific software mitigation
- 'Mitigation: Enhanced IBRS' Hardware-focused mitigation
- ==================================== =================================
+ ======================================== =================================
+ 'Not affected' The processor is not vulnerable
+ 'Mitigation: None' Vulnerable, no mitigation
+ 'Mitigation: Retpolines' Use Retpoline thunks
+ 'Mitigation: LFENCE' Use LFENCE instructions
+ 'Mitigation: Enhanced IBRS' Hardware-focused mitigation
+ 'Mitigation: Enhanced IBRS + Retpolines' Hardware-focused + Retpolines
+ 'Mitigation: Enhanced IBRS + LFENCE' Hardware-focused + LFENCE
+ ======================================== =================================
- Firmware status: Show if Indirect Branch Restricted Speculation (IBRS) is
used to protect against Spectre variant 2 attacks when calling firmware (x86 only).
@@ -407,6 +422,14 @@ The possible values in this file are:
'RSB filling' Protection of RSB on context switch enabled
============= ===========================================
+ - EIBRS Post-barrier Return Stack Buffer (PBRSB) protection status:
+
+ =========================== =======================================================
+ 'PBRSB-eIBRS: SW sequence' CPU is affected and protection of RSB on VMEXIT enabled
+ 'PBRSB-eIBRS: Vulnerable' CPU is vulnerable
+ 'PBRSB-eIBRS: Not affected' CPU is not affected by PBRSB
+ =========================== =======================================================
+
Full mitigation might require a microcode update from the CPU
vendor. When the necessary microcode is not available, the kernel will
report vulnerability.
@@ -450,14 +473,25 @@ Spectre variant 2
-mindirect-branch=thunk-extern -mindirect-branch-register options.
If the kernel is compiled with a Clang compiler, the compiler needs
to support -mretpoline-external-thunk option. The kernel config
- CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
- the latest updated microcode.
+ CONFIG_MITIGATION_RETPOLINE needs to be turned on, and the CPU needs
+ to run with the latest updated microcode.
On Intel Skylake-era systems the mitigation covers most, but not all,
cases. See :ref:`[3] <spec_ref3>` for more details.
- On CPUs with hardware mitigation for Spectre variant 2 (e.g. Enhanced
- IBRS on x86), retpoline is automatically disabled at run time.
+ On CPUs with hardware mitigation for Spectre variant 2 (e.g. IBRS
+ or enhanced IBRS on x86), retpoline is automatically disabled at run time.
+
+ Systems which support enhanced IBRS (eIBRS) enable IBRS protection once at
+ boot, by setting the IBRS bit, and they're automatically protected against
+ Spectre v2 variant attacks.
+
+ On Intel's enhanced IBRS systems, this includes cross-thread branch target
+ injections on SMT systems (STIBP). In other words, Intel eIBRS enables
+ STIBP, too.
+
+ AMD Automatic IBRS does not protect userspace, and Legacy IBRS systems clear
+ the IBRS bit on exit to userspace, therefore both explicitly enable STIBP.
The retpoline mitigation is turned on by default on vulnerable
CPUs. It can be forced on or off by the administrator
@@ -468,7 +502,7 @@ Spectre variant 2
before invoking any firmware code to prevent Spectre variant 2 exploits
using the firmware.
- Using kernel address space randomization (CONFIG_RANDOMIZE_SLAB=y
+ Using kernel address space randomization (CONFIG_RANDOMIZE_BASE=y
and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) makes
attacks on the kernel generally more difficult.
@@ -481,18 +515,20 @@ Spectre variant 2
For Spectre variant 2 mitigation, individual user programs
can be compiled with return trampolines for indirect branches.
This protects them from consuming poisoned entries in the branch
- target buffer left by malicious software. Alternatively, the
- programs can disable their indirect branch speculation via prctl()
- (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
+ target buffer left by malicious software.
+
+ On legacy IBRS systems, at return to userspace, implicit STIBP is disabled
+ because the kernel clears the IBRS bit. In this case, the userspace programs
+ can disable indirect branch speculation via prctl() (See
+ :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
On x86, this will turn on STIBP to guard against attacks from the
sibling thread when the user program is running, and use IBPB to
flush the branch target buffer when switching to/from the program.
Restricting indirect branch speculation on a user program will
also prevent the program from launching a variant 2 attack
- on x86. All sand-boxed SECCOMP programs have indirect branch
- speculation restricted by default. Administrators can change
- that behavior via the kernel command line and sysfs control files.
+ on x86. Administrators can change that behavior via the kernel
+ command line and sysfs control files.
See :ref:`spectre_mitigation_control_command_line`.
Programs that disable their indirect branch speculation will have
@@ -573,8 +609,8 @@ kernel command line.
Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
- CONFIG_RETPOLINE configuration option, and the
- compiler with which the kernel was built.
+ CONFIG_MITIGATION_RETPOLINE configuration option,
+ and the compiler with which the kernel was built.
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
@@ -584,71 +620,26 @@ kernel command line.
Specific mitigations can also be selected manually:
- retpoline
- replace indirect branches
- retpoline,generic
- google's original retpoline
- retpoline,amd
- AMD-specific minimal thunk
+ retpoline auto pick between generic,lfence
+ retpoline,generic Retpolines
+ retpoline,lfence LFENCE; indirect branch
+ retpoline,amd alias for retpoline,lfence
+ eibrs Enhanced/Auto IBRS
+ eibrs,retpoline Enhanced/Auto IBRS + Retpolines
+ eibrs,lfence Enhanced/Auto IBRS + LFENCE
+ ibrs use IBRS to protect kernel
Not specifying this option is equivalent to
spectre_v2=auto.
-For user space mitigation:
-
- spectre_v2_user=
-
- [X86] Control mitigation of Spectre variant 2
- (indirect branch speculation) vulnerability between
- user space tasks
-
- on
- Unconditionally enable mitigations. Is
- enforced by spectre_v2=on
-
- off
- Unconditionally disable mitigations. Is
- enforced by spectre_v2=off
-
- prctl
- Indirect branch speculation is enabled,
- but mitigation can be enabled via prctl
- per thread. The mitigation control state
- is inherited on fork.
-
- prctl,ibpb
- Like "prctl" above, but only STIBP is
- controlled per thread. IBPB is issued
- always when switching between different user
- space processes.
-
- seccomp
- Same as "prctl" above, but all seccomp
- threads will enable the mitigation unless
- they explicitly opt out.
-
- seccomp,ibpb
- Like "seccomp" above, but only STIBP is
- controlled per thread. IBPB is issued
- always when switching between different
- user space processes.
-
- auto
- Kernel selects the mitigation depending on
- the available CPU features and vulnerability.
-
- Default mitigation:
- If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl"
-
- Not specifying this option is equivalent to
- spectre_v2_user=auto.
-
In general the kernel by default selects
reasonable mitigations for the current CPU. To
disable Spectre variant 2 mitigations, boot with
spectre_v2=off. Spectre variant 1 mitigations
cannot be disabled.
+For spectre_v2_user see Documentation/admin-guide/kernel-parameters.txt
+
Mitigation selection guide
--------------------------
@@ -674,9 +665,8 @@ Mitigation selection guide
off by disabling their indirect branch speculation when they are run
(See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
This prevents untrusted programs from polluting the branch target
- buffer. All programs running in SECCOMP sandboxes have indirect
- branch speculation restricted by default. This behavior can be
- changed via the kernel command line and sysfs control files. See
+ buffer. This behavior can be changed via the kernel command line
+ and sysfs control files. See
:ref:`spectre_mitigation_control_command_line`.
3. High security mode
@@ -730,7 +720,7 @@ AMD white papers:
.. _spec_ref6:
-[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf>`_.
+[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf>`_.
ARM white papers:
diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst
new file mode 100644
index 000000000000..e715bfc09879
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/srso.rst
@@ -0,0 +1,160 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Speculative Return Stack Overflow (SRSO)
+========================================
+
+This is a mitigation for the speculative return stack overflow (SRSO)
+vulnerability found on AMD processors. The mechanism is by now the well
+known scenario of poisoning CPU functional units - the Branch Target
+Buffer (BTB) and Return Address Predictor (RAP) in this case - and then
+tricking the elevated privilege domain (the kernel) into leaking
+sensitive data.
+
+AMD CPUs predict RET instructions using a Return Address Predictor (aka
+Return Address Stack/Return Stack Buffer). In some cases, a non-architectural
+CALL instruction (i.e., an instruction predicted to be a CALL but is
+not actually a CALL) can create an entry in the RAP which may be used
+to predict the target of a subsequent RET instruction.
+
+The specific circumstances that lead to this varies by microarchitecture
+but the concern is that an attacker can mis-train the CPU BTB to predict
+non-architectural CALL instructions in kernel space and use this to
+control the speculative target of a subsequent kernel RET, potentially
+leading to information disclosure via a speculative side-channel.
+
+The issue is tracked under CVE-2023-20569.
+
+Affected processors
+-------------------
+
+AMD Zen, generations 1-4. That is, all families 0x17 and 0x19. Older
+processors have not been investigated.
+
+System information and options
+------------------------------
+
+First of all, it is required that the latest microcode be loaded for
+mitigations to be effective.
+
+The sysfs file showing SRSO mitigation status is:
+
+ /sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow
+
+The possible values in this file are:
+
+ * 'Not affected':
+
+ The processor is not vulnerable
+
+* 'Vulnerable':
+
+ The processor is vulnerable and no mitigations have been applied.
+
+ * 'Vulnerable: No microcode':
+
+ The processor is vulnerable, no microcode extending IBPB
+ functionality to address the vulnerability has been applied.
+
+ * 'Vulnerable: Safe RET, no microcode':
+
+ The "Safe RET" mitigation (see below) has been applied to protect the
+ kernel, but the IBPB-extending microcode has not been applied. User
+ space tasks may still be vulnerable.
+
+ * 'Vulnerable: Microcode, no safe RET':
+
+ Extended IBPB functionality microcode patch has been applied. It does
+ not address User->Kernel and Guest->Host transitions protection but it
+ does address User->User and VM->VM attack vectors.
+
+ Note that User->User mitigation is controlled by how the IBPB aspect in
+ the Spectre v2 mitigation is selected:
+
+ * conditional IBPB:
+
+ where each process can select whether it needs an IBPB issued
+ around it PR_SPEC_DISABLE/_ENABLE etc, see :doc:`spectre`
+
+ * strict:
+
+ i.e., always on - by supplying spectre_v2_user=on on the kernel
+ command line
+
+ (spec_rstack_overflow=microcode)
+
+ * 'Mitigation: Safe RET':
+
+ Combined microcode/software mitigation. It complements the
+ extended IBPB microcode patch functionality by addressing
+ User->Kernel and Guest->Host transitions protection.
+
+ Selected by default or by spec_rstack_overflow=safe-ret
+
+ * 'Mitigation: IBPB':
+
+ Similar protection as "safe RET" above but employs an IBPB barrier on
+ privilege domain crossings (User->Kernel, Guest->Host).
+
+ (spec_rstack_overflow=ibpb)
+
+ * 'Mitigation: IBPB on VMEXIT':
+
+ Mitigation addressing the cloud provider scenario - the Guest->Host
+ transitions only.
+
+ (spec_rstack_overflow=ibpb-vmexit)
+
+
+
+In order to exploit vulnerability, an attacker needs to:
+
+ - gain local access on the machine
+
+ - break kASLR
+
+ - find gadgets in the running kernel in order to use them in the exploit
+
+ - potentially create and pin an additional workload on the sibling
+ thread, depending on the microarchitecture (not necessary on fam 0x19)
+
+ - run the exploit
+
+Considering the performance implications of each mitigation type, the
+default one is 'Mitigation: safe RET' which should take care of most
+attack vectors, including the local User->Kernel one.
+
+As always, the user is advised to keep her/his system up-to-date by
+applying software updates regularly.
+
+The default setting will be reevaluated when needed and especially when
+new attack vectors appear.
+
+As one can surmise, 'Mitigation: safe RET' does come at the cost of some
+performance depending on the workload. If one trusts her/his userspace
+and does not want to suffer the performance impact, one can always
+disable the mitigation with spec_rstack_overflow=off.
+
+Similarly, 'Mitigation: IBPB' is another full mitigation type employing
+an indrect branch prediction barrier after having applied the required
+microcode patch for one's system. This mitigation comes also at
+a performance cost.
+
+Mitigation: Safe RET
+--------------------
+
+The mitigation works by ensuring all RET instructions speculate to
+a controlled location, similar to how speculation is controlled in the
+retpoline sequence. To accomplish this, the __x86_return_thunk forces
+the CPU to mispredict every function return using a 'safe return'
+sequence.
+
+To ensure the safety of this mitigation, the kernel must ensure that the
+safe return sequence is itself free from attacker interference. In Zen3
+and Zen4, this is accomplished by creating a BTB alias between the
+untraining function srso_alias_untrain_ret() and the safe return
+function srso_alias_safe_ret() which results in evicting a potentially
+poisoned BTB entry and using that safe one for all function returns.
+
+In older Zen1 and Zen2, this is accomplished using a reinterpretation
+technique similar to Retbleed one: srso_untrain_ret() and
+srso_safe_ret().
diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
index 68d96f0e9c95..444f84e22a91 100644
--- a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
+++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
@@ -60,10 +60,10 @@ Hyper-Thread attacks are possible.
The victim of a malicious actor does not need to make use of TSX. Only the
attacker needs to begin a TSX transaction and raise an asynchronous abort
-which in turn potenitally leaks data stored in the buffers.
+which in turn potentially leaks data stored in the buffers.
More detailed technical information is available in the TAA specific x86
-architecture section: :ref:`Documentation/x86/tsx_async_abort.rst <tsx_async_abort>`.
+architecture section: :ref:`Documentation/arch/x86/tsx_async_abort.rst <tsx_async_abort>`.
Attack scenarios
@@ -98,7 +98,19 @@ The possible values in this file are:
* - 'Vulnerable'
- The CPU is affected by this vulnerability and the microcode and kernel mitigation are not applied.
* - 'Vulnerable: Clear CPU buffers attempted, no microcode'
- - The system tries to clear the buffers but the microcode might not support the operation.
+ - The processor is vulnerable but microcode is not updated. The
+ mitigation is enabled on a best effort basis.
+
+ If the processor is vulnerable but the availability of the microcode
+ based mitigation mechanism is not advertised via CPUID, the kernel
+ selects a best effort mitigation mode. This mode invokes the mitigation
+ instructions without a guarantee that they clear the CPU buffers.
+
+ This is done to address virtualization scenarios where the host has the
+ microcode update applied, but the hypervisor is not yet updated to
+ expose the CPUID to the guest. If the host has updated microcode the
+ protection takes effect; otherwise a few CPU cycles are wasted
+ pointlessly.
* - 'Mitigation: Clear CPU buffers'
- The microcode has been updated to clear the buffers. TSX is still enabled.
* - 'Mitigation: TSX disabled'
@@ -106,25 +118,6 @@ The possible values in this file are:
* - 'Not affected'
- The CPU is not affected by this issue.
-.. _ucode_needed:
-
-Best effort mitigation mode
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If the processor is vulnerable, but the availability of the microcode-based
-mitigation mechanism is not advertised via CPUID the kernel selects a best
-effort mitigation mode. This mode invokes the mitigation instructions
-without a guarantee that they clear the CPU buffers.
-
-This is done to address virtualization scenarios where the host has the
-microcode update applied, but the hypervisor is not yet updated to expose the
-CPUID to the guest. If the host has updated microcode the protection takes
-effect; otherwise a few CPU cycles are wasted pointlessly.
-
-The state in the tsx_async_abort sysfs file reflects this situation
-accordingly.
-
-
Mitigation mechanism
--------------------
diff --git a/Documentation/admin-guide/hw_random.rst b/Documentation/admin-guide/hw_random.rst
index 121de96e395e..bfc39f1cf470 100644
--- a/Documentation/admin-guide/hw_random.rst
+++ b/Documentation/admin-guide/hw_random.rst
@@ -1,6 +1,6 @@
-==========================================================
-Linux support for random number generator in i8xx chipsets
-==========================================================
+=================================
+Hardware random number generators
+=================================
Introduction
============
@@ -14,10 +14,9 @@ into that core.
To make the most effective use of these mechanisms, you
should download the support software as well. Download the
-latest version of the "rng-tools" package from the
-hw_random driver's official Web site:
+latest version of the "rng-tools" package from:
- http://sourceforge.net/projects/gkernel/
+ https://github.com/nhorman/rng-tools
Those tools use /dev/hwrng to fill the kernel entropy pool,
which is used internally and exported by the /dev/urandom and
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 58c7f9fc2396..32ea52f1d150 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -1,3 +1,4 @@
+=================================================
The Linux kernel user's and administrator's guide
=================================================
@@ -18,6 +19,9 @@ etc.
devices
sysctl/index
+ abi
+ features
+
This section describes CPU vulnerabilities and their mitigations.
.. toctree::
@@ -31,8 +35,10 @@ problems and bugs in particular.
.. toctree::
:maxdepth: 1
- reporting-bugs
- security-bugs
+ reporting-issues
+ reporting-regressions
+ quickly-build-trimmed-linux
+ verify-bugs-and-bisect-regressions
bug-hunting
bug-bisect
tainted-kernels
@@ -41,6 +47,7 @@ problems and bugs in particular.
init
kdump/index
perf/index
+ pstore-blk
This is the beginning of a section with information of interest to
application developers. Documents covering various aspects of the kernel
@@ -51,6 +58,17 @@ ABI will be found here.
sysfs-rules
+This is the beginning of a section with information of interest to
+application developers and system integrators doing analysis of the
+Linux kernel for safety critical applications. Documents supporting
+analysis of kernel interactions with applications, and key kernel
+subsystems expectations will be found here.
+
+.. toctree::
+ :maxdepth: 1
+
+ workload-tracing
+
The rest of this manual consists of various unordered guides on how to
configure specific aspects of kernel behavior to your liking.
@@ -78,6 +96,7 @@ configure specific aspects of kernel behavior to your liking.
edid
efi-stub
ext4
+ filesystem-monitoring
nfs/index
gpio/index
highuid
@@ -102,19 +121,21 @@ configure specific aspects of kernel behavior to your liking.
parport
perf-security
pm/index
+ pmf
pnp
rapidio
- ras
+ RAS/index
rtc
serial-console
svga
+ syscall-user-dispatch
sysrq
+ thermal/index
thunderbolt
ufs
unicode
vga-softcursor
video-output
- wimax/index
xfs
.. only:: subproject and html
diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
index 9b14b0c2c9c4..609a3201fd4e 100644
--- a/Documentation/admin-guide/iostats.rst
+++ b/Documentation/admin-guide/iostats.rst
@@ -76,7 +76,7 @@ Field 3 -- # of sectors read (unsigned long)
Field 4 -- # of milliseconds spent reading (unsigned int)
This is the total number of milliseconds spent by all reads (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 5 -- # of writes completed (unsigned long)
This is the total number of writes completed successfully.
@@ -89,7 +89,7 @@ Field 7 -- # of sectors written (unsigned long)
Field 8 -- # of milliseconds spent writing (unsigned int)
This is the total number of milliseconds spent by all writes (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 9 -- # of I/Os currently in progress (unsigned int)
The only field that should go to zero. Incremented as requests are
@@ -120,7 +120,7 @@ Field 14 -- # of sectors discarded (unsigned long)
Field 15 -- # of milliseconds spent discarding (unsigned int)
This is the total number of milliseconds spent by all discards (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 16 -- # of flush requests completed
This is the total number of flush requests completed successfully.
diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt
index 220d0a80ca2c..030de95e3e6b 100644
--- a/Documentation/admin-guide/kdump/gdbmacros.txt
+++ b/Documentation/admin-guide/kdump/gdbmacros.txt
@@ -170,57 +170,103 @@ document trapinfo
address the kernel panicked.
end
-define dump_log_idx
- set $idx = $arg0
- if ($argc > 1)
- set $prev_flags = $arg1
+define dump_record
+ set var $desc = $arg0
+ set var $info = $arg1
+ if ($argc > 2)
+ set var $prev_flags = $arg2
else
- set $prev_flags = 0
+ set var $prev_flags = 0
end
- set $msg = ((struct printk_log *) (log_buf + $idx))
- set $prefix = 1
- set $newline = 1
- set $log = log_buf + $idx + sizeof(*$msg)
-
- # prev & LOG_CONT && !(msg->flags & LOG_PREIX)
- if (($prev_flags & 8) && !($msg->flags & 4))
- set $prefix = 0
+
+ set var $prefix = 1
+ set var $newline = 1
+
+ set var $begin = $desc->text_blk_lpos.begin % (1U << prb->text_data_ring.size_bits)
+ set var $next = $desc->text_blk_lpos.next % (1U << prb->text_data_ring.size_bits)
+
+ # handle data-less record
+ if ($begin & 1)
+ set var $text_len = 0
+ set var $log = ""
+ else
+ # handle wrapping data block
+ if ($begin > $next)
+ set var $begin = 0
+ end
+
+ # skip over descriptor id
+ set var $begin = $begin + sizeof(long)
+
+ # handle truncated message
+ if ($next - $begin < $info->text_len)
+ set var $text_len = $next - $begin
+ else
+ set var $text_len = $info->text_len
+ end
+
+ set var $log = &prb->text_data_ring.data[$begin]
+ end
+
+ # prev & LOG_CONT && !(info->flags & LOG_PREIX)
+ if (($prev_flags & 8) && !($info->flags & 4))
+ set var $prefix = 0
end
- # msg->flags & LOG_CONT
- if ($msg->flags & 8)
+ # info->flags & LOG_CONT
+ if ($info->flags & 8)
# (prev & LOG_CONT && !(prev & LOG_NEWLINE))
if (($prev_flags & 8) && !($prev_flags & 2))
- set $prefix = 0
+ set var $prefix = 0
end
- # (!(msg->flags & LOG_NEWLINE))
- if (!($msg->flags & 2))
- set $newline = 0
+ # (!(info->flags & LOG_NEWLINE))
+ if (!($info->flags & 2))
+ set var $newline = 0
end
end
if ($prefix)
- printf "[%5lu.%06lu] ", $msg->ts_nsec / 1000000000, $msg->ts_nsec % 1000000000
+ printf "[%5lu.%06lu] ", $info->ts_nsec / 1000000000, $info->ts_nsec % 1000000000
end
- if ($msg->text_len != 0)
- eval "printf \"%%%d.%ds\", $log", $msg->text_len, $msg->text_len
+ if ($text_len)
+ eval "printf \"%%%d.%ds\", $log", $text_len, $text_len
end
if ($newline)
printf "\n"
end
- if ($msg->dict_len > 0)
- set $dict = $log + $msg->text_len
- set $idx = 0
- set $line = 1
- while ($idx < $msg->dict_len)
- if ($line)
- printf " "
- set $line = 0
+
+ # handle dictionary data
+
+ set var $dict = &$info->dev_info.subsystem[0]
+ set var $dict_len = sizeof($info->dev_info.subsystem)
+ if ($dict[0] != '\0')
+ printf " SUBSYSTEM="
+ set var $idx = 0
+ while ($idx < $dict_len)
+ set var $c = $dict[$idx]
+ if ($c == '\0')
+ loop_break
+ else
+ if ($c < ' ' || $c >= 127 || $c == '\\')
+ printf "\\x%02x", $c
+ else
+ printf "%c", $c
+ end
end
- set $c = $dict[$idx]
+ set var $idx = $idx + 1
+ end
+ printf "\n"
+ end
+
+ set var $dict = &$info->dev_info.device[0]
+ set var $dict_len = sizeof($info->dev_info.device)
+ if ($dict[0] != '\0')
+ printf " DEVICE="
+ set var $idx = 0
+ while ($idx < $dict_len)
+ set var $c = $dict[$idx]
if ($c == '\0')
- printf "\n"
- set $line = 1
+ loop_break
else
if ($c < ' ' || $c >= 127 || $c == '\\')
printf "\\x%02x", $c
@@ -228,35 +274,48 @@ define dump_log_idx
printf "%c", $c
end
end
- set $idx = $idx + 1
+ set var $idx = $idx + 1
end
printf "\n"
end
end
-document dump_log_idx
- Dump a single log given its index in the log buffer. The first
- parameter is the index into log_buf, the second is optional and
- specified the previous log buffer's flags, used for properly
- formatting continued lines.
+document dump_record
+ Dump a single record. The first parameter is the descriptor,
+ the second parameter is the info, the third parameter is
+ optional and specifies the previous record's flags, used for
+ properly formatting continued lines.
end
define dmesg
- set $i = log_first_idx
- set $end_idx = log_first_idx
- set $prev_flags = 0
+ # definitions from kernel/printk/printk_ringbuffer.h
+ set var $desc_committed = 1
+ set var $desc_finalized = 2
+ set var $desc_sv_bits = sizeof(long) * 8
+ set var $desc_flags_shift = $desc_sv_bits - 2
+ set var $desc_flags_mask = 3 << $desc_flags_shift
+ set var $id_mask = ~$desc_flags_mask
+
+ set var $desc_count = 1U << prb->desc_ring.count_bits
+ set var $prev_flags = 0
+
+ set var $id = prb->desc_ring.tail_id.counter
+ set var $end_id = prb->desc_ring.head_id.counter
while (1)
- set $msg = ((struct printk_log *) (log_buf + $i))
- if ($msg->len == 0)
- set $i = 0
- else
- dump_log_idx $i $prev_flags
- set $i = $i + $msg->len
- set $prev_flags = $msg->flags
+ set var $desc = &prb->desc_ring.descs[$id % $desc_count]
+ set var $info = &prb->desc_ring.infos[$id % $desc_count]
+
+ # skip non-committed record
+ set var $state = 3 & ($desc->state_var.counter >> $desc_flags_shift)
+ if ($state == $desc_committed || $state == $desc_finalized)
+ dump_record $desc $info $prev_flags
+ set var $prev_flags = $info->flags
end
- if ($i == $end_idx)
+
+ if ($id == $end_id)
loop_break
end
+ set var $id = ($id + 1) & $id_mask
end
end
document dmesg
diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 2da65fef2a1c..0302a93b1d40 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -2,7 +2,7 @@
Documentation for Kdump - The kexec-based Crash Dumping Solution
================================================================
-This document includes overview, setup and installation, and analysis
+This document includes overview, setup, installation, and analysis
information.
Overview
@@ -13,11 +13,11 @@ dump of the system kernel's memory needs to be taken (for example, when
the system panics). The system kernel's memory image is preserved across
the reboot and is accessible to the dump-capture kernel.
-You can use common commands, such as cp and scp, to copy the
-memory image to a dump file on the local disk, or across the network to
-a remote system.
+You can use common commands, such as cp, scp or makedumpfile to copy
+the memory image to a dump file on the local disk, or across the network
+to a remote system.
-Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
+Kdump and kexec are currently supported on the x86, x86_64, ppc64,
s390x, arm and arm64 architectures.
When the system kernel boots, it reserves a small section of memory for
@@ -26,13 +26,15 @@ the dump-capture kernel. This ensures that ongoing Direct Memory Access
The kexec -p command loads the dump-capture kernel into this reserved
memory.
-On x86 machines, the first 640 KB of physical memory is needed to boot,
-regardless of where the kernel loads. Therefore, kexec backs up this
-region just before rebooting into the dump-capture kernel.
+On x86 machines, the first 640 KB of physical memory is needed for boot,
+regardless of where the kernel loads. For simpler handling, the whole
+low 1M is reserved to avoid any later kernel or device driver writing
+data into this area. Like this, the low 1M can be reused as system RAM
+by kdump kernel without extra handling.
-Similarly on PPC64 machines first 32KB of physical memory is needed for
-booting regardless of where the kernel is loaded and to support 64K page
-size kexec backs up the first 64KB memory.
+On PPC64 machines first 32KB of physical memory is needed for booting
+regardless of where the kernel is loaded and to support 64K page size
+kexec backs up the first 64KB memory.
For s390x, when kdump is triggered, the crashkernel region is exchanged
with the region [0, crashkernel region size] and then the kdump kernel
@@ -46,14 +48,14 @@ passed to the dump-capture kernel through the elfcorehdr= boot
parameter. Optionally the size of the ELF header can also be passed
when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax.
-
With the dump-capture kernel, you can access the memory image through
/proc/vmcore. This exports the dump as an ELF-format file that you can
-write out using file copy commands such as cp or scp. Further, you can
-use analysis tools such as the GNU Debugger (GDB) and the Crash tool to
-debug the dump file. This method ensures that the dump pages are correctly
-ordered.
-
+write out using file copy commands such as cp or scp. You can also use
+makedumpfile utility to analyze and write out filtered contents with
+options, e.g with '-d 31' it will only write out kernel data. Further,
+you can use analysis tools such as the GNU Debugger (GDB) and the Crash
+tool to debug the dump file. This method ensures that the dump pages are
+correctly ordered.
Setup and Installation
======================
@@ -111,7 +113,7 @@ There are two possible methods of using Kdump.
2) Or use the system kernel binary itself as dump-capture kernel and there is
no need to build a separate dump-capture kernel. This is possible
only with the architectures which support a relocatable kernel. As
- of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support
+ of today, i386, x86_64, ppc64, arm and arm64 architectures support
relocatable kernel.
Building a relocatable kernel is advantageous from the point of view that
@@ -125,9 +127,18 @@ dump-capture kernels for enabling kdump support.
System kernel config options
----------------------------
-1) Enable "kexec system call" in "Processor type and features."::
+1) Enable "kexec system call" or "kexec file based system call" in
+ "Processor type and features."::
+
+ CONFIG_KEXEC=y or CONFIG_KEXEC_FILE=y
+
+ And both of them will select KEXEC_CORE::
+
+ CONFIG_KEXEC_CORE=y
+
+ Subsequently, CRASH_CORE is selected by KEXEC_CORE::
- CONFIG_KEXEC=y
+ CONFIG_CRASH_CORE=y
2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
filesystems." This is usually enabled by default::
@@ -135,9 +146,9 @@ System kernel config options
CONFIG_SYSFS=y
Note that "sysfs file system support" might not appear in the "Pseudo
- filesystems" menu if "Configure standard kernel features (for small
- systems)" is not enabled in "General Setup." In this case, check the
- .config file itself to ensure that sysfs is turned on, as follows::
+ filesystems" menu if "Configure standard kernel features (expert users)"
+ is not enabled in "General Setup." In this case, check the .config file
+ itself to ensure that sysfs is turned on, as follows::
grep 'CONFIG_SYSFS' .config
@@ -175,17 +186,17 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
CONFIG_HIGHMEM4G
-2) On i386 and x86_64, disable symmetric multi-processing support
- under "Processor type and features"::
+2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel
+ command line when loading the dump-capture kernel because one
+ CPU is enough for kdump kernel to dump vmcore on most of systems.
- CONFIG_SMP=n
+ However, you can also specify nr_cpus=X to enable multiple processors
+ in kdump kernel.
- (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
- when loading the dump-capture kernel, see section "Load the Dump-capture
- Kernel".)
+ With CONFIG_SMP=n, the above things are not related.
-3) If one wants to build and use a relocatable kernel,
- Enable "Build a relocatable kernel" support under "Processor type and
+3) A relocatable kernel is suggested to be built by default. If not yet,
+ enable "Build a relocatable kernel" support under "Processor type and
features"::
CONFIG_RELOCATABLE=y
@@ -223,28 +234,6 @@ Dump-capture kernel config options (Arch Dependent, ppc64)
Make and install the kernel and its modules.
-Dump-capture kernel config options (Arch Dependent, ia64)
-----------------------------------------------------------
-
-- No specific options are required to create a dump-capture kernel
- for ia64, other than those specified in the arch independent section
- above. This means that it is possible to use the system kernel
- as a dump-capture kernel if desired.
-
- The crashkernel region can be automatically placed by the system
- kernel at run time. This is done by specifying the base address as 0,
- or omitting it all together::
-
- crashkernel=256M@0
-
- or::
-
- crashkernel=256M
-
- If the start address is specified, note that the start address of the
- kernel will be aligned to 64Mb, so if the start address is not then
- any space below the alignment point will be wasted.
-
Dump-capture kernel config options (Arch Dependent, arm)
----------------------------------------------------------
@@ -260,54 +249,85 @@ Dump-capture kernel config options (Arch Dependent, arm64)
on non-VHE systems even if it is configured. This is because the CPU
will not be reset to EL2 on panic.
-Extended crashkernel syntax
+crashkernel syntax
===========================
+1) crashkernel=size@offset
-While the "crashkernel=size[@offset]" syntax is sufficient for most
-configurations, sometimes it's handy to have the reserved memory dependent
-on the value of System RAM -- that's mostly for distributors that pre-setup
-the kernel command line to avoid a unbootable system after some memory has
-been removed from the machine.
+ Here 'size' specifies how much memory to reserve for the dump-capture kernel
+ and 'offset' specifies the beginning of this reserved memory. For example,
+ "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
+ starting at physical address 0x01000000 (16MB) for the dump-capture kernel.
-The syntax is::
+ The crashkernel region can be automatically placed by the system
+ kernel at run time. This is done by specifying the base address as 0,
+ or omitting it all together::
- crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
- range=start-[end]
+ crashkernel=256M@0
-For example::
+ or::
- crashkernel=512M-2G:64M,2G-:128M
+ crashkernel=256M
-This would mean:
+ If the start address is specified, note that the start address of the
+ kernel will be aligned to a value (which is Arch dependent), so if the
+ start address is not then any space below the alignment point will be
+ wasted.
- 1) if the RAM is smaller than 512M, then don't reserve anything
- (this is the "rescue" case)
- 2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
- 3) if the RAM size is larger than 2G, then reserve 128M
+2) range1:size1[,range2:size2,...][@offset]
+ While the "crashkernel=size[@offset]" syntax is sufficient for most
+ configurations, sometimes it's handy to have the reserved memory dependent
+ on the value of System RAM -- that's mostly for distributors that pre-setup
+ the kernel command line to avoid a unbootable system after some memory has
+ been removed from the machine.
+ The syntax is::
-Boot into System Kernel
-=======================
+ crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
+ range=start-[end]
+
+ For example::
+
+ crashkernel=512M-2G:64M,2G-:128M
+
+ This would mean:
+ 1) if the RAM is smaller than 512M, then don't reserve anything
+ (this is the "rescue" case)
+ 2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
+ 3) if the RAM size is larger than 2G, then reserve 128M
+
+3) crashkernel=size,high and crashkernel=size,low
+
+ If memory above 4G is preferred, crashkernel=size,high can be used to
+ fulfill that. With it, physical memory is allowed to be allocated from top,
+ so could be above 4G if system has more than 4G RAM installed. Otherwise,
+ memory region will be allocated below 4G if available.
+
+ When crashkernel=X,high is passed, kernel could allocate physical memory
+ region above 4G, low memory under 4G is needed in this case. There are
+ three ways to get low memory:
+
+ 1) Kernel will allocate at least 256M memory below 4G automatically
+ if crashkernel=Y,low is not specified.
+ 2) Let user specify low memory size instead.
+ 3) Specified value 0 will disable low memory allocation::
+
+ crashkernel=0,low
+
+Boot into System Kernel
+-----------------------
1) Update the boot loader (such as grub, yaboot, or lilo) configuration
files as necessary.
-2) Boot the system kernel with the boot parameter "crashkernel=Y@X",
- where Y specifies how much memory to reserve for the dump-capture kernel
- and X specifies the beginning of this reserved memory. For example,
- "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
- starting at physical address 0x01000000 (16MB) for the dump-capture kernel.
+2) Boot the system kernel with the boot parameter "crashkernel=Y@X".
- On x86 and x86_64, use "crashkernel=64M@16M".
+ On x86 and x86_64, use "crashkernel=Y[@X]". Most of the time, the
+ start address 'X' is not necessary, kernel will search a suitable
+ area. Unless an explicit start address is expected.
On ppc64, use "crashkernel=128M@32M".
- On ia64, 256M@256M is a generous value that typically works.
- The region may be automatically placed on ia64, see the
- dump-capture kernel config option notes above.
- If use sparse memory, the size should be rounded to GRANULE boundaries.
-
On s390x, typically use "crashkernel=xxM". The value of xx is dependent
on the memory consumption of the kdump system. In general this is not
dependent on the memory size of the production system.
@@ -331,17 +351,13 @@ of dump-capture kernel. Following is the summary.
For i386 and x86_64:
- - Use vmlinux if kernel is not relocatable.
- Use bzImage/vmlinuz if kernel is relocatable.
+ - Use vmlinux if kernel is not relocatable.
For ppc64:
- Use vmlinux
-For ia64:
-
- - Use vmlinux or vmlinuz.gz
-
For s390x:
- Use image or bzImage
@@ -383,16 +399,12 @@ to load dump-capture kernel::
--initrd=<initrd-for-dump-capture-kernel> \
--append="root=<root-dev> <arch-specific-options>"
-Please note, that --args-linux does not need to be specified for ia64.
-It is planned to make this a no-op on that architecture, but for now
-it should be omitted
-
Following are the arch specific command line options to be used while
loading dump-capture kernel.
-For i386, x86_64 and ia64:
+For i386 and x86_64:
- "1 irqpoll maxcpus=1 reset_devices"
+ "1 irqpoll nr_cpus=1 reset_devices"
For ppc64:
@@ -400,7 +412,7 @@ For ppc64:
For s390x:
- "1 maxcpus=1 cgroup_disable=memory"
+ "1 nr_cpus=1 cgroup_disable=memory"
For arm:
@@ -408,7 +420,7 @@ For arm:
For arm64:
- "1 maxcpus=1 reset_devices"
+ "1 nr_cpus=1 reset_devices"
Notes on loading the dump-capture kernel:
@@ -440,8 +452,7 @@ Notes on loading the dump-capture kernel:
to use multi-thread programs with it, such as parallel dump feature of
makedumpfile. Otherwise, the multi-thread program may have a great
performance degradation. To enable multi-cpu support, you should bring up an
- SMP dump-capture kernel and specify maxcpus/nr_cpus, disable_cpu_apicid=[X]
- options while loading it.
+ SMP dump-capture kernel and specify maxcpus/nr_cpus options while loading it.
* For s390x there are two kdump modes: If a ELF header is specified with
the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
@@ -488,6 +499,14 @@ the following command::
cp /proc/vmcore <dump-file>
+or use scp to write out the dump file between hosts on a network, e.g::
+
+ scp /proc/vmcore remote_username@remote_ip:<dump-file>
+
+You can also use makedumpfile utility to write out the dump file
+with specified options to filter out unwanted contents, e.g::
+
+ makedumpfile -l --message-level 1 -d 31 /proc/vmcore <dump-file>
Analysis
========
@@ -509,9 +528,12 @@ ELF32-format headers using the --elf32-core-headers kernel option on the
dump kernel.
You can also use the Crash utility to analyze dump files in Kdump
-format. Crash is available on Dave Anderson's site at the following URL:
+format. Crash is available at the following URL:
+
+ https://github.com/crash-utility/crash
- http://people.redhat.com/~anderson/
+Crash document can be found at:
+ https://crash-utility.github.io/
Trigger Kdump on WARN()
=======================
@@ -532,8 +554,7 @@ This will cause a kdump to occur at the add_taint()->panic() call.
Contact
=======
-- Vivek Goyal (vgoyal@redhat.com)
-- Maneesh Soni (maneesh@in.ibm.com)
+- kexec@lists.infradead.org
GDB macros
==========
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index e4ee8b2db604..0f714fc945ac 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -39,6 +39,12 @@ call.
User-space tools can get the kernel name, host name, kernel release
number, kernel version, architecture name and OS type from it.
+(uts_namespace, name)
+---------------------
+
+Offset of the name's member. Crash Utility and Makedumpfile get
+the start address of the init_uts_ns.name from this.
+
node_online_map
---------------
@@ -59,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
the kernel start address. Used to convert a virtual address from the
direct kernel map to a physical address.
-vmap_area_list
---------------
+VMALLOC_START
+-------------
-Stores the virtual area list. makedumpfile gets the vmalloc start value
-from this variable and its value is necessary for vmalloc translation.
+Stores the base address of vmalloc area. makedumpfile gets this value
+since is necessary for vmalloc translation.
mem_map
-------
@@ -93,6 +99,11 @@ It exists in the sparse memory mapping model, and it is also somewhat
similar to the mem_map variable, both of them are used to translate an
address.
+MAX_PHYSMEM_BITS
+----------------
+
+Defines the maximum supported physical address space memory.
+
page
----
@@ -130,8 +141,8 @@ nodemask_t
The size of a nodemask_t type. Used to compute the number of online
nodes.
-(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
--------------------------------------------------------------------------------------------------
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
+----------------------------------------------------------------------------------
User-space tools compute their values based on the offset of these
variables. The variables are used when excluding unnecessary pages.
@@ -161,7 +172,7 @@ variables.
Offset of the free_list's member. This value is used to compute the number
of free pages.
-Each zone has a free_area structure array called free_area[MAX_ORDER].
+Each zone has a free_area structure array called free_area[NR_PAGE_ORDERS].
The free_list represents a linked list of free page blocks.
(list_head, next|prev)
@@ -178,56 +189,129 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
information. Makedumpfile gets the start address of the vmalloc region
from this.
-(zone.free_area, MAX_ORDER)
----------------------------
+(zone.free_area, NR_PAGE_ORDERS)
+--------------------------------
Free areas descriptor. User-space tools use this value to iterate the
-free_area ranges. MAX_ORDER is used by the zone buddy allocator.
+free_area ranges. NR_PAGE_ORDERS is used by the zone buddy allocator.
+
+prb
+---
+
+A pointer to the printk ringbuffer (struct printk_ringbuffer). This
+may be pointing to the static boot ringbuffer or the dynamically
+allocated ringbuffer, depending on when the core dump occurred.
+Used by user-space tools to read the active kernel log buffer.
+
+printk_rb_static
+----------------
-log_first_idx
+A pointer to the static boot printk ringbuffer. If @prb has a
+different value, this is useful for viewing the initial boot messages,
+which may have been overwritten in the dynamically allocated
+ringbuffer.
+
+clear_seq
+---------
+
+The sequence number of the printk() record after the last clear
+command. It indicates the first record after the last
+SYSLOG_ACTION_CLEAR, like issued by 'dmesg -c'. Used by user-space
+tools to dump a subset of the dmesg log.
+
+printk_ringbuffer
+-----------------
+
+The size of a printk_ringbuffer structure. This structure contains all
+information required for accessing the various components of the
+kernel log buffer.
+
+(printk_ringbuffer, desc_ring|text_data_ring|dict_data_ring|fail)
+-----------------------------------------------------------------
+
+Offsets for the various components of the printk ringbuffer. Used by
+user-space tools to view the kernel log buffer without requiring the
+declaration of the structure.
+
+prb_desc_ring
-------------
-Index of the first record stored in the buffer log_buf. Used by
-user-space tools to read the strings in the log_buf.
+The size of the prb_desc_ring structure. This structure contains
+information about the set of record descriptors.
-log_buf
--------
+(prb_desc_ring, count_bits|descs|head_id|tail_id)
+-------------------------------------------------
-Console output is written to the ring buffer log_buf at index
-log_first_idx. Used to get the kernel log.
+Offsets for the fields describing the set of record descriptors. Used
+by user-space tools to be able to traverse the descriptors without
+requiring the declaration of the structure.
-log_buf_len
+prb_desc
+--------
+
+The size of the prb_desc structure. This structure contains
+information about a single record descriptor.
+
+(prb_desc, info|state_var|text_blk_lpos|dict_blk_lpos)
+------------------------------------------------------
+
+Offsets for the fields describing a record descriptors. Used by
+user-space tools to be able to read descriptors without requiring
+the declaration of the structure.
+
+prb_data_blk_lpos
+-----------------
+
+The size of the prb_data_blk_lpos structure. This structure contains
+information about where the text or dictionary data (data block) is
+located within the respective data ring.
+
+(prb_data_blk_lpos, begin|next)
+-------------------------------
+
+Offsets for the fields describing the location of a data block. Used
+by user-space tools to be able to locate data blocks without
+requiring the declaration of the structure.
+
+printk_info
-----------
-log_buf's length.
+The size of the printk_info structure. This structure contains all
+the meta-data for a record.
-clear_idx
----------
+(printk_info, seq|ts_nsec|text_len|dict_len|caller_id)
+------------------------------------------------------
-The index that the next printk() record to read after the last clear
-command. It indicates the first record after the last SYSLOG_ACTION
-_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
-the dmesg log.
+Offsets for the fields providing the meta-data for a record. Used by
+user-space tools to be able to read the information without requiring
+the declaration of the structure.
-log_next_idx
-------------
+prb_data_ring
+-------------
-The index of the next record to store in the buffer log_buf. Used to
-compute the index of the current buffer position.
+The size of the prb_data_ring structure. This structure contains
+information about a set of data blocks.
-printk_log
-----------
+(prb_data_ring, size_bits|data|head_lpos|tail_lpos)
+---------------------------------------------------
+
+Offsets for the fields describing a set of data blocks. Used by
+user-space tools to be able to access the data blocks without
+requiring the declaration of the structure.
-The size of a structure printk_log. Used to compute the size of
-messages, and extract dmesg log. It encapsulates header information for
-log_buf, such as timestamp, syslog level, etc.
+atomic_long_t
+-------------
+
+The size of the atomic_long_t structure. Used by user-space tools to
+be able to copy the full structure, regardless of its
+architecture-specific implementation.
-(printk_log, ts_nsec|len|text_len|dict_len)
--------------------------------------------
+(atomic_long_t, counter)
+------------------------
-It represents field offsets in struct printk_log. User space tools
-parse it and check whether the values of printk_log's members have been
-changed.
+Offset for the long value of an atomic_long_t variable. Used by
+user-space tools to access the long value without requiring the
+architecture-specific declaration.
(free_area.free_list, MIGRATE_TYPES)
------------------------------------
@@ -241,8 +325,8 @@ NR_FREE_PAGES
On linux-2.6.21 or later, the number of free pages is in
vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
-PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
-------------------------------------------------------------------------------
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
+-----------------------------------------------------------------------------------------
Page attributes. These flags are used to filter various unnecessary for
dumping pages.
@@ -254,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
dumping pages.
-HUGETLB_PAGE_DTOR
------------------
-
-The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
-excludes these pages.
-
x86_64
======
@@ -335,36 +413,6 @@ of a higher page table lookup overhead, and also consumes more page
table space per process. Used to check whether PAE was enabled in the
crash kernel when converting virtual addresses to physical addresses.
-ia64
-====
-
-pgdat_list|(pgdat_list, MAX_NUMNODES)
--------------------------------------
-
-pg_data_t array storing all NUMA nodes information. MAX_NUMNODES
-indicates the number of the nodes.
-
-node_memblk|(node_memblk, NR_NODE_MEMBLKS)
-------------------------------------------
-
-List of node memory chunks. Filled when parsing the SRAT table to obtain
-information about memory nodes. NR_NODE_MEMBLKS indicates the number of
-node memory chunks.
-
-These values are used to compute the number of nodes the crashed kernel used.
-
-node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
-----------------------------------------------------------------
-
-The size of a struct node_memblk_s and the offsets of the
-node_memblk_s's members. Used to compute the number of nodes.
-
-PGTABLE_3|PGTABLE_4
--------------------
-
-User-space tools need to know whether the crash kernel was in 3-level or
-4-level paging mode. Used to distinguish the page table.
-
ARM64
=====
@@ -399,6 +447,25 @@ KERNELPACMASK
The mask to extract the Pointer Authentication Code from a kernel virtual
address.
+TCR_EL1.T1SZ
+------------
+
+Indicates the size offset of the memory region addressed by TTBR1_EL1.
+The region size is 2^(64-T1SZ) bytes.
+
+TTBR1_EL1 is the table base address register specified by ARMv8-A
+architecture which is used to lookup the page-tables for the Virtual
+addresses in the higher VA range (refer to ARMv8 ARM document for
+more details).
+
+MODULES_VADDR|MODULES_END|VMALLOC_START|VMALLOC_END|VMEMMAP_START|VMEMMAP_END
+-----------------------------------------------------------------------------
+
+Used to get the correct ranges:
+ MODULES_VADDR ~ MODULES_END-1 : Kernel module space.
+ VMALLOC_START ~ VMALLOC_END-1 : vmalloc() / ioremap() space.
+ VMEMMAP_START ~ VMEMMAP_END-1 : vmemmap region, used for struct page array.
+
arm
===
@@ -492,3 +559,38 @@ X2TLB
-----
Indicates whether the crashed kernel enabled SH extended mode.
+
+RISCV64
+=======
+
+VA_BITS
+-------
+
+The maximum number of bits for virtual addresses. Used to compute the
+virtual memory ranges.
+
+PAGE_OFFSET
+-----------
+
+Indicates the virtual kernel start address of the direct-mapped RAM region.
+
+phys_ram_base
+-------------
+
+Indicates the start physical RAM address.
+
+MODULES_VADDR|MODULES_END|VMALLOC_START|VMALLOC_END|VMEMMAP_START|VMEMMAP_END|KERNEL_LINK_ADDR
+----------------------------------------------------------------------------------------------
+
+Used to get the correct ranges:
+
+ * MODULES_VADDR ~ MODULES_END : Kernel module space.
+ * VMALLOC_START ~ VMALLOC_END : vmalloc() / ioremap() space.
+ * VMEMMAP_START ~ VMEMMAP_END : vmemmap space, used for struct page array.
+ * KERNEL_LINK_ADDR : start address of Kernel link and BPF
+
+va_kernel_pa_offset
+-------------------
+
+Indicates the offset between the kernel virtual and physical mappings.
+Used to translate virtual to physical addresses.
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 6d421694d98e..e8bdf5e86a9b 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -3,8 +3,8 @@
The kernel's command-line parameters
====================================
-The following is a consolidated list of the kernel parameters as
-implemented by the __setup(), core_param() and module_param() macros
+The following is a consolidated list of the kernel parameters as implemented
+by the __setup(), early_param(), core_param() and module_param() macros
and sorted into English Dictionary order (defined as ignoring all
punctuation and sorting digits before letters in a case insensitive
manner), and with descriptions where known.
@@ -60,7 +60,7 @@ Note that for the special case of a range one can split the range into equal
sized groups and for each group use some amount from the beginning of that
group:
- <cpu number>-cpu number>:<used size>/<group size>
+ <cpu number>-<cpu number>:<used size>/<group size>
For example one can add to the command line following parameter:
@@ -68,7 +68,19 @@ For example one can add to the command line following parameter:
where the final item represents CPUs 100,101,125,126,150,151,...
+The value "N" can be used to represent the numerically last CPU on the system,
+i.e "foo_cpus=16-N" would be equivalent to "16-31" on a 32 core system.
+Keep in mind that "N" is dynamic, so if system changes cause the bitmap width
+to change, such as less cores in the CPU list, then N and any ranges using N
+will also change. Use the same on a small 4 core system, and "16-N" becomes
+"16-3" and now the same boot input will be flagged as invalid (start > end).
+
+The special case-tolerant group name "all" has a meaning of selecting all CPUs,
+so that "nohz_full=all" is the equivalent of "nohz_full=0-N".
+
+The semantics of "N" and "all" is supported on a level of bitmaps and holds for
+all users of bitmap_parselist().
This document may not be entirely up to date and comprehensive. The command
"modinfo -p ${modulename}" shows a current list of all parameters of a loadable
@@ -77,16 +89,18 @@ reveal their parameters in /sys/module/${modulename}/parameters/. Some of these
parameters may be changed at runtime by the command
``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``.
-The parameters listed below are only valid if certain kernel build options were
-enabled and if respective hardware is present. The text in square brackets at
-the beginning of each description states the restrictions within which a
-parameter is applicable::
+The parameters listed below are only valid if certain kernel build options
+were enabled and if respective hardware is present. This list should be kept
+in alphabetical order. The text in square brackets at the beginning
+of each description states the restrictions within which a parameter
+is applicable::
ACPI ACPI support is enabled.
AGP AGP (Accelerated Graphics Port) is enabled.
ALSA ALSA sound support is enabled.
APIC APIC support is enabled.
APM Advanced Power Management support is enabled.
+ APPARMOR AppArmor support is enabled.
ARM ARM architecture is enabled.
ARM64 ARM64 architecture is enabled.
AX25 Appropriate AX.25 support is enabled.
@@ -94,17 +108,18 @@ parameter is applicable::
CMA Contiguous Memory Area support is enabled.
DRM Direct Rendering Management support is enabled.
DYNAMIC_DEBUG Build in debug messages and enable them at runtime
+ EARLY Parameter processed too early to be embedded in initrd.
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
EFI EFI Partitioning (GPT) is enabled
- EIDE EIDE/ATAPI support is enabled.
EVM Extended Verification Module
FB The frame buffer device is enabled.
FTRACE Function tracing enabled.
GCOV GCOV profiling is enabled.
+ HIBERNATION HIBERNATION is enabled.
HW Appropriate hardware is enabled.
+ HYPER_V HYPERV support is enabled.
IA-64 IA-64 architecture is enabled.
IMA Integrity measurement architecture is enabled.
- IOSCHED More than one I/O scheduler is enabled.
IP_PNP IP DHCP, BOOTP, or RARP is enabled.
IPV6 IPv6 support is enabled.
ISAPNP ISA PnP code is enabled.
@@ -114,23 +129,21 @@ parameter is applicable::
KGDB Kernel debugger support is enabled.
KVM Kernel Virtual Machine support is enabled.
LIBATA Libata driver is enabled
- LP Printer support is enabled.
+ LOONGARCH LoongArch architecture is enabled.
LOOP Loopback device support is enabled.
+ LP Printer support is enabled.
M68k M68k architecture is enabled.
These options have more detailed description inside of
- Documentation/m68k/kernel-options.rst.
+ Documentation/arch/m68k/kernel-options.rst.
MDA MDA console support is enabled.
MIPS MIPS architecture is enabled.
MOUSE Appropriate mouse support is enabled.
MSI Message Signaled Interrupts (PCI).
MTD MTD (Memory Technology Device) support is enabled.
NET Appropriate network support is enabled.
- NUMA NUMA support is enabled.
NFS Appropriate NFS support is enabled.
+ NUMA NUMA support is enabled.
OF Devicetree is enabled.
- OSS OSS sound support is enabled.
- PV_OPS A paravirtualized kernel is enabled.
- PARIDE The ParIDE (parallel port IDE) subsystem is enabled.
PARISC The PA-RISC architecture is enabled.
PCI PCI bus support is enabled.
PCIE PCI Express support is enabled.
@@ -139,53 +152,53 @@ parameter is applicable::
PPC PowerPC architecture is enabled.
PPT Parallel port support is enabled.
PS2 Appropriate PS/2 support is enabled.
+ PV_OPS A paravirtualized kernel is enabled.
RAM RAM disk support is enabled.
RDT Intel Resource Director Technology.
+ RISCV RISCV architecture is enabled.
S390 S390 architecture is enabled.
SCSI Appropriate SCSI support is enabled.
A lot of drivers have their options described inside
the Documentation/scsi/ sub-directory.
SECURITY Different security models are enabled.
SELINUX SELinux support is enabled.
- APPARMOR AppArmor support is enabled.
SERIAL Serial support is enabled.
SH SuperH architecture is enabled.
SMP The kernel is an SMP kernel.
SPARC Sparc architecture is enabled.
- SWSUSP Software suspend (hibernation) is enabled.
SUSPEND System suspend states are enabled.
+ SWSUSP Software suspend (hibernation) is enabled.
TPM TPM drivers are enabled.
- TS Appropriate touchscreen support is enabled.
UMS USB Mass Storage support is enabled.
USB USB support is enabled.
USBHID USB Human Interface Device support is enabled.
V4L Video For Linux support is enabled.
- VMMIO Driver for memory mapped virtio devices is enabled.
VGA The VGA console has been enabled.
+ VMMIO Driver for memory mapped virtio devices is enabled.
VT Virtual terminal support is enabled.
WDT Watchdog support is enabled.
- XT IBM PC/XT MFM hard disk support is enabled.
X86-32 X86-32, aka i386 architecture is enabled.
X86-64 X86-64 architecture is enabled.
More X86-64 boot options can be found in
- Documentation/x86/x86_64/boot-options.rst.
+ Documentation/arch/x86/x86_64/boot-options.rst.
X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64)
X86_UV SGI UV support is enabled.
XEN Xen support is enabled
+ XTENSA xtensa architecture is enabled.
In addition, the following text indicates that the option::
+ BOOT Is a boot loader parameter.
BUGS= Relates to possible processor bugs on the said processor.
KNL Is a kernel start-up parameter.
- BOOT Is a boot loader parameter.
Parameters denoted with BOOT are actually interpreted by the boot
loader, and have no meaning to the kernel directly.
Do not modify the syntax of boot loader parameters without extreme
-need or coordination with <Documentation/x86/boot.rst>.
+need or coordination with <Documentation/arch/x86/boot.rst>.
There are also arch-specific kernel-parameters not documented here.
-See for example <Documentation/x86/x86_64/boot-options.rst>.
+See for example <Documentation/arch/x86/x86_64/boot-options.rst>.
Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
a trailing = on the name of any parameter states that that parameter will
@@ -197,7 +210,7 @@ The number of kernel parameters is not limited, but the length of the
complete command line (parameters including spaces etc.) is limited to
a fixed number of characters. This limit depends on the architecture
and is between 256 and 4096 characters. It is defined in the file
-./include/asm/setup.h as COMMAND_LINE_SIZE.
+./include/uapi/asm-generic/setup.h as COMMAND_LINE_SIZE.
Finally, the [KMG] suffix is commonly described after a number of kernel
parameter values. These 'K', 'M', and 'G' letters represent the _binary_
@@ -206,8 +219,3 @@ bytes respectively. Such letter suffixes can also be entirely omitted:
.. include:: kernel-parameters.txt
:literal:
-
-Todo
-----
-
- Add more DRM drivers.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..bb884c14b2f6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1,21 +1,32 @@
- acpi= [HW,ACPI,X86,ARM64]
+ accept_memory= [MM]
+ Format: { eager | lazy }
+ default: lazy
+ By default, unaccepted memory is accepted lazily to
+ avoid prolonged boot times. The lazy option will add
+ some runtime overhead until all memory is eventually
+ accepted. In most cases the overhead is negligible.
+ For some workloads or for debugging purposes
+ accept_memory=eager can be used to accept all memory
+ at once during boot.
+
+ acpi= [HW,ACPI,X86,ARM64,RISCV64,EARLY]
Advanced Configuration and Power Interface
Format: { force | on | off | strict | noirq | rsdt |
copy_dsdt }
force -- enable ACPI if default was off
- on -- enable ACPI but allow fallback to DT [arm64]
+ on -- enable ACPI but allow fallback to DT [arm64,riscv64]
off -- disable ACPI if default was on
noirq -- do not use ACPI for IRQ routing
strict -- Be less tolerant of platforms that are not
strictly ACPI specification compliant.
rsdt -- prefer RSDT over (default) XSDT
copy_dsdt -- copy DSDT to memory
- For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force"
- are available
+ For ARM64 and RISCV64, ONLY "acpi=off", "acpi=on" or
+ "acpi=force" are available
See also Documentation/power/runtime_pm.rst, pci=noacpi
- acpi_apic_instance= [ACPI, IOAPIC]
+ acpi_apic_instance= [ACPI,IOAPIC,EARLY]
Format: <int>
2: use 2nd APIC table, if available
1,0: use 1st APIC table
@@ -30,7 +41,7 @@
If set to native, use the device's native backlight mode.
If set to none, disable the ACPI backlight interface.
- acpi_force_32bit_fadt_addr
+ acpi_force_32bit_fadt_addr [ACPI,EARLY]
force FADT to use 32 bit addresses rather than the
64 bit X_* addresses. Some firmware have broken 64
bit addresses for force ACPI ignore these and use
@@ -50,7 +61,7 @@
CONFIG_ACPI_DEBUG must be enabled to produce any ACPI
debug output. Bits in debug_layer correspond to a
_COMPONENT in an ACPI source file, e.g.,
- #define _COMPONENT ACPI_PCI_COMPONENT
+ #define _COMPONENT ACPI_EVENTS
Bits in debug_level correspond to a level in
ACPI_DEBUG_PRINT statements, e.g.,
ACPI_DEBUG_PRINT((ACPI_DB_INFO, ...
@@ -60,8 +71,6 @@
Enable processor driver info messages:
acpi.debug_layer=0x20000000
- Enable PCI/PCI interrupt routing info messages:
- acpi.debug_layer=0x400000
Enable AML "Debug" output, i.e., stores to the Debug
object while interpreting AML:
acpi.debug_layer=0xffffffff acpi.debug_level=0x2
@@ -88,7 +97,7 @@
no: ACPI OperationRegions are not marked as reserved,
no further checks are performed.
- acpi_force_table_verification [HW,ACPI]
+ acpi_force_table_verification [HW,ACPI,EARLY]
Enable table checksum verification during early stage.
By default, this is disabled due to x86 early mapping
size limitation.
@@ -115,7 +124,7 @@
the GPE dispatcher.
This facility can be used to prevent such uncontrolled
GPE floodings.
- Format: <byte>
+ Format: <byte> or <bitmap-list>
acpi_no_auto_serialize [HW,ACPI]
Disable auto-serialization of AML methods
@@ -128,7 +137,7 @@
acpi_no_memhotplug [ACPI] Disable memory hotplug. Useful for kdump
kernels.
- acpi_no_static_ssdt [HW,ACPI]
+ acpi_no_static_ssdt [HW,ACPI,EARLY]
Disable installation of static SSDTs at early boot time
By default, SSDTs contained in the RSDT/XSDT will be
installed automatically and they will appear under
@@ -142,7 +151,7 @@
Ignore the ACPI-based watchdog interface (WDAT) and let
a native driver control the watchdog device instead.
- acpi_rsdp= [ACPI,EFI,KEXEC]
+ acpi_rsdp= [ACPI,EFI,KEXEC,EARLY]
Pass the RSDP address to the kernel, mostly used
on machines running EFI runtime service to boot the
second kernel for kdump.
@@ -219,22 +228,31 @@
to assume that this machine's pmtimer latches its value
and always returns good values.
- acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode
+ acpi_sci= [HW,ACPI,EARLY] ACPI System Control Interrupt trigger mode
Format: { level | edge | high | low }
- acpi_skip_timer_override [HW,ACPI]
+ acpi_skip_timer_override [HW,ACPI,EARLY]
Recognize and ignore IRQ0/pin2 Interrupt Override.
For broken nForce2 BIOS resulting in XT-PIC timer.
acpi_sleep= [HW,ACPI] Sleep options
- Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
- old_ordering, nonvs, sci_force_enable, nobl }
+ Format: { s3_bios, s3_mode, s3_beep, s4_hwsig,
+ s4_nohwsig, old_ordering, nonvs,
+ sci_force_enable, nobl }
See Documentation/power/video.rst for information on
s3_bios and s3_mode.
s3_beep is for debugging; it makes the PC's speaker beep
as soon as the kernel's real-mode entry point is called.
+ s4_hwsig causes the kernel to check the ACPI hardware
+ signature during resume from hibernation, and gracefully
+ refuse to resume if it has changed. This complies with
+ the ACPI specification but not with reality, since
+ Windows does not do this and many laptops do change it
+ on docking. So the default behaviour is to allow resume
+ and simply warn when the signature changes, unless the
+ s4_hwsig option is enabled.
s4_nohwsig prevents ACPI hardware signature from being
- used during resume from hibernation.
+ used (or even warned about) during resume.
old_ordering causes the ACPI 1.0 ordering of the _PTS
control method, with respect to putting devices into
low power states, to be enforced (the ACPI 2.0 ordering
@@ -248,11 +266,11 @@
behave incorrectly in some ways with respect to system
suspend and resume to be ignored (use wisely).
- acpi_use_timer_override [HW,ACPI]
+ acpi_use_timer_override [HW,ACPI,EARLY]
Use timer override. For some broken Nvidia NF5 boards
that require a timer override, but don't have HPET
- add_efi_memmap [EFI; X86] Include EFI memory map in
+ add_efi_memmap [EFI,X86,EARLY] Include EFI memory map in
kernel's map of available physical RAM.
agp= [AGP]
@@ -289,13 +307,21 @@
do not want to use tracing_snapshot_alloc() as it needs
to be done where GFP_KERNEL allocations are allowed.
+ allow_mismatched_32bit_el0 [ARM64,EARLY]
+ Allow execve() of 32-bit applications and setting of the
+ PER_LINUX32 personality on systems where only a strict
+ subset of the CPUs support 32-bit EL0. When this
+ parameter is present, the set of CPUs supporting 32-bit
+ EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
+ and hot-unplug operations may be restricted.
+
+ See Documentation/arch/arm64/asymmetric-32bit.rst for more
+ information.
+
amd_iommu= [HW,X86-64]
Pass parameters to the AMD IOMMU driver in the system.
Possible values are:
- fullflush - enable flushing of IO/TLB entries when
- they are unmapped. Otherwise they are
- flushed before they will be reused, which
- is a lot of faster
+ fullflush - Deprecated, equivalent to iommu.strict=1
off - do not initialize any AMD IOMMU found in
the system
force_isolation - Force device isolation for all
@@ -303,6 +329,12 @@
allowed anymore to lift isolation
requirements as needed. This option
does not override iommu=pt
+ force_enable - Force enable the IOMMU on platforms known
+ to be buggy with IOMMU enabled. Use this
+ option with care.
+ pgtbl_v1 - Use v1 page table for DMA-API (Default).
+ pgtbl_v2 - Use v2 page table for DMA-API.
+ irtcachedis - Disable Interrupt Remapping Table (IRT) caching.
amd_iommu_dump= [HW,X86-64]
Enable AMD IOMMU driver option to dump the ACPI table
@@ -319,6 +351,34 @@
This mode requires kvm-amd.avic=1.
(Default when IOMMU HW support is present.)
+ amd_pstate= [X86,EARLY]
+ disable
+ Do not enable amd_pstate as the default
+ scaling driver for the supported processors
+ passive
+ Use amd_pstate with passive mode as a scaling driver.
+ In this mode autonomous selection is disabled.
+ Driver requests a desired performance level and platform
+ tries to match the same performance level if it is
+ satisfied by guaranteed performance level.
+ active
+ Use amd_pstate_epp driver instance as the scaling driver,
+ driver provides a hint to the hardware if software wants
+ to bias toward performance (0x0) or energy efficiency (0xff)
+ to the CPPC firmware. then CPPC power algorithm will
+ calculate the runtime workload and adjust the realtime cores
+ frequency.
+ guided
+ Activate guided autonomous mode. Driver requests minimum and
+ maximum performance level and the platform autonomously
+ selects a performance level in this range and appropriate
+ to the current workload.
+
+ amd_prefcore=
+ [X86]
+ disable
+ Disable amd-pstate preferred core.
+
amijoy.map= [HW,JOY] Amiga joystick support
Map of devices attached to JOY0DAT and JOY1DAT
Format: <a>,<b>
@@ -336,7 +396,7 @@
not play well with APC CPU idle - disable it if you have
APC and your system crashes randomly.
- apic= [APIC,X86] Advanced Programmable Interrupt Controller
+ apic= [APIC,X86,EARLY] Advanced Programmable Interrupt Controller
Change the output verbosity while booting
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
@@ -346,7 +406,7 @@
Format: apic=driver_name
Examples: apic=bigsmp
- apic_extnmi= [APIC,X86] External NMI delivery setting
+ apic_extnmi= [APIC,X86,EARLY] External NMI delivery setting
Format: { bsp (default) | all | none }
bsp: External NMI is delivered only to CPU 0
all: External NMIs are broadcast to all CPUs as a
@@ -358,21 +418,37 @@
autoconf= [IPV6]
See Documentation/networking/ipv6.rst.
- show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller
- Limit apic dumping. The parameter defines the maximal
- number of local apics being dumped. Also it is possible
- to set it to "all" by meaning -- no limit here.
- Format: { 1 (default) | 2 | ... | all }.
- The parameter valid if only apic=debug or
- apic=verbose is specified.
- Example: apic=debug show_lapic=all
-
apm= [APM] Advanced Power Management
See header of arch/x86/kernel/apm_32.c.
+ apparmor= [APPARMOR] Disable or enable AppArmor at boot time
+ Format: { "0" | "1" }
+ See security/apparmor/Kconfig help text
+ 0 -- disable.
+ 1 -- enable.
+ Default value is set via kernel config option.
+
arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards
Format: <io>,<irq>,<nodeID>
+ arm64.nobti [ARM64] Unconditionally disable Branch Target
+ Identification support
+
+ arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory
+ Set instructions support
+
+ arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension
+ support
+
+ arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
+ support
+
+ arm64.nosme [ARM64] Unconditionally disable Scalable Matrix
+ Extension support
+
+ arm64.nosve [ARM64] Unconditionally disable Scalable Vector
+ Extension support
+
ataflop= [HW,M68k]
atarimouse= [HW,MOUSE] Atari Mouse
@@ -434,27 +510,30 @@
Format: <io>,<irq>,<mode>
See header of drivers/net/hamradio/baycom_ser_hdx.c.
+ bert_disable [ACPI]
+ Disable BERT OS support on buggy BIOSes.
+
+ bgrt_disable [ACPI,X86,EARLY]
+ Disable BGRT to avoid flickering OEM logo.
+
blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input.
See Documentation/block/cmdline-partition.rst
- boot_delay= Milliseconds to delay each printk during boot.
- Values larger than 10 seconds (10000) are changed to
- no delay (0).
+ boot_delay= [KNL,EARLY]
+ Milliseconds to delay each printk during boot.
+ Only works if CONFIG_BOOT_PRINTK_DELAY is enabled,
+ and you may also have to specify "lpj=". Boot_delay
+ values larger than 10 seconds (10000) are assumed
+ erroneous and ignored.
Format: integer
- bootconfig [KNL]
+ bootconfig [KNL,EARLY]
Extended command line options can be added to an initrd
and this will cause the kernel to look for it.
See Documentation/admin-guide/bootconfig.rst
- bert_disable [ACPI]
- Disable BERT OS support on buggy BIOSes.
-
- bgrt_disable [ACPI][X86]
- Disable BGRT to avoid flickering OEM logo.
-
bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards)
bttv.radio= Most important insmod options are available as
kernel args too.
@@ -484,25 +563,30 @@
trust validation.
format: { id:<keyid> | builtin }
- cca= [MIPS] Override the kernel pages' cache coherency
+ cca= [MIPS,EARLY] Override the kernel pages' cache coherency
algorithm. Accepted values range from 0 to 7
inclusive. See arch/mips/include/asm/pgtable-bits.h
for platform specific values (SB1, Loongson3 and
others).
ccw_timeout_log [S390]
- See Documentation/s390/common_io.rst for details.
+ See Documentation/arch/s390/common_io.rst for details.
- cgroup_disable= [KNL] Disable a particular controller
- Format: {name of the controller(s) to disable}
+ cgroup_disable= [KNL] Disable a particular controller or optional feature
+ Format: {name of the controller(s) or feature(s) to disable}
The effects of cgroup_disable=foo are:
- foo isn't auto-mounted if you mount all cgroups in
a single hierarchy
- foo isn't visible as an individually mountable
subsystem
+ - if foo is an optional feature then the feature is
+ disabled and corresponding cgroup files are not
+ created
{Currently only "memory" controller deal with this and
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
+ Specifying "pressure" disables per-cgroup pressure
+ stall information accounting feature
cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1
Format: { { controller | "all" | "named" }
@@ -513,12 +597,17 @@
named mounts. Specifying both "all" and "named" disables
all v1 hierarchies.
+ cgroup_favordynmods= [KNL] Enable or Disable favordynmods.
+ Format: { "true" | "false" }
+ Defaults to the value of CONFIG_CGROUP_FAVOR_DYNMODS.
+
cgroup.memory= [KNL] Pass options to the cgroup memory controller.
Format: <string>
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
+ nobpf -- Disable BPF memory accounting.
- checkreqprot [SELINUX] Set initial checkreqprot flag value.
+ checkreqprot= [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
0 -- check protection applied by kernel (includes
@@ -530,7 +619,26 @@
Setting checkreqprot to 1 is deprecated.
cio_ignore= [S390]
- See Documentation/s390/common_io.rst for details.
+ See Documentation/arch/s390/common_io.rst for details.
+
+ clearcpuid=X[,X...] [X86]
+ Disable CPUID feature X for the kernel. See
+ arch/x86/include/asm/cpufeatures.h for the valid bit
+ numbers X. Note the Linux-specific bits are not necessarily
+ stable over kernel options, but the vendor-specific
+ ones should be.
+ X can also be a string as appearing in the flags: line
+ in /proc/cpuinfo which does not have the above
+ instability issue. However, not all features have names
+ in /proc/cpuinfo.
+ Note that using this option will taint your kernel.
+ Also note that user programs calling CPUID directly
+ or using the feature without checking anything
+ will still see it. This just prevents it from
+ being used by the kernel or shown in /proc/cpuinfo.
+ Also note the kernel might malfunction if you disable
+ some critical bits.
+
clk_ignore_unused
[CLK]
Prevents the clock framework from automatically gating
@@ -570,34 +678,59 @@
[X86-64] hpet,tsc
clocksource.arm_arch_timer.evtstrm=
- [ARM,ARM64]
+ [ARM,ARM64,EARLY]
Format: <bool>
Enable/disable the eventstream feature of the ARM
architected timer so that code using WFE-based polling
loops can be debugged more effectively on production
systems.
- clearcpuid=BITNUM [X86]
- Disable CPUID feature X for the kernel. See
- arch/x86/include/asm/cpufeatures.h for the valid bit
- numbers. Note the Linux specific bits are not necessarily
- stable over kernel options, but the vendor specific
- ones should be.
- Also note that user programs calling CPUID directly
- or using the feature without checking anything
- will still see it. This just prevents it from
- being used by the kernel or shown in /proc/cpuinfo.
- Also note the kernel might malfunction if you disable
- some critical bits.
+ clocksource.verify_n_cpus= [KNL]
+ Limit the number of CPUs checked for clocksources
+ marked with CLOCK_SOURCE_VERIFY_PERCPU that
+ are marked unstable due to excessive skew.
+ A negative value says to check all CPUs, while
+ zero says not to check any. Values larger than
+ nr_cpu_ids are silently truncated to nr_cpu_ids.
+ The actual CPUs are chosen randomly, with
+ no replacement if the same CPU is chosen twice.
+
+ clocksource-wdtest.holdoff= [KNL]
+ Set the time in seconds that the clocksource
+ watchdog test waits before commencing its tests.
+ Defaults to zero when built as a module and to
+ 10 seconds when built into the kernel.
cma=nn[MG]@[start[MG][-end[MG]]]
- [ARM,X86,KNL]
+ [KNL,CMA,EARLY]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
- include/linux/dma-contiguous.h
+ kernel/dma/contiguous.c
+
+ cma_pernuma=nn[MG]
+ [KNL,CMA,EARLY]
+ Sets the size of kernel per-numa memory area for
+ contiguous memory allocations. A value of 0 disables
+ per-numa CMA altogether. And If this option is not
+ specified, the default value is 0.
+ With per-numa CMA enabled, DMA users on node nid will
+ first try to allocate buffer from the pernuma area
+ which is located in node nid, if the allocation fails,
+ they will fallback to the global default memory area.
+
+ numa_cma=<node>:nn[MG][,<node>:nn[MG]]
+ [KNL,CMA,EARLY]
+ Sets the size of kernel numa memory area for
+ contiguous memory allocations. It will reserve CMA
+ area for the specified node.
+
+ With numa CMA enabled, DMA users on node nid will
+ first try to allocate buffer from the numa area
+ which is located in node nid, if the allocation fails,
+ they will fallback to the global default memory area.
cmo_free_hint= [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
@@ -606,7 +739,7 @@
a hypervisor.
Default: yes
- coherent_pool=nn[KMG] [ARM,KNL]
+ coherent_pool=nn[KMG] [ARM,KNL,EARLY]
Sets the size of memory pool for coherent, atomic dma
allocations, by default set to 256K.
@@ -624,6 +757,17 @@
condev= [HW,S390] console device
conmode=
+ con3215_drop= [S390,EARLY] 3215 console drop mode.
+ Format: y|n|Y|N|1|0
+ When set to true, drop data on the 3215 console when
+ the console buffer is full. In this case the
+ operator using a 3270 terminal emulator (for example
+ x3270) does not have to enter the clear key for the
+ console output to advance and the kernel to continue.
+ This leads to a much faster boot time when a 3270
+ terminal emulator is active. If no 3270 terminal
+ emulator is used, this parameter has no effect.
+
console= [KNL] Output console device and options.
tty<n> Use the virtual console device <n>.
@@ -659,6 +803,12 @@
hvc<n> Use the hypervisor console device <n>. This is for
both Xen and PowerPC hypervisors.
+ { null | "" }
+ Use to disable console output, i.e., to have kernel
+ console messages discarded.
+ This must be the only console= parameter used on the
+ kernel command line.
+
If the device connected to the port is not a TTY but a braille
device, prepend "brl," before the device type, for instance
console=brl,ttyS0
@@ -694,6 +844,10 @@
0: default value, disable debugging
1: enable debugging at boot time
+ cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
+ Format:
+ <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
+
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system
@@ -703,25 +857,40 @@
cpufreq.off=1 [CPU_FREQ]
disable the cpufreq sub-system
+ cpufreq.default_governor=
+ [CPU_FREQ] Name of the default cpufreq governor or
+ policy to use. This governor must be registered in the
+ kernel before the cpufreq driver probes.
+
cpu_init_udelay=N
- [X86] Delay for N microsec between assert and de-assert
+ [X86,EARLY] Delay for N microsec between assert and de-assert
of APIC INIT to start processors. This delay occurs
on every CPU online, such as boot, and resume from suspend.
Default: 10000
- cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
- Format:
- <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
+ cpuhp.parallel=
+ [SMP] Enable/disable parallel bringup of secondary CPUs
+ Format: <bool>
+ Default is enabled if CONFIG_HOTPLUG_PARALLEL=y. Otherwise
+ the parameter has no effect.
+
+ crash_kexec_post_notifiers
+ Run kdump after running panic-notifiers and dumping
+ kmsg. This only for the users who doubt kdump always
+ succeeds in any situation.
+ Note that this also increases risks of kdump failure,
+ because some panic notifiers can make the crashed
+ kernel more unstable.
crashkernel=size[KMG][@offset[KMG]]
- [KNL] Using kexec, Linux can switch to a 'crash kernel'
+ [KNL,EARLY] Using kexec, Linux can switch to a 'crash kernel'
upon panic. This parameter reserves the physical
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
is selected automatically.
- [KNL, x86_64] select a region under 4G first, and
- fall back to reserve region above 4G when '@offset'
- hasn't been specified.
+ [KNL, X86-64, ARM64, RISCV, LoongArch] Select a region
+ under 4G first, and fall back to reserve region above
+ 4G when '@offset' hasn't been specified.
See Documentation/admin-guide/kdump/kdump.rst for further details.
crashkernel=range1:size1[,range2:size2,...][@offset]
@@ -732,22 +901,28 @@
Documentation/admin-guide/kdump/kdump.rst for an example.
crashkernel=size[KMG],high
- [KNL, x86_64] range could be above 4G. Allow kernel
- to allocate physical memory region from top, so could
- be above 4G if system have more than 4G ram installed.
- Otherwise memory region will be allocated below 4G, if
- available.
+ [KNL, X86-64, ARM64, RISCV, LoongArch] range could be
+ above 4G.
+ Allow kernel to allocate physical memory region from top,
+ so could be above 4G if system have more than 4G ram
+ installed. Otherwise memory region will be allocated
+ below 4G, if available.
It will be ignored if crashkernel=X is specified.
crashkernel=size[KMG],low
- [KNL, x86_64] range under 4G. When crashkernel=X,high
- is passed, kernel could allocate physical memory region
- above 4G, that cause second kernel crash on system
- that require some amount of low memory, e.g. swiotlb
- requires at least 64M+32K low memory, also enough extra
- low memory is needed to make sure DMA buffers for 32-bit
- devices won't run out. Kernel would try to allocate at
- at least 256M below 4G automatically.
- This one let user to specify own low range under 4G
+ [KNL, X86-64, ARM64, RISCV, LoongArch] range under 4G.
+ When crashkernel=X,high is passed, kernel could allocate
+ physical memory region above 4G, that cause second kernel
+ crash on system that require some amount of low memory,
+ e.g. swiotlb requires at least 64M+32K low memory, also
+ enough extra low memory is needed to make sure DMA buffers
+ for 32-bit devices won't run out. Kernel would try to allocate
+ default size of memory below 4G automatically. The default
+ size is platform dependent.
+ --> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
+ --> arm64: 128MiB
+ --> riscv: 128MiB
+ --> loongarch: 128MiB
+ This one lets the user specify own low range under 4G
for second kernel instead.
0: to disable low allocation.
It will be ignored when crashkernel=X,high is not used
@@ -762,6 +937,15 @@
cs89x0_media= [HW,NET]
Format: { rj45 | aui | bnc }
+ csdlock_debug= [KNL] Enable or disable debug add-ons of cross-CPU
+ function call handling. When switched on,
+ additional debug data is printed to the console
+ in case a hanging CPU is detected, and that
+ CPU is pinged again in order to try to resolve
+ the hang situation. The default value of this
+ option depends on the CSD_LOCK_WAIT_DEBUG_DEFAULT
+ Kconfig option.
+
dasd= [HW,NET]
See header of drivers/s390/block/dasd_devmap.c.
@@ -770,15 +954,10 @@
Format: <port#>,<type>
See also Documentation/input/devices/joystick-parport.rst
- ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
- time. See
- Documentation/admin-guide/dynamic-debug-howto.rst for
- details. Deprecated, see dyndbg.
-
- debug [KNL] Enable kernel debugging (events log level).
+ debug [KNL,EARLY] Enable kernel debugging (events log level).
debug_boot_weak_hash
- [KNL] Enable printing [hashed] pointers early in the
+ [KNL,EARLY] Enable printing [hashed] pointers early in the
boot sequence. If enabled, we use a weak hash instead
of siphash to hash pointers. Use this option if you are
seeing instances of '(___ptrval___)') and need to see a
@@ -786,40 +965,38 @@
insecure, please do not use on production kernels.
debug_locks_verbose=
- [KNL] verbose self-tests
- Format=<0|1>
+ [KNL] verbose locking self-tests
+ Format: <int>
Print debugging info while doing the locking API
self-tests.
- We default to 0 (no extra messages), setting it to
- 1 will print _a lot_ more information - normally
- only useful to kernel developers.
+ Bitmask for the various LOCKTYPE_ tests. Defaults to 0
+ (no extra messages), setting it to -1 (all bits set)
+ will print _a_lot_ more information - normally only
+ useful to lockdep developers.
- debug_objects [KNL] Enable object debugging
-
- no_debug_objects
- [KNL] Disable object debugging
+ debug_objects [KNL,EARLY] Enable object debugging
debug_guardpage_minorder=
- [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
+ [KNL,EARLY] When CONFIG_DEBUG_PAGEALLOC is set, this
parameter allows control of the order of pages that will
be intentionally kept free (and hence protected) by the
buddy allocator. Bigger value increase the probability
of catching random memory corruption, but reduce the
amount of memory for normal system use. The maximum
- possible value is MAX_ORDER/2. Setting this parameter
- to 1 or 2 should be enough to identify most random
- memory corruption problems caused by bugs in kernel or
- driver code when a CPU writes to (or reads from) a
- random memory location. Note that there exists a class
- of memory corruptions problems caused by buggy H/W or
- F/W or by drivers badly programing DMA (basically when
- memory is written at bus level and the CPU MMU is
- bypassed) which are not detectable by
- CONFIG_DEBUG_PAGEALLOC, hence this option will not help
- tracking down these problems.
+ possible value is MAX_PAGE_ORDER/2. Setting this
+ parameter to 1 or 2 should be enough to identify most
+ random memory corruption problems caused by bugs in
+ kernel or driver code when a CPU writes to (or reads
+ from) a random memory location. Note that there exists
+ a class of memory corruptions problems caused by buggy
+ H/W or F/W or by drivers badly programming DMA
+ (basically when memory is written at bus level and the
+ CPU MMU is bypassed) which are not detectable by
+ CONFIG_DEBUG_PAGEALLOC, hence this option will not
+ help tracking down these problems.
debug_pagealloc=
- [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
+ [KNL,EARLY] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
enables the feature at boot time. By default, it is
disabled and the system will work mostly the same as a
kernel built without CONFIG_DEBUG_PAGEALLOC.
@@ -827,11 +1004,22 @@
useful to also enable the page_owner functionality.
on: enable the feature
- debugpat [X86] Enable PAT debugging
+ debugfs= [KNL,EARLY] This parameter enables what is exposed to
+ userspace and debugfs internal clients.
+ Format: { on, no-mount, off }
+ on: All functions are enabled.
+ no-mount:
+ Filesystem is not registered but kernel clients can
+ access APIs and a crashkernel can be used to read
+ its content. There is nothing to mount.
+ off: Filesystem is not registered and clients
+ get a -EPERM as result when trying to register files
+ or directories within debugfs.
+ This is equivalent of the runtime functionality if
+ debugfs was not enabled in the kernel at all.
+ Default value is set in build-time with a kernel configuration.
- decnet.addr= [HW,NET]
- Format: <area>[,<node>]
- See also Documentation/networking/decnet.rst.
+ debugpat [X86] Enable PAT debugging
default_hugepagesz=
[HW] The size of the default HugeTLB page. This is
@@ -848,11 +1036,39 @@
[KNL] Debugging option to set a timeout in seconds for
deferred probe to give up waiting on dependencies to
probe. Only specific dependencies (subsystems or
- drivers) that have opted in will be ignored. A timeout of 0
- will timeout at the end of initcalls. This option will also
+ drivers) that have opted in will be ignored. A timeout
+ of 0 will timeout at the end of initcalls. If the time
+ out hasn't expired, it'll be restarted by each
+ successful driver registration. This option will also
dump out devices still on the deferred probe list after
retrying.
+ delayacct [KNL] Enable per-task delay accounting
+
+ dell_smm_hwmon.ignore_dmi=
+ [HW] Continue probing hardware even if DMI data
+ indicates that the driver is running on unsupported
+ hardware.
+
+ dell_smm_hwmon.force=
+ [HW] Activate driver even if SMM BIOS signature does
+ not match list of supported models and enable otherwise
+ blacklisted features.
+
+ dell_smm_hwmon.power_status=
+ [HW] Report power status in /proc/i8k
+ (disabled by default).
+
+ dell_smm_hwmon.restricted=
+ [HW] Allow controlling fans only if SYS_ADMIN
+ capability is set.
+
+ dell_smm_hwmon.fan_mult=
+ [HW] Factor to multiply fan speed with.
+
+ dell_smm_hwmon.fan_max=
+ [HW] Maximum configurable fan speed.
+
dfltcc= [HW,S390]
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
@@ -868,72 +1084,41 @@
dhash_entries= [KNL]
Set number of hash buckets for dentry cache.
- disable_1tb_segments [PPC]
+ disable_1tb_segments [PPC,EARLY]
Disables the use of 1TB hash page table segments. This
causes the kernel to fall back to 256MB segments which
can be useful when debugging issues that require an SLB
miss to occur.
- stress_slb [PPC]
- Limits the number of kernel SLB entries, and flushes
- them frequently to increase the rate of SLB faults
- on kernel addresses.
-
disable= [IPV6]
See Documentation/networking/ipv6.rst.
- hardened_usercopy=
- [KNL] Under CONFIG_HARDENED_USERCOPY, whether
- hardening is enabled for this boot. Hardened
- usercopy checking is used to protect the kernel
- from reading or writing beyond known memory
- allocation boundaries as a proactive defense
- against bounds-checking flaws in the kernel's
- copy_to_user()/copy_from_user() interface.
- on Perform hardened usercopy checks (default).
- off Disable hardened usercopy checks.
-
- disable_radix [PPC]
+ disable_radix [PPC,EARLY]
Disable RADIX MMU mode on POWER9
disable_tlbie [PPC]
Disable TLBIE instruction. Currently does not work
with KVM, with HASH MMU, or with coherent accelerators.
- disable_cpu_apicid= [X86,APIC,SMP]
- Format: <int>
- The number of initial APIC ID for the
- corresponding CPU to be disabled at boot,
- mostly used for the kdump 2nd kernel to
- disable BSP to wake up multiple CPUs without
- causing system reset or hang due to sending
- INIT from AP to BSP.
-
- perf_v4_pmi= [X86,INTEL]
- Format: <bool>
- Disable Intel PMU counter freezing feature.
- The feature only exists starting from
- Arch Perfmon v4 (Skylake and newer).
-
- disable_ddw [PPC/PSERIES]
- Disable Dynamic DMA Window support. Use this if
+ disable_ddw [PPC/PSERIES,EARLY]
+ Disable Dynamic DMA Window support. Use this
to workaround buggy firmware.
disable_ipv6= [IPV6]
See Documentation/networking/ipv6.rst.
- disable_mtrr_cleanup [X86]
+ disable_mtrr_cleanup [X86,EARLY]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter disables that.
- disable_mtrr_trim [X86, Intel and AMD only]
+ disable_mtrr_trim [X86, Intel and AMD only,EARLY]
By default the kernel will trim any uncacheable
memory out of your available memory pool based on
MTRR settings. This parameter disables that behavior,
possibly causing your machine to run very slowly.
- disable_timer_pin_1 [X86]
+ disable_timer_pin_1 [X86,EARLY]
Disable PIN 1 of APIC timer
Can be useful to work around chipset bugs.
@@ -956,8 +1141,31 @@
The filter can be disabled or changed to another
driver later using sysfs.
+ reg_file_data_sampling=
+ [X86] Controls mitigation for Register File Data
+ Sampling (RFDS) vulnerability. RFDS is a CPU
+ vulnerability which may allow userspace to infer
+ kernel data values previously stored in floating point
+ registers, vector registers, or integer registers.
+ RFDS only affects Intel Atom processors.
+
+ on: Turns ON the mitigation.
+ off: Turns OFF the mitigation.
+
+ This parameter overrides the compile time default set
+ by CONFIG_MITIGATION_RFDS. Mitigation cannot be
+ disabled when other VERW based mitigations (like MDS)
+ are enabled. In order to disable RFDS mitigation all
+ VERW based mitigations need to be disabled.
+
+ For details see:
+ Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
+
driver_async_probe= [KNL]
- List of driver names to be probed asynchronously.
+ List of driver names to be probed asynchronously. *
+ matches with all driver names. If * is specified, the
+ rest of the listed driver names are those that will NOT
+ match the *.
Format: <driver_name1>,<driver_name2>...
drm.edid_firmware=[<connector>:]<file>[,[<connector>:]<file>]
@@ -965,22 +1173,16 @@
panels may send no or incorrect EDID data sets.
This parameter allows to specify an EDID data sets
in the /lib/firmware directory that are used instead.
- Generic built-in EDID data sets are used, if one of
- edid/1024x768.bin, edid/1280x1024.bin,
- edid/1680x1050.bin, or edid/1920x1080.bin is given
- and no file with the same name exists. Details and
- instructions how to build your own EDID data are
- available in Documentation/admin-guide/edid.rst. An EDID
- data set will only be used for a particular connector,
- if its name and a colon are prepended to the EDID
- name. Each connector may use a unique EDID data
- set by separating the files with a comma. An EDID
+ An EDID data set will only be used for a particular
+ connector, if its name and a colon are prepended to
+ the EDID name. Each connector may use a unique EDID
+ data set by separating the files with a comma. An EDID
data set with no connector name will be used for
any connectors not explicitly specified.
dscc4.setup= [NET]
- dt_cpu_ftrs= [PPC]
+ dt_cpu_ftrs= [PPC,EARLY]
Format: {"off" | "known"}
Control how the dt_cpu_ftrs device-tree binding is
used for CPU feature discovery and setup (if it
@@ -995,23 +1197,17 @@
what data is available or for reverse-engineering.
dyndbg[="val"] [KNL,DYNAMIC_DEBUG]
- module.dyndbg[="val"]
+ <module>.dyndbg[="val"]
Enable debug messages at boot time. See
Documentation/admin-guide/dynamic-debug-howto.rst
for details.
- nopku [X86] Disable Memory Protection Keys CPU feature found
- in some Intel CPUs.
-
- module.async_probe [KNL]
- Enable asynchronous probe on this module.
-
- early_ioremap_debug [KNL]
+ early_ioremap_debug [KNL,EARLY]
Enable debug messages in early_ioremap support. This
is useful for tracking down temporary early mappings
which are not unmapped.
- earlycon= [KNL] Output early console device and options.
+ earlycon= [KNL,EARLY] Output early console device and options.
When used with no options, the early console is
determined by stdout-path property in device tree's
@@ -1025,10 +1221,10 @@
specified, the serial port must already be setup and
configured.
- uart[8250],io,<addr>[,options]
- uart[8250],mmio,<addr>[,options]
- uart[8250],mmio32,<addr>[,options]
- uart[8250],mmio32be,<addr>[,options]
+ uart[8250],io,<addr>[,options[,uartclk]]
+ uart[8250],mmio,<addr>[,options[,uartclk]]
+ uart[8250],mmio32,<addr>[,options[,uartclk]]
+ uart[8250],mmio32be,<addr>[,options[,uartclk]]
uart[8250],0x<addr>[,options]
Start an early, polled-mode console on the 8250/16550
UART at the specified I/O port or MMIO address.
@@ -1037,7 +1233,9 @@
If none of [io|mmio|mmio32|mmio32be], <addr> is assumed
to be equivalent to 'mmio'. 'options' are specified
in the same format described for "console=ttyS<n>"; if
- unspecified, the h/w is not initialized.
+ unspecified, the h/w is not initialized. 'uartclk' is
+ the uart clock frequency; if unspecified, it is set
+ to 'BASE_BAUD' * 16.
pl011,<addr>
pl011,mmio32,<addr>
@@ -1048,6 +1246,11 @@
the driver will use only 32-bit accessors to read/write
the device registers.
+ liteuart,<addr>
+ Start an early console on a litex serial port at the
+ specified address. The serial port must already be
+ setup and configured. Options are not yet supported.
+
meson,<addr>
Start an early, polled-mode console on a meson serial
port at the specified address. The serial port must
@@ -1140,7 +1343,7 @@
address must be provided, and the serial port must
already be setup and configured.
- earlyprintk= [X86,SH,ARM,M68k,S390]
+ earlyprintk= [X86,SH,ARM,M68k,S390,UM,EARLY]
earlyprintk=vga
earlyprintk=sclp
earlyprintk=xen
@@ -1150,6 +1353,7 @@
earlyprintk=dbgp[debugController#]
earlyprintk=pciserial[,force],bus:device.function[,baudrate]
earlyprintk=xdbc[xhciController#]
+ earlyprintk=bios
earlyprintk is useful when the kernel crashes before
the normal console is initialized. It is not enabled by
@@ -1158,7 +1362,7 @@
Append ",keep" to not disable it when the real console
takes over.
- Only one of vga, efi, serial, or usb debug port can
+ Only one of vga, serial, or usb debug port can
be used at a time.
Currently only ttyS0 and ttyS1 may be specified by
@@ -1173,13 +1377,15 @@
Interaction with the standard serial driver is not
very good.
- The VGA and EFI output is eventually overwritten by
+ The VGA output is eventually overwritten by
the real console.
- The xen output can only be used by Xen PV guests.
+ The xen option can only be used in Xen domains.
The sclp output can only be used on s390.
+ The bios output can only be used on SuperH.
+
The optional "force" to "pciserial" enables use of a
PCI device even when its classcode is not of the
UART class.
@@ -1192,49 +1398,37 @@
force: enforce the use of EDAC to report H/W event.
default: on.
- ekgdboc= [X86,KGDB] Allow early kernel console debugging
- ekgdboc=kbd
-
- This is designed to be used in conjunction with
- the boot argument: earlyprintk=vga
-
- This parameter works in place of the kgdboc parameter
- but can only be used if the backing tty is available
- very early in the boot process. For early debugging
- via a serial port see kgdboc_earlycon instead.
-
edd= [EDD]
Format: {"off" | "on" | "skip[mbr]"}
- efi= [EFI]
- Format: { "old_map", "nochunk", "noruntime", "debug",
- "nosoftreserve", "disable_early_pci_dma",
- "no_disable_early_pci_dma" }
- old_map [X86-64]: switch to the old ioremap-based EFI
- runtime services mapping. [Needs CONFIG_X86_UV=y]
+ efi= [EFI,EARLY]
+ Format: { "debug", "disable_early_pci_dma",
+ "nochunk", "noruntime", "nosoftreserve",
+ "novamap", "no_disable_early_pci_dma" }
+ debug: enable misc debug output.
+ disable_early_pci_dma: disable the busmaster bit on all
+ PCI bridges while in the EFI boot stub.
nochunk: disable reading files in "chunks" in the EFI
boot stub, as chunking can cause problems with some
firmware implementations.
noruntime : disable EFI runtime services support
- debug: enable misc debug output
nosoftreserve: The EFI_MEMORY_SP (Specific Purpose)
attribute may cause the kernel to reserve the
memory range for a memory mapping driver to
claim. Specify efi=nosoftreserve to disable this
reservation and treat the memory by its base type
(i.e. EFI_CONVENTIONAL_MEMORY / "System RAM").
- disable_early_pci_dma: Disable the busmaster bit on all
- PCI bridges while in the EFI boot stub
+ novamap: do not call SetVirtualAddressMap().
no_disable_early_pci_dma: Leave the busmaster bit set
on all PCI bridges while in the EFI boot stub
- efi_no_storage_paranoia [EFI; X86]
+ efi_no_storage_paranoia [EFI,X86,EARLY]
Using this parameter you can use more than 50% of
your efi variable storage. Use this parameter only if
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
- efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
+ efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI,X86,EARLY]
Add arbitrary attribute to specific memory range by
updating original EFI memory map.
Region of memory which aa attribute is added to is
@@ -1265,17 +1459,28 @@
eisa_irq_edge= [PARISC,HW]
See header of drivers/parisc/eisa.c.
+ ekgdboc= [X86,KGDB,EARLY] Allow early kernel console debugging
+ Format: ekgdboc=kbd
+
+ This is designed to be used in conjunction with
+ the boot argument: earlyprintk=vga
+
+ This parameter works in place of the kgdboc parameter
+ but can only be used if the backing tty is available
+ very early in the boot process. For early debugging
+ via a serial port see kgdboc_earlycon instead.
+
elanfreq= [X86-32]
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
- elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
+ elfcorehdr=[size[KMG]@]offset[KMG] [PPC,SH,X86,S390,EARLY]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
kexec loader will pass this option to capture kernel.
See Documentation/admin-guide/kdump/kdump.rst for details.
- enable_mtrr_cleanup [X86]
+ enable_mtrr_cleanup [X86,EARLY]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter enables that.
@@ -1286,7 +1491,7 @@
(in particular on some ATI chipsets).
The kernel tries to set a reasonable default.
- enforcing [SELINUX] Set initial enforcing status.
+ enforcing= [SELINUX] Set initial enforcing status.
Format: {"0" | "1"}
See security/selinux/Kconfig help text.
0 -- permissive (log only, no denials).
@@ -1308,22 +1513,30 @@
Permit 'security.evm' to be updated regardless of
current integrity status.
+ early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
+ stages so cover more early boot allocations.
+ Please note that as side effect some optimizations
+ might be disabled to achieve that (e.g. parallelized
+ memory initialization is disabled) so the boot process
+ might take longer, especially on systems with a lot of
+ memory. Available with CONFIG_PAGE_EXTENSION=y.
+
failslab=
+ fail_usercopy=
fail_page_alloc=
fail_make_request=[KNL]
General fault injection mechanism.
Format: <interval>,<probability>,<space>,<times>
See also Documentation/fault-injection/.
+ fb_tunnels= [NET]
+ Format: { initns | none }
+ See Documentation/admin-guide/sysctl/net.rst for
+ fb_tunnels_only_for_init_ns
+
floppy= [HW]
See Documentation/admin-guide/blockdev/floppy.rst.
- force_pal_cache_flush
- [IA-64] Avoid check_sal_cache_flush which may hang on
- buggy SAL_CACHE_FLUSH implementations. Using this
- parameter will force ia64_sal_cache_flush to call
- ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
-
forcepae [X86-32]
Forcefully enable Physical Address Extension (PAE).
Many Pentium M systems disable PAE but may have a
@@ -1331,21 +1544,60 @@
Warning: use of this parameter will taint the kernel
and may cause unknown problems.
+ fred= [X86-64]
+ Enable/disable Flexible Return and Event Delivery.
+ Format: { on | off }
+ on: enable FRED when it's present.
+ off: disable FRED, the default setting.
+
ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
boot debugging.
- ftrace_dump_on_oops[=orig_cpu]
+ ftrace_boot_snapshot
+ [FTRACE] On boot up, a snapshot will be taken of the
+ ftrace ring buffer that can be read at:
+ /sys/kernel/tracing/snapshot.
+ This is useful if you need tracing information from kernel
+ boot up that is likely to be overridden by user space
+ start up functionality.
+
+ Optionally, the snapshot can also be defined for a tracing
+ instance that was created by the trace_instance= command
+ line parameter.
+
+ trace_instance=foo,sched_switch ftrace_boot_snapshot=foo
+
+ The above will cause the "foo" tracing instance to trigger
+ a snapshot at the end of boot up.
+
+ ftrace_dump_on_oops[=2(orig_cpu) | =<instance>][,<instance> |
+ ,<instance>=2(orig_cpu)]
[FTRACE] will dump the trace buffers on oops.
- If no parameter is passed, ftrace will dump
- buffers of all CPUs, but if you pass orig_cpu, it will
- dump only the buffer of the CPU that triggered the
- oops.
+ If no parameter is passed, ftrace will dump global
+ buffers of all CPUs, if you pass 2 or orig_cpu, it
+ will dump only the buffer of the CPU that triggered
+ the oops, or the specific instance will be dumped if
+ its name is passed. Multiple instance dump is also
+ supported, and instances are separated by commas. Each
+ instance supports only dump on CPU that triggered the
+ oops by passing 2 or orig_cpu to it.
+
+ ftrace_dump_on_oops=foo=orig_cpu
+
+ The above will dump only the buffer of "foo" instance
+ on CPU that triggered the oops.
+
+ ftrace_dump_on_oops,foo,bar=orig_cpu
+
+ The above will dump global buffer on all CPUs, the
+ buffer of "foo" instance on all CPUs and the buffer
+ of "bar" instance on CPU that triggered the oops.
ftrace_filter=[function-list]
[FTRACE] Limit the functions traced by the function
- tracer at boot up. function-list is a comma separated
+ tracer at boot up. function-list is a comma-separated
list of functions. This list can be changed at run
time by the set_ftrace_filter file in the debugfs
tracing directory.
@@ -1359,13 +1611,13 @@
ftrace_graph_filter=[function-list]
[FTRACE] Limit the top level callers functions traced
by the function graph tracer at boot up.
- function-list is a comma separated list of functions
+ function-list is a comma-separated list of functions
that can be changed at run time by the
set_graph_function file in the debugfs tracing directory.
ftrace_graph_notrace=[function-list]
[FTRACE] Do not trace from the functions specified in
- function-list. This list is a comma separated list of
+ function-list. This list is a comma-separated list of
functions that can be changed at run time by the
set_graph_notrace file in the debugfs tracing directory.
@@ -1375,7 +1627,7 @@
can be changed at run time by the max_graph_depth file
in the tracefs tracing directory. default: 0 (no limit)
- fw_devlink= [KNL] Create device links between consumer and supplier
+ fw_devlink= [KNL,EARLY] Create device links between consumer and supplier
devices by scanning the firmware to infer the
consumer/supplier relationships. This feature is
especially useful when drivers are loaded as modules as
@@ -1393,6 +1645,25 @@
to enforce probe and suspend/resume ordering.
rpm -- Like "on", but also use to order runtime PM.
+ fw_devlink.strict=<bool>
+ [KNL,EARLY] Treat all inferred dependencies as mandatory
+ dependencies. This only applies for fw_devlink=on|rpm.
+ Format: <bool>
+
+ fw_devlink.sync_state =
+ [KNL,EARLY] When all devices that could probe have finished
+ probing, this parameter controls what to do with
+ devices that haven't yet received their sync_state()
+ calls.
+ Format: { strict | timeout }
+ strict -- Default. Continue waiting on consumers to
+ probe successfully.
+ timeout -- Give up waiting on consumers and call
+ sync_state() on any devices that haven't yet
+ received their sync_state() calls after
+ deferred_probe_timeout has expired or by
+ late_initcall() if !CONFIG_MODULES.
+
gamecon.map[2|3]=
[HW,JOY] Multisystem joystick and NES/SNES/PSX pad
support via parallel port (up to 5 devices per port)
@@ -1401,10 +1672,30 @@
gamma= [HW,DRM]
- gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
+ gart_fix_e820= [X86-64,EARLY] disable the fix e820 for K8 GART
Format: off | on
default: on
+ gather_data_sampling=
+ [X86,INTEL,EARLY] Control the Gather Data Sampling (GDS)
+ mitigation.
+
+ Gather Data Sampling is a hardware vulnerability which
+ allows unprivileged speculative access to data which was
+ previously stored in vector registers.
+
+ This issue is mitigated by default in updated microcode.
+ The mitigation may have a performance impact but can be
+ disabled. On systems without the microcode mitigation
+ disabling AVX serves as a mitigation.
+
+ force: Disable AVX to mitigate systems without
+ microcode mitigation. No effect if the microcode
+ mitigation is present. Known to cause crashes in
+ userspace with buggy AVX enumeration.
+
+ off: Disable GDS mitigation.
+
gcov_persist= [GCOV] When non-zero (default), profiling data for
kernel modules is saved and remains accessible via
debugfs, even when the module is unloaded/reloaded.
@@ -1415,6 +1706,12 @@
Don't use this when you are not running on the
android emulator
+ gpio-mockup.gpio_mockup_ranges
+ [HW] Sets the ranges of gpiochip of for this device.
+ Format: <start1>,<end1>,<start2>,<end2>...
+ gpio-mockup.gpio_mockup_named_lines
+ [HW] Let the driver know GPIO lines should be named.
+
gpt [EFI] Forces disk with valid GPT signature but
invalid Protective MBR to be treated as GPT. If the
primary GPT is corrupted, it enables the backup/alternate
@@ -1438,9 +1735,16 @@
Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0.
Default: 1024
- gpio-mockup.gpio_mockup_ranges
- [HW] Sets the ranges of gpiochip of for this device.
- Format: <start1>,<end1>,<start2>,<end2>...
+ hardened_usercopy=
+ [KNL] Under CONFIG_HARDENED_USERCOPY, whether
+ hardening is enabled for this boot. Hardened
+ usercopy checking is used to protect the kernel
+ from reading or writing beyond known memory
+ allocation boundaries as a proactive defense
+ against bounds-checking flaws in the kernel's
+ copy_to_user()/copy_from_user() interface.
+ on Perform hardened usercopy checks (default).
+ off Disable hardened usercopy checks.
hardlockup_all_cpu_backtrace=
[KNL] Should the hard-lockup detector generate
@@ -1462,7 +1766,27 @@
corresponding firmware-first mode error processing
logic will be disabled.
- highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
+ hibernate= [HIBERNATION]
+ noresume Don't check if there's a hibernation image
+ present during boot.
+ nocompress Don't compress/decompress hibernation images.
+ no Disable hibernation and resume.
+ protect_image Turn on image protection during restoration
+ (that will set all pages holding image data
+ during restoration read-only).
+
+ hibernate.compressor= [HIBERNATION] Compression algorithm to be
+ used with hibernation.
+ Format: { lzo | lz4 }
+ Default: lzo
+
+ lzo: Select LZO compression algorithm to
+ compress/decompress hibernation image.
+
+ lz4: Select LZ4 compression algorithm to
+ compress/decompress hibernation image.
+
+ highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact
size of <nn>. This works even on boxes that have no
highmem otherwise. This also works to reduce highmem
size on bigger boxes.
@@ -1473,6 +1797,19 @@
hlt [BUGS=ARM,SH]
+ hostname= [KNL,EARLY] Set the hostname (aka UTS nodename).
+ Format: <string>
+ This allows setting the system's hostname during early
+ startup. This sets the name returned by gethostname.
+ Using this parameter to set the hostname makes it
+ possible to ensure the hostname is correctly set before
+ any userspace processes run, avoiding the possibility
+ that a process may call gethostname before the hostname
+ has been explicitly set, resulting in the calling
+ process getting an incorrect result. The string must
+ not exceed the maximum allowed hostname length (usually
+ 64 characters) and will be truncated otherwise.
+
hpet= [X86-32,HPET] option to control HPET usage
Format: { enable (default) | disable | force |
verbose }
@@ -1484,22 +1821,16 @@
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
- hugetlb_cma= [HW] The size of a cma area used for allocation
- of gigantic hugepages.
- Format: nn[KMGTPE]
-
- Reserve a cma area of given size and allocate gigantic
- hugepages using the cma allocator. If enabled, the
- boot-time allocation of gigantic hugepages is skipped.
-
hugepages= [HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
line, it specifies the number of pages to allocate for
- the default huge page size. See also
- Documentation/admin-guide/mm/hugetlbpage.rst.
- Format: <integer>
+ the default huge page size. If using node format, the
+ number of pages to allocate per-node can be specified.
+ See also Documentation/admin-guide/mm/hugetlbpage.rst.
+ Format: <integer> or (node format)
+ <node>:<integer>[,<node>:<integer>]
hugepagesz=
[HW] The size of the HugeTLB pages. This is used in
@@ -1511,6 +1842,36 @@
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
+ hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation
+ of gigantic hugepages. Or using node format, the size
+ of a CMA area per node can be specified.
+ Format: nn[KMGTPE] or (node format)
+ <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
+
+ Reserve a CMA area of given size and allocate gigantic
+ hugepages using the CMA allocator. If enabled, the
+ boot-time allocation of gigantic hugepages is skipped.
+
+ hugetlb_free_vmemmap=
+ [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+ enabled.
+ Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
+ Allows heavy hugetlb users to free up some more
+ memory (7 * PAGE_SIZE for each 2MB hugetlb page).
+ Format: { on | off (default) }
+
+ on: enable HVO
+ off: disable HVO
+
+ Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
+ the default is on.
+
+ Note that the vmemmap pages may be allocated from the added
+ memory block itself when memory_hotplug.memmap_on_memory is
+ enabled, those vmemmap pages cannot be optimized even if this
+ feature is enabled. Other vmemmap pages not allocated from
+ the added memory block itself do not be affected.
+
hung_task_panic=
[KNL] Should the hung task detector generate panics.
Format: 0 | 1
@@ -1527,15 +1888,10 @@
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
- hv_nopvspin [X86,HYPER_V] Disables the paravirt spinlock optimizations
- which allow the hypervisor to 'idle' the
- guest on lock contention.
-
- keep_bootcon [KNL]
- Do not unregister boot console at start. This is only
- useful for debugging when something happens in the window
- between unregistering the boot console and initializing
- the real console.
+ hv_nopvspin [X86,HYPER_V,EARLY]
+ Disables the paravirt spinlock optimizations
+ which allow the hypervisor to 'idle' the guest
+ on lock contention.
i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not
@@ -1571,20 +1927,11 @@
architectures force reset to be always executed
i8042.unlock [HW] Unlock (ignore) the keylock
i8042.kbdreset [HW] Reset device connected to KBD port
+ i8042.probe_defer
+ [HW] Allow deferred probing upon i8042 probe errors
i810= [HW,DRM]
- i8k.ignore_dmi [HW] Continue probing hardware even if DMI data
- indicates that the driver is running on unsupported
- hardware.
- i8k.force [HW] Activate i8k driver even if SMM BIOS signature
- does not match list of supported models.
- i8k.power_status
- [HW] Report power status in /proc/i8k
- (disabled by default)
- i8k.restricted [HW] Allow controlling fans only if SYS_ADMIN
- capability is set.
-
i915.invert_brightness=
[DRM] Invert the sense of the variable that is used to
set the brightness of the panel backlight. Normally a
@@ -1599,31 +1946,17 @@
0 -- machine default
1 -- force brightness inversion
+ ia32_emulation= [X86-64]
+ Format: <bool>
+ When true, allows loading 32-bit programs and executing 32-bit
+ syscalls, essentially overriding IA32_EMULATION_DEFAULT_DISABLED at
+ boot time. When false, unconditionally disables IA32 emulation.
+
icn= [HW,ISDN]
Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
- ide-core.nodma= [HW] (E)IDE subsystem
- Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
- .vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr
- .cdrom .chs .ignore_cable are additional options
- See Documentation/ide/ide.rst.
- ide-generic.probe-mask= [HW] (E)IDE subsystem
- Format: <int>
- Probe mask for legacy ISA IDE ports. Depending on
- platform up to 6 ports are supported, enabled by
- setting corresponding bits in the mask to 1. The
- default value is 0x0, which has a special meaning.
- On systems that have PCI, it triggers scanning the
- PCI bus for the first and the second port, which
- are then probed. On systems without PCI the value
- of 0x0 enables probing the two first ports as if it
- was 0x3.
-
- ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem
- Claim all unknown PCI IDE storage controllers.
-
- idle= [X86]
+ idle= [X86,EARLY]
Format: idle=poll, idle=halt, idle=nomwait
Poll forces a polling idle loop that can slightly
improve the performance of waking up a idle CPU, but
@@ -1633,6 +1966,17 @@
In such case C2/C3 won't be used again.
idle=nomwait: Disable mwait for CPU C-states
+ idxd.sva= [HW]
+ Format: <bool>
+ Allow force disabling of Shared Virtual Memory (SVA)
+ support for the idxd driver. By default it is set to
+ true (1).
+
+ idxd.tc_override= [HW]
+ Format: <bool>
+ Allow override of default traffic class configuration
+ for the device. By default it is set to false (0).
+
ieee754= [MIPS] Select IEEE Std 754 conformance mode
Format: { strict | legacy | 2008 | relaxed }
Default: strict
@@ -1668,7 +2012,7 @@
mode generally follows that for the NaN encoding,
except where unsupported by hardware.
- ignore_loglevel [KNL]
+ ignore_loglevel [KNL,EARLY]
Ignore loglevel setting - this will print /all/
kernel messages to the console. Useful for debugging.
We also add it as printk module parameter, so users
@@ -1706,7 +2050,7 @@
ima_policy= [IMA]
The builtin policies to load during IMA setup.
Format: "tcb | appraise_tcb | secure_boot |
- fail_securely"
+ fail_securely | critical_data"
The "tcb" policy measures all programs exec'd, files
mmap'd for exec, and all files opened with the read
@@ -1725,6 +2069,9 @@
filesystems with the SB_I_UNVERIFIABLE_SIGNATURE
flag.
+ The "critical_data" policy measures kernel integrity
+ critical data.
+
ima_tcb [IMA] Deprecated. Use ima_policy= instead.
Load a policy which meets the needs of the Trusted
Computing Base. This means IMA will measure all
@@ -1733,7 +2080,8 @@
ima_template= [IMA]
Select one of defined IMA measurements template formats.
- Formats: { "ima" | "ima-ng" | "ima-sig" }
+ Formats: { "ima" | "ima-ng" | "ima-ngv2" | "ima-sig" |
+ "ima-sigv2" }
Default: "ima-ng"
ima_template_fmt=
@@ -1770,25 +2118,37 @@
initcall functions. Useful for debugging built-in
modules and initcalls.
- initrd= [BOOT] Specify the location of the initial ramdisk
-
- initrdmem= [KNL] Specify a physical address and size from which to
+ initramfs_async= [KNL]
+ Format: <bool>
+ Default: 1
+ This parameter controls whether the initramfs
+ image is unpacked asynchronously, concurrently
+ with devices being probed and
+ initialized. This should normally just work,
+ but as a debugging aid, one can get the
+ historical behaviour of the initramfs
+ unpacking being completed before device_ and
+ late_ initcalls.
+
+ initrd= [BOOT,EARLY] Specify the location of the initial ramdisk
+
+ initrdmem= [KNL,EARLY] Specify a physical address and size from which to
load the initrd. If an initrd is compiled in or
specified in the bootparams, it takes priority over this
setting.
Format: ss[KMG],nn[KMG]
Default is 0, 0
- init_on_alloc= [MM] Fill newly allocated pages and heap objects with
+ init_on_alloc= [MM,EARLY] Fill newly allocated pages and heap objects with
zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON.
- init_on_free= [MM] Fill freed pages and heap objects with zeroes.
+ init_on_free= [MM,EARLY] Fill freed pages and heap objects with zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.
- init_pkru= [x86] Specify the default memory protection keys rights
+ init_pkru= [X86] Specify the default memory protection keys rights
register contents for all processes. 0x55555554 by
default (disallow access to all but pkey 0). Can
override in debugfs after boot.
@@ -1796,7 +2156,7 @@
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
Format: <irq>
- int_pln_enable [x86] Enable power limit notification interrupt
+ int_pln_enable [X86] Enable power limit notification interrupt
integrity_audit=[IMA]
Format: { "0" | "1" }
@@ -1814,26 +2174,18 @@
bypassed by not enabling DMAR with this option. In
this case, gfx device will use physical address for
DMA.
- forcedac [x86_64]
- With this option iommu will not optimize to look
- for io virtual address below 32-bit forcing dual
- address cycle on pci bus for cards supporting greater
- than 32-bit addressing. The default is to look
- for translation below 32-bit and if not available
- then look in the higher range.
strict [Default Off]
- With this option on every unmap_single operation will
- result in a hardware IOTLB flush operation as opposed
- to batching them for performance.
+ Deprecated, equivalent to iommu.strict=1.
sp_off [Default Off]
By default, super page will be supported if Intel IOMMU
has the capability. With this option, super page will
not be supported.
- sm_on [Default Off]
- By default, scalable mode will be disabled even if the
- hardware advertises that it has support for the scalable
- mode translation. With this option set, scalable mode
- will be used on hardware which claims to support it.
+ sm_on
+ Enable the Intel IOMMU scalable mode if the hardware
+ advertises that it has support for the scalable mode
+ translation.
+ sm_off
+ Disallow use of the Intel IOMMU scalable mode.
tboot_noforce [Default Off]
Do not force the Intel IOMMU enabled under tboot.
By default, tboot will force Intel IOMMU on, which
@@ -1843,20 +2195,25 @@
Note that using this option lowers the security
provided by tboot because it makes the system
vulnerable to DMA attacks.
- nobounce [Default off]
- Disable bounce buffer for untrusted devices such as
- the Thunderbolt devices. This will treat the untrusted
- devices as the trusted ones, hence might expose security
- risks of DMA attacks.
intel_idle.max_cstate= [KNL,HW,ACPI,X86]
0 disables intel_idle and fall back on acpi_idle.
1 to 9 specify maximum depth of C-state.
- intel_pstate= [X86]
+ intel_pstate= [X86,EARLY]
disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
+ active
+ Use intel_pstate driver to bypass the scaling
+ governors layer of cpufreq and provides it own
+ algorithms for p-state selection. There are two
+ P-state selection algorithms provided by
+ intel_pstate in the active mode: powersave and
+ performance. The way they both operate depends
+ on whether or not the hardware managed P-states
+ (HWP) feature has been enabled in the processor
+ and possibly on the processor model.
passive
Use intel_pstate as a scaling driver, but configure it
to work with generic cpufreq governors (instead of
@@ -1887,7 +2244,7 @@
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
- intremap= [X86-64, Intel-IOMMU]
+ intremap= [X86-64,Intel-IOMMU,EARLY]
on enable Interrupt Remapping (default)
off disable Interrupt Remapping
nosid disable Source ID checking
@@ -1899,7 +2256,7 @@
strict regions from userspace.
relaxed
- iommu= [x86]
+ iommu= [X86,EARLY]
off
force
noforce
@@ -1909,12 +2266,20 @@
merge
nomerge
soft
- pt [x86]
- nopt [x86]
+ pt [X86]
+ nopt [X86]
nobypass [PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
- iommu.strict= [ARM64] Configure TLB invalidation behaviour
+ iommu.forcedac= [ARM64,X86,EARLY] Control IOVA allocation for PCI devices.
+ Format: { "0" | "1" }
+ 0 - Try to allocate a 32-bit DMA address first, before
+ falling back to the full range if needed.
+ 1 - Allocate directly from the full usable range,
+ forcing Dual Address Cycle for PCI cards supporting
+ greater than 32-bit addressing.
+
+ iommu.strict= [ARM64,X86,S390,EARLY] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
Request that DMA unmap operations use deferred
@@ -1922,22 +2287,25 @@
throughput at the cost of reduced device isolation.
Will fall back to strict mode if not supported by
the relevant IOMMU driver.
- 1 - Strict mode (default).
+ 1 - Strict mode.
DMA unmap operations invalidate IOMMU hardware TLBs
synchronously.
+ unset - Use value of CONFIG_IOMMU_DEFAULT_DMA_{LAZY,STRICT}.
+ Note: on x86, strict mode specified via one of the
+ legacy driver-specific options takes precedence.
iommu.passthrough=
- [ARM64, X86] Configure DMA to bypass the IOMMU by default.
+ [ARM64,X86,EARLY] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
- io7= [HW] IO7 for Marvel based alpha systems
+ io7= [HW] IO7 for Marvel-based Alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
- io_delay= [X86] I/O delay method
+ io_delay= [X86,EARLY] I/O delay method
0x80
Standard port 0x80 based delay
0xed
@@ -1950,28 +2318,28 @@
ip= [IP_PNP]
See Documentation/admin-guide/nfs/nfsroot.rst.
- ipcmni_extend [KNL] Extend the maximum number of unique System V
+ ipcmni_extend [KNL,EARLY] Extend the maximum number of unique System V
IPC identifiers from 32,768 to 16,777,216.
irqaffinity= [SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
irqchip.gicv2_force_probe=
- [ARM, ARM64]
+ [ARM,ARM64,EARLY]
Format: <bool>
Force the kernel to look for the second 4kB page
of a GICv2 controller even if the memory range
exposed by the device tree is too small.
irqchip.gicv3_nolpi=
- [ARM, ARM64]
+ [ARM,ARM64,EARLY]
Force the kernel to ignore the availability of
LPIs (and by consequence ITSs). Intended for system
that use the kernel as a bootloader, and thus want
to let secondary kernels in charge of setting up
LPIs.
- irqchip.gicv3_pseudo_nmi= [ARM64]
+ irqchip.gicv3_pseudo_nmi= [ARM64,EARLY]
Enables support for pseudo-NMIs in the kernel. This
requires the kernel to be built with
CONFIG_ARM64_PSEUDO_NMI.
@@ -2053,44 +2421,78 @@
iucv= [HW,NET]
- ivrs_ioapic [HW,X86_64]
+ ivrs_ioapic [HW,X86-64]
Provide an override to the IOAPIC-ID<->DEVICE-ID
- mapping provided in the IVRS ACPI table. For
- example, to map IOAPIC-ID decimal 10 to
- PCI device 00:14.0 write the parameter as:
+ mapping provided in the IVRS ACPI table.
+ By default, PCI segment is 0, and can be omitted.
+
+ For example, to map IOAPIC-ID decimal 10 to
+ PCI segment 0x1 and PCI device 00:14.0,
+ write the parameter as:
+ ivrs_ioapic=10@0001:00:14.0
+
+ Deprecated formats:
+ * To map IOAPIC-ID decimal 10 to PCI device 00:14.0
+ write the parameter as:
ivrs_ioapic[10]=00:14.0
+ * To map IOAPIC-ID decimal 10 to PCI segment 0x1 and
+ PCI device 00:14.0 write the parameter as:
+ ivrs_ioapic[10]=0001:00:14.0
- ivrs_hpet [HW,X86_64]
+ ivrs_hpet [HW,X86-64]
Provide an override to the HPET-ID<->DEVICE-ID
- mapping provided in the IVRS ACPI table. For
- example, to map HPET-ID decimal 0 to
- PCI device 00:14.0 write the parameter as:
+ mapping provided in the IVRS ACPI table.
+ By default, PCI segment is 0, and can be omitted.
+
+ For example, to map HPET-ID decimal 10 to
+ PCI segment 0x1 and PCI device 00:14.0,
+ write the parameter as:
+ ivrs_hpet=10@0001:00:14.0
+
+ Deprecated formats:
+ * To map HPET-ID decimal 0 to PCI device 00:14.0
+ write the parameter as:
ivrs_hpet[0]=00:14.0
+ * To map HPET-ID decimal 10 to PCI segment 0x1 and
+ PCI device 00:14.0 write the parameter as:
+ ivrs_ioapic[10]=0001:00:14.0
- ivrs_acpihid [HW,X86_64]
+ ivrs_acpihid [HW,X86-64]
Provide an override to the ACPI-HID:UID<->DEVICE-ID
- mapping provided in the IVRS ACPI table. For
- example, to map UART-HID:UID AMD0020:0 to
- PCI device 00:14.5 write the parameter as:
+ mapping provided in the IVRS ACPI table.
+ By default, PCI segment is 0, and can be omitted.
+
+ For example, to map UART-HID:UID AMD0020:0 to
+ PCI segment 0x1 and PCI device ID 00:14.5,
+ write the parameter as:
+ ivrs_acpihid=AMD0020:0@0001:00:14.5
+
+ Deprecated formats:
+ * To map UART-HID:UID AMD0020:0 to PCI segment is 0,
+ PCI device ID 00:14.5, write the parameter as:
ivrs_acpihid[00:14.5]=AMD0020:0
+ * To map UART-HID:UID AMD0020:0 to PCI segment 0x1 and
+ PCI device ID 00:14.5, write the parameter as:
+ ivrs_acpihid[0001:00:14.5]=AMD0020:0
js= [HW,JOY] Analog joystick
See Documentation/input/joydev/joystick.rst.
- nokaslr [KNL]
- When CONFIG_RANDOMIZE_BASE is set, this disables
- kernel and module base offset ASLR (Address Space
- Layout Randomization).
-
kasan_multi_shot
[KNL] Enforce KASAN (Kernel Address Sanitizer) to print
report on every invalid memory access. Without this
parameter KASAN will print report only for the first
invalid access.
- keepinitrd [HW,ARM]
+ keep_bootcon [KNL,EARLY]
+ Do not unregister boot console at start. This is only
+ useful for debugging when something happens in the window
+ between unregistering the boot console and initializing
+ the real console.
+
+ keepinitrd [HW,ARM] See retain_initrd.
- kernelcore= [KNL,X86,IA-64,PPC]
+ kernelcore= [KNL,X86,IA-64,PPC,EARLY]
Format: nn[KMGTPE] | nn% | "mirror"
This parameter specifies the amount of memory usable by
the kernel for non-movable allocations. The requested
@@ -2115,7 +2517,7 @@
for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
are exclusive, so you cannot specify multiple forms.
- kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
+ kgdbdbgp= [KGDB,HW,EARLY] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
port as it is probed via PCI. The poll interval is
@@ -2136,7 +2538,7 @@
kms, kbd format: kms,kbd
kms, kbd and serial format: kms,kbd,<ser_dev>[,baud]
- kgdboc_earlycon= [KGDB,HW]
+ kgdboc_earlycon= [KGDB,HW,EARLY]
If the boot console provides the ability to read
characters and can work in polling mode, you can use
this parameter to tell kgdb to use it as a backend
@@ -2151,14 +2553,14 @@
blank and the first boot console that implements
read() will be picked.
- kgdbwait [KGDB] Stop kernel execution and enter the
+ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.
- kmac= [MIPS] korina ethernet MAC address.
+ kmac= [MIPS] Korina ethernet MAC address.
Configure the RouterBoard 532 series on-chip
Ethernet adapter MAC address.
- kmemleak= [KNL] Boot-time kmemleak enable/disable
+ kmemleak= [KNL,EARLY] Boot-time kmemleak enable/disable
Valid arguments: on, off
Default: on
Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
@@ -2177,22 +2579,49 @@
See also Documentation/trace/kprobetrace.rst "Kernel
Boot Parameter" section.
- kpti= [ARM64] Control page table isolation of user
- and kernel address spaces.
+ kpti= [ARM64,EARLY] Control page table isolation of
+ user and kernel address spaces.
Default: enabled on cores which need mitigation.
0: force disabled
1: force enabled
+ kunit.enable= [KUNIT] Enable executing KUnit tests. Requires
+ CONFIG_KUNIT to be set to be fully enabled. The
+ default value can be overridden via
+ KUNIT_DEFAULT_ENABLED.
+ Default is 1 (enabled)
+
kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
Default is 0 (don't ignore, but inject #GP)
+ kvm.eager_page_split=
+ [KVM,X86] Controls whether or not KVM will try to
+ proactively split all huge pages during dirty logging.
+ Eager page splitting reduces interruptions to vCPU
+ execution by eliminating the write-protection faults
+ and MMU lock contention that would otherwise be
+ required to split huge pages lazily.
+
+ VM workloads that rarely perform writes or that write
+ only to a small region of VM memory may benefit from
+ disabling eager page splitting to allow huge pages to
+ still be used for reads.
+
+ The behavior of eager page splitting depends on whether
+ KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
+ disabled, all huge pages in a memslot will be eagerly
+ split when dirty logging is enabled on that memslot. If
+ enabled, eager page splitting will be performed during
+ the KVM_CLEAR_DIRTY ioctl, and only for the pages being
+ cleared.
+
+ Eager page splitting is only supported when kvm.tdp_mmu=Y.
+
+ Default is Y (on).
+
kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
Default is false (don't support).
- kvm.mmu_audit= [KVM] This is a R/W parameter which allows audit
- KVM MMU at runtime.
- Default is 0 (off)
-
kvm.nx_huge_pages=
[KVM] Controls the software workaround for the
X86_BUG_ITLB_MULTIHIT bug.
@@ -2210,51 +2639,95 @@
[KVM] Controls how many 4KiB pages are periodically zapped
back to huge pages. 0 disables the recovery, otherwise if
the value is N KVM will zap 1/Nth of the 4KiB pages every
- minute. The default is 60.
+ period (see below). The default is 60.
- kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
- Default is 1 (enabled)
+ kvm.nx_huge_pages_recovery_period_ms=
+ [KVM] Controls the time period at which KVM zaps 4KiB pages
+ back to huge pages. If the value is a non-zero N, KVM will
+ zap a portion (see ratio above) of the pages every N msecs.
+ If the value is 0 (the default), KVM will pick a period based
+ on the ratio, such that a page is zapped after 1 hour on average.
+
+ kvm-amd.nested= [KVM,AMD] Control nested virtualization feature in
+ KVM/SVM. Default is 1 (enabled).
+
+ kvm-amd.npt= [KVM,AMD] Control KVM's use of Nested Page Tables,
+ a.k.a. Two-Dimensional Page Tables. Default is 1
+ (enabled). Disable by KVM if hardware lacks support
+ for NPT.
+
+ kvm-arm.mode=
+ [KVM,ARM,EARLY] Select one of KVM/arm64's modes of
+ operation.
+
+ none: Forcefully disable KVM.
+
+ nvhe: Standard nVHE-based mode, without support for
+ protected guests.
+
+ protected: nVHE-based mode with support for guests whose
+ state is kept private from the host.
- kvm-amd.npt= [KVM,AMD] Disable nested paging (virtualized MMU)
- for all guests.
- Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
+ nested: VHE-based mode with support for nested
+ virtualization. Requires at least ARMv8.3
+ hardware.
+
+ Defaults to VHE/nVHE based on hardware support. Setting
+ mode to "protected" will disable kexec and hibernation
+ for the host. "nested" is experimental and should be
+ used with extreme caution.
kvm-arm.vgic_v3_group0_trap=
- [KVM,ARM] Trap guest accesses to GICv3 group-0
+ [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0
system registers
kvm-arm.vgic_v3_group1_trap=
- [KVM,ARM] Trap guest accesses to GICv3 group-1
+ [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-1
system registers
kvm-arm.vgic_v3_common_trap=
- [KVM,ARM] Trap guest accesses to GICv3 common
+ [KVM,ARM,EARLY] Trap guest accesses to GICv3 common
system registers
kvm-arm.vgic_v4_enable=
- [KVM,ARM] Allow use of GICv4 for direct injection of
- LPIs.
+ [KVM,ARM,EARLY] Allow use of GICv4 for direct
+ injection of LPIs.
+
+ kvm_cma_resv_ratio=n [PPC,EARLY]
+ Reserves given percentage from system memory area for
+ contiguous memory allocation for KVM hash pagetable
+ allocation.
+ By default it reserves 5% of total system memory.
+ Format: <integer>
+ Default: 5
- kvm-intel.ept= [KVM,Intel] Disable extended page tables
- (virtualized MMU) support on capable Intel chips.
- Default is 1 (enabled)
+ kvm-intel.ept= [KVM,Intel] Control KVM's use of Extended Page Tables,
+ a.k.a. Two-Dimensional Page Tables. Default is 1
+ (enabled). Disable by KVM if hardware lacks support
+ for EPT.
kvm-intel.emulate_invalid_guest_state=
- [KVM,Intel] Enable emulation of invalid guest states
- Default is 0 (disabled)
+ [KVM,Intel] Control whether to emulate invalid guest
+ state. Ignored if kvm-intel.enable_unrestricted_guest=1,
+ as guest state is never invalid for unrestricted
+ guests. This param doesn't apply to nested guests (L2),
+ as KVM never emulates invalid L2 guest state.
+ Default is 1 (enabled).
kvm-intel.flexpriority=
- [KVM,Intel] Disable FlexPriority feature (TPR shadow).
- Default is 1 (enabled)
+ [KVM,Intel] Control KVM's use of FlexPriority feature
+ (TPR shadow). Default is 1 (enabled). Disable by KVM if
+ hardware lacks support for it.
kvm-intel.nested=
- [KVM,Intel] Enable VMX nesting (nVMX).
- Default is 0 (disabled)
+ [KVM,Intel] Control nested virtualization feature in
+ KVM/VMX. Default is 1 (enabled).
kvm-intel.unrestricted_guest=
- [KVM,Intel] Disable unrestricted guest feature
- (virtualized real and unpaged mode) on capable
- Intel chips. Default is 1 (enabled)
+ [KVM,Intel] Control KVM's use of unrestricted guest
+ feature (virtualized real and unpaged mode). Default
+ is 1 (enabled). Disable by KVM if EPT is disabled or
+ hardware lacks support for it.
kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
CVE-2018-3620.
@@ -2268,11 +2741,29 @@
Default is cond (do L1 cache flush in specific instances)
- kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification
- feature (tagged TLBs) on capable Intel chips.
- Default is 1 (enabled)
+ kvm-intel.vpid= [KVM,Intel] Control KVM's use of Virtual Processor
+ Identification feature (tagged TLBs). Default is 1
+ (enabled). Disable by KVM if hardware lacks support
+ for it.
+
+ l1d_flush= [X86,INTEL,EARLY]
+ Control mitigation for L1D based snooping vulnerability.
- l1tf= [X86] Control mitigation of the L1TF vulnerability on
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the mitigation. The
+ options are:
+
+ on - enable the interface for the mitigation
+
+ l1tf= [X86,EARLY] Control mitigation of the L1TF vulnerability on
affected CPUs
The kernel PTE inversion protection is unconditionally
@@ -2341,14 +2832,15 @@
l3cr= [PPC]
- lapic [X86-32,APIC] Enable the local APIC even if BIOS
+ lapic [X86-32,APIC,EARLY] Enable the local APIC even if BIOS
disabled it.
- lapic= [x86,APIC] "notscdeadline" Do not use TSC deadline
+ lapic= [X86,APIC] Do not use TSC deadline
value for LAPIC timer one-shot implementation. Default
back to the programmable timer unit in the LAPIC.
+ Format: notscdeadline
- lapic_timer_c2_ok [X86,APIC] trust the local apic timer
+ lapic_timer_c2_ok [X86,APIC,EARLY] trust the local apic timer
in C2 power state.
libata.dma= [LIBATA] DMA control
@@ -2367,14 +2859,14 @@
when set.
Format: <int>
- libata.force= [LIBATA] Force configurations. The format is comma
- separated list of "[ID:]VAL" where ID is
- PORT[.DEVICE]. PORT and DEVICE are decimal numbers
- matching port, link or device. Basically, it matches
- the ATA ID string printed on console by libata. If
- the whole ID part is omitted, the last PORT and DEVICE
- values are used. If ID hasn't been specified yet, the
- configuration applies to all ports, links and devices.
+ libata.force= [LIBATA] Force configurations. The format is a comma-
+ separated list of "[ID:]VAL" where ID is PORT[.DEVICE].
+ PORT and DEVICE are decimal numbers matching port, link
+ or device. Basically, it matches the ATA ID string
+ printed on console by libata. If the whole ID part is
+ omitted, the last PORT and DEVICE values are used. If
+ ID hasn't been specified yet, the configuration applies
+ to all ports, links and devices.
If only DEVICE is omitted, the parameter applies to
the port and all links and devices behind it. DEVICE
@@ -2384,7 +2876,7 @@
host link and device attached to it.
The VAL specifies the configuration to force. As long
- as there's no ambiguity shortcut notation is allowed.
+ as there is no ambiguity, shortcut notation is allowed.
For example, both 1.5 and 1.5G would work for 1.5Gbps.
The following configurations can be forced.
@@ -2397,29 +2889,68 @@
udma[/][16,25,33,44,66,100,133] notation is also
allowed.
+ * nohrst, nosrst, norst: suppress hard, soft and both
+ resets.
+
+ * rstonce: only attempt one reset during hot-unplug
+ link recovery.
+
+ * [no]dbdelay: Enable or disable the extra 200ms delay
+ before debouncing a link PHY and device presence
+ detection.
+
* [no]ncq: Turn on or off NCQ.
- * [no]ncqtrim: Turn off queued DSM TRIM.
+ * [no]ncqtrim: Enable or disable queued DSM TRIM.
+
+ * [no]ncqati: Enable or disable NCQ trim on ATI chipset.
+
+ * [no]trim: Enable or disable (unqueued) TRIM.
+
+ * trim_zero: Indicate that TRIM command zeroes data.
+
+ * max_trim_128m: Set 128M maximum trim size limit.
+
+ * [no]dma: Turn on or off DMA transfers.
+
+ * atapi_dmadir: Enable ATAPI DMADIR bridge support.
+
+ * atapi_mod16_dma: Enable the use of ATAPI DMA for
+ commands that are not a multiple of 16 bytes.
+
+ * [no]dmalog: Enable or disable the use of the
+ READ LOG DMA EXT command to access logs.
+
+ * [no]iddevlog: Enable or disable access to the
+ identify device data log.
+
+ * [no]logdir: Enable or disable access to the general
+ purpose log directory.
+
+ * max_sec_128: Set transfer size limit to 128 sectors.
+
+ * max_sec_1024: Set or clear transfer size limit to
+ 1024 sectors.
+
+ * max_sec_lba48: Set or clear transfer size limit to
+ 65535 sectors.
- * nohrst, nosrst, norst: suppress hard, soft
- and both resets.
+ * [no]lpm: Enable or disable link power management.
- * rstonce: only attempt one reset during
- hot-unplug link recovery
+ * [no]setxfer: Indicate if transfer speed mode setting
+ should be skipped.
- * dump_id: dump IDENTIFY data.
+ * [no]fua: Disable or enable FUA (Force Unit Access)
+ support for devices supporting this feature.
- * atapi_dmadir: Enable ATAPI DMADIR bridge support
+ * dump_id: Dump IDENTIFY data.
* disable: Disable this device.
If there are multiple matching configurations changing
the same attribute, the last one is used.
- memblock=debug [KNL] Enable memblock debug messages.
-
- load_ramdisk= [RAM] List of ramdisks to load from floppy
- See Documentation/admin-guide/blockdev/ramdisk.rst.
+ load_ramdisk= [RAM] [Deprecated]
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
@@ -2433,7 +2964,7 @@
lockd.nlm_udpport=M [NFS] Assign UDP port.
Format: <integer>
- lockdown= [SECURITY]
+ lockdown= [SECURITY,EARLY]
{ integrity | confidentiality }
Enable the kernel lockdown feature. If set to
integrity, kernel features that allow userland to
@@ -2442,6 +2973,38 @@
to extract confidential information from the kernel
are also disabled.
+ locktorture.acq_writer_lim= [KNL]
+ Set the time limit in jiffies for a lock
+ acquisition. Acquisitions exceeding this limit
+ will result in a splat once they do complete.
+
+ locktorture.bind_readers= [KNL]
+ Specify the list of CPUs to which the readers are
+ to be bound.
+
+ locktorture.bind_writers= [KNL]
+ Specify the list of CPUs to which the writers are
+ to be bound.
+
+ locktorture.call_rcu_chains= [KNL]
+ Specify the number of self-propagating call_rcu()
+ chains to set up. These are used to ensure that
+ there is a high probability of an RCU grace period
+ in progress at any given time. Defaults to 0,
+ which disables these call_rcu() chains.
+
+ locktorture.long_hold= [KNL]
+ Specify the duration in milliseconds for the
+ occasional long-duration lock hold time. Defaults
+ to 100 milliseconds. Select 0 to disable.
+
+ locktorture.nested_locks= [KNL]
+ Specify the maximum lock nesting depth that
+ locktorture is to exercise, up to a limit of 8
+ (MAX_NESTED_LOCKS). Specify zero to disable.
+ Note that this parameter is ineffective on types
+ of locks that do not support nested acquisition.
+
locktorture.nreaders_stress= [KNL]
Set the number of locking read-acquisition kthreads.
Defaults to being automatically set based on the
@@ -2457,6 +3020,25 @@
Set time (s) between CPU-hotplug operations, or
zero to disable CPU-hotplug testing.
+ locktorture.rt_boost= [KNL]
+ Do periodic testing of real-time lock priority
+ boosting. Select 0 to disable, 1 to boost
+ only rt_mutex, and 2 to boost unconditionally.
+ Defaults to 2, which might seem to be an
+ odd choice, but which should be harmless for
+ non-real-time spinlocks, due to their disabling
+ of preemption. Note that non-realtime mutexes
+ disable boosting.
+
+ locktorture.rt_boost_factor= [KNL]
+ Number that determines how often and for how
+ long priority boosting is exercised. This is
+ scaled down by the number of writers, so that the
+ number of boosts per unit time remains roughly
+ constant as the number of writers increases.
+ On the other hand, the duration of each boost
+ increases with the number of writers.
+
locktorture.shuffle_interval= [KNL]
Set task-shuffle interval (jiffies). Shuffling
tasks allows some CPUs to go into dyntick-idle
@@ -2482,10 +3064,15 @@
locktorture.verbose= [KNL]
Enable additional printk() statements.
+ locktorture.writer_fifo= [KNL]
+ Run the write-side locktorture kthreads at
+ sched_set_fifo() real-time priority.
+
logibm.irq= [HW,MOUSE] Logitech Bus Mouse Driver
Format: <irq>
- loglevel= All Kernel Messages with a loglevel smaller than the
+ loglevel= [KNL,EARLY]
+ All Kernel Messages with a loglevel smaller than the
console loglevel will be printed to the console. It can
also be changed with klogd or other programs. The
loglevels are defined as follows:
@@ -2499,13 +3086,15 @@
6 (KERN_INFO) informational
7 (KERN_DEBUG) debug-level messages
- log_buf_len=n[KMG] Sets the size of the printk ring buffer,
- in bytes. n must be a power of two and greater
- than the minimal size. The minimal size is defined
- by LOG_BUF_SHIFT kernel config parameter. There is
- also CONFIG_LOG_CPU_MAX_BUF_SHIFT config parameter
- that allows to increase the default size depending on
- the number of CPUs. See init/Kconfig for more details.
+ log_buf_len=n[KMG] [KNL,EARLY]
+ Sets the size of the printk ring buffer, in bytes.
+ n must be a power of two and greater than the
+ minimal size. The minimal size is defined by
+ LOG_BUF_SHIFT kernel config parameter. There
+ is also CONFIG_LOG_CPU_MAX_BUF_SHIFT config
+ parameter that allows to increase the default size
+ depending on the number of CPUs. See init/Kconfig
+ for more details.
logo.nologo [FB] Disables display of the built-in Linux logo.
This may be used to provide more screen space for
@@ -2556,14 +3145,14 @@
(machvec) in a generic kernel.
Example: machvec=hpzx1
- machtype= [Loongson] Share the same kernel image file between different
- yeeloong laptop.
+ machtype= [Loongson] Share the same kernel image file between
+ different yeeloong laptops.
Example: machtype=lemote-yeeloong-2f-7inch
- max_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory greater
+ max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater
than or equal to this physical address is ignored.
- maxcpus= [SMP] Maximum number of processors that an SMP kernel
+ maxcpus= [SMP,EARLY] Maximum number of processors that an SMP kernel
will bring up during bootup. maxcpus=n : n >= 0 limits
the kernel to bring up 'n' processors. Surely after
bootup you can bring up the other plugged cpu by executing
@@ -2581,7 +3170,7 @@
mce [X86-32] Machine Check Exception
- mce=option [X86-64] See Documentation/x86/x86_64/boot-options.rst
+ mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst
md= [HW] RAID subsystems devices and level
See Documentation/admin-guide/md.rst.
@@ -2590,7 +3179,7 @@
Format: <first>,<last>
Specifies range of consoles to be captured by the MDA.
- mds= [X86,INTEL]
+ mds= [X86,INTEL,EARLY]
Control mitigation for the Micro-architectural Data
Sampling (MDS) vulnerability.
@@ -2622,13 +3211,24 @@
For details see: Documentation/admin-guide/hw-vuln/mds.rst
- mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
- Amount of memory to be used in cases as follows:
+ mem=nn[KMG] [HEXAGON,EARLY] Set the memory size.
+ Must be specified, otherwise memory size will be 0.
+
+ mem=nn[KMG] [KNL,BOOT,EARLY] Force usage of a specific amount
+ of memory Amount of memory to be used in cases
+ as follows:
1 for test;
2 when the kernel is not able to see the whole system memory;
3 memory that lies after 'mem=' boundary is excluded from
the hypervisor, then assigned to KVM guests.
+ 4 to limit the memory available for kdump kernel.
+
+ [ARC,MICROBLAZE] - the limit applies only to low memory,
+ high memory is not affected.
+
+ [ARM64] - only limits memory covered by the linear
+ mapping. The NOMAP regions are not affected.
[X86] Work as limiting max address. Use together
with memmap= to avoid physical address space collisions.
@@ -2639,14 +3239,24 @@
in above case 3, memory may need be hot added after boot
if system memory of hypervisor is not sufficient.
+ mem=nn[KMG]@ss[KMG]
+ [ARM,MIPS,EARLY] - override the memory layout
+ reported by firmware.
+ Define a memory region of size nn[KMG] starting at
+ ss[KMG].
+ Multiple different regions can be specified with
+ multiple mem= parameters on the command line.
+
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory.
+ memblock=debug [KNL,EARLY] Enable memblock debug messages.
+
memchunk=nn[KMG]
[KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers.
- memhp_default_state=online/offline
+ memhp_default_state=online/offline/online_kernel/online_movable
[KNL] Set the initial state for the memory hotplug
onlining policy. If not specified, the default value is
set according to the
@@ -2654,14 +3264,14 @@
option.
See Documentation/admin-guide/mm/memory-hotplug.rst.
- memmap=exactmap [KNL,X86] Enable setting of an exact
+ memmap=exactmap [KNL,X86,EARLY] Enable setting of an exact
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
BIOS output or other requirements. See the memmap=nn@ss
option description.
memmap=nn[KMG]@ss[KMG]
- [KNL] Force usage of a specific region of memory.
+ [KNL, X86,MIPS,XTENSA,EARLY] Force usage of a specific region of memory.
Region of memory to be used is from ss to ss+nn.
If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG],
which limits max address to nn[KMG].
@@ -2671,11 +3281,11 @@
memmap=100M@2G,100M#3G,1G!1024G
memmap=nn[KMG]#ss[KMG]
- [KNL,ACPI] Mark specific memory as ACPI data.
+ [KNL,ACPI,EARLY] Mark specific memory as ACPI data.
Region of memory to be marked is from ss to ss+nn.
memmap=nn[KMG]$ss[KMG]
- [KNL,ACPI] Mark specific memory as reserved.
+ [KNL,ACPI,EARLY] Mark specific memory as reserved.
Region of memory to be reserved is from ss to ss+nn.
Example: Exclude memory from 0x18690000-0x1869ffff
memmap=64K$0x18690000
@@ -2685,14 +3295,14 @@
like Grub2, otherwise '$' and the following number
will be eaten.
- memmap=nn[KMG]!ss[KMG]
+ memmap=nn[KMG]!ss[KMG,EARLY]
[KNL,X86] Mark specific memory as protected.
Region of memory to be used, from ss to ss+nn.
The memory region may be marked as e820 type 12 (0xc)
and is NVDIMM or ADR memory.
memmap=<size>%<offset>-<oldtype>+<newtype>
- [KNL,ACPI] Convert memory within the specified region
+ [KNL,ACPI,EARLY] Convert memory within the specified region
from <oldtype> to <newtype>. If "-<oldtype>" is left
out, the whole region will be marked as <newtype>,
even if previously unavailable. If "+<newtype>" is left
@@ -2700,7 +3310,7 @@
specified as e820 types, e.g., 1 = RAM, 2 = reserved,
3 = ACPI, 12 = PRAM.
- memory_corruption_check=0/1 [X86]
+ memory_corruption_check=0/1 [X86,EARLY]
Some BIOSes seem to corrupt the first 64k of
memory when doing things like suspend/resume.
Setting this option will scan the memory
@@ -2712,18 +3322,37 @@
affects the same memory, you can use memmap=
to prevent the kernel from using that memory.
- memory_corruption_check_size=size [X86]
+ memory_corruption_check_size=size [X86,EARLY]
By default it checks for corruption in the low
64k, making this memory unavailable for normal
use. Use this parameter to scan for
corruption in more or less memory.
- memory_corruption_check_period=seconds [X86]
+ memory_corruption_check_period=seconds [X86,EARLY]
By default it checks for corruption every 60
seconds. Use this parameter to check at some
other rate. 0 disables periodic checking.
- memtest= [KNL,X86,ARM,PPC] Enable memtest
+ memory_hotplug.memmap_on_memory
+ [KNL,X86,ARM] Boolean flag to enable this feature.
+ Format: {on | off (default)}
+ When enabled, runtime hotplugged memory will
+ allocate its internal metadata (struct pages,
+ those vmemmap pages cannot be optimized even
+ if hugetlb_free_vmemmap is enabled) from the
+ hotadded memory which will allow to hotadd a
+ lot of memory without requiring additional
+ memory to do so.
+ This feature is disabled by default because it
+ has some implication on large (e.g. GB)
+ allocations in some configurations (e.g. small
+ memory blocks).
+ The state of the flag can be read in
+ /sys/module/memory_hotplug/parameters/memmap_on_memory.
+ Note that even when enabled, there are a few cases where
+ the feature is not effective.
+
+ memtest= [KNL,X86,ARM,M68K,PPC,RISCV,EARLY] Enable memtest
Format: <integer>
default : 0 <disable>
Specifies the number of memtest passes to be
@@ -2735,13 +3364,11 @@
mem_encrypt= [X86-64] AMD Secure Memory Encryption (SME) control
Valid arguments: on, off
- Default (depends on kernel configuration option):
- on (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y)
- off (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n)
+ Default: off
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
- Refer to Documentation/virt/kvm/amd-memory-encryption.rst
+ Refer to Documentation/virt/kvm/x86/amd-memory-encryption.rst
for details on when memory encryption can be activated.
mem_sleep_default= [SUSPEND] Default system suspend mode:
@@ -2750,9 +3377,6 @@
deep - Suspend-To-RAM or equivalent (if supported)
See Documentation/admin-guide/pm/sleep-states.rst.
- meye.*= [HW] Set MotionEye Camera parameters
- See Documentation/admin-guide/media/meye.rst.
-
mfgpt_irq= [IA-32] Specify the IRQ to use for the
Multi-Function General Purpose Timers on AMD Geode
platforms.
@@ -2764,7 +3388,12 @@
mga= [HW,DRM]
- min_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory below this
+ microcode.force_minrev= [X86]
+ Format: <bool>
+ Enable or disable the microcode minimal revision
+ enforcement for the runtime microcode loader.
+
+ min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this
physical address is ignored.
mini2440= [ARM,HW,KNL]
@@ -2786,10 +3415,10 @@
touchscreen support is not enabled in the mainstream
kernel as of 2.6.30, a preliminary port can be found
in the "bleeding edge" mini2440 support kernel at
- http://repo.or.cz/w/linux-2.6/mini2440.git
+ https://repo.or.cz/w/linux-2.6/mini2440.git
mitigations=
- [X86,PPC,S390,ARM64] Control optional mitigations for
+ [X86,PPC,S390,ARM64,EARLY] Control optional mitigations for
CPU vulnerabilities. This is a set of curated,
arch-independent options, each of which is an
aggregation of existing arch-specific options.
@@ -2798,18 +3427,27 @@
Disable all optional CPU mitigations. This
improves system performance, but it may also
expose users to several CPU vulnerabilities.
- Equivalent to: nopti [X86,PPC]
- kpti=0 [ARM64]
- nospectre_v1 [X86,PPC]
+ Equivalent to: if nokaslr then kpti=0 [ARM64]
+ gather_data_sampling=off [X86]
+ kvm.nx_huge_pages=off [X86]
+ l1tf=off [X86]
+ mds=off [X86]
+ mmio_stale_data=off [X86]
+ no_entry_flush [PPC]
+ no_uaccess_flush [PPC]
nobp=0 [S390]
+ nopti [X86,PPC]
+ nospectre_bhb [ARM64]
+ nospectre_v1 [X86,PPC]
nospectre_v2 [X86,PPC,S390,ARM64]
- spectre_v2_user=off [X86]
+ reg_file_data_sampling=off [X86]
+ retbleed=off [X86]
+ spec_rstack_overflow=off [X86]
spec_store_bypass_disable=off [X86,PPC]
+ spectre_v2_user=off [X86]
+ srbds=off [X86,INTEL]
ssbd=force-off [ARM64]
- l1tf=off [X86]
- mds=off [X86]
tsx_async_abort=off [X86]
- kvm.nx_huge_pages=off [X86]
Exceptions:
This does not have any effect on
@@ -2831,15 +3469,73 @@
Equivalent to: l1tf=flush,nosmt [X86]
mds=full,nosmt [X86]
tsx_async_abort=full,nosmt [X86]
+ mmio_stale_data=full,nosmt [X86]
+ retbleed=auto,nosmt [X86]
mminit_loglevel=
- [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
+ [KNL,EARLY] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
the additional memory initialisation checks. A value
of 0 disables mminit logging and a level of 4 will
log everything. Information is printed at KERN_DEBUG
so loglevel=8 may also need to be specified.
+ mmio_stale_data=
+ [X86,INTEL,EARLY] Control mitigation for the Processor
+ MMIO Stale Data vulnerabilities.
+
+ Processor MMIO Stale Data is a class of
+ vulnerabilities that may expose data after an MMIO
+ operation. Exposed data could originate or end in
+ the same CPU buffers as affected by MDS and TAA.
+ Therefore, similar to MDS and TAA, the mitigation
+ is to clear the affected CPU buffers.
+
+ This parameter controls the mitigation. The
+ options are:
+
+ full - Enable mitigation on vulnerable CPUs
+
+ full,nosmt - Enable mitigation and disable SMT on
+ vulnerable CPUs.
+
+ off - Unconditionally disable mitigation
+
+ On MDS or TAA affected machines,
+ mmio_stale_data=off can be prevented by an active
+ MDS or TAA mitigation as these vulnerabilities are
+ mitigated with the same mechanism so in order to
+ disable this mitigation, you need to specify
+ mds=off and tsx_async_abort=off too.
+
+ Not specifying this option is equivalent to
+ mmio_stale_data=full.
+
+ For details see:
+ Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
+
+ <module>.async_probe[=<bool>] [KNL]
+ If no <bool> value is specified or if the value
+ specified is not a valid <bool>, enable asynchronous
+ probe on this module. Otherwise, enable/disable
+ asynchronous probe on this module as indicated by the
+ <bool> value. See also: module.async_probe
+
+ module.async_probe=<bool>
+ [KNL] When set to true, modules will use async probing
+ by default. To enable/disable async probing for a
+ specific module, use the module specific control that
+ is documented under <module>.async_probe. When both
+ module.async_probe and <module>.async_probe are
+ specified, <module>.async_probe takes precedence for
+ the specific module.
+
+ module.enable_dups_trace
+ [KNL] When CONFIG_MODULE_DEBUG_AUTOLOAD_DUPS is set,
+ this means that duplicate request_module() calls will
+ trigger a WARN_ON() instead of a pr_warn(). Note that
+ if MODULE_DEBUG_AUTOLOAD_DUPS_TRACE is set, WARN_ON()s
+ will always be issued and this option does nothing.
module.sig_enforce
[KNL] When CONFIG_MODULE_SIG is set, this means that
modules without (valid) signatures will fail to load.
@@ -2860,7 +3556,7 @@
mousedev.yres= [MOUSE] Vertical screen resolution, used for devices
reporting absolute coordinates, such as tablets
- movablecore= [KNL,X86,IA-64,PPC]
+ movablecore= [KNL,X86,IA-64,PPC,EARLY]
Format: nn[KMGTPE] | nn%
This parameter is the complement to kernelcore=, it
specifies the amount of memory used for migratable
@@ -2871,7 +3567,7 @@
that the amount of memory usable for all allocations
is not too small.
- movable_node [KNL] Boot-time switch to make hotplugable memory
+ movable_node [KNL,EARLY] Boot-time switch to make hotplugable memory
NUMA nodes to be movable. This means that the memory
of such nodes will be usable only for movable
allocations which rules out almost all kernel
@@ -2886,46 +3582,40 @@
mtdparts= [MTD]
See drivers/mtd/parsers/cmdlinepart.c
- multitce=off [PPC] This parameter disables the use of the pSeries
- firmware feature for updating multiple TCE entries
- at a time.
-
- onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
-
- Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
-
- boundary - index of last SLC block on Flex-OneNAND.
- The remaining blocks are configured as MLC blocks.
- lock - Configure if Flex-OneNAND boundary should be locked.
- Once locked, the boundary cannot be changed.
- 1 indicates lock status, 0 indicates unlock status.
-
mtdset= [ARM]
ARM/S3C2412 JIVE boot control
- See arch/arm/mach-s3c2412/mach-jive.c
+ See arch/arm/mach-s3c/mach-jive.c
mtouchusb.raw_coordinates=
[HW] Make the MicroTouch USB driver use raw coordinates
('y', default) or cooked coordinates ('n')
- mtrr_chunk_size=nn[KMG] [X86]
+ mtrr=debug [X86,EARLY]
+ Enable printing debug information related to MTRR
+ registers at boot time.
+
+ mtrr_chunk_size=nn[KMG,X86,EARLY]
used for mtrr cleanup. It is largest continuous chunk
that could hold holes aka. UC entries.
- mtrr_gran_size=nn[KMG] [X86]
+ mtrr_gran_size=nn[KMG,X86,EARLY]
Used for mtrr cleanup. It is granularity of mtrr block.
Default is 1.
Large value could prevent small alignment from
using up MTRRs.
- mtrr_spare_reg_nr=n [X86]
+ mtrr_spare_reg_nr=n [X86,EARLY]
Format: <integer>
Range: 0,7 : spare reg number
Default : 1
Used for mtrr cleanup. It is spare mtrr entries number.
Set to 2 or more if your graphical card needs more.
+ multitce=off [PPC] This parameter disables the use of the pSeries
+ firmware feature for updating multiple TCE entries
+ at a time.
+
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
netdev= [NET] Network devices parameters
@@ -2935,20 +3625,24 @@
This usage is only documented in each driver source
file if at all.
+ netpoll.carrier_timeout=
+ [NET] Specifies amount of time (in seconds) that
+ netpoll should wait for a carrier. By default netpoll
+ waits 4 seconds.
+
nf_conntrack.acct=
[NETFILTER] Enable connection tracking flow accounting
0 to disable accounting
1 to enable accounting
Default value is 0.
- nfsaddrs= [NFS] Deprecated. Use ip= instead.
- See Documentation/admin-guide/nfs/nfsroot.rst.
-
- nfsroot= [NFS] nfs root filesystem for disk-less boxes.
- See Documentation/admin-guide/nfs/nfsroot.rst.
+ nfs.cache_getent=
+ [NFS] sets the pathname to the program which is used
+ to update the NFS client cache entries.
- nfsrootdebug [NFS] enable nfsroot debugging messages.
- See Documentation/admin-guide/nfs/nfsroot.rst.
+ nfs.cache_getent_timeout=
+ [NFS] sets the timeout after which an attempt to
+ update a cache entry is deemed to have failed.
nfs.callback_nr_threads=
[NFSv4] set the total number of threads that the
@@ -2959,17 +3653,12 @@
[NFS] set the TCP port on which the NFSv4 callback
channel should listen.
- nfs.cache_getent=
- [NFS] sets the pathname to the program which is used
- to update the NFS client cache entries.
-
- nfs.cache_getent_timeout=
- [NFS] sets the timeout after which an attempt to
- update a cache entry is deemed to have failed.
-
- nfs.idmap_cache_timeout=
- [NFS] set the maximum lifetime for idmapper cache
- entries.
+ nfs.delay_retrans=
+ [NFS] specifies the number of times the NFSv4 client
+ retries the request before returning an EAGAIN error,
+ after a reply of NFS4ERR_DELAY from the server.
+ Only applies if the softerr mount option is enabled,
+ and the specified value is >= 0.
nfs.enable_ino64=
[NFS] enable 64-bit inode numbers.
@@ -2978,6 +3667,10 @@
of returning the full 64-bit number.
The default is to return 64-bit inode numbers.
+ nfs.idmap_cache_timeout=
+ [NFS] set the maximum lifetime for idmapper cache
+ entries.
+
nfs.max_session_cb_slots=
[NFSv4.1] Sets the maximum number of session
slots the client will assign to the callback
@@ -3005,21 +3698,14 @@
will be autodetected by the client, and it will fall
back to using the idmapper.
To turn off this behaviour, set the value to '0'.
+
nfs.nfs4_unique_id=
[NFS4] Specify an additional fixed unique ident-
ification string that NFSv4 clients can insert into
their nfs_client_id4 string. This is typically a
UUID that is generated at system install time.
- nfs.send_implementation_id =
- [NFSv4.1] Send client implementation identification
- information in exchange_id requests.
- If zero, no implementation identification information
- will be sent.
- The default is to send the implementation identification
- information.
-
- nfs.recover_lost_locks =
+ nfs.recover_lost_locks=
[NFSv4] Attempt to recover locks that were lost due
to a lease timeout on the server. Please note that
doing this risks data corruption, since there are
@@ -3031,7 +3717,15 @@
The default parameter value of '0' causes the kernel
not to attempt recovery of lost locks.
- nfs4.layoutstats_timer =
+ nfs.send_implementation_id=
+ [NFSv4.1] Send client implementation identification
+ information in exchange_id requests.
+ If zero, no implementation identification information
+ will be sent.
+ The default is to send the implementation identification
+ information.
+
+ nfs4.layoutstats_timer=
[NFSv4.2] Change the rate at which the kernel sends
layoutstats to the pNFS metadata server.
@@ -3040,6 +3734,11 @@
driver. A non-zero value sets the minimum interval
in seconds between layoutstats transmissions.
+ nfsd.inter_copy_offload_enable=
+ [NFSv4.2] When set to 1, the server will support
+ server-to-server copies for which this server is
+ the destination of the copy.
+
nfsd.nfs4_disable_idmapping=
[NFSv4] When set to the default of '1', the NFSv4
server will return only numeric uids and gids to
@@ -3047,6 +3746,27 @@
and gids from such clients. This is intended to ease
migration from NFSv2/v3.
+ nfsd.nfsd4_ssc_umount_timeout=
+ [NFSv4.2] When used as the destination of a
+ server-to-server copy, knfsd temporarily mounts
+ the source server. It caches the mount in case
+ it will be needed again, and discards it if not
+ used for the number of milliseconds specified by
+ this parameter.
+
+ nfsaddrs= [NFS] Deprecated. Use ip= instead.
+ See Documentation/admin-guide/nfs/nfsroot.rst.
+
+ nfsroot= [NFS] nfs root filesystem for disk-less boxes.
+ See Documentation/admin-guide/nfs/nfsroot.rst.
+
+ nfsrootdebug [NFS] enable nfsroot debugging messages.
+ See Documentation/admin-guide/nfs/nfsroot.rst.
+
+ nmi_backtrace.backtrace_idle [KNL]
+ Dump stacks even of idle CPUs in response to an
+ NMI stack-backtrace request.
+
nmi_debug= [KNL,SH] Specify one or more actions to take
when a NMI is triggered.
Format: [state][,regs][,debounce][,die]
@@ -3067,18 +3787,28 @@
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
- netpoll.carrier_timeout=
- [NET] Specifies amount of time (in seconds) that
- netpoll should wait for a carrier. By default netpoll
- waits 4 seconds.
-
no387 [BUGS=X86-32] Tells the kernel to use the 387 maths
emulation library even if a 387 maths coprocessor
is present.
- no5lvl [X86-64] Disable 5-level paging mode. Forces
+ no4lvl [RISCV,EARLY] Disable 4-level and 5-level paging modes.
+ Forces kernel to use 3-level paging instead.
+
+ no5lvl [X86-64,RISCV,EARLY] Disable 5-level paging mode. Forces
kernel to use 4-level paging instead.
+ noalign [KNL,ARM]
+
+ noaltinstr [S390,EARLY] Disables alternative instructions
+ patching (CPU alternatives feature).
+
+ noapic [SMP,APIC,EARLY] Tells the kernel to not make use of any
+ IOAPICs that may be present in the system.
+
+ noautogroup Disable scheduler automatic task group creation.
+
+ nocache [ARM,EARLY]
+
no_console_suspend
[HW] Never suspend the console
Disable suspending of consoles during suspend and
@@ -3094,58 +3824,16 @@
/sys/module/printk/parameters/console_suspend) to
turn on/off it dynamically.
- novmcoredd [KNL,KDUMP]
- Disable device dump. Device dump allows drivers to
- append dump data to vmcore so you can collect driver
- specified debug info. Drivers can append the data
- without any limit and this data is stored in memory,
- so this may cause significant memory stress. Disabling
- device dump can help save memory but the driver debug
- data will be no longer available. This parameter
- is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP
- is set.
-
- noaliencache [MM, NUMA, SLAB] Disables the allocation of alien
- caches in the slab allocator. Saves per-node memory,
- but will impact performance.
-
- noalign [KNL,ARM]
-
- noaltinstr [S390] Disables alternative instructions patching
- (CPU alternatives feature).
-
- noapic [SMP,APIC] Tells the kernel to not make use of any
- IOAPICs that may be present in the system.
-
- noautogroup Disable scheduler automatic task group creation.
-
- nobats [PPC] Do not use BATs for mapping kernel lowmem
- on "Classic" PPC cores.
-
- nocache [ARM]
-
- noclflush [BUGS=X86] Don't use the CLFLUSH instruction
-
- nodelayacct [KNL] Disable per-task delay accounting
+ no_debug_objects
+ [KNL,EARLY] Disable object debugging
nodsp [SH] Disable hardware DSP at boot time.
- noefi Disable EFI runtime services support.
-
- noexec [IA-64]
+ noefi [EFI,EARLY] Disable EFI runtime services support.
- noexec [X86]
- On X86-32 available only on PAE configured kernels.
- noexec=on: enable non-executable mappings (default)
- noexec=off: disable non-executable mappings
-
- nosmap [X86,PPC]
- Disable SMAP (Supervisor Mode Access Prevention)
- even if it is supported by processor.
+ no_entry_flush [PPC,EARLY] Don't flush the L1-D cache when entering the kernel.
- nosmep [X86,PPC]
- Disable SMEP (Supervisor Mode Execution Prevention)
- even if it is supported by processor.
+ noexec [IA-64]
noexec32 [X86-64]
This affects only 32-bit executables.
@@ -3154,60 +3842,18 @@
noexec32=off: disable non-executable mappings
read implies executable mappings
+ no_file_caps Tells the kernel not to honor file capabilities. The
+ only way then for a file to be executed with privilege
+ is to be setuid root or executed by root.
+
nofpu [MIPS,SH] Disable hardware FPU at boot time.
+ nofsgsbase [X86] Disables FSGSBASE instructions.
+
nofxsr [BUGS=X86-32] Disables x86 floating point extended
register save and restore. The kernel will only save
legacy floating-point registers on task switch.
- nohugeiomap [KNL,x86,PPC] Disable kernel huge I/O mappings.
-
- nosmt [KNL,S390] Disable symmetric multithreading (SMT).
- Equivalent to smt=1.
-
- [KNL,x86] Disable symmetric multithreading (SMT).
- nosmt=force: Force disable SMT, cannot be undone
- via the sysfs control file.
-
- nospectre_v1 [X86,PPC] Disable mitigations for Spectre Variant 1
- (bounds check bypass). With this option data leaks are
- possible in the system.
-
- nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for
- the Spectre variant 2 (indirect branch prediction)
- vulnerability. System may allow data leaks with this
- option.
-
- nospec_store_bypass_disable
- [HW] Disable all mitigations for the Speculative Store Bypass vulnerability
-
- noxsave [BUGS=X86] Disables x86 extended register state save
- and restore using xsave. The kernel will fallback to
- enabling legacy floating-point and sse state.
-
- noxsaveopt [X86] Disables xsaveopt used in saving x86 extended
- register states. The kernel will fall back to use
- xsave to save the states. By using this parameter,
- performance of saving the states is degraded because
- xsave doesn't support modified optimization while
- xsaveopt supports it on xsaveopt enabled systems.
-
- noxsaves [X86] Disables xsaves and xrstors used in saving and
- restoring x86 extended register state in compacted
- form of xsave area. The kernel will fall back to use
- xsaveopt and xrstor to save and restore the states
- in standard form of xsave area. By using this
- parameter, xsave area per process might occupy more
- memory on xsaves enabled systems.
-
- nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or
- wfi(ARM) instruction doesn't work correctly and not to
- use it. This is also useful when using JTAG debugger.
-
- no_file_caps Tells the kernel not to honor file capabilities. The
- only way then for a file to be executed with privilege
- is to be setuid root or executed by root.
-
nohalt [IA-64] Tells the kernel not to use the power saving
function PAL_HALT_LIGHT when idle. This increases
power-consumption. On the positive side, it reduces
@@ -3215,8 +3861,36 @@
in certain environments such as networked servers or
real-time systems.
+ no_hash_pointers
+ [KNL,EARLY]
+ Force pointers printed to the console or buffers to be
+ unhashed. By default, when a pointer is printed via %p
+ format string, that pointer is "hashed", i.e. obscured
+ by hashing the pointer value. This is a security feature
+ that hides actual kernel addresses from unprivileged
+ users, but it also makes debugging the kernel more
+ difficult since unequal pointers can no longer be
+ compared. However, if this command-line option is
+ specified, then all normal pointers will have their true
+ value printed. This option should only be specified when
+ debugging the kernel. Please do not use on production
+ kernels.
+
nohibernate [HIBERNATION] Disable hibernation and resume.
+ nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to
+ busy wait in do_idle() and not use the arch_cpu_idle()
+ implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP
+ to be effective. This is useful on platforms where the
+ sleep(SH) or wfi(ARM,ARM64) instructions do not work
+ correctly or when doing power measurements to evaluate
+ the impact of the sleep instructions. This is also
+ useful when using JTAG debugger.
+
+ nohugeiomap [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge I/O mappings.
+
+ nohugevmalloc [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge vmalloc mappings.
+
nohz= [KNL] Boottime enable/disable dynamic ticks
Valid arguments: on, off
Default: on
@@ -3231,48 +3905,42 @@
just as if they had also been called out in the
rcu_nocbs= boot parameter.
- noiotrap [SH] Disables trapped I/O port accesses.
-
- noirqdebug [X86-32] Disables the code which attempts to detect and
- disable unhandled interrupt sources.
-
- no_timer_check [X86,APIC] Disables the code which tests for
- broken timer IRQ sources.
-
- noisapnp [ISAPNP] Disables ISA PnP code.
+ Note that this argument takes precedence over
+ the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
noinitrd [RAM] Tells the kernel not to load any configured
initial RAM disk.
- nointremap [X86-64, Intel-IOMMU] Do not enable interrupt
+ nointremap [X86-64,Intel-IOMMU,EARLY] Do not enable interrupt
remapping.
[Deprecated - use intremap=off]
nointroute [IA-64]
- noinvpcid [X86] Disable the INVPCID cpu feature.
+ noinvpcid [X86,EARLY] Disable the INVPCID cpu feature.
- nojitter [IA-64] Disables jitter checking for ITC timers.
+ noiotrap [SH] Disables trapped I/O port accesses.
- no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver
+ noirqdebug [X86-32] Disables the code which attempts to detect and
+ disable unhandled interrupt sources.
- no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
- fault handling.
+ noisapnp [ISAPNP] Disables ISA PnP code.
- no-vmw-sched-clock
- [X86,PV_OPS] Disable paravirtualized VMware scheduler
- clock and use the default one.
+ nojitter [IA-64] Disables jitter checking for ITC timers.
- no-steal-acc [X86,PV_OPS,ARM64] Disable paravirtualized steal time
- accounting. steal time is computed, but won't
- influence scheduler behaviour
+ nokaslr [KNL,EARLY]
+ When CONFIG_RANDOMIZE_BASE is set, this disables
+ kernel and module base offset ASLR (Address Space
+ Layout Randomization).
+
+ no-kvmapf [X86,KVM,EARLY] Disable paravirtualized asynchronous page
+ fault handling.
- nolapic [X86-32,APIC] Do not enable or use the local APIC.
+ no-kvmclock [X86,KVM,EARLY] Disable paravirtualized KVM clock driver
- nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
+ nolapic [X86-32,APIC,EARLY] Do not enable or use the local APIC.
- noltlbs [PPC] Do not use large page/tlb entries for kernel
- lowmem mapping on PPC40x and PPC8xx
+ nolapic_timer [X86-32,APIC,EARLY] Do not use the local APIC timer.
nomca [IA-64] Disable machine check abort handling
@@ -3281,16 +3949,42 @@
nomfgpt [X86-32] Disable Multi-Function General Purpose
Timer usage (for AMD Geode machines).
+ nomodeset Disable kernel modesetting. Most systems' firmware
+ sets up a display mode and provides framebuffer memory
+ for output. With nomodeset, DRM and fbdev drivers will
+ not load if they could possibly displace the pre-
+ initialized output. Only the system framebuffer will
+ be available for use. The respective drivers will not
+ perform display-mode changes or accelerated rendering.
+
+ Useful as error fallback, or for testing and debugging.
+
+ nomodule Disable module load
+
nonmi_ipi [X86] Disable using NMI IPIs during panic/reboot to
shutdown the other cpus. Instead use the REBOOT_VECTOR
irq.
- nomodule Disable module load
-
- nopat [X86] Disable PAT (page attribute table extension of
+ nopat [X86,EARLY] Disable PAT (page attribute table extension of
pagetables) support.
- nopcid [X86-64] Disable the PCID cpu feature.
+ nopcid [X86-64,EARLY] Disable the PCID cpu feature.
+
+ nopku [X86] Disable Memory Protection Keys CPU feature found
+ in some Intel CPUs.
+
+ nopti [X86-64,EARLY]
+ Equivalent to pti=off
+
+ nopv= [X86,XEN,KVM,HYPER_V,VMWARE,EARLY]
+ Disables the PV optimizations forcing the guest to run
+ as generic guest with no PV drivers. Currently support
+ XEN HVM, KVM, HYPER_V and VMWARE guest.
+
+ nopvspin [X86,XEN,KVM,EARLY]
+ Disables the qspinlock slow path using PV optimizations
+ which allow the hypervisor to 'idle' the guest on lock
+ contention.
norandmaps Don't use address space randomization. Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space
@@ -3298,49 +3992,110 @@
noreplace-smp [X86-32,SMP] Don't replace SMP instructions
with UP alternatives
- nordrand [X86] Disable kernel use of the RDRAND and
- RDSEED instructions even if they are supported
- by the processor. RDRAND and RDSEED are still
- available to user space applications.
-
noresume [SWSUSP] Disables resume and restores original swap
space.
+ nosbagart [IA-64]
+
no-scroll [VGA] Disables scrollback.
This is required for the Braillex ib80-piezo Braille
reader made by F.H. Papenmeier (Germany).
- nosbagart [IA-64]
+ nosgx [X86-64,SGX,EARLY] Disables Intel SGX kernel support.
+
+ nosmap [PPC,EARLY]
+ Disable SMAP (Supervisor Mode Access Prevention)
+ even if it is supported by processor.
- nosep [BUGS=X86-32] Disables x86 SYSENTER/SYSEXIT support.
+ nosmep [PPC64s,EARLY]
+ Disable SMEP (Supervisor Mode Execution Prevention)
+ even if it is supported by processor.
- nosmp [SMP] Tells an SMP kernel to act as a UP kernel,
+ nosmp [SMP,EARLY] Tells an SMP kernel to act as a UP kernel,
and disable the IO APIC. legacy for "maxcpus=0".
+ nosmt [KNL,MIPS,PPC,S390,EARLY] Disable symmetric multithreading (SMT).
+ Equivalent to smt=1.
+
+ [KNL,X86,PPC] Disable symmetric multithreading (SMT).
+ nosmt=force: Force disable SMT, cannot be undone
+ via the sysfs control file.
+
nosoftlockup [KNL] Disable the soft-lockup detector.
+ nospec_store_bypass_disable
+ [HW,EARLY] Disable all mitigations for the Speculative
+ Store Bypass vulnerability
+
+ nospectre_bhb [ARM64,EARLY] Disable all mitigations for Spectre-BHB (branch
+ history injection) vulnerability. System may allow data leaks
+ with this option.
+
+ nospectre_v1 [X86,PPC,EARLY] Disable mitigations for Spectre Variant 1
+ (bounds check bypass). With this option data leaks are
+ possible in the system.
+
+ nospectre_v2 [X86,PPC_E500,ARM64,EARLY] Disable all mitigations
+ for the Spectre variant 2 (indirect branch
+ prediction) vulnerability. System may allow data
+ leaks with this option.
+
+ no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV,EARLY] Disable
+ paravirtualized steal time accounting. steal time is
+ computed, but won't influence scheduler behaviour
+
nosync [HW,M68K] Disables sync negotiation for all devices.
+ no_timer_check [X86,APIC] Disables the code which tests for
+ broken timer IRQ sources.
+
+ no_uaccess_flush
+ [PPC,EARLY] Don't flush the L1-D cache after accessing user data.
+
+ novmcoredd [KNL,KDUMP]
+ Disable device dump. Device dump allows drivers to
+ append dump data to vmcore so you can collect driver
+ specified debug info. Drivers can append the data
+ without any limit and this data is stored in memory,
+ so this may cause significant memory stress. Disabling
+ device dump can help save memory but the driver debug
+ data will be no longer available. This parameter
+ is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP
+ is set.
+
+ no-vmw-sched-clock
+ [X86,PV_OPS,EARLY] Disable paravirtualized VMware
+ scheduler clock and use the default one.
+
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).
- nowb [ARM]
+ nowb [ARM,EARLY]
- nox2apic [X86-64,APIC] Do not enable x2APIC mode.
+ nox2apic [X86-64,APIC,EARLY] Do not enable x2APIC mode.
- cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
- CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
- Some features depend on CPU0. Known dependencies are:
- 1. Resume from suspend/hibernate depends on CPU0.
- Suspend/hibernate will fail if CPU0 is offline and you
- need to online CPU0 before suspend/hibernate.
- 2. PIC interrupts also depend on CPU0. CPU0 can't be
- removed if a PIC interrupt is detected.
- It's said poweroff/reboot may depend on CPU0 on some
- machines although I haven't seen such issues so far
- after CPU0 is offline on a few tested machines.
- If the dependencies are under your control, you can
- turn on cpu0_hotplug.
+ NOTE: this parameter will be ignored on systems with the
+ LEGACY_XAPIC_DISABLED bit set in the
+ IA32_XAPIC_DISABLE_STATUS MSR.
+
+ noxsave [BUGS=X86] Disables x86 extended register state save
+ and restore using xsave. The kernel will fallback to
+ enabling legacy floating-point and sse state.
+
+ noxsaveopt [X86] Disables xsaveopt used in saving x86 extended
+ register states. The kernel will fall back to use
+ xsave to save the states. By using this parameter,
+ performance of saving the states is degraded because
+ xsave doesn't support modified optimization while
+ xsaveopt supports it on xsaveopt enabled systems.
+
+ noxsaves [X86] Disables xsaves and xrstors used in saving and
+ restoring x86 extended register state in compacted
+ form of xsave area. The kernel will fall back to use
+ xsaveopt and xrstor to save and restore the states
+ in standard form of xsave area. By using this
+ parameter, xsave area per process might occupy more
+ memory on xsaves enabled systems.
nps_mtm_hs_ctr= [KNL,ARC]
This parameter sets the maximum duration, in
@@ -3355,7 +4110,7 @@
purges which is reported from either PAL_VM_SUMMARY or
SAL PALO.
- nr_cpus= [SMP] Maximum number of processors that an SMP kernel
+ nr_cpus= [SMP,EARLY] Maximum number of processors that an SMP kernel
could support. nr_cpus=n : n >= 1 limits the kernel to
support 'n' processors. It could be larger than the
number of already plugged CPU during bootup, later in
@@ -3366,7 +4121,12 @@
nr_uarts= [SERIAL] maximum number of UARTs to be registered.
- numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing.
+ numa=off [KNL, ARM64, PPC, RISCV, SPARC, X86, EARLY]
+ Disable NUMA, Only set up a single NUMA node
+ spanning all memory.
+
+ numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
+ NUMA balancing.
Allowed values are enable and disable
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
@@ -3374,7 +4134,7 @@
This can be set from sysctl after boot.
See Documentation/admin-guide/sysctl/vm.rst for details.
- ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
+ ohci1394_dma=early [HW,EARLY] enable debugging via the ohci1394 driver.
See Documentation/core-api/debugging-via-ohci1394.rst for more
info.
@@ -3390,21 +4150,18 @@
For example, to override I2C bus2:
omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100
- oprofile.timer= [HW]
- Use timer interrupt instead of performance counters
-
- oprofile.cpu_type= Force an oprofile cpu type
- This might be useful if you have an older oprofile
- userland or if you want common events.
- Format: { arch_perfmon }
- arch_perfmon: [X86] Force use of architectural
- perfmon on Intel CPUs instead of the
- CPU specific event set.
- timer: [X86] Force use of architectural NMI
- timer mode (see also oprofile.timer
- for generic hr timer mode)
-
- oops=panic Always panic on oopses. Default is to just kill the
+ onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
+
+ Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
+
+ boundary - index of last SLC block on Flex-OneNAND.
+ The remaining blocks are configured as MLC blocks.
+ lock - Configure if Flex-OneNAND boundary should be locked.
+ Once locked, the boundary cannot be changed.
+ 1 indicates lock status, 0 indicates unlock status.
+
+ oops=panic [KNL,EARLY]
+ Always panic on oopses. Default is to just kill the
process, but there is a small probability of
deadlocking the machine.
This will also cause panics on machine check exceptions.
@@ -3420,34 +4177,32 @@
can be read from sysfs at:
/sys/module/page_alloc/parameters/shuffle.
- page_owner= [KNL] Boot-time page_owner enabling option.
+ page_owner= [KNL,EARLY] Boot-time page_owner enabling option.
Storage of the information about who allocated
each page is disabled in default. With this switch,
we can turn it on.
on: enable the feature
- page_poison= [KNL] Boot-time parameter changing the state of
+ page_poison= [KNL,EARLY] Boot-time parameter changing the state of
poisoning on the buddy allocator, available with
CONFIG_PAGE_POISONING=y.
off: turn off poisoning (default)
on: turn on poisoning
+ page_reporting.page_reporting_order=
+ [KNL] Minimal page reporting order
+ Format: <integer>
+ Adjust the minimal page reporting order. The page
+ reporting is disabled when it exceeds MAX_PAGE_ORDER.
+
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
timeout = 0: wait forever
timeout < 0: reboot immediately
Format: <timeout>
- panic_print= Bitmask for printing system info when panic happens.
- User can chose combination of the following bits:
- bit 0: print all tasks info
- bit 1: print system memory info
- bit 2: print timer info
- bit 3: print locks info if CONFIG_LOCKDEP is on
- bit 4: print ftrace buffer
- bit 5: print all printk messages in buffer
-
- panic_on_taint= Bitmask for conditionally calling panic() in add_taint()
+ panic_on_taint= [KNL,EARLY]
+ Bitmask for conditionally calling panic() in add_taint()
Format: <hex>[,nousertaint]
Hexadecimal bitmask representing the set of TAINT flags
that will cause the kernel to panic when add_taint() is
@@ -3460,16 +4215,23 @@
extra details on the taint flags that users can pick
to compose the bitmask to assign to panic_on_taint.
- panic_on_warn panic() instead of WARN(). Useful to cause kdump
+ panic_on_warn=1 panic() instead of WARN(). Useful to cause kdump
on a WARN().
- crash_kexec_post_notifiers
- Run kdump after running panic-notifiers and dumping
- kmsg. This only for the users who doubt kdump always
- succeeds in any situation.
- Note that this also increases risks of kdump failure,
- because some panic notifiers can make the crashed
- kernel more unstable.
+ panic_print= Bitmask for printing system info when panic happens.
+ User can chose combination of the following bits:
+ bit 0: print all tasks info
+ bit 1: print system memory info
+ bit 2: print timer info
+ bit 3: print locks info if CONFIG_LOCKDEP is on
+ bit 4: print ftrace buffer
+ bit 5: print all printk messages in buffer
+ bit 6: print all CPUs backtrace (if available in the arch)
+ bit 7: print only tasks in uninterruptible (blocked) state
+ *Be aware* that this option may print a _lot_ of lines,
+ so there are risks of losing older messages in the log.
+ Use this option carefully, maybe worth to setup a
+ bigger log buffer with "log_buf_len" along with this.
parkbd.port= [HW] Parallel port number the keyboard adapter is
connected to, default is 0.
@@ -3500,18 +4262,104 @@
Currently this function knows 686a and 8231 chips.
Format: [spp|ps2|epp|ecp|ecpepp]
- pause_on_oops=
+ pata_legacy.all= [HW,LIBATA]
+ Format: <int>
+ Set to non-zero to probe primary and secondary ISA
+ port ranges on PCI systems where no PCI PATA device
+ has been found at either range. Disabled by default.
+
+ pata_legacy.autospeed= [HW,LIBATA]
+ Format: <int>
+ Set to non-zero if a chip is present that snoops speed
+ changes. Disabled by default.
+
+ pata_legacy.ht6560a= [HW,LIBATA]
+ Format: <int>
+ Set to 1, 2, or 3 for HT 6560A on the primary channel,
+ the secondary channel, or both channels respectively.
+ Disabled by default.
+
+ pata_legacy.ht6560b= [HW,LIBATA]
+ Format: <int>
+ Set to 1, 2, or 3 for HT 6560B on the primary channel,
+ the secondary channel, or both channels respectively.
+ Disabled by default.
+
+ pata_legacy.iordy_mask= [HW,LIBATA]
+ Format: <int>
+ IORDY enable mask. Set individual bits to allow IORDY
+ for the respective channel. Bit 0 is for the first
+ legacy channel handled by this driver, bit 1 is for
+ the second channel, and so on. The sequence will often
+ correspond to the primary legacy channel, the secondary
+ legacy channel, and so on, but the handling of a PCI
+ bus and the use of other driver options may interfere
+ with the sequence. By default IORDY is allowed across
+ all channels.
+
+ pata_legacy.opti82c46x= [HW,LIBATA]
+ Format: <int>
+ Set to 1, 2, or 3 for Opti 82c611A on the primary
+ channel, the secondary channel, or both channels
+ respectively. Disabled by default.
+
+ pata_legacy.opti82c611a= [HW,LIBATA]
+ Format: <int>
+ Set to 1, 2, or 3 for Opti 82c465MV on the primary
+ channel, the secondary channel, or both channels
+ respectively. Disabled by default.
+
+ pata_legacy.pio_mask= [HW,LIBATA]
+ Format: <int>
+ PIO mode mask for autospeed devices. Set individual
+ bits to allow the use of the respective PIO modes.
+ Bit 0 is for mode 0, bit 1 is for mode 1, and so on.
+ All modes allowed by default.
+
+ pata_legacy.probe_all= [HW,LIBATA]
+ Format: <int>
+ Set to non-zero to probe tertiary and further ISA
+ port ranges on PCI systems. Disabled by default.
+
+ pata_legacy.probe_mask= [HW,LIBATA]
+ Format: <int>
+ Probe mask for legacy ISA PATA ports. Depending on
+ platform configuration and the use of other driver
+ options up to 6 legacy ports are supported: 0x1f0,
+ 0x170, 0x1e8, 0x168, 0x1e0, 0x160, however probing
+ of individual ports can be disabled by setting the
+ corresponding bits in the mask to 1. Bit 0 is for
+ the first port in the list above (0x1f0), and so on.
+ By default all supported ports are probed.
+
+ pata_legacy.qdi= [HW,LIBATA]
+ Format: <int>
+ Set to non-zero to probe QDI controllers. By default
+ set to 1 if CONFIG_PATA_QDI_MODULE, 0 otherwise.
+
+ pata_legacy.winbond= [HW,LIBATA]
+ Format: <int>
+ Set to non-zero to probe Winbond controllers. Use
+ the standard I/O port (0x130) if 1, otherwise the
+ value given is the I/O port to use (typically 0x1b0).
+ By default set to 1 if CONFIG_PATA_WINBOND_VLB_MODULE,
+ 0 otherwise.
+
+ pata_platform.pio_mask= [HW,LIBATA]
+ Format: <int>
+ Supported PIO mode mask. Set individual bits to allow
+ the use of the respective PIO modes. Bit 0 is for
+ mode 0, bit 1 is for mode 1, and so on. Mode 0 only
+ allowed by default.
+
+ pause_on_oops=<int>
Halt all CPUs after the first oops has been printed for
the specified number of seconds. This is to be used if
your oopses keep scrolling off the screen.
pcbit= [HW,ISDN]
- pcd. [PARIDE]
- See header of drivers/block/paride/pcd.c.
- See also Documentation/admin-guide/blockdev/paride.rst.
-
- pci=option[,option...] [PCI] various PCI subsystem options.
+ pci=option[,option...] [PCI,EARLY] various PCI subsystem options.
Some options herein operate on a specific device
or a set of devices (<pci_dev>). These are
@@ -3625,6 +4473,15 @@
please report a bug.
nocrs [X86] Ignore PCI host bridge windows from ACPI.
If you need to use this, please report a bug.
+ use_e820 [X86] Use E820 reservations to exclude parts of
+ PCI host bridge windows. This is a workaround
+ for BIOS defects in host bridge _CRS methods.
+ If you need to use this, please report a bug to
+ <linux-pci@vger.kernel.org>.
+ no_e820 [X86] Ignore E820 reservations for PCI host
+ bridge windows. This is the default on modern
+ hardware. If you need to use this, please report
+ a bug to <linux-pci@vger.kernel.org>.
routeirq Do IRQ routing for all PCI devices.
This is normally done in pci_enable_device(),
so this option is a temporary workaround
@@ -3677,7 +4534,9 @@
specified, e.g., 12@pci:8086:9c22:103c:198f
for 4096-byte alignment.
ecrc= Enable/disable PCIe ECRC (transaction layer
- end-to-end CRC checking).
+ end-to-end CRC checking). Only effective if
+ OS has native AER control (either granted by
+ ACPI _OSC or forced via "pcie_ports=native")
bios: Use BIOS/firmware settings. This is the
the default.
off: Turn ECRC off
@@ -3764,29 +4623,21 @@
for debug and development, but should not be
needed on a platform with proper driver support.
- pd. [PARIDE]
- See Documentation/admin-guide/blockdev/paride.rst.
-
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
Format: { 0 | 1 }
See arch/parisc/kernel/pdc_chassis.c
- percpu_alloc= Select which percpu first chunk allocator to use.
+ percpu_alloc= [MM,EARLY]
+ Select which percpu first chunk allocator to use.
Currently supported values are "embed" and "page".
Archs may support subset or none of the selections.
See comments in mm/percpu.c for details on each
allocator. This parameter is primarily for debugging
and performance comparison.
- pf. [PARIDE]
- See Documentation/admin-guide/blockdev/paride.rst.
-
- pg. [PARIDE]
- See Documentation/admin-guide/blockdev/paride.rst.
-
pirq= [SMP,APIC] Manual mp-table setup
- See Documentation/x86/i386/IO-APIC.rst.
+ See Documentation/arch/x86/i386/IO-APIC.rst.
plip= [PPT,NET] Parallel port network link
Format: { parport<nr> | timid | 0 }
@@ -3796,6 +4647,14 @@
Override pmtimer IOPort with a hex value.
e.g. pmtmr=0x508
+ pmu_override= [PPC] Override the PMU.
+ This option takes over the PMU facility, so it is no
+ longer usable by perf. Setting this option starts the
+ PMU counters by setting MMCR0 to 0 (the FC bit is
+ cleared). If a number is given, then MMCR1 is set to
+ that number, otherwise (e.g., 'pmu_override=on'), MMCR1
+ remains 0.
+
pm_debug_messages [SUSPEND,KNL]
Enable suspend/resume debug messages during boot up.
@@ -3832,6 +4691,11 @@
may be specified.
Format: <port>,<port>....
+ possible_cpus= [SMP,S390,X86]
+ Format: <unsigned int>
+ Set the number of possible CPUs, overriding the
+ regular discovery mechanisms (such as ACPI/FW, etc).
+
powersave=off [PPC] This option disables power saving features.
It specifically disables cpuidle and sets the
platform machine description specific power_save
@@ -3839,15 +4703,22 @@
execution priority.
ppc_strict_facility_enable
- [PPC] This option catches any kernel floating point,
+ [PPC,ENABLE] This option catches any kernel floating point,
Altivec, VSX and SPE outside of regions specifically
allowed (eg kernel_enable_fpu()/kernel_disable_fpu()).
There is some performance impact when enabling this.
- ppc_tm= [PPC]
+ ppc_tm= [PPC,EARLY]
Format: {"off"}
Disable Hardware Transactional Memory
+ preempt= [KNL]
+ Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
+ none - Limited to cond_resched() calls
+ voluntary - Limited to cond_resched() and might_sleep() calls
+ full - Any section that isn't explicitly preempt disabled
+ can be preempted anytime.
+
print-fatal-signals=
[KNL] debug: print fatal signals
@@ -3867,6 +4738,15 @@
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
default: disabled
+ printk.console_no_auto_verbose=
+ Disable console loglevel raise on oops, panic
+ or lockdep-detected issues (only if lock debug is on).
+ With an exception to setups with low baudrate on
+ serial console, keeping this 0 is a good choice
+ in order to provide more debug information.
+ Format: <bool>
+ default: 0 (auto_verbose is enabled)
+
printk.devkmsg={on,off,ratelimit}
Control writing to /dev/kmsg.
on - unlimited logging to /dev/kmsg from userspace
@@ -3896,9 +4776,7 @@
Param: <number> - step/bucket size as a power of 2 for
statistical time based profiling.
- prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
- before loading.
- See Documentation/admin-guide/blockdev/ramdisk.rst.
+ prompt_ramdisk= [RAM] [Deprecated]
prot_virt= [S390] enable hosting protected virtual machines
isolated from the hypervisor (if hardware supports
@@ -3924,10 +4802,7 @@
pstore.backend= Specify the name of the pstore backend to use
- pt. [PARIDE]
- See Documentation/admin-guide/blockdev/paride.rst.
-
- pti= [X86_64] Control Page Table Isolation of user and
+ pti= [X86-64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
removes hardening, but improves performance of
system calls and interrupts.
@@ -3939,28 +4814,46 @@
Not specifying this option is equivalent to pti=auto.
- nopti [X86_64]
- Equivalent to pti=off
-
pty.legacy_count=
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
- quiet [KNL] Disable most log messages
+ quiet [KNL,EARLY] Disable most log messages
r128= [HW,DRM]
+ radix_hcall_invalidate=on [PPC/PSERIES]
+ Disable RADIX GTSE feature and use hcall for TLB
+ invalidate.
+
raid= [HW,RAID]
See Documentation/admin-guide/md.rst.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
See Documentation/admin-guide/blockdev/ramdisk.rst.
- random.trust_cpu={on,off}
- [KNL] Enable or disable trusting the use of the
- CPU's random number generator (if available) to
- fully seed the kernel's CRNG. Default is controlled
- by CONFIG_RANDOM_TRUST_CPU.
+ ramdisk_start= [RAM] RAM disk image start address
+
+ random.trust_cpu=off
+ [KNL,EARLY] Disable trusting the use of the CPU's
+ random number generator (if available) to
+ initialize the kernel's RNG.
+
+ random.trust_bootloader=off
+ [KNL,EARLY] Disable trusting the use of the a seed
+ passed by the bootloader (if available) to
+ initialize the kernel's RNG.
+
+ randomize_kstack_offset=
+ [KNL,EARLY] Enable or disable kernel stack offset
+ randomization, which provides roughly 5 bits of
+ entropy, frustrating memory corruption attacks
+ that depend on stack address determinism or
+ cross-syscall address exposures. This is only
+ available on architectures that have defined
+ CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET.
+ Format: <bool> (1/Y/y=enable, 0/N/n=disable)
+ Default is CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
ras=option[,option,...] [KNL] RAS-specific options
@@ -3968,21 +4861,33 @@
Disable the Correctable Errors Collector,
see CONFIG_RAS_CEC help text.
- rcu_nocbs= [KNL]
- The argument is a cpu list, as described above,
- except that the string "all" can be used to
- specify every CPU on the system.
-
- In kernels built with CONFIG_RCU_NOCB_CPU=y, set
- the specified list of CPUs to be no-callback CPUs.
- Invocation of these CPUs' RCU callbacks will be
- offloaded to "rcuox/N" kthreads created for that
- purpose, where "x" is "p" for RCU-preempt, and
- "s" for RCU-sched, and "N" is the CPU number.
- This reduces OS jitter on the offloaded CPUs,
- which can be useful for HPC and real-time
- workloads. It can also improve energy efficiency
- for asymmetric multiprocessors.
+ rcu_nocbs[=cpu-list]
+ [KNL] The optional argument is a cpu list,
+ as described above.
+
+ In kernels built with CONFIG_RCU_NOCB_CPU=y,
+ enable the no-callback CPU mode, which prevents
+ such CPUs' callbacks from being invoked in
+ softirq context. Invocation of such CPUs' RCU
+ callbacks will instead be offloaded to "rcuox/N"
+ kthreads created for that purpose, where "x" is
+ "p" for RCU-preempt, "s" for RCU-sched, and "g"
+ for the kthreads that mediate grace periods; and
+ "N" is the CPU number. This reduces OS jitter on
+ the offloaded CPUs, which can be useful for HPC
+ and real-time workloads. It can also improve
+ energy efficiency for asymmetric multiprocessors.
+
+ If a cpulist is passed as an argument, the specified
+ list of CPUs is set to no-callback mode from boot.
+
+ Otherwise, if the '=' sign and the cpulist
+ arguments are omitted, no CPU will be set to
+ no-callback mode from boot but the mode may be
+ toggled at runtime via cpusets.
+
+ Note that this argument takes precedence over
+ the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
rcu_nocb_poll [KNL]
Rather than requiring that offloaded CPUs
@@ -3999,6 +4904,13 @@
Set maximum number of finished RCU callbacks to
process in one batch.
+ rcutree.do_rcu_barrier= [KNL]
+ Request a call to rcu_barrier(). This is
+ throttled so that userspace tests can safely
+ hammer on the sysfs variable if they so choose.
+ If triggered before the RCU grace-period machinery
+ is fully active, this will error out with EAGAIN.
+
rcutree.dump_tree= [KNL]
Dump the structure of the rcu_node combining tree
out at early boot. This is used for diagnostic
@@ -4018,26 +4930,6 @@
the propagation of recent CPU-hotplug changes up
the rcu_node combining tree.
- rcutree.use_softirq= [KNL]
- If set to zero, move all RCU_SOFTIRQ processing to
- per-CPU rcuc kthreads. Defaults to a non-zero
- value, meaning that RCU_SOFTIRQ is used by default.
- Specify rcutree.use_softirq=0 to use rcuc kthreads.
-
- rcutree.rcu_fanout_exact= [KNL]
- Disable autobalancing of the rcu_node combining
- tree. This is used by rcutorture, and might
- possibly be useful for architectures having high
- cache-to-cache transfer latencies.
-
- rcutree.rcu_fanout_leaf= [KNL]
- Change the number of CPUs assigned to each
- leaf rcu_node structure. Useful for very
- large systems, which will choose the value 64,
- and for NUMA systems with large remote-access
- latencies, which will choose a value aligned
- with the appropriate hardware boundaries.
-
rcutree.jiffies_till_first_fqs= [KNL]
Set delay from grace-period initialization to
first attempt to force quiescent states.
@@ -4073,14 +4965,21 @@
(the least-favored priority). Otherwise, when
RCU_BOOST is not set, valid values are 0-99 and
the default is zero (non-realtime operation).
-
- rcutree.rcu_nocb_gp_stride= [KNL]
- Set the number of NOCB callback kthreads in
- each group, which defaults to the square root
- of the number of CPUs. Larger numbers reduce
- the wakeup overhead on the global grace-period
- kthread, but increases that same overhead on
- each group's NOCB grace-period kthread.
+ When RCU_NOCB_CPU is set, also adjust the
+ priority of NOCB callback kthreads.
+
+ rcutree.nocb_nobypass_lim_per_jiffy= [KNL]
+ On callback-offloaded (rcu_nocbs) CPUs,
+ RCU reduces the lock contention that would
+ otherwise be caused by callback floods through
+ use of the ->nocb_bypass list. However, in the
+ common non-flooded case, RCU queues directly to
+ the main ->cblist in order to avoid the extra
+ overhead of the ->nocb_bypass list and its lock.
+ But if there are too many callbacks queued during
+ a single jiffy, RCU pre-queues the callbacks into
+ the ->nocb_bypass queue. The definition of "too
+ many" is supplied by this kernel boot parameter.
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
@@ -4099,15 +4998,55 @@
on rcutree.qhimark at boot time and to zero to
disable more aggressive help enlistment.
- rcutree.rcu_idle_gp_delay= [KNL]
- Set wakeup interval for idle CPUs that have
- RCU callbacks (RCU_FAST_NO_HZ=y).
+ rcutree.rcu_delay_page_cache_fill_msec= [KNL]
+ Set the page-cache refill delay (in milliseconds)
+ in response to low-memory conditions. The range
+ of permitted values is in the range 0:100000.
+
+ rcutree.rcu_divisor= [KNL]
+ Set the shift-right count to use to compute
+ the callback-invocation batch limit bl from
+ the number of callbacks queued on this CPU.
+ The result will be bounded below by the value of
+ the rcutree.blimit kernel parameter. Every bl
+ callbacks, the softirq handler will exit in
+ order to allow the CPU to do other work.
+
+ Please note that this callback-invocation batch
+ limit applies only to non-offloaded callback
+ invocation. Offloaded callbacks are instead
+ invoked in the context of an rcuoc kthread, which
+ scheduler will preempt as it does any other task.
- rcutree.rcu_idle_lazy_gp_delay= [KNL]
- Set wakeup interval for idle CPUs that have
- only "lazy" RCU callbacks (RCU_FAST_NO_HZ=y).
- Lazy RCU callbacks are those which RCU can
- prove do nothing more than free memory.
+ rcutree.rcu_fanout_exact= [KNL]
+ Disable autobalancing of the rcu_node combining
+ tree. This is used by rcutorture, and might
+ possibly be useful for architectures having high
+ cache-to-cache transfer latencies.
+
+ rcutree.rcu_fanout_leaf= [KNL]
+ Change the number of CPUs assigned to each
+ leaf rcu_node structure. Useful for very
+ large systems, which will choose the value 64,
+ and for NUMA systems with large remote-access
+ latencies, which will choose a value aligned
+ with the appropriate hardware boundaries.
+
+ rcutree.rcu_min_cached_objs= [KNL]
+ Minimum number of objects which are cached and
+ maintained per one CPU. Object size is equal
+ to PAGE_SIZE. The cache allows to reduce the
+ pressure to page allocator, also it makes the
+ whole algorithm to behave better in low memory
+ condition.
+
+ rcutree.rcu_nocb_gp_stride= [KNL]
+ Set the number of NOCB callback kthreads in
+ each group, which defaults to the square root
+ of the number of CPUs. Larger numbers reduce
+ the wakeup overhead on the global grace-period
+ kthread, but increases that same overhead on
+ each group's NOCB grace-period kthread.
rcutree.rcu_kick_kthreads= [KNL]
Cause the grace-period kthread to get an extra
@@ -4116,46 +5055,104 @@
This wake_up() will be accompanied by a
WARN_ONCE() splat and an ftrace_dump().
+ rcutree.rcu_resched_ns= [KNL]
+ Limit the time spend invoking a batch of RCU
+ callbacks to the specified number of nanoseconds.
+ By default, this limit is checked only once
+ every 32 callbacks in order to limit the pain
+ inflicted by local_clock() overhead.
+
+ rcutree.rcu_unlock_delay= [KNL]
+ In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels,
+ this specifies an rcu_read_unlock()-time delay
+ in microseconds. This defaults to zero.
+ Larger delays increase the probability of
+ catching RCU pointer leaks, that is, buggy use
+ of RCU-protected pointers after the relevant
+ rcu_read_unlock() has completed.
+
rcutree.sysrq_rcu= [KNL]
Commandeer a sysrq key to dump out Tree RCU's
rcu_node tree with an eye towards determining
why a new grace period has not yet started.
- rcuperf.gp_async= [KNL]
+ rcutree.use_softirq= [KNL]
+ If set to zero, move all RCU_SOFTIRQ processing to
+ per-CPU rcuc kthreads. Defaults to a non-zero
+ value, meaning that RCU_SOFTIRQ is used by default.
+ Specify rcutree.use_softirq=0 to use rcuc kthreads.
+
+ But note that CONFIG_PREEMPT_RT=y kernels disable
+ this kernel boot parameter, forcibly setting it
+ to zero.
+
+ rcutree.enable_rcu_lazy= [KNL]
+ To save power, batch RCU callbacks and flush after
+ delay, memory pressure or callback list growing too
+ big.
+
+ rcuscale.gp_async= [KNL]
Measure performance of asynchronous
grace-period primitives such as call_rcu().
- rcuperf.gp_async_max= [KNL]
+ rcuscale.gp_async_max= [KNL]
Specify the maximum number of outstanding
callbacks per writer thread. When a writer
thread exceeds this limit, it invokes the
corresponding flavor of rcu_barrier() to allow
previously posted callbacks to drain.
- rcuperf.gp_exp= [KNL]
+ rcuscale.gp_exp= [KNL]
Measure performance of expedited synchronous
grace-period primitives.
- rcuperf.holdoff= [KNL]
+ rcuscale.holdoff= [KNL]
Set test-start holdoff period. The purpose of
this parameter is to delay the start of the
test until boot completes in order to avoid
interference.
- rcuperf.kfree_rcu_test= [KNL]
+ rcuscale.kfree_by_call_rcu= [KNL]
+ In kernels built with CONFIG_RCU_LAZY=y, test
+ call_rcu() instead of kfree_rcu().
+
+ rcuscale.kfree_mult= [KNL]
+ Instead of allocating an object of size kfree_obj,
+ allocate one of kfree_mult * sizeof(kfree_obj).
+ Defaults to 1.
+
+ rcuscale.kfree_rcu_test= [KNL]
Set to measure performance of kfree_rcu() flooding.
- rcuperf.kfree_nthreads= [KNL]
+ rcuscale.kfree_rcu_test_double= [KNL]
+ Test the double-argument variant of kfree_rcu().
+ If this parameter has the same value as
+ rcuscale.kfree_rcu_test_single, both the single-
+ and double-argument variants are tested.
+
+ rcuscale.kfree_rcu_test_single= [KNL]
+ Test the single-argument variant of kfree_rcu().
+ If this parameter has the same value as
+ rcuscale.kfree_rcu_test_double, both the single-
+ and double-argument variants are tested.
+
+ rcuscale.kfree_nthreads= [KNL]
The number of threads running loops of kfree_rcu().
- rcuperf.kfree_alloc_num= [KNL]
+ rcuscale.kfree_alloc_num= [KNL]
Number of allocations and frees done in an iteration.
- rcuperf.kfree_loops= [KNL]
- Number of loops doing rcuperf.kfree_alloc_num number
+ rcuscale.kfree_loops= [KNL]
+ Number of loops doing rcuscale.kfree_alloc_num number
of allocations and frees.
- rcuperf.nreaders= [KNL]
+ rcuscale.minruntime= [KNL]
+ Set the minimum test run time in seconds. This
+ does not affect the data-collection interval,
+ but instead allows better measurement of things
+ like CPU consumption.
+
+ rcuscale.nreaders= [KNL]
Set number of RCU readers. The value -1 selects
N, where N is the number of CPUs. A value
"n" less than -1 selects N-n+1, where N is again
@@ -4164,27 +5161,32 @@
A value of "n" less than or equal to -N selects
a single reader.
- rcuperf.nwriters= [KNL]
+ rcuscale.nwriters= [KNL]
Set number of RCU writers. The values operate
- the same as for rcuperf.nreaders.
+ the same as for rcuscale.nreaders.
N, where N is the number of CPUs
- rcuperf.perf_type= [KNL]
+ rcuscale.scale_type= [KNL]
Specify the RCU implementation to test.
- rcuperf.shutdown= [KNL]
+ rcuscale.shutdown= [KNL]
Shut the system down after performance tests
complete. This is useful for hands-off automated
testing.
- rcuperf.verbose= [KNL]
+ rcuscale.verbose= [KNL]
Enable additional printk() statements.
- rcuperf.writer_holdoff= [KNL]
+ rcuscale.writer_holdoff= [KNL]
Write-side holdoff between grace periods,
in microseconds. The default of zero says
no holdoff.
+ rcuscale.writer_holdoff_jiffies= [KNL]
+ Additional write-side holdoff between grace
+ periods, but in jiffies. The default of zero
+ says no holdoff.
+
rcutorture.fqs_duration= [KNL]
Set duration of force_quiescent_state bursts
in microseconds.
@@ -4198,8 +5200,12 @@
in seconds.
rcutorture.fwd_progress= [KNL]
- Enable RCU grace-period forward-progress testing
+ Specifies the number of kthreads to be used
+ for RCU grace-period forward-progress testing
for the types of RCU supporting this notion.
+ Defaults to 1 kthread, values less than zero or
+ greater than the number of CPUs cause the number
+ of CPUs to be used.
rcutorture.fwd_progress_div= [KNL]
Specify the fraction of a CPU-stall-warning
@@ -4233,6 +5239,18 @@
are zero, rcutorture acts as if is interpreted
they are all non-zero.
+ rcutorture.irqreader= [KNL]
+ Run RCU readers from irq handlers, or, more
+ accurately, from a timer handler. Not all RCU
+ flavors take kindly to this sort of thing.
+
+ rcutorture.leakpointer= [KNL]
+ Leak an RCU-protected pointer out of the reader.
+ This can of course result in splats, and is
+ intended to test the ability of things like
+ CONFIG_RCU_STRICT_GRACE_PERIOD=y to detect
+ such leaks.
+
rcutorture.n_barrier_cbs= [KNL]
Set callbacks/threads for rcu_barrier() testing.
@@ -4241,6 +5259,14 @@
stress RCU, they don't participate in the actual
test, hence the "fake".
+ rcutorture.nocbs_nthreads= [KNL]
+ Set number of RCU callback-offload togglers.
+ Zero (the default) disables toggling.
+
+ rcutorture.nocbs_toggle= [KNL]
+ Set the delay in milliseconds between successive
+ callback-offload toggling attempts.
+
rcutorture.nreaders= [KNL]
Set number of RCU readers. The value -1 selects
N-1, where N is the number of CPUs. A value
@@ -4258,6 +5284,20 @@
Set time (jiffies) between CPU-hotplug operations,
or zero to disable CPU-hotplug testing.
+ rcutorture.read_exit= [KNL]
+ Set the number of read-then-exit kthreads used
+ to test the interaction of RCU updaters and
+ task-exit processing.
+
+ rcutorture.read_exit_burst= [KNL]
+ The number of times in a given read-then-exit
+ episode that a set of read-then-exit kthreads
+ is spawned.
+
+ rcutorture.read_exit_delay= [KNL]
+ The delay, in seconds, between successive
+ read-then-exit testing episodes.
+
rcutorture.shuffle_interval= [KNL]
Set task-shuffle interval (s). Shuffling tasks
allows some CPUs to go into dyntick-idle mode
@@ -4273,8 +5313,17 @@
rcutorture.stall_cpu_block= [KNL]
Sleep while stalling if set. This will result
- in warnings from preemptible RCU in addition
- to any other stall-related activity.
+ in warnings from preemptible RCU in addition to
+ any other stall-related activity. Note that
+ in kernels built with CONFIG_PREEMPTION=n and
+ CONFIG_PREEMPT_COUNT=y, this parameter will
+ cause the CPU to pass through a quiescent state.
+ Given CONFIG_PREEMPTION=n, this will suppress
+ RCU CPU stall warnings, but will instead result
+ in scheduling-while-atomic splats.
+
+ Use of this module parameter results in splats.
+
rcutorture.stall_cpu_holdoff= [KNL]
Time to wait (s) after boot before inducing stall.
@@ -4323,6 +5372,12 @@
Dump ftrace buffer after reporting RCU CPU
stall warning.
+ rcupdate.rcu_cpu_stall_notifiers= [KNL]
+ Provide RCU CPU stall notifiers, but see the
+ warnings in the RCU_CPU_STALL_NOTIFIER Kconfig
+ option's help text. TL;DR: You almost certainly
+ do not want rcupdate.rcu_cpu_stall_notifiers.
+
rcupdate.rcu_cpu_stall_suppress= [KNL]
Suppress RCU CPU stall warning messages.
@@ -4334,6 +5389,29 @@
rcupdate.rcu_cpu_stall_timeout= [KNL]
Set timeout for RCU CPU stall warning messages.
+ The value is in seconds and the maximum allowed
+ value is 300 seconds.
+
+ rcupdate.rcu_exp_cpu_stall_timeout= [KNL]
+ Set timeout for expedited RCU CPU stall warning
+ messages. The value is in milliseconds
+ and the maximum allowed value is 21000
+ milliseconds. Please note that this value is
+ adjusted to an arch timer tick resolution.
+ Setting this to zero causes the value from
+ rcupdate.rcu_cpu_stall_timeout to be used (after
+ conversion from seconds to milliseconds).
+
+ rcupdate.rcu_cpu_stall_cputime= [KNL]
+ Provide statistics on the cputime and count of
+ interrupts and tasks during the sampling period. For
+ multiple continuous RCU stalls, all sampling periods
+ begin at half of the first RCU stall timeout.
+
+ rcupdate.rcu_exp_stall_task_details= [KNL]
+ Print stack dumps of any tasks blocking the
+ current expedited RCU grace period during an
+ expedited RCU CPU stall warning.
rcupdate.rcu_expedited= [KNL]
Use expedited grace-period primitives, for
@@ -4359,6 +5437,36 @@
only normal grace-period primitives. No effect
on CONFIG_TINY_RCU kernels.
+ But note that CONFIG_PREEMPT_RT=y kernels enables
+ this kernel boot parameter, forcibly setting
+ it to the value one, that is, converting any
+ post-boot attempt at an expedited RCU grace
+ period to instead use normal non-expedited
+ grace-period processing.
+
+ rcupdate.rcu_task_collapse_lim= [KNL]
+ Set the maximum number of callbacks present
+ at the beginning of a grace period that allows
+ the RCU Tasks flavors to collapse back to using
+ a single callback queue. This switching only
+ occurs when rcupdate.rcu_task_enqueue_lim is
+ set to the default value of -1.
+
+ rcupdate.rcu_task_contend_lim= [KNL]
+ Set the minimum number of callback-queuing-time
+ lock-contention events per jiffy required to
+ cause the RCU Tasks flavors to switch to per-CPU
+ callback queuing. This switching only occurs
+ when rcupdate.rcu_task_enqueue_lim is set to
+ the default value of -1.
+
+ rcupdate.rcu_task_enqueue_lim= [KNL]
+ Set the number of callback queues to use for the
+ RCU Tasks family of RCU flavors. The default
+ of -1 allows this to be automatically (and
+ dynamically) adjusted. This parameter is intended
+ for use in testing.
+
rcupdate.rcu_task_ipi_delay= [KNL]
Set time in jiffies during which RCU tasks will
avoid sending IPIs, starting with the beginning
@@ -4366,10 +5474,64 @@
number avoids disturbing real-time workloads,
but lengthens grace periods.
+ rcupdate.rcu_task_lazy_lim= [KNL]
+ Number of callbacks on a given CPU that will
+ cancel laziness on that CPU. Use -1 to disable
+ cancellation of laziness, but be advised that
+ doing so increases the danger of OOM due to
+ callback flooding.
+
+ rcupdate.rcu_task_stall_info= [KNL]
+ Set initial timeout in jiffies for RCU task stall
+ informational messages, which give some indication
+ of the problem for those not patient enough to
+ wait for ten minutes. Informational messages are
+ only printed prior to the stall-warning message
+ for a given grace period. Disable with a value
+ less than or equal to zero. Defaults to ten
+ seconds. A change in value does not take effect
+ until the beginning of the next grace period.
+
+ rcupdate.rcu_task_stall_info_mult= [KNL]
+ Multiplier for time interval between successive
+ RCU task stall informational messages for a given
+ RCU tasks grace period. This value is clamped
+ to one through ten, inclusive. It defaults to
+ the value three, so that the first informational
+ message is printed 10 seconds into the grace
+ period, the second at 40 seconds, the third at
+ 160 seconds, and then the stall warning at 600
+ seconds would prevent a fourth at 640 seconds.
+
rcupdate.rcu_task_stall_timeout= [KNL]
- Set timeout in jiffies for RCU task stall warning
- messages. Disable with a value less than or equal
- to zero.
+ Set timeout in jiffies for RCU task stall
+ warning messages. Disable with a value less
+ than or equal to zero. Defaults to ten minutes.
+ A change in value does not take effect until
+ the beginning of the next grace period.
+
+ rcupdate.rcu_tasks_lazy_ms= [KNL]
+ Set timeout in milliseconds RCU Tasks asynchronous
+ callback batching for call_rcu_tasks().
+ A negative value will take the default. A value
+ of zero will disable batching. Batching is
+ always disabled for synchronize_rcu_tasks().
+
+ rcupdate.rcu_tasks_rude_lazy_ms= [KNL]
+ Set timeout in milliseconds RCU Tasks
+ Rude asynchronous callback batching for
+ call_rcu_tasks_rude(). A negative value
+ will take the default. A value of zero will
+ disable batching. Batching is always disabled
+ for synchronize_rcu_tasks_rude().
+
+ rcupdate.rcu_tasks_trace_lazy_ms= [KNL]
+ Set timeout in milliseconds RCU Tasks
+ Trace asynchronous callback batching for
+ call_rcu_tasks_trace(). A negative value
+ will take the default. A value of zero will
+ disable batching. Batching is always disabled
+ for synchronize_rcu_tasks_trace().
rcupdate.rcu_self_test= [KNL]
Run the RCU early boot self tests
@@ -4379,7 +5541,7 @@
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
- rdrand= [X86]
+ rdrand= [X86,EARLY]
force - Override the decision by the kernel to hide the
advertisement of RDRAND support (this affects
certain AMD processors because of buggy BIOS
@@ -4389,13 +5551,13 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba.
+ mba, smba, bmec.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
reboot= [KNL]
Format (x86 or x86_64):
- [w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \
+ [w[arm] | c[old] | h[ard] | s[oft] | g[pio]] | d[efault] \
[[,]s[mp]#### \
[[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \
[[,]f[orce]
@@ -4407,6 +5569,64 @@
reboot_cpu is s[mp]#### with #### being the processor
to be used for rebooting.
+ refscale.holdoff= [KNL]
+ Set test-start holdoff period. The purpose of
+ this parameter is to delay the start of the
+ test until boot completes in order to avoid
+ interference.
+
+ refscale.lookup_instances= [KNL]
+ Number of data elements to use for the forms of
+ SLAB_TYPESAFE_BY_RCU testing. A negative number
+ is negated and multiplied by nr_cpu_ids, while
+ zero specifies nr_cpu_ids.
+
+ refscale.loops= [KNL]
+ Set the number of loops over the synchronization
+ primitive under test. Increasing this number
+ reduces noise due to loop start/end overhead,
+ but the default has already reduced the per-pass
+ noise to a handful of picoseconds on ca. 2020
+ x86 laptops.
+
+ refscale.nreaders= [KNL]
+ Set number of readers. The default value of -1
+ selects N, where N is roughly 75% of the number
+ of CPUs. A value of zero is an interesting choice.
+
+ refscale.nruns= [KNL]
+ Set number of runs, each of which is dumped onto
+ the console log.
+
+ refscale.readdelay= [KNL]
+ Set the read-side critical-section duration,
+ measured in microseconds.
+
+ refscale.scale_type= [KNL]
+ Specify the read-protection implementation to test.
+
+ refscale.shutdown= [KNL]
+ Shut down the system at the end of the performance
+ test. This defaults to 1 (shut it down) when
+ refscale is built into the kernel and to 0 (leave
+ it running) when refscale is built as a module.
+
+ refscale.verbose= [KNL]
+ Enable additional printk() statements.
+
+ refscale.verbose_batched= [KNL]
+ Batch the additional printk() statements. If zero
+ (the default) or negative, print everything. Otherwise,
+ print every Nth verbose statement, where N is the value
+ specified.
+
+ regulator_ignore_unused
+ [REGULATOR]
+ Prevents regulator framework from disabling regulators
+ that are unused, due no driver claiming them. This may
+ be useful for debug and development, but should not be
+ needed on a platform with proper driver support.
+
relax_domain_level=
[KNL, SMP] Set scheduler's default relax_domain_level.
See Documentation/admin-guide/cgroup-v1/cpusets.rst.
@@ -4417,16 +5637,11 @@
them. If <base> is less than 0x10000, the region
is assumed to be I/O ports; otherwise it is memory.
- reservetop= [X86-32]
+ reservetop= [X86-32,EARLY]
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
address space.
- reservelow= [X86]
- Format: nn[K]
- Set the amount of memory to reserve for BIOS at
- the bottom of the address space.
-
reset_devices [KNL] Force drivers to reset the underlying device
during initialization.
@@ -4448,16 +5663,45 @@
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
- hibernate= [HIBERNATION]
- noresume Don't check if there's a hibernation image
- present during boot.
- nocompress Don't compress/decompress hibernation images.
- no Disable hibernation and resume.
- protect_image Turn on image protection during restoration
- (that will set all pages holding image data
- during restoration read-only).
-
- retain_initrd [RAM] Keep initrd memory after extraction
+ retain_initrd [RAM] Keep initrd memory after extraction. After boot, it will
+ be accessible via /sys/firmware/initrd.
+
+ retbleed= [X86] Control mitigation of RETBleed (Arbitrary
+ Speculative Code Execution with Return Instructions)
+ vulnerability.
+
+ AMD-based UNRET and IBPB mitigations alone do not stop
+ sibling threads from influencing the predictions of other
+ sibling threads. For that reason, STIBP is used on pro-
+ cessors that support it, and mitigate SMT on processors
+ that don't.
+
+ off - no mitigation
+ auto - automatically select a migitation
+ auto,nosmt - automatically select a mitigation,
+ disabling SMT if necessary for
+ the full mitigation (only on Zen1
+ and older without STIBP).
+ ibpb - On AMD, mitigate short speculation
+ windows on basic block boundaries too.
+ Safe, highest perf impact. It also
+ enables STIBP if present. Not suitable
+ on Intel.
+ ibpb,nosmt - Like "ibpb" above but will disable SMT
+ when STIBP is not available. This is
+ the alternative for systems which do not
+ have STIBP.
+ unret - Force enable untrained return thunks,
+ only effective on AMD f15h-f17h based
+ systems.
+ unret,nosmt - Like unret, but will disable SMT when STIBP
+ is not available. This is the alternative for
+ systems which do not have STIBP.
+
+ Selecting 'auto' will choose a mitigation method at run
+ time according to the CPU.
+
+ Not specifying this option is equivalent to retbleed=auto.
rfkill.default_state=
0 "airplane mode". All wifi, bluetooth, wimax, gps, fm,
@@ -4478,20 +5722,35 @@
[KNL] Disable ring 3 MONITOR/MWAIT feature on supported
CPUs.
+ riscv_isa_fallback [RISCV,EARLY]
+ When CONFIG_RISCV_ISA_FALLBACK is not enabled, permit
+ falling back to detecting extension support by parsing
+ "riscv,isa" property on devicetree systems when the
+ replacement properties are not found. See the Kconfig
+ entry for RISCV_ISA_FALLBACK.
+
ro [KNL] Mount root device read-only on boot
- rodata= [KNL]
+ rodata= [KNL,EARLY]
on Mark read-only kernel memory as read-only (default).
off Leave read-only kernel memory writable for debugging.
+ full Mark read-only kernel memory and aliases as read-only
+ [arm64]
rockchip.usb_uart
+ [EARLY]
Enable the uart passthrough on the designated usb port
on Rockchip SoCs. When active, the signals of the
debug-uart get routed to the D+ and D- pins of the usb
port and the regular usb controller gets disabled.
root= [KNL] Root filesystem
- See name_to_dev_t comment in init/do_mounts.c.
+ Usually this a a block device specifier of some kind,
+ see the early_lookup_bdev comment in
+ block/early-lookup.c for details.
+ Alternatively this can be "ram" for the legacy initial
+ ramdisk, "nfs" and "cifs" for root on a network file
+ system, or "mtd" and "ubi" for mounting from raw flash.
rootdelay= [KNL] Delay (in seconds) to pause before attempting to
mount the root filesystem
@@ -4504,6 +5763,10 @@
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
+ rootwait= [KNL] Maximum time (in seconds) to wait for root device
+ to show up before attempting to mount the root
+ filesystem.
+
rproc_mem=nn[KMG][@address]
[KNL,ARM,CMA] Remoteproc physical memory block.
Memory area to be used by remote processor image,
@@ -4516,16 +5779,27 @@
s390_iommu= [HW,S390]
Set s390 IOTLB flushing mode
strict
- With strict flushing every unmap operation will result in
- an IOTLB flush. Default is lazy flushing before reuse,
- which is faster.
+ With strict flushing every unmap operation will result
+ in an IOTLB flush. Default is lazy flushing before
+ reuse, which is faster. Deprecated, equivalent to
+ iommu.strict=1.
+
+ s390_iommu_aperture= [KNL,S390]
+ Specifies the size of the per device DMA address space
+ accessible through the DMA and IOMMU APIs as a decimal
+ factor of the size of main memory.
+ The default is 1 meaning that one can concurrently use
+ as many DMA addresses as physical memory is installed,
+ if supported by hardware, and thus map all of memory
+ once. With a value of 2 one can map all of memory twice
+ and so on. As a special case a factor of 0 imposes no
+ restrictions other than those given by hardware at the
+ cost of significant additional memory use for tables.
sa1100ir [NET]
See drivers/net/irda/sa1100_ir.c.
- sbni= [NET] Granch SBNI12 leased line adapter
-
- sched_debug [KNL] Enables verbose scheduler debug messages.
+ sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages.
schedstats= [KNL,X86] Enable or disable scheduled statistics.
Allowed values are enable and disable. This feature
@@ -4548,7 +5822,99 @@
Format: integer between 0 and 10
Default is 0.
- skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
+ scftorture.holdoff= [KNL]
+ Number of seconds to hold off before starting
+ test. Defaults to zero for module insertion and
+ to 10 seconds for built-in smp_call_function()
+ tests.
+
+ scftorture.longwait= [KNL]
+ Request ridiculously long waits randomly selected
+ up to the chosen limit in seconds. Zero (the
+ default) disables this feature. Please note
+ that requesting even small non-zero numbers of
+ seconds can result in RCU CPU stall warnings,
+ softlockup complaints, and so on.
+
+ scftorture.nthreads= [KNL]
+ Number of kthreads to spawn to invoke the
+ smp_call_function() family of functions.
+ The default of -1 specifies a number of kthreads
+ equal to the number of CPUs.
+
+ scftorture.onoff_holdoff= [KNL]
+ Number seconds to wait after the start of the
+ test before initiating CPU-hotplug operations.
+
+ scftorture.onoff_interval= [KNL]
+ Number seconds to wait between successive
+ CPU-hotplug operations. Specifying zero (which
+ is the default) disables CPU-hotplug operations.
+
+ scftorture.shutdown_secs= [KNL]
+ The number of seconds following the start of the
+ test after which to shut down the system. The
+ default of zero avoids shutting down the system.
+ Non-zero values are useful for automated tests.
+
+ scftorture.stat_interval= [KNL]
+ The number of seconds between outputting the
+ current test statistics to the console. A value
+ of zero disables statistics output.
+
+ scftorture.stutter_cpus= [KNL]
+ The number of jiffies to wait between each change
+ to the set of CPUs under test.
+
+ scftorture.use_cpus_read_lock= [KNL]
+ Use use_cpus_read_lock() instead of the default
+ preempt_disable() to disable CPU hotplug
+ while invoking one of the smp_call_function*()
+ functions.
+
+ scftorture.verbose= [KNL]
+ Enable additional printk() statements.
+
+ scftorture.weight_single= [KNL]
+ The probability weighting to use for the
+ smp_call_function_single() function with a zero
+ "wait" parameter. A value of -1 selects the
+ default if all other weights are -1. However,
+ if at least one weight has some other value, a
+ value of -1 will instead select a weight of zero.
+
+ scftorture.weight_single_wait= [KNL]
+ The probability weighting to use for the
+ smp_call_function_single() function with a
+ non-zero "wait" parameter. See weight_single.
+
+ scftorture.weight_many= [KNL]
+ The probability weighting to use for the
+ smp_call_function_many() function with a zero
+ "wait" parameter. See weight_single.
+ Note well that setting a high probability for
+ this weighting can place serious IPI load
+ on the system.
+
+ scftorture.weight_many_wait= [KNL]
+ The probability weighting to use for the
+ smp_call_function_many() function with a
+ non-zero "wait" parameter. See weight_single
+ and weight_many.
+
+ scftorture.weight_all= [KNL]
+ The probability weighting to use for the
+ smp_call_function_all() function with a zero
+ "wait" parameter. See weight_single and
+ weight_many.
+
+ scftorture.weight_all_wait= [KNL]
+ The probability weighting to use for the
+ smp_call_function_all() function with a
+ non-zero "wait" parameter. See weight_single
+ and weight_many.
+
+ skew_tick= [KNL,EARLY] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
Format: { "0" | "1" }
@@ -4568,85 +5934,99 @@
1 -- enable.
Default value is 1.
- apparmor= [APPARMOR] Disable or enable AppArmor at boot time
- Format: { "0" | "1" }
- See security/apparmor/Kconfig help text
- 0 -- disable.
- 1 -- enable.
- Default value is set via kernel config option.
-
serialnumber [BUGS=X86-32]
+ sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst
+
shapers= [NET]
Maximal number of shapers.
+ show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller
+ Limit apic dumping. The parameter defines the maximal
+ number of local apics being dumped. Also it is possible
+ to set it to "all" by meaning -- no limit here.
+ Format: { 1 (default) | 2 | ... | all }.
+ The parameter valid if only apic=debug or
+ apic=verbose is specified.
+ Example: apic=debug show_lapic=all
+
simeth= [IA-64]
simscsi=
- slram= [HW,MTD]
-
- slab_nomerge [MM]
- Disable merging of slabs with similar size. May be
- necessary if there is some reason to distinguish
- allocs to different slabs, especially in hardened
- environments where the risk of heap overflows and
- layout control by attackers can usually be
- frustrated by disabling merging. This will reduce
- most of the exposure of a heap attack to a single
- cache (risks via metadata attacks are mostly
- unchanged). Debug options disable merging on their
- own.
- For more information see Documentation/vm/slub.rst.
-
- slab_max_order= [MM, SLAB]
- Determines the maximum allowed order for slabs.
- A high setting may cause OOMs due to memory
- fragmentation. Defaults to 1 for systems with
- more than 32MB of RAM, 0 otherwise.
-
- slub_debug[=options[,slabs]] [MM, SLUB]
- Enabling slub_debug allows one to determine the
+ slab_debug[=options[,slabs][;[options[,slabs]]...] [MM]
+ Enabling slab_debug allows one to determine the
culprit if slab objects become corrupted. Enabling
- slub_debug can create guard zones around objects and
+ slab_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the
last alloc / free. For more information see
- Documentation/vm/slub.rst.
+ Documentation/mm/slub.rst.
+ (slub_debug legacy name also accepted for now)
- slub_memcg_sysfs= [MM, SLUB]
- Determines whether to enable sysfs directories for
- memory cgroup sub-caches. 1 to enable, 0 to disable.
- The default is determined by CONFIG_SLUB_MEMCG_SYSFS_ON.
- Enabling this can lead to a very high number of debug
- directories and files being created under
- /sys/kernel/slub.
-
- slub_max_order= [MM, SLUB]
+ slab_max_order= [MM]
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. For more information see
- Documentation/vm/slub.rst.
+ Documentation/mm/slub.rst.
+ (slub_max_order legacy name also accepted for now)
+
+ slab_merge [MM]
+ Enable merging of slabs with similar size when the
+ kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
+ (slub_merge legacy name also accepted for now)
- slub_min_objects= [MM, SLUB]
+ slab_min_objects= [MM]
The minimum number of objects per slab. SLUB will
- increase the slab order up to slub_max_order to
+ increase the slab order up to slab_max_order to
generate a sufficiently large slab able to contain
the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired.
- For more information see Documentation/vm/slub.rst.
+ For more information see Documentation/mm/slub.rst.
+ (slub_min_objects legacy name also accepted for now)
- slub_min_order= [MM, SLUB]
+ slab_min_order= [MM]
Determines the minimum page order for slabs. Must be
- lower than slub_max_order.
- For more information see Documentation/vm/slub.rst.
+ lower or equal to slab_max_order. For more information see
+ Documentation/mm/slub.rst.
+ (slub_min_order legacy name also accepted for now)
- slub_nomerge [MM, SLUB]
- Same with slab_nomerge. This is supported for legacy.
- See slab_nomerge for more information.
+ slab_nomerge [MM]
+ Disable merging of slabs with similar size. May be
+ necessary if there is some reason to distinguish
+ allocs to different slabs, especially in hardened
+ environments where the risk of heap overflows and
+ layout control by attackers can usually be
+ frustrated by disabling merging. This will reduce
+ most of the exposure of a heap attack to a single
+ cache (risks via metadata attacks are mostly
+ unchanged). Debug options disable merging on their
+ own.
+ For more information see Documentation/mm/slub.rst.
+ (slub_nomerge legacy name also accepted for now)
+
+ slram= [HW,MTD]
smart2= [HW]
Format: <io1>[,<io2>[,...,<io8>]]
+ smp.csd_lock_timeout= [KNL]
+ Specify the period of time in milliseconds
+ that smp_call_function() and friends will wait
+ for a CPU to release the CSD lock. This is
+ useful when diagnosing bugs involving CPUs
+ disabling interrupts for extended periods
+ of time. Defaults to 5,000 milliseconds, and
+ setting a value of zero disables this feature.
+ This feature may be more efficiently disabled
+ using the csdlock_debug- kernel parameter.
+
+ smp.panic_on_ipistall= [KNL]
+ If a csd_lock_timeout extends for more than
+ the specified number of milliseconds, panic the
+ system. By default, let CSD-lock acquisition
+ take as long as they take. Specifying 300,000
+ for this value provides a 5-minute timeout.
+
smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
smsc-ircc2.ircc_sir= [HW] SIR base I/O port
@@ -4658,10 +6038,10 @@
1: Fast pin select (default)
2: ATC IRMode
- smt [KNL,S390] Set the maximum number of threads (logical
- CPUs) to use per physical CPU on systems capable of
- symmetric multithreading (SMT). Will be capped to the
- actual hardware limit.
+ smt= [KNL,MIPS,S390,EARLY] Set the maximum number of threads
+ (logical CPUs) to use per physical CPU on systems
+ capable of symmetric multithreading (SMT). Will
+ be capped to the actual hardware limit.
Format: <integer>
Default: -1 (no limit)
@@ -4683,7 +6063,7 @@
sonypi.*= [HW] Sony Programmable I/O Control Device driver
See Documentation/admin-guide/laptops/sonypi.rst
- spectre_v2= [X86] Control mitigation of Spectre variant 2
+ spectre_v2= [X86,EARLY] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
The default operation protects the kernel from
user space attacks.
@@ -4698,8 +6078,8 @@
Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
- CONFIG_RETPOLINE configuration option, and the
- compiler with which the kernel was built.
+ CONFIG_MITIGATION_RETPOLINE configuration option,
+ and the compiler with which the kernel was built.
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
@@ -4710,8 +6090,13 @@
Specific mitigations can also be selected manually:
retpoline - replace indirect branches
- retpoline,generic - google's original retpoline
- retpoline,amd - AMD-specific minimal thunk
+ retpoline,generic - Retpolines
+ retpoline,lfence - LFENCE; indirect branch
+ retpoline,amd - alias for retpoline,lfence
+ eibrs - Enhanced/Auto IBRS
+ eibrs,retpoline - Enhanced/Auto IBRS + Retpolines
+ eibrs,lfence - Enhanced/Auto IBRS + LFENCE
+ ibrs - use IBRS to protect kernel
Not specifying this option is equivalent to
spectre_v2=auto.
@@ -4752,14 +6137,24 @@
auto - Kernel selects the mitigation depending on
the available CPU features and vulnerability.
- Default mitigation:
- If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl"
+ Default mitigation: "prctl"
Not specifying this option is equivalent to
spectre_v2_user=auto.
+ spec_rstack_overflow=
+ [X86,EARLY] Control RAS overflow mitigation on AMD Zen CPUs
+
+ off - Disable mitigation
+ microcode - Enable microcode mitigation only
+ safe-ret - Enable sw-only safe RET mitigation (default)
+ ibpb - Enable mitigation by issuing IBPB on
+ kernel entry
+ ibpb-vmexit - Issue IBPB only on VMEXIT
+ (cloud-specific mitigation)
+
spec_store_bypass_disable=
- [HW] Control Speculative Store Bypass (SSB) Disable mitigation
+ [HW,EARLY] Control Speculative Store Bypass (SSB) Disable mitigation
(Speculative Store Bypass vulnerability)
Certain CPUs are vulnerable to an exploit against a
@@ -4797,7 +6192,7 @@
will disable SSB unless they explicitly opt out.
Default mitigations:
- X86: If CONFIG_SECCOMP=y "seccomp", otherwise "prctl"
+ X86: "prctl"
On powerpc the options are:
@@ -4816,28 +6211,46 @@
spia_peddr=
split_lock_detect=
- [X86] Enable split lock detection
+ [X86] Enable split lock detection or bus lock detection
When enabled (and if hardware support is present), atomic
instructions that access data across cache line
- boundaries will result in an alignment check exception.
+ boundaries will result in an alignment check exception
+ for split lock detection or a debug exception for
+ bus lock detection.
off - not enabled
- warn - the kernel will emit rate limited warnings
+ warn - the kernel will emit rate-limited warnings
about applications triggering the #AC
- exception. This mode is the default on CPUs
- that supports split lock detection.
+ exception or the #DB exception. This mode is
+ the default on CPUs that support split lock
+ detection or bus lock detection. Default
+ behavior is by #AC if both features are
+ enabled in hardware.
fatal - the kernel will send SIGBUS to applications
- that trigger the #AC exception.
+ that trigger the #AC exception or the #DB
+ exception. Default behavior is by #AC if
+ both features are enabled in hardware.
+
+ ratelimit:N -
+ Set system wide rate limit to N bus locks
+ per second for bus lock detection.
+ 0 < N <= 1000.
+
+ N/A for split lock detection.
+
If an #AC exception is hit in the kernel or in
firmware (i.e. not while executing in user mode)
the kernel will oops in either "warn" or "fatal"
mode.
- srbds= [X86,INTEL]
+ #DB exception for bus lock is triggered only when
+ CPL > 0.
+
+ srbds= [X86,INTEL,EARLY]
Control the Special Register Buffer Data Sampling
(SRBDS) mitigation.
@@ -4857,6 +6270,30 @@
off: Disable mitigation and remove
performance impact to RDRAND and RDSEED
+ srcutree.big_cpu_lim [KNL]
+ Specifies the number of CPUs constituting a
+ large system, such that srcu_struct structures
+ should immediately allocate an srcu_node array.
+ This kernel-boot parameter defaults to 128,
+ but takes effect only when the low-order four
+ bits of srcutree.convert_to_big is equal to 3
+ (decide at boot).
+
+ srcutree.convert_to_big [KNL]
+ Specifies under what conditions an SRCU tree
+ srcu_struct structure will be converted to big
+ form, that is, with an rcu_node tree:
+
+ 0: Never.
+ 1: At init_srcu_struct() time.
+ 2: When rcutorture decides to.
+ 3: Decide at boot time (default).
+ 0x1X: Above plus if high contention.
+
+ Either way, the srcu_node tree will be sized based
+ on the actual runtime number of CPUs (nr_cpu_ids)
+ instead of the compile-time CONFIG_NR_CPUS.
+
srcutree.counter_wrap_check [KNL]
Specifies how frequently to check for
grace-period sequence counter wrap for the
@@ -4874,7 +6311,33 @@
expediting. Set to zero to disable automatic
expediting.
- ssbd= [ARM64,HW]
+ srcutree.srcu_max_nodelay [KNL]
+ Specifies the number of no-delay instances
+ per jiffy for which the SRCU grace period
+ worker thread will be rescheduled with zero
+ delay. Beyond this limit, worker thread will
+ be rescheduled with a sleep delay of one jiffy.
+
+ srcutree.srcu_max_nodelay_phase [KNL]
+ Specifies the per-grace-period phase, number of
+ non-sleeping polls of readers. Beyond this limit,
+ grace period worker thread will be rescheduled
+ with a sleep delay of one jiffy, between each
+ rescan of the readers, for a grace period phase.
+
+ srcutree.srcu_retry_check_delay [KNL]
+ Specifies number of microseconds of non-sleeping
+ delay between each non-sleeping poll of readers.
+
+ srcutree.small_contention_lim [KNL]
+ Specifies the number of update-side contention
+ events per jiffy will be tolerated before
+ initiating a conversion of an srcu_struct
+ structure to big form. Note that the value of
+ srcutree.convert_to_big must have the 0x10 bit
+ set for contention-based conversions to occur.
+
+ ssbd= [ARM64,HW,EARLY]
Speculative Store Bypass Disable control
On CPUs that are vulnerable to the Speculative
@@ -4898,12 +6361,18 @@
growing up) the main stack are reserved for no other
mapping. Default value is 256 pages.
+ stack_depot_disable= [KNL,EARLY]
+ Setting this to true through kernel command line will
+ disable the stack depot thereby saving the static memory
+ consumed by the stack hash table. By default this is set
+ to false.
+
stacktrace [FTRACE]
Enabled the stack tracer on boot up.
stacktrace_filter=[function-list]
[FTRACE] Limit the functions that the stack tracer
- will trace at boot up. function-list is a comma separated
+ will trace at boot up. function-list is a comma-separated
list of functions. This list can be changed at run
time by the stack_trace_filter file in the debugfs
tracing directory. Note, this enables stack tracing
@@ -4922,6 +6391,25 @@
stifb= [HW]
Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]]
+ strict_sas_size=
+ [X86]
+ Format: <bool>
+ Enable or disable strict sigaltstack size checks
+ against the required signal frame size which
+ depends on the supported FPU features. This can
+ be used to filter out binaries which have
+ not yet been made aware of AT_MINSIGSTKSZ.
+
+ stress_hpt [PPC,EARLY]
+ Limits the number of kernel HPT entries in the hash
+ page table to increase the rate of hash page table
+ faults on kernel addresses.
+
+ stress_slb [PPC,EARLY]
+ Limits the number of kernel SLB entries, and flushes
+ them frequently to increase the rate of SLB faults
+ on kernel addresses.
+
sunrpc.min_resvport=
sunrpc.max_resvport=
[NFS,SUNRPC]
@@ -4977,19 +6465,17 @@
This parameter controls use of the Protected
Execution Facility on pSeries.
- swapaccount=[0|1]
- [KNL] Enable accounting of swap in memory resource
- controller if no parameter or 1 is given or disable
- it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
-
- swiotlb= [ARM,IA-64,PPC,MIPS,X86]
- Format: { <int> | force | noforce }
+ swiotlb= [ARM,IA-64,PPC,MIPS,X86,EARLY]
+ Format: { <int> [,<int>] | force | noforce }
<int> -- Number of I/O TLB slabs
+ <int> -- Second integer after comma. Number of swiotlb
+ areas with their own lock. Will be rounded up
+ to a power of 2.
force -- force using of bounce buffers even if they
wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
- switches= [HW,M68k]
+ switches= [HW,M68k,EARLY]
sysctl.*= [KNL]
Set a sysctl parameter, right before loading the init
@@ -5000,15 +6486,6 @@
later by a loaded module cannot be set this way.
Example: sysctl.vm.swappiness=40
- sysfs.deprecated=0|1 [KNL]
- Enable/disable old style sysfs layout for old udev
- on older distributions. When this option is enabled
- very new udev will not work anymore. When this option
- is disabled (or CONFIG_SYSFS_DEPRECATED not compiled)
- in older udev will not work anymore.
- Default depends on CONFIG_SYSFS_DEPRECATED_V2 set in
- the kernel configuration.
-
sysrq_always_enabled
[KNL]
Ignore sysrq setting - this boot parameter will
@@ -5024,7 +6501,8 @@
tdfx= [HW,DRM]
- test_suspend= [SUSPEND][,N]
+ test_suspend= [SUSPEND]
+ Format: { "mem" | "standby" | "freeze" }[,N]
Specify "mem" (for Suspend-to-RAM) or "standby" (for
standby suspend) or "freeze" (for suspend type freeze)
as the system sleep state during system startup with
@@ -5043,10 +6521,6 @@
-1: disable all critical trip points in all thermal zones
<degrees C>: override all critical trip points
- thermal.nocrt= [HW,ACPI]
- Set to disable actions on ACPI thermal zone
- critical and hot trip points.
-
thermal.off= [HW,ACPI]
1: disable ACPI thermal control
@@ -5060,11 +6534,11 @@
<deci-seconds>: poll all this frequency
0: no polling (default)
- threadirqs [KNL]
+ threadirqs [KNL,EARLY]
Force threading of all interrupt handlers except those
marked explicitly IRQF_NO_THREAD.
- topology= [S390]
+ topology= [S390,EARLY]
Format: {off | on}
Specify if the kernel should make use of the cpu
topology information if the hardware supports this.
@@ -5082,6 +6556,21 @@
Prevent the CPU-hotplug component of torturing
until after init has spawned.
+ torture.ftrace_dump_at_shutdown= [KNL]
+ Dump the ftrace buffer at torture-test shutdown,
+ even if there were no errors. This can be a
+ very costly operation when many torture tests
+ are running concurrently, especially on systems
+ with rotating-rust storage.
+
+ torture.verbose_sleep_frequency= [KNL]
+ Specifies how many verbose printk()s should be
+ emitted between each sleep. The default of zero
+ disables verbose-printk() sleeping.
+
+ torture.verbose_sleep_duration= [KNL]
+ Duration of each verbose-printk() sleep in jiffies.
+
tp720= [HW,PS2]
tpm_suspend_pcr=[HW,TPM]
@@ -5093,22 +6582,102 @@
This will guarantee that all the other pcrs
are saved.
+ tpm_tis.interrupts= [HW,TPM]
+ Enable interrupts for the MMIO based physical layer
+ for the FIFO interface. By default it is set to false
+ (0). For more information about TPM hardware interfaces
+ defined by Trusted Computing Group (TCG) see
+ https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/
+
+ tp_printk [FTRACE]
+ Have the tracepoints sent to printk as well as the
+ tracing ring buffer. This is useful for early boot up
+ where the system hangs or reboots and does not give the
+ option for reading the tracing buffer or performing a
+ ftrace_dump_on_oops.
+
+ To turn off having tracepoints sent to printk,
+ echo 0 > /proc/sys/kernel/tracepoint_printk
+ Note, echoing 1 into this file without the
+ tracepoint_printk kernel cmdline option has no effect.
+
+ The tp_printk_stop_on_boot (see below) can also be used
+ to stop the printing of events to console at
+ late_initcall_sync.
+
+ ** CAUTION **
+
+ Having tracepoints sent to printk() and activating high
+ frequency tracepoints such as irq or sched, can cause
+ the system to live lock.
+
+ tp_printk_stop_on_boot [FTRACE]
+ When tp_printk (above) is set, it can cause a lot of noise
+ on the console. It may be useful to only include the
+ printing of events during boot up, as user space may
+ make the system inoperable.
+
+ This command line option will stop the printing of events
+ to console at the late_initcall_sync() time frame.
+
trace_buf_size=nn[KMG]
[FTRACE] will set tracing buffer size on each cpu.
+ trace_clock= [FTRACE] Set the clock used for tracing events
+ at boot up.
+ local - Use the per CPU time stamp counter
+ (converted into nanoseconds). Fast, but
+ depending on the architecture, may not be
+ in sync between CPUs.
+ global - Event time stamps are synchronize across
+ CPUs. May be slower than the local clock,
+ but better for some race conditions.
+ counter - Simple counting of events (1, 2, ..)
+ note, some counts may be skipped due to the
+ infrastructure grabbing the clock more than
+ once per event.
+ uptime - Use jiffies as the time stamp.
+ perf - Use the same clock that perf uses.
+ mono - Use ktime_get_mono_fast_ns() for time stamps.
+ mono_raw - Use ktime_get_raw_fast_ns() for time
+ stamps.
+ boot - Use ktime_get_boot_fast_ns() for time stamps.
+ Architectures may add more clocks. See
+ Documentation/trace/ftrace.rst for more details.
+
trace_event=[event-list]
[FTRACE] Set and start specified trace events in order
to facilitate early boot debugging. The event-list is a
- comma separated list of trace events to enable. See
+ comma-separated list of trace events to enable. See
also Documentation/trace/events.rst
+ trace_instance=[instance-info]
+ [FTRACE] Create a ring buffer instance early in boot up.
+ This will be listed in:
+
+ /sys/kernel/tracing/instances
+
+ Events can be enabled at the time the instance is created
+ via:
+
+ trace_instance=<name>,<system1>:<event1>,<system2>:<event2>
+
+ Note, the "<system*>:" portion is optional if the event is
+ unique.
+
+ trace_instance=foo,sched:sched_switch,irq_handler_entry,initcall
+
+ will enable the "sched_switch" event (note, the "sched:" is optional, and
+ the same thing would happen if it was left off). The irq_handler_entry
+ event, and all events under the "initcall" system.
+
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
that can be enabled or disabled just as if you were
to echo the option name into
- /sys/kernel/debug/tracing/trace_options
+ /sys/kernel/tracing/trace_options
For example, to enable stacktrace option (to dump the
stack trace of each event), add to the command line:
@@ -5118,29 +6687,30 @@
See also Documentation/trace/ftrace.rst "trace options"
section.
- tp_printk[FTRACE]
- Have the tracepoints sent to printk as well as the
- tracing ring buffer. This is useful for early boot up
- where the system hangs or reboots and does not give the
- option for reading the tracing buffer or performing a
- ftrace_dump_on_oops.
+ trace_trigger=[trigger-list]
+ [FTRACE] Add a event trigger on specific events.
+ Set a trigger on top of a specific event, with an optional
+ filter.
- To turn off having tracepoints sent to printk,
- echo 0 > /proc/sys/kernel/tracepoint_printk
- Note, echoing 1 into this file without the
- tracepoint_printk kernel cmdline option has no effect.
+ The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."
+ Where more than one trigger may be specified that are comma deliminated.
- ** CAUTION **
+ For example:
+
+ trace_trigger="sched_switch.stacktrace if prev_state == 2"
+
+ The above will enable the "stacktrace" trigger on the "sched_switch"
+ event but only trigger it if the "prev_state" of the "sched_switch"
+ event is "2" (TASK_UNINTERUPTIBLE).
+
+ See also "Event triggers" in Documentation/trace/events.rst
- Having tracepoints sent to printk() and activating high
- frequency tracepoints such as irq or sched, can cause
- the system to live lock.
traceoff_on_warning
[FTRACE] enable this option to disable tracing when a
warning is hit. This turns off "tracing_on". Tracing can
be enabled again by echoing '1' into the "tracing_on"
- file located in /sys/kernel/debug/tracing/
+ file located in /sys/kernel/tracing/
This option is useful, as it disables the trace before
the WARNING dump is called, which prevents the trace to
@@ -5157,6 +6727,29 @@
See Documentation/admin-guide/mm/transhuge.rst
for more details.
+ trusted.source= [KEYS]
+ Format: <string>
+ This parameter identifies the trust source as a backend
+ for trusted keys implementation. Supported trust
+ sources:
+ - "tpm"
+ - "tee"
+ - "caam"
+ If not specified then it defaults to iterating through
+ the trust source list starting with TPM and assigns the
+ first trust source as a backend which is initialized
+ successfully during iteration.
+
+ trusted.rng= [KEYS]
+ Format: <string>
+ The RNG used to generate key material for trusted keys.
+ Can be one of:
+ - "kernel"
+ - the same value as trusted.source: "tpm" or "tee"
+ - "default"
+ If not specified, "default" is used. In this case,
+ the RNG's choice is left to each individual trust source.
+
tsc= Disable clocksource stability checks for TSC.
Format: <string>
[x86] reliable: mark tsc clocksource as reliable, this
@@ -5175,8 +6768,18 @@
in situations with strict latency requirements (where
interruptions from clocksource watchdog are not
acceptable).
-
- tsc_early_khz= [X86] Skip early TSC calibration and use the given
+ [x86] recalibrate: force recalibration against a HW timer
+ (HPET or PM timer) on systems whose TSC frequency was
+ obtained from HW or FW using either an MSR or CPUID(0x15).
+ Warn if the difference is more than 500 ppm.
+ [x86] watchdog: Use TSC as the watchdog clocksource with
+ which to check other HW timers (HPET or PM timer), but
+ only on systems where TSC has been deemed trustworthy.
+ This will be suppressed by an earlier tsc=nowatchdog and
+ can be overridden by a later tsc=nowatchdog. A console
+ message will flag any such suppression or overriding.
+
+ tsc_early_khz= [X86,EARLY] Skip early TSC calibration and use the given
value instead. Useful when the early TSC frequency discovery
procedure is not reliable, such as on overclocked systems
with CPUID.16h support and partial CPUID.15h support.
@@ -5211,7 +6814,7 @@
See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
for more details.
- tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async
+ tsx_async_abort= [X86,INTEL,EARLY] Control mitigation for the TSX Async
Abort (TAA) vulnerability.
Similar to Micro-architectural Data Sampling (MDS)
@@ -5277,9 +6880,15 @@
unknown_nmi_panic
[X86] Cause panic on unknown NMI.
+ unwind_debug [X86-64,EARLY]
+ Enable unwinder debug output. This can be
+ useful for debugging certain unwinder error
+ conditions, including corrupt stacks and
+ bad/missing unwinder metadata.
+
usbcore.authorized_default=
[USB] Default USB device authorization:
- (default -1 = authorized except for wireless USB,
+ (default -1 = authorized (same as 1),
0 = not authorized, 1 = authorized, 2 = authorized
if device connected to internal port)
@@ -5377,6 +6986,9 @@
pause after every control message);
o = USB_QUIRK_HUB_SLOW_RESET (Hub needs extra
delay after resetting its port);
+ p = USB_QUIRK_SHORT_SET_ADDRESS_REQ_TIMEOUT
+ (Reduce timeout of the SET_ADDRESS
+ request from 5000 ms to 500 ms);
Example: quirks=0781:5580:bk,0a5c:5834:gij
usbhid.mousepoll=
@@ -5421,6 +7033,7 @@
device);
j = NO_REPORT_LUNS (don't use report luns
command, uas only);
+ k = NO_SAME (do not use WRITE_SAME, uas only)
l = NOT_LOCKABLE (don't try to lock and
unlock ejectable media, not on uas);
m = MAX_SECTORS_64 (don't transfer more
@@ -5457,13 +7070,13 @@
Example: user_debug=31
userpte=
- [X86] Flags controlling user PTE allocations.
+ [X86,EARLY] Flags controlling user PTE allocations.
nohigh = do not allocate PTE pages in
HIGHMEM regardless of setting
of CONFIG_HIGHPTE.
- vdso= [X86,SH]
+ vdso= [X86,SH,SPARC]
On X86_32, this is an alias for vdso32=. Otherwise:
vdso=1: enable VDSO (the default)
@@ -5486,14 +7099,15 @@
vector= [IA-64,SMP]
vector=percpu: enable percpu vector domain
- video= [FB] Frame buffer configuration
+ video= [FB,EARLY] Frame buffer configuration
See Documentation/fb/modedb.rst.
- video.brightness_switch_enabled= [0,1]
+ video.brightness_switch_enabled= [ACPI]
+ Format: [0|1]
If set to 1, on receiving an ACPI notify event
generated by hotkey, video driver will adjust brightness
level and then send out the event to user space through
- the allocated input device; If set to 0, video driver
+ the allocated input device. If set to 0, video driver
will only send out the event without touching backlight
brightness level.
default: 1
@@ -5515,7 +7129,7 @@
Can be used multiple times for multiple devices.
vga= [BOOT,X86-32] Select a particular video mode
- See Documentation/x86/boot.rst and
+ See Documentation/arch/x86/boot.rst and
Documentation/admin-guide/svga.rst.
Use vga=ask for menu.
This is actually a boot loader parameter; the value is
@@ -5533,13 +7147,13 @@
P Enable page structure init time poisoning
- Disable all of the above options
- vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact
- size of <nn>. This can be used to increase the
- minimum size (128MB on x86). It can also be used to
- decrease the size and leave more room for directly
- mapped kernel RAM.
+ vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an
+ exact size of <nn>. This can be used to increase
+ the minimum size (128MB on x86). It can also be
+ used to decrease the size and leave more room
+ for directly mapped kernel RAM.
- vmcp_cma=nn[MG] [KNL,S390]
+ vmcp_cma=nn[MG] [KNL,S390,EARLY]
Sets the memory size reserved for contiguous memory
allocations for the vmcp device driver.
@@ -5552,7 +7166,7 @@
vmpoff= [KNL,S390] Perform z/VM CP command after power off.
Format: <command>
- vsyscall= [X86-64]
+ vsyscall= [X86-64,EARLY]
Controls the behavior of vsyscalls (i.e. calls to
fixed addresses of 0xffffffffff600x00 from legacy
code). Most statically-linked binaries and older
@@ -5560,11 +7174,11 @@
functions are at fixed addresses, they make nice
targets for exploits that can control RIP.
- emulate [default] Vsyscalls turn into traps and are
- emulated reasonably safely. The vsyscall
- page is readable.
+ emulate Vsyscalls turn into traps and are emulated
+ reasonably safely. The vsyscall page is
+ readable.
- xonly Vsyscalls turn into traps and are
+ xonly [default] Vsyscalls turn into traps and are
emulated reasonably safely. The vsyscall
page is not readable.
@@ -5634,6 +7248,13 @@
disables both lockup detectors. Default is 10
seconds.
+ workqueue.unbound_cpus=
+ [KNL,SMP] Specify to constrain one or some CPUs
+ to use in unbound workqueues.
+ Format: <cpu-list>
+ By default, all online CPUs are available for
+ unbound workqueues.
+
workqueue.watchdog_thresh=
If CONFIG_WQ_WATCHDOG is configured, workqueue can
warn stall conditions and dump internal state to
@@ -5643,14 +7264,26 @@
it can be updated at runtime by writing to the
corresponding sysfs file.
- workqueue.disable_numa
- By default, all work items queued to unbound
- workqueues are affine to the NUMA nodes they're
- issued on, which results in better behavior in
- general. If NUMA affinity needs to be disabled for
- whatever reason, this option can be used. Note
- that this also can be controlled per-workqueue for
- workqueues visible under /sys/bus/workqueue/.
+ workqueue.cpu_intensive_thresh_us=
+ Per-cpu work items which run for longer than this
+ threshold are automatically considered CPU intensive
+ and excluded from concurrency management to prevent
+ them from noticeably delaying other per-cpu work
+ items. Default is 10000 (10ms).
+
+ If CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel
+ will report the work functions which violate this
+ threshold repeatedly. They are likely good
+ candidates for using WQ_UNBOUND workqueues instead.
+
+ workqueue.cpu_intensive_warning_thresh=<uint>
+ If CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel
+ will report the work functions which violate the
+ intensive_threshold_us repeatedly. In order to prevent
+ spurious warnings, start printing only after a work
+ function has violated this threshold number of times.
+
+ The default is 4 times. 0 disables the warning.
workqueue.power_efficient
Per-cpu workqueues are generally preferred because
@@ -5667,6 +7300,18 @@
The default value of this parameter is determined by
the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
+ workqueue.default_affinity_scope=
+ Select the default affinity scope to use for unbound
+ workqueues. Can be one of "cpu", "smt", "cache",
+ "numa" and "system". Default is "cache". For more
+ information, see the Affinity Scopes section in
+ Documentation/core-api/workqueue.rst.
+
+ This can be changed after boot by writing to the
+ matching /sys/module/workqueue/parameters file. All
+ workqueues with the "default" affinity scope will be
+ updated accordignly.
+
workqueue.debug_force_rr_cpu
Workqueue used to implicitly guarantee that work
items queued without explicit CPU specified are put
@@ -5678,16 +7323,16 @@
When enabled, memory and cache locality will be
impacted.
- x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
+ writecombine= [LOONGARCH,EARLY] Control the MAT (Memory Access
+ Type) of ioremap_wc().
+
+ on - Enable writecombine, use WUC for ioremap_wc()
+ off - Disable writecombine, use SUC for ioremap_wc()
+
+ x2apic_phys [X86-64,APIC,EARLY] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
- x86_intel_mid_timer= [X86-32,APBT]
- Choose timer option for x86 Intel MID platform.
- Two valid options are apbt timer only and lapic timer
- plus one apbt timer for broadcast timer.
- x86_intel_mid_timer=apbt_only | lapic_and_apbt
-
xen_512gb_limit [KNL,X86-64,XEN]
Restricts the kernel running paravirtualized under Xen
to use only up to 512 GB of RAM. The reason to do so is
@@ -5695,7 +7340,7 @@
save/restore/migration must be enabled to handle larger
domains.
- xen_emul_unplug= [HW,X86,XEN]
+ xen_emul_unplug= [HW,X86,XEN,EARLY]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
ide-disks -- unplug primary master IDE devices
@@ -5707,13 +7352,20 @@
the unplug protocol
never -- do not unplug even if version check succeeds
- xen_legacy_crash [X86,XEN]
+ xen_legacy_crash [X86,XEN,EARLY]
Crash from Xen panic notifier, without executing late
panic() code such as dumping handler.
- xen_nopvspin [X86,XEN]
- Disables the ticketlock slowpath using Xen PV
- optimizations.
+ xen_msr_safe= [X86,XEN,EARLY]
+ Format: <bool>
+ Select whether to always use non-faulting (safe) MSR
+ access functions when running as Xen PV guest. The
+ default value is controlled by CONFIG_XEN_PV_MSR_SAFE.
+
+ xen_nopvspin [X86,XEN,EARLY]
+ Disables the qspinlock slowpath using Xen PV optimizations.
+ This parameter is obsoleted by "nopvspin" parameter, which
+ has equivalent effect for XEN platform.
xen_nopv [X86]
Disables the PV optimizations forcing the HVM guest to
@@ -5721,23 +7373,44 @@
This option is obsoleted by the "nopv" option, which
has equivalent effect for XEN platform.
+ xen_no_vector_callback
+ [KNL,X86,XEN,EARLY] Disable the vector callback for Xen
+ event channel interrupts.
+
xen_scrub_pages= [XEN]
Boolean option to control scrubbing pages before giving them back
to Xen, for use by other domains. Can be also changed at runtime
with /sys/devices/system/xen_memory/xen_memory0/scrub_pages.
Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
- xen_timer_slop= [X86-64,XEN]
+ xen_timer_slop= [X86-64,XEN,EARLY]
Set the timer slop (in nanoseconds) for the virtual Xen
timers (default is 100000). This adjusts the minimum
delta of virtualized Xen timers, where lower values
improve timer resolution at the expense of processing
more timer interrupts.
- nopv= [X86,XEN,KVM,HYPER_V,VMWARE]
- Disables the PV optimizations forcing the guest to run
- as generic guest with no PV drivers. Currently support
- XEN HVM, KVM, HYPER_V and VMWARE guest.
+ xen.balloon_boot_timeout= [XEN]
+ The time (in seconds) to wait before giving up to boot
+ in case initial ballooning fails to free enough memory.
+ Applies only when running as HVM or PVH guest and
+ started with less memory configured than allowed at
+ max. Default is 180.
+
+ xen.event_eoi_delay= [XEN]
+ How long to delay EOI handling in case of event
+ storms (jiffies). Default is 10.
+
+ xen.event_loop_timeout= [XEN]
+ After which time (jiffies) the event handling loop
+ should start to delay EOI handling. Default is 2.
+
+ xen.fifo_events= [XEN]
+ Boolean parameter to disable using fifo event handling
+ even if available. Normally fifo event handling is
+ preferred over the 2-level event handling, as it is
+ fairer and the number of possible event channels is
+ much higher. Default is on (use fifo events).
xirc2ps_cs= [NET,PCMCIA]
Format:
@@ -5752,12 +7425,18 @@
controller on both pseries and powernv
platforms. Only useful on POWER9 and above.
+ xive.store-eoi=off [PPC]
+ By default on POWER10 and above, the kernel will use
+ stores for EOI handling when the XIVE interrupt mode
+ is active. This option allows the XIVE driver to use
+ loads instead, as on POWER9.
+
xhci-hcd.quirks [USB,KNL]
A hex value specifying bitmask with supplemental xhci
host controller quirks. Meaning of each bit can be
consulted in header drivers/usb/host/xhci.h.
- xmon [PPC]
+ xmon [PPC,EARLY]
Format: { early | on | rw | ro | off }
Controls if xmon debugger is enabled. Default is off.
Passing only "xmon" is equivalent to "xmon=early".
@@ -5775,3 +7454,4 @@
memory, and other data can't be written using
xmon commands.
off xmon is disabled.
+
diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
index dc36aeb65d0a..b6aeae3327ce 100644
--- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
+++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
@@ -25,7 +25,7 @@ References
- In order to locate kernel-generated OS jitter on CPU N:
- cd /sys/kernel/debug/tracing
+ cd /sys/kernel/tracing
echo 1 > max_graph_depth # Increase the "1" for more detail
echo function_graph > current_tracer
# run workload
@@ -208,7 +208,7 @@ Do at least one of the following:
2. Enable RCU to do its processing remotely via dyntick-idle by
doing all of the following:
- a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
+ a. Build with CONFIG_NO_HZ=y.
b. Ensure that the CPU goes idle frequently, allowing other
CPUs to detect that it has passed through an RCU quiescent
state. If the kernel is built with CONFIG_NO_HZ_FULL=y,
@@ -243,13 +243,9 @@ To reduce its OS jitter, do any of the following:
3. Do any of the following needed to avoid jitter that your
application cannot tolerate:
- a. Build your kernel with CONFIG_SLUB=y rather than
- CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
- use of each CPU's workqueues to run its cache_reap()
- function.
- b. Avoid using oprofile, thus avoiding OS jitter from
+ a. Avoid using oprofile, thus avoiding OS jitter from
wq_sync_buffer().
- c. Limit your CPU frequency so that a CPU-frequency
+ b. Limit your CPU frequency so that a CPU-frequency
governor is not required, possibly enlisting the aid of
special heatsinks or other cooling technologies. If done
correctly, and if you CPU architecture permits, you should
@@ -259,7 +255,7 @@ To reduce its OS jitter, do any of the following:
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
- d. As of v3.18, Christoph Lameter's on-demand vmstat workers
+ c. As of v3.18, Christoph Lameter's on-demand vmstat workers
commit prevents OS jitter due to vmstat_update() on
CONFIG_SMP=y systems. Before v3.18, is not possible
to entirely get rid of the OS jitter, but you can
@@ -273,8 +269,8 @@ To reduce its OS jitter, do any of the following:
However, there is an RFC patch from Christoph Lameter
(based on an earlier one from Gilad Ben-Yossef) that
reduces or even eliminates vmstat overhead for some
- workloads at https://lkml.org/lkml/2013/9/4/379.
- e. If running on high-end powerpc servers, build with
+ workloads at https://lore.kernel.org/r/00000140e9dfd6bd-40db3d4f-c1be-434f-8132-7820f81bb586-000000@email.amazonses.com.
+ d. If running on high-end powerpc servers, build with
CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
daemon from running on each CPU every second or so.
(This will require editing Kconfig files and will defeat
@@ -282,12 +278,12 @@ To reduce its OS jitter, do any of the following:
due to the rtas_event_scan() function.
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
- f. If running on Cell Processor, build your kernel with
+ e. If running on Cell Processor, build your kernel with
CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
spu_gov_work().
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
- g. If running on PowerMAC, build your kernel with
+ f. If running on PowerMAC, build your kernel with
CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
avoiding OS jitter from rackmeter_do_timer().
@@ -332,23 +328,3 @@ To reduce its OS jitter, do at least one of the following:
kthreads from being created in the first place. However, please
note that this will not eliminate OS jitter, but will instead
shift it to RCU_SOFTIRQ.
-
-Name:
- watchdog/%u
-
-Purpose:
- Detect software lockups on each CPU.
-
-To reduce its OS jitter, do at least one of the following:
-
-1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these
- kthreads from being created in the first place.
-2. Boot with "nosoftlockup=0", which will also prevent these kthreads
- from being created. Other related watchdog and softlockup boot
- parameters may be found in Documentation/admin-guide/kernel-parameters.rst
- and Documentation/watchdog/watchdog-parameters.rst.
-3. Echo a zero to /proc/sys/kernel/watchdog to disable the
- watchdog timer.
-4. Echo a large number of /proc/sys/kernel/watchdog_thresh in
- order to reduce the frequency of OS jitter due to the watchdog
- timer down to a level that is acceptable for your workload.
diff --git a/Documentation/admin-guide/laptops/disk-shock-protection.rst b/Documentation/admin-guide/laptops/disk-shock-protection.rst
index e97c5f78d8c3..22c7ec3e84cf 100644
--- a/Documentation/admin-guide/laptops/disk-shock-protection.rst
+++ b/Documentation/admin-guide/laptops/disk-shock-protection.rst
@@ -135,7 +135,7 @@ single project which, although still considered experimental, is fit
for use. Please feel free to add projects that have been the victims
of my ignorance.
-- http://www.thinkwiki.org/wiki/HDAPS
+- https://www.thinkwiki.org/wiki/HDAPS
See this page for information about Linux support of the hard disk
active protection system as implemented in IBM/Lenovo Thinkpads.
diff --git a/Documentation/admin-guide/laptops/laptop-mode.rst b/Documentation/admin-guide/laptops/laptop-mode.rst
index c984c4262f2e..b61cc601d298 100644
--- a/Documentation/admin-guide/laptops/laptop-mode.rst
+++ b/Documentation/admin-guide/laptops/laptop-mode.rst
@@ -101,17 +101,6 @@ this results in concentration of disk activity in a small time interval which
occurs only once every 10 minutes, or whenever the disk is forced to spin up by
a cache miss. The disk can then be spun down in the periods of inactivity.
-If you want to find out which process caused the disk to spin up, you can
-gather information by setting the flag /proc/sys/vm/block_dump. When this flag
-is set, Linux reports all disk read and write operations that take place, and
-all block dirtyings done to files. This makes it possible to debug why a disk
-needs to spin up, and to increase battery life even more. The output of
-block_dump is written to the kernel output, and it can be retrieved using
-"dmesg". When you use block_dump and your kernel logging level also includes
-kernel debugging messages, you probably want to turn off klogd, otherwise
-the output of block_dump will be logged, causing disk activity that is not
-normally there.
-
Configuration
-------------
diff --git a/Documentation/admin-guide/laptops/lg-laptop.rst b/Documentation/admin-guide/laptops/lg-laptop.rst
index ce9b14671cb9..67fd6932cef4 100644
--- a/Documentation/admin-guide/laptops/lg-laptop.rst
+++ b/Documentation/admin-guide/laptops/lg-laptop.rst
@@ -13,10 +13,8 @@ Hotkeys
The following FN keys are ignored by the kernel without this driver:
- FN-F1 (LG control panel) - Generates F15
-- FN-F5 (Touchpad toggle) - Generates F13
+- FN-F5 (Touchpad toggle) - Generates F21
- FN-F6 (Airplane mode) - Generates RFKILL
-- FN-F8 (Keyboard backlight) - Generates F16.
- This key also changes keyboard backlight mode.
- FN-F9 (Reader mode) - Generates F14
The rest of the FN keys work without a need for a special driver.
@@ -40,7 +38,7 @@ FN lock.
Battery care limit
------------------
-Writing 80/100 to /sys/devices/platform/lg-laptop/battery_care_limit
+Writing 80/100 to /sys/class/power_supply/CMB0/charge_control_end_threshold
sets the maximum capacity to charge the battery. Limiting the charge
reduces battery capacity loss over time.
diff --git a/Documentation/admin-guide/laptops/sonypi.rst b/Documentation/admin-guide/laptops/sonypi.rst
index c6eaaf48f7c1..190da1234314 100644
--- a/Documentation/admin-guide/laptops/sonypi.rst
+++ b/Documentation/admin-guide/laptops/sonypi.rst
@@ -151,7 +151,7 @@ Bugs:
different way to adjust the backlighting of the screen. There
is a userspace utility to adjust the brightness on those models,
which can be downloaded from
- http://www.acc.umu.se/~erikw/program/smartdimmer-0.1.tar.bz2
+ https://www.acc.umu.se/~erikw/program/smartdimmer-0.1.tar.bz2
- since all development was done by reverse engineering, there is
*absolutely no guarantee* that this driver will not crash your
diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
index 822907dcc845..7f674a6cfa8a 100644
--- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst
+++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
@@ -50,6 +50,10 @@ detailed description):
- WAN enable and disable
- UWB enable and disable
- LCD Shadow (PrivacyGuard) enable and disable
+ - Lap mode sensor
+ - Setting keyboard language
+ - WWAN Antenna type
+ - Auxmac
A compatibility table by model and feature is maintained on the web
site, http://ibm-acpi.sf.net/. I appreciate any success or failure
@@ -440,7 +444,9 @@ event code Key Notes
0x1008 0x07 FN+F8 IBM: toggle screen expand
Lenovo: configure UltraNav,
- or toggle screen expand
+ or toggle screen expand.
+ On newer platforms (2024+)
+ replaced by 0x131f (see below)
0x1009 0x08 FN+F9 -
@@ -500,6 +506,9 @@ event code Key Notes
0x1019 0x18 unknown
+0x131f ... FN+F8 Platform Mode change.
+ Implemented in driver.
+
... ... ...
0x1020 0x1F unknown
@@ -904,7 +913,7 @@ temperatures:
The mapping of thermal sensors to physical locations varies depending on
system-board model (and thus, on ThinkPad model).
-http://thinkwiki.org/wiki/Thermal_Sensors is a public wiki page that
+https://thinkwiki.org/wiki/Thermal_Sensors is a public wiki page that
tries to track down these locations for various models.
Most (newer?) models seem to follow this pattern:
@@ -925,7 +934,7 @@ For the R51 (source: Thomas Gruber):
- 3: Internal HDD
For the T43, T43/p (source: Shmidoax/Thinkwiki.org)
-http://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_T43.2C_T43p
+https://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_T43.2C_T43p
- 2: System board, left side (near PCMCIA slot), reported as HDAPS temp
- 3: PCMCIA slot
@@ -935,7 +944,7 @@ http://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_T43.2C_T43p
- 11: Power regulator, underside of system board, below F2 key
The A31 has a very atypical layout for the thermal sensors
-(source: Milos Popovic, http://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_A31)
+(source: Milos Popovic, https://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_A31)
- 1: CPU
- 2: Main Battery: main sensor
@@ -1432,6 +1441,20 @@ The first command ensures the best viewing angle and the latter one turns
on the feature, restricting the viewing angles.
+DYTC Lapmode sensor
+-------------------
+
+sysfs: dytc_lapmode
+
+Newer thinkpads and mobile workstations have the ability to determine if
+the device is in deskmode or lapmode. This feature is used by user space
+to decide if WWAN transmission can be increased to maximum power and is
+also useful for understanding the different thermal modes available as
+they differ between desk and lap mode.
+
+The property is read-only. If the platform doesn't have support the sysfs
+class is not created.
+
EXPERIMENTAL: UWB
-----------------
@@ -1451,6 +1474,68 @@ Sysfs notes
rfkill controller switch "tpacpi_uwb_sw": refer to
Documentation/driver-api/rfkill.rst for details.
+
+Setting keyboard language
+-------------------------
+
+sysfs: keyboard_lang
+
+This feature is used to set keyboard language to ECFW using ASL interface.
+Fewer thinkpads models like T580 , T590 , T15 Gen 1 etc.. has "=", "(',
+")" numeric keys, which are not displaying correctly, when keyboard language
+is other than "english". This is because the default keyboard language in ECFW
+is set as "english". Hence using this sysfs, user can set the correct keyboard
+language to ECFW and then these key's will work correctly.
+
+Example of command to set keyboard language is mentioned below::
+
+ echo jp > /sys/devices/platform/thinkpad_acpi/keyboard_lang
+
+Text corresponding to keyboard layout to be set in sysfs are: be(Belgian),
+cz(Czech), da(Danish), de(German), en(English), es(Spain), et(Estonian),
+fr(French), fr-ch(French(Switzerland)), hu(Hungarian), it(Italy), jp (Japan),
+nl(Dutch), nn(Norway), pl(Polish), pt(portuguese), sl(Slovenian), sv(Sweden),
+tr(Turkey)
+
+WWAN Antenna type
+-----------------
+
+sysfs: wwan_antenna_type
+
+On some newer Thinkpads we need to set SAR value based on the antenna
+type. This interface will be used by userspace to get the antenna type
+and set the corresponding SAR value, as is required for FCC certification.
+
+The available commands are::
+
+ cat /sys/devices/platform/thinkpad_acpi/wwan_antenna_type
+
+Currently 2 antenna types are supported as mentioned below:
+- type a
+- type b
+
+The property is read-only. If the platform doesn't have support the sysfs
+class is not created.
+
+Auxmac
+------
+
+sysfs: auxmac
+
+Some newer Thinkpads have a feature called MAC Address Pass-through. This
+feature is implemented by the system firmware to provide a system unique MAC,
+that can override a dock or USB ethernet dongle MAC, when connected to a
+network. This property enables user-space to easily determine the MAC address
+if the feature is enabled.
+
+The values of this auxiliary MAC are:
+
+ cat /sys/devices/platform/thinkpad_acpi/auxmac
+
+If the feature is disabled, the value will be 'disabled'.
+
+This property is read-only.
+
Adaptive keyboard
-----------------
@@ -1460,15 +1545,32 @@ This sysfs attribute controls the keyboard "face" that will be shown on the
Lenovo X1 Carbon 2nd gen (2014)'s adaptive keyboard. The value can be read
and set.
-- 1 = Home mode
-- 2 = Web-browser mode
-- 3 = Web-conference mode
-- 4 = Function mode
-- 5 = Layflat mode
+- 0 = Home mode
+- 1 = Web-browser mode
+- 2 = Web-conference mode
+- 3 = Function mode
+- 4 = Layflat mode
For more details about which buttons will appear depending on the mode, please
review the laptop's user guide:
-http://www.lenovo.com/shop/americas/content/user_guides/x1carbon_2_ug_en.pdf
+https://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles_pdf/x1carbon_2_ug_en.pdf
+
+Battery charge control
+----------------------
+
+sysfs attributes:
+/sys/class/power_supply/BAT*/charge_control_{start,end}_threshold
+
+These two attributes are created for those batteries that are supported by the
+driver. They enable the user to control the battery charge thresholds of the
+given battery. Both values may be read and set. `charge_control_start_threshold`
+accepts an integer between 0 and 99 (inclusive); this value represents a battery
+percentage level, below which charging will begin. `charge_control_end_threshold`
+accepts an integer between 1 and 100 (inclusive); this value represents a battery
+percentage level, above which charging will stop.
+
+The exact semantics of the attributes may be found in
+Documentation/ABI/testing/sysfs-class-power.
Multiple Commands, Module Parameters
------------------------------------
diff --git a/Documentation/admin-guide/lockup-watchdogs.rst b/Documentation/admin-guide/lockup-watchdogs.rst
index 290840c160af..3e09284a8b9b 100644
--- a/Documentation/admin-guide/lockup-watchdogs.rst
+++ b/Documentation/admin-guide/lockup-watchdogs.rst
@@ -39,7 +39,7 @@ in principle, they should work in any architecture where these
subsystems are present.
A periodic hrtimer runs to generate interrupts and kick the watchdog
-task. An NMI perf event is generated every "watchdog_thresh"
+job. An NMI perf event is generated every "watchdog_thresh"
(compile-time initialized to 10 and configurable through sysctl of the
same name) seconds to check for hardlockups. If any CPU in the system
does not receive any hrtimer interrupt during that time the
@@ -47,7 +47,7 @@ does not receive any hrtimer interrupt during that time the
generate a kernel warning or call panic, depending on the
configuration.
-The watchdog task is a high priority kernel thread that updates a
+The watchdog job runs in a stop scheduling thread that updates a
timestamp every time it is scheduled. If that timestamp is not updated
for 2*watchdog_thresh seconds (the softlockup threshold) the
'softlockup detector' (coded inside the hrtimer callback function)
diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index d973d469ffc4..4ff2cc291d18 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -221,7 +221,7 @@ All md devices contain:
layout
The ``layout`` for the array for the particular level. This is
- simply a number that is interpretted differently by different
+ simply a number that is interpreted differently by different
levels. It can be written while assembling an array.
array_size
@@ -317,7 +317,7 @@ All md devices contain:
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
- Writing this, if accepted, will block until array is quiessent
+ Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
@@ -426,6 +426,10 @@ All md devices contain:
The accepted values when writing to this file are ``ppl`` and ``resync``,
used to enable and disable PPL.
+ uuid
+ This indicates the UUID of the array in the following format:
+ xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
+
As component devices are added to an md array, they appear in the ``md``
directory as new directories named::
diff --git a/Documentation/admin-guide/media/bt8xx.rst b/Documentation/admin-guide/media/bt8xx.rst
index 1382ada1e38e..3589f6ab7e46 100644
--- a/Documentation/admin-guide/media/bt8xx.rst
+++ b/Documentation/admin-guide/media/bt8xx.rst
@@ -15,11 +15,12 @@ Authors:
General information
-------------------
-This class of cards has a bt878a as the PCI interface, and require the bttv driver
-for accessing the i2c bus and the gpio pins of the bt8xx chipset.
+This class of cards has a bt878a as the PCI interface, and require the bttv
+driver for accessing the i2c bus and the gpio pins of the bt8xx chipset.
-Please see :doc:`bttv-cardlist` for a complete list of Cards based on the
-Conexant Bt8xx PCI bridge supported by the Linux Kernel.
+Please see Documentation/admin-guide/media/bttv-cardlist.rst for a complete
+list of Cards based on the Conexant Bt8xx PCI bridge supported by the
+Linux Kernel.
In order to be able to compile the kernel, some config options should be
enabled::
@@ -80,7 +81,7 @@ for dvb-bt8xx drivers by passing modprobe parameters may be necessary.
Running TwinHan and Clones
~~~~~~~~~~~~~~~~~~~~~~~~~~
-As shown at :doc:`bttv-cardlist`, TwinHan and
+As shown at Documentation/admin-guide/media/bttv-cardlist.rst, TwinHan and
clones use ``card=113`` modprobe parameter. So, in order to properly
detect it for devices without EEPROM, you should use::
@@ -105,12 +106,12 @@ The autodetected values are determined by the cards' "response string".
In your logs see f. ex.: dst_get_device_id: Recognize [DSTMCI].
For bug reports please send in a complete log with verbose=4 activated.
-Please also see :doc:`ci`.
+Please also see Documentation/admin-guide/media/ci.rst.
Running multiple cards
~~~~~~~~~~~~~~~~~~~~~~
-See :doc:`bttv-cardlist` for a complete list of
+See Documentation/admin-guide/media/bttv-cardlist.rst for a complete list of
Card ID. Some examples:
=========================== ===
diff --git a/Documentation/admin-guide/media/bttv.rst b/Documentation/admin-guide/media/bttv.rst
index 49382377b1dc..58cbaf6df694 100644
--- a/Documentation/admin-guide/media/bttv.rst
+++ b/Documentation/admin-guide/media/bttv.rst
@@ -24,7 +24,8 @@ If your board has digital TV, you'll also need::
./scripts/config -m DVB_BT8XX
-In this case, please see :doc:`bt8xx` for additional notes.
+In this case, please see Documentation/admin-guide/media/bt8xx.rst
+for additional notes.
Make bttv work with your card
-----------------------------
@@ -39,7 +40,7 @@ If it doesn't bttv likely could not autodetect your card and needs some
insmod options. The most important insmod option for bttv is "card=n"
to select the correct card type. If you get video but no sound you've
very likely specified the wrong (or no) card type. A list of supported
-cards is in :doc:`bttv-cardlist`.
+cards is in Documentation/admin-guide/media/bttv-cardlist.rst.
If bttv takes very long to load (happens sometimes with the cheap
cards which have no tuner), try adding this to your modules configuration
@@ -57,8 +58,8 @@ directory should be enough for it to be autoload during the driver's
probing mode (e. g. when the Kernel boots or when the driver is
manually loaded via ``modprobe`` command).
-If your card isn't listed in :doc:`bttv-cardlist` or if you have
-trouble making audio work, please read :ref:`still_doesnt_work`.
+If your card isn't listed in Documentation/admin-guide/media/bttv-cardlist.rst
+or if you have trouble making audio work, please read :ref:`still_doesnt_work`.
Autodetecting cards
@@ -77,8 +78,8 @@ the Subsystem ID in the second line, looks like this:
only bt878-based cards can have a subsystem ID (which does not mean
that every card really has one). bt848 cards can't have a Subsystem
ID and therefore can't be autodetected. There is a list with the ID's
-at :doc:`bttv-cardlist` (in case you are intrested or want to mail
-patches with updates).
+at Documentation/admin-guide/media/bttv-cardlist.rst
+(in case you are interested or want to mail patches with updates).
.. _still_doesnt_work:
@@ -259,15 +260,15 @@ bug. It is very helpful if you can tell where exactly it broke
With a hard freeze you probably doesn't find anything in the logfiles.
The only way to capture any kernel messages is to hook up a serial
console and let some terminal application log the messages. /me uses
-screen. See :doc:`/admin-guide/serial-console` for details on setting
-up a serial console.
+screen. See Documentation/admin-guide/serial-console.rst for details on
+setting up a serial console.
-Read :doc:`/admin-guide/bug-hunting` to learn how to get any useful
+Read Documentation/admin-guide/bug-hunting.rst to learn how to get any useful
information out of a register+stack dump printed by the kernel on
protection faults (so-called "kernel oops").
If you run into some kind of deadlock, you can try to dump a call trace
-for each process using sysrq-t (see :doc:`/admin-guide/sysrq`).
+for each process using sysrq-t (see Documentation/admin-guide/sysrq.rst).
This way it is possible to figure where *exactly* some process in "D"
state is stuck.
@@ -908,7 +909,7 @@ DE hat diverse Treiber fuer diese Modelle (Stand 09/2002):
- TVPhone98 (Bt878)
- AVerTV und TVCapture98 w/VCR (Bt 878)
- AVerTVStudio und TVPhone98 w/VCR (Bt878)
- - AVerTV GO Serie (Kein SVideo Input)
+ - AVerTV GO Series (Kein SVideo Input)
- AVerTV98 (BT-878 chip)
- AVerTV98 mit Fernbedienung (BT-878 chip)
- AVerTV/FM98 (BT-878 chip)
diff --git a/Documentation/admin-guide/media/building.rst b/Documentation/admin-guide/media/building.rst
index c898e3a981c1..a06473429916 100644
--- a/Documentation/admin-guide/media/building.rst
+++ b/Documentation/admin-guide/media/building.rst
@@ -90,7 +90,7 @@ built as modules.
Those GPU-specific drivers are selected via the ``Graphics support``
menu, under ``Device Drivers``.
- When a GPU driver supports supports HDMI CEC, it will automatically
+ When a GPU driver supports HDMI CEC, it will automatically
enable the CEC core support at the media subsystem.
Media dependencies
@@ -137,7 +137,7 @@ The ``LIRC user interface`` option adds enhanced functionality when using the
from remote controllers.
The ``Support for eBPF programs attached to lirc devices`` option allows
-the usage of special programs (called eBPF) that would allow aplications
+the usage of special programs (called eBPF) that would allow applications
to add extra remote controller decoding functionality to the Linux Kernel.
The ``Remote controller decoders`` option allows selecting the
@@ -244,7 +244,7 @@ functionality.
If you have an hybrid card, you may need to enable both ``Analog TV``
and ``Digital TV`` at the menu.
-When using this option, the defaults for the the media support core
+When using this option, the defaults for the media support core
functionality are usually good enough to provide the basic functionality
for the driver. Yet, you could manually enable some desired extra (optional)
functionality using the settings under each of the following
diff --git a/Documentation/admin-guide/media/cec-drivers.rst b/Documentation/admin-guide/media/cec-drivers.rst
deleted file mode 100644
index 8d9686c08df9..000000000000
--- a/Documentation/admin-guide/media/cec-drivers.rst
+++ /dev/null
@@ -1,10 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=================================
-CEC driver-specific documentation
-=================================
-
-.. toctree::
- :maxdepth: 2
-
- pulse8-cec
diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst
new file mode 100644
index 000000000000..6b30e355cf23
--- /dev/null
+++ b/Documentation/admin-guide/media/cec.rst
@@ -0,0 +1,380 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+HDMI CEC
+========
+
+Supported hardware in mainline
+==============================
+
+HDMI Transmitters:
+
+- Exynos4
+- Exynos5
+- STIH4xx HDMI CEC
+- V4L2 adv7511 (same HW, but a different driver from the drm adv7511)
+- stm32
+- Allwinner A10 (sun4i)
+- Raspberry Pi
+- dw-hdmi (Synopsis IP)
+- amlogic (meson ao-cec and ao-cec-g12a)
+- drm adv7511/adv7533
+- omap4
+- tegra
+- rk3288, rk3399
+- tda998x
+- DisplayPort CEC-Tunneling-over-AUX on i915, nouveau and amdgpu
+- ChromeOS EC CEC
+- CEC for SECO boards (UDOO x86).
+- Chrontel CH7322
+
+
+HDMI Receivers:
+
+- adv7604/11/12
+- adv7842
+- tc358743
+
+USB Dongles (see below for additional information on how to use these
+dongles):
+
+- Pulse-Eight: the pulse8-cec driver implements the following module option:
+ ``persistent_config``: by default this is off, but when set to 1 the driver
+ will store the current settings to the device's internal eeprom and restore
+ it the next time the device is connected to the USB port.
+- RainShadow Tech. Note: this driver does not support the persistent_config
+ module option of the Pulse-Eight driver. The hardware supports it, but I
+ have no plans to add this feature. But I accept patches :-)
+
+Miscellaneous:
+
+- vivid: emulates a CEC receiver and CEC transmitter.
+ Can be used to test CEC applications without actual CEC hardware.
+
+- cec-gpio. If the CEC pin is hooked up to a GPIO pin then
+ you can control the CEC line through this driver. This supports error
+ injection as well.
+
+- cec-gpio and Allwinner A10 (or any other driver that uses the CEC pin
+ framework to drive the CEC pin directly): the CEC pin framework uses
+ high-resolution timers. These timers are affected by NTP daemons that
+ speed up or slow down the clock to sync with the official time. The
+ chronyd server will by default increase or decrease the clock by
+ 1/12th. This will cause the CEC timings to go out of spec. To fix this,
+ add a 'maxslewrate 40000' line to chronyd.conf. This limits the clock
+ frequency change to 1/25th, which keeps the CEC timings within spec.
+
+
+Utilities
+=========
+
+Utilities are available here: https://git.linuxtv.org/v4l-utils.git
+
+``utils/cec-ctl``: control a CEC device
+
+``utils/cec-compliance``: test compliance of a remote CEC device
+
+``utils/cec-follower``: emulate a CEC follower device
+
+Note that ``cec-ctl`` has support for the CEC Hospitality Profile as is
+used in some hotel displays. See http://www.htng.org.
+
+Note that the libcec library (https://github.com/Pulse-Eight/libcec) supports
+the linux CEC framework.
+
+If you want to get the CEC specification, then look at the References of
+the HDMI wikipedia page: https://en.wikipedia.org/wiki/HDMI. CEC is part
+of the HDMI specification. HDMI 1.3 is freely available (very similar to
+HDMI 1.4 w.r.t. CEC) and should be good enough for most things.
+
+
+DisplayPort to HDMI Adapters with working CEC
+=============================================
+
+Background: most adapters do not support the CEC Tunneling feature,
+and of those that do many did not actually connect the CEC pin.
+Unfortunately, this means that while a CEC device is created, it
+is actually all alone in the world and will never be able to see other
+CEC devices.
+
+This is a list of known working adapters that have CEC Tunneling AND
+that properly connected the CEC pin. If you find adapters that work
+but are not in this list, then drop me a note.
+
+To test: hook up your DP-to-HDMI adapter to a CEC capable device
+(typically a TV), then run::
+
+ cec-ctl --playback # Configure the PC as a CEC Playback device
+ cec-ctl -S # Show the CEC topology
+
+The ``cec-ctl -S`` command should show at least two CEC devices,
+ourselves and the CEC device you are connected to (i.e. typically the TV).
+
+General note: I have only seen this work with the Parade PS175, PS176 and
+PS186 chipsets and the MegaChips 2900. While MegaChips 28x0 claims CEC support,
+I have never seen it work.
+
+USB-C to HDMI
+-------------
+
+Samsung Multiport Adapter EE-PW700: https://www.samsung.com/ie/support/model/EE-PW700BBEGWW/
+
+Kramer ADC-U31C/HF: https://www.kramerav.com/product/ADC-U31C/HF
+
+Club3D CAC-2504: https://www.club-3d.com/en/detail/2449/usb_3.1_type_c_to_hdmi_2.0_uhd_4k_60hz_active_adapter/
+
+DisplayPort to HDMI
+-------------------
+
+Club3D CAC-1080: https://www.club-3d.com/en/detail/2442/displayport_1.4_to_hdmi_2.0b_hdr/
+
+CableCreation (SKU: CD0712): https://www.cablecreation.com/products/active-displayport-to-hdmi-adapter-4k-hdr
+
+HP DisplayPort to HDMI True 4k Adapter (P/N 2JA63AA): https://www.hp.com/us-en/shop/pdp/hp-displayport-to-hdmi-true-4k-adapter
+
+Mini-DisplayPort to HDMI
+------------------------
+
+Club3D CAC-1180: https://www.club-3d.com/en/detail/2443/mini_displayport_1.4_to_hdmi_2.0b_hdr/
+
+Note that passive adapters will never work, you need an active adapter.
+
+The Club3D adapters in this list are all MegaChips 2900 based. Other Club3D adapters
+are PS176 based and do NOT have the CEC pin hooked up, so only the three Club3D
+adapters above are known to work.
+
+I suspect that MegaChips 2900 based designs in general are likely to work
+whereas with the PS176 it is more hit-and-miss (mostly miss). The PS186 is
+likely to have the CEC pin hooked up, it looks like they changed the reference
+design for that chipset.
+
+
+USB CEC Dongles
+===============
+
+These dongles appear as ``/dev/ttyACMX`` devices and need the ``inputattach``
+utility to create the ``/dev/cecX`` devices. Support for the Pulse-Eight
+has been added to ``inputattach`` 1.6.0. Support for the Rainshadow Tech has
+been added to ``inputattach`` 1.6.1.
+
+You also need udev rules to automatically start systemd services::
+
+ SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="2548", ATTRS{idProduct}=="1002", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="pulse8-cec-inputattach@%k.service"
+ SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="2548", ATTRS{idProduct}=="1001", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="pulse8-cec-inputattach@%k.service"
+ SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="04d8", ATTRS{idProduct}=="ff59", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="rainshadow-cec-inputattach@%k.service"
+
+and these systemd services:
+
+For Pulse-Eight make /lib/systemd/system/pulse8-cec-inputattach@.service::
+
+ [Unit]
+ Description=inputattach for pulse8-cec device on %I
+
+ [Service]
+ Type=simple
+ ExecStart=/usr/bin/inputattach --pulse8-cec /dev/%I
+
+For the RainShadow Tech make /lib/systemd/system/rainshadow-cec-inputattach@.service::
+
+ [Unit]
+ Description=inputattach for rainshadow-cec device on %I
+
+ [Service]
+ Type=simple
+ ExecStart=/usr/bin/inputattach --rainshadow-cec /dev/%I
+
+
+For proper suspend/resume support create: /lib/systemd/system/restart-cec-inputattach.service::
+
+ [Unit]
+ Description=restart inputattach for cec devices
+ After=suspend.target
+
+ [Service]
+ Type=forking
+ ExecStart=/bin/bash -c 'for d in /dev/serial/by-id/usb-Pulse-Eight*; do /usr/bin/inputattach --daemon --pulse8-cec $d; done; for d in /dev/serial/by-id/usb-RainShadow_Tech*; do /usr/bin/inputattach --daemon --rainshadow-cec $d; done'
+
+ [Install]
+ WantedBy=suspend.target
+
+And run ``systemctl enable restart-cec-inputattach``.
+
+To automatically set the physical address of the CEC device whenever the
+EDID changes, you can use ``cec-ctl`` with the ``-E`` option::
+
+ cec-ctl -E /sys/class/drm/card0-DP-1/edid
+
+This assumes the dongle is connected to the card0-DP-1 output (``xrandr`` will tell
+you which output is used) and it will poll for changes to the EDID and update
+the Physical Address whenever they occur.
+
+To automatically run this command you can use cron. Edit crontab with
+``crontab -e`` and add this line::
+
+ @reboot /usr/local/bin/cec-ctl -E /sys/class/drm/card0-DP-1/edid
+
+This only works for display drivers that expose the EDID in ``/sys/class/drm``,
+such as the i915 driver.
+
+
+CEC Without HPD
+===============
+
+Some displays when in standby mode have no HDMI Hotplug Detect signal, but
+CEC is still enabled so connected devices can send an <Image View On> CEC
+message in order to wake up such displays. Unfortunately, not all CEC
+adapters can support this. An example is the Odroid-U3 SBC that has a
+level-shifter that is powered off when the HPD signal is low, thus
+blocking the CEC pin. Even though the SoC can use CEC without a HPD,
+the level-shifter will prevent this from functioning.
+
+There is a CEC capability flag to signal this: ``CEC_CAP_NEEDS_HPD``.
+If set, then the hardware cannot wake up displays with this behavior.
+
+Note for CEC application implementers: the <Image View On> message must
+be the first message you send, don't send any other messages before.
+Certain very bad but unfortunately not uncommon CEC implementations
+get very confused if they receive anything else but this message and
+they won't wake up.
+
+When writing a driver it can be tricky to test this. There are two
+ways to do this:
+
+1) Get a Pulse-Eight USB CEC dongle, connect an HDMI cable from your
+ device to the Pulse-Eight, but do not connect the Pulse-Eight to
+ the display.
+
+ Now configure the Pulse-Eight dongle::
+
+ cec-ctl -p0.0.0.0 --tv
+
+ and start monitoring::
+
+ sudo cec-ctl -M
+
+ On the device you are testing run::
+
+ cec-ctl --playback
+
+ It should report a physical address of f.f.f.f. Now run this
+ command::
+
+ cec-ctl -t0 --image-view-on
+
+ The Pulse-Eight should see the <Image View On> message. If not,
+ then something (hardware and/or software) is preventing the CEC
+ message from going out.
+
+ To make sure you have the wiring correct just connect the
+ Pulse-Eight to a CEC-enabled display and run the same command
+ on your device: now there is a HPD, so you should see the command
+ arriving at the Pulse-Eight.
+
+2) If you have another linux device supporting CEC without HPD, then
+ you can just connect your device to that device. Yes, you can connect
+ two HDMI outputs together. You won't have a HPD (which is what we
+ want for this test), but the second device can monitor the CEC pin.
+
+ Otherwise use the same commands as in 1.
+
+If CEC messages do not come through when there is no HPD, then you
+need to figure out why. Typically it is either a hardware restriction
+or the software powers off the CEC core when the HPD goes low. The
+first cannot be corrected of course, the second will likely required
+driver changes.
+
+
+Microcontrollers & CEC
+======================
+
+We have seen some CEC implementations in displays that use a microcontroller
+to sample the bus. This does not have to be a problem, but some implementations
+have timing issues. This is hard to discover unless you can hook up a low-level
+CEC debugger (see the next section).
+
+You will see cases where the CEC transmitter holds the CEC line high or low for
+a longer time than is allowed. For directed messages this is not a problem since
+if that happens the message will not be Acked and it will be retransmitted.
+For broadcast messages no such mechanism exists.
+
+It's not clear what to do about this. It is probably wise to transmit some
+broadcast messages twice to reduce the chance of them being lost. Specifically
+<Standby> and <Active Source> are candidates for that.
+
+
+Making a CEC debugger
+=====================
+
+By using a Raspberry Pi 4B and some cheap components you can make
+your own low-level CEC debugger.
+
+The critical component is one of these HDMI female-female passthrough connectors
+(full soldering type 1):
+
+https://elabbay.myshopify.com/collections/camera/products/hdmi-af-af-v1a-hdmi-type-a-female-to-hdmi-type-a-female-pass-through-adapter-breakout-board?variant=45533926147
+
+The video quality is variable and certainly not enough to pass-through 4kp60
+(594 MHz) video. You might be able to support 4kp30, but more likely you will
+be limited to 1080p60 (148.5 MHz). But for CEC testing that is fine.
+
+You need a breadboard and some breadboard wires:
+
+http://www.dx.com/p/diy-40p-male-to-female-male-to-male-female-to-female-dupont-line-wire-3pcs-356089#.WYLOOXWGN7I
+
+If you want to monitor the HPD and/or 5V lines as well, then you need one of
+these 5V to 3.3V level shifters:
+
+https://www.adafruit.com/product/757
+
+(This is just where I got these components, there are many other places you
+can get similar things).
+
+The ground pin of the HDMI connector needs to be connected to a ground
+pin of the Raspberry Pi, of course.
+
+The CEC pin of the HDMI connector needs to be connected to these pins:
+GPIO 6 and GPIO 7. The optional HPD pin of the HDMI connector should
+be connected via the level shifter to these pins: GPIO 23 and GPIO 12.
+The optional 5V pin of the HDMI connector should be connected via the
+level shifter to these pins: GPIO 25 and GPIO 22. Monitoring the HPD and
+5V lines is not necessary, but it is helpful.
+
+This device tree addition in ``arch/arm/boot/dts/bcm2711-rpi-4-b.dts``
+will hook up the cec-gpio driver correctly::
+
+ cec@6 {
+ compatible = "cec-gpio";
+ cec-gpios = <&gpio 6 (GPIO_ACTIVE_HIGH|GPIO_OPEN_DRAIN)>;
+ hpd-gpios = <&gpio 23 GPIO_ACTIVE_HIGH>;
+ v5-gpios = <&gpio 25 GPIO_ACTIVE_HIGH>;
+ };
+
+ cec@7 {
+ compatible = "cec-gpio";
+ cec-gpios = <&gpio 7 (GPIO_ACTIVE_HIGH|GPIO_OPEN_DRAIN)>;
+ hpd-gpios = <&gpio 12 GPIO_ACTIVE_HIGH>;
+ v5-gpios = <&gpio 22 GPIO_ACTIVE_HIGH>;
+ };
+
+If you haven't hooked up the HPD and/or 5V lines, then just delete those
+lines.
+
+This dts change will enable two cec GPIO devices: I typically use one to
+send/receive CEC commands and the other to monitor. If you monitor using
+an unconfigured CEC adapter then it will use GPIO interrupts which makes
+monitoring very accurate.
+
+If you just want to monitor traffic, then a single instance is sufficient.
+The minimum configuration is one HDMI female-female passthrough connector
+and two female-female breadboard wires: one for connecting the HDMI ground
+pin to a ground pin on the Raspberry Pi, and the other to connect the HDMI
+CEC pin to GPIO 6 on the Raspberry Pi.
+
+The documentation on how to use the error injection is here: :ref:`cec_pin_error_inj`.
+
+``cec-ctl --monitor-pin`` will do low-level CEC bus sniffing and analysis.
+You can also store the CEC traffic to file using ``--store-pin`` and analyze
+it later using ``--analyze-pin``.
+
+You can also use this as a full-fledged CEC device by configuring it
+using ``cec-ctl --tv -p0.0.0.0`` or ``cec-ctl --playback -p1.0.0.0``.
diff --git a/Documentation/admin-guide/media/cpia2.rst b/Documentation/admin-guide/media/cpia2.rst
deleted file mode 100644
index f6ffef686462..000000000000
--- a/Documentation/admin-guide/media/cpia2.rst
+++ /dev/null
@@ -1,145 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-The cpia2 driver
-================
-
-Authors: Peter Pregler <Peter_Pregler@email.com>,
-Scott J. Bertin <scottbertin@yahoo.com>, and
-Jarl Totland <Jarl.Totland@bdc.no> for the original cpia driver, which
-this one was modelled from.
-
-Introduction
-------------
-
-This is a driver for STMicroelectronics's CPiA2 (second generation
-Colour Processor Interface ASIC) based cameras. This camera outputs an MJPEG
-stream at up to vga size. It implements the Video4Linux interface as much as
-possible. Since the V4L interface does not support compressed formats, only
-an mjpeg enabled application can be used with the camera. We have modified the
-gqcam application to view this stream.
-
-The driver is implemented as two kernel modules. The cpia2 module
-contains the camera functions and the V4L interface. The cpia2_usb module
-contains usb specific functions. The main reason for this was the size of the
-module was getting out of hand, so I separated them. It is not likely that
-there will be a parallel port version.
-
-Features
---------
-
-- Supports cameras with the Vision stv6410 (CIF) and stv6500 (VGA) cmos
- sensors. I only have the vga sensor, so can't test the other.
-- Image formats: VGA, QVGA, CIF, QCIF, and a number of sizes in between.
- VGA and QVGA are the native image sizes for the VGA camera. CIF is done
- in the coprocessor by scaling QVGA. All other sizes are done by clipping.
-- Palette: YCrCb, compressed with MJPEG.
-- Some compression parameters are settable.
-- Sensor framerate is adjustable (up to 30 fps CIF, 15 fps VGA).
-- Adjust brightness, color, contrast while streaming.
-- Flicker control settable for 50 or 60 Hz mains frequency.
-
-Making and installing the stv672 driver modules
------------------------------------------------
-
-Requirements
-~~~~~~~~~~~~
-
-Video4Linux must be either compiled into the kernel or
-available as a module. Video4Linux2 is automatically detected and made
-available at compile time.
-
-Setup
-~~~~~
-
-Use ``modprobe cpia2`` to load and ``modprobe -r cpia2`` to unload. This
-may be done automatically by your distribution.
-
-Driver options
-~~~~~~~~~~~~~~
-
-.. tabularcolumns:: |p{13ex}|L|
-
-
-============== ========================================================
-Option Description
-============== ========================================================
-video_nr video device to register (0=/dev/video0, etc)
- range -1 to 64. default is -1 (first available)
- If you have more than 1 camera, this MUST be -1.
-buffer_size Size for each frame buffer in bytes (default 68k)
-num_buffers Number of frame buffers (1-32, default 3)
-alternate USB Alternate (2-7, default 7)
-flicker_freq Frequency for flicker reduction(50 or 60, default 60)
-flicker_mode 0 to disable, or 1 to enable flicker reduction.
- (default 0). This is only effective if the camera
- uses a stv0672 coprocessor.
-============== ========================================================
-
-Setting the options
-~~~~~~~~~~~~~~~~~~~
-
-If you are using modules, edit /etc/modules.conf and add an options
-line like this::
-
- options cpia2 num_buffers=3 buffer_size=65535
-
-If the driver is compiled into the kernel, at boot time specify them
-like this::
-
- cpia2.num_buffers=3 cpia2.buffer_size=65535
-
-What buffer size should I use?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The maximum image size depends on the alternate you choose, and the
-frame rate achieved by the camera. If the compression engine is able to
-keep up with the frame rate, the maximum image size is given by the table
-below.
-
-The compression engine starts out at maximum compression, and will
-increase image quality until it is close to the size in the table. As long
-as the compression engine can keep up with the frame rate, after a short time
-the images will all be about the size in the table, regardless of resolution.
-
-At low alternate settings, the compression engine may not be able to
-compress the image enough and will reduce the frame rate by producing larger
-images.
-
-The default of 68k should be good for most users. This will handle
-any alternate at frame rates down to 15fps. For lower frame rates, it may
-be necessary to increase the buffer size to avoid having frames dropped due
-to insufficient space.
-
-========== ========== ======== =====
-Alternate bytes/ms 15fps 30fps
-========== ========== ======== =====
- 2 128 8533 4267
- 3 384 25600 12800
- 4 640 42667 21333
- 5 768 51200 25600
- 6 896 59733 29867
- 7 1023 68200 34100
-========== ========== ======== =====
-
-Table: Image size(bytes)
-
-
-How many buffers should I use?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-For normal streaming, 3 should give the best results. With only 2,
-it is possible for the camera to finish sending one image just after a
-program has started reading the other. If this happens, the driver must drop
-a frame. The exception to this is if you have a heavily loaded machine. In
-this case use 2 buffers. You are probably not reading at the full frame rate.
-If the camera can send multiple images before a read finishes, it could
-overwrite the third buffer before the read finishes, leading to a corrupt
-image. Single and double buffering have extra checks to avoid overwriting.
-
-Using the camera
-~~~~~~~~~~~~~~~~
-
-We are providing a modified gqcam application to view the output. In
-order to avoid confusion, here it is called mview. There is also the qx5view
-program which can also control the lights on the qx5 microscope. MJPEG Tools
-(http://mjpeg.sourceforge.net) can also be used to record from the camera.
diff --git a/Documentation/admin-guide/media/davinci-vpbe.rst b/Documentation/admin-guide/media/davinci-vpbe.rst
deleted file mode 100644
index 9e6360fd02db..000000000000
--- a/Documentation/admin-guide/media/davinci-vpbe.rst
+++ /dev/null
@@ -1,65 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-The VPBE V4L2 driver design
-===========================
-
-Functional partitioning
------------------------
-
-Consists of the following:
-
- 1. V4L2 display driver
-
- Implements creation of video2 and video3 device nodes and
- provides v4l2 device interface to manage VID0 and VID1 layers.
-
- 2. Display controller
-
- Loads up VENC, OSD and external encoders such as ths8200. It provides
- a set of API calls to V4L2 drivers to set the output/standards
- in the VENC or external sub devices. It also provides
- a device object to access the services from OSD subdevice
- using sub device ops. The connection of external encoders to VENC LCD
- controller port is done at init time based on default output and standard
- selection or at run time when application change the output through
- V4L2 IOCTLs.
-
- When connected to an external encoder, vpbe controller is also responsible
- for setting up the interface between VENC and external encoders based on
- board specific settings (specified in board-xxx-evm.c). This allows
- interfacing external encoders such as ths8200. The setup_if_config()
- is implemented for this as well as configure_venc() (part of the next patch)
- API to set timings in VENC for a specific display resolution. As of this
- patch series, the interconnection and enabling and setting of the external
- encoders is not present, and would be a part of the next patch series.
-
- 3. VENC subdevice module
-
- Responsible for setting outputs provided through internal DACs and also
- setting timings at LCD controller port when external encoders are connected
- at the port or LCD panel timings required. When external encoder/LCD panel
- is connected, the timings for a specific standard/preset is retrieved from
- the board specific table and the values are used to set the timings in
- venc using non-standard timing mode.
-
- Support LCD Panel displays using the VENC. For example to support a Logic
- PD display, it requires setting up the LCD controller port with a set of
- timings for the resolution supported and setting the dot clock. So we could
- add the available outputs as a board specific entry (i.e add the "LogicPD"
- output name to board-xxx-evm.c). A table of timings for various LCDs
- supported can be maintained in the board specific setup file to support
- various LCD displays.As of this patch a basic driver is present, and this
- support for external encoders and displays forms a part of the next
- patch series.
-
- 4. OSD module
-
- OSD module implements all OSD layer management and hardware specific
- features. The VPBE module interacts with the OSD for enabling and
- disabling appropriate features of the OSD.
-
-Current status
---------------
-
-A fully functional working version of the V4L2 driver is available. This
-driver has been tested with NTSC and PAL standards and buffer streaming.
diff --git a/Documentation/admin-guide/media/dvb-drivers.rst b/Documentation/admin-guide/media/dvb-drivers.rst
index 8df637c375f9..66fa4edd0606 100644
--- a/Documentation/admin-guide/media/dvb-drivers.rst
+++ b/Documentation/admin-guide/media/dvb-drivers.rst
@@ -13,4 +13,3 @@ Digital TV driver-specific documentation
opera-firmware
technisat
ttusb-dec
- zr364xx
diff --git a/Documentation/admin-guide/media/dvb-usb-dvbsky-cardlist.rst b/Documentation/admin-guide/media/dvb-usb-dvbsky-cardlist.rst
index 4fb4ce56df7c..9f7b619f35f7 100644
--- a/Documentation/admin-guide/media/dvb-usb-dvbsky-cardlist.rst
+++ b/Documentation/admin-guide/media/dvb-usb-dvbsky-cardlist.rst
@@ -20,13 +20,13 @@ dvb-usb-dvbsky cards list
- 0572:0320
* - DVBSky T680CI
- 0572:680c
- * - MyGica Mini DVB-T2 USB Stick T230
+ * - MyGica Mini DVB-(T/T2/C) USB Stick T230
- 0572:c688
- * - MyGica Mini DVB-T2 USB Stick T230C
+ * - MyGica Mini DVB-(T/T2/C) USB Stick T230C
- 0572:c689
- * - MyGica Mini DVB-T2 USB Stick T230C Lite
+ * - MyGica Mini DVB-(T/T2/C) USB Stick T230C Lite
- 0572:c699
- * - MyGica Mini DVB-T2 USB Stick T230C v2
+ * - MyGica Mini DVB-(T/T2/C) USB Stick T230C v2
- 0572:c68a
* - TechnoTrend TT-connect CT2-4650 CI
- 0b48:3012
diff --git a/Documentation/admin-guide/media/dvb-usb-dw2102-cardlist.rst b/Documentation/admin-guide/media/dvb-usb-dw2102-cardlist.rst
index f01f9df1e249..e39bc8e4bffe 100644
--- a/Documentation/admin-guide/media/dvb-usb-dw2102-cardlist.rst
+++ b/Documentation/admin-guide/media/dvb-usb-dw2102-cardlist.rst
@@ -40,6 +40,10 @@ dvb-usb-dw2102 cards list
- 0b48:3011
* - TerraTec Cinergy S USB
- 0ccd:0064
+ * - Terratec Cinergy S2 PCIe Dual Port 1
+ - 153b:1181
+ * - Terratec Cinergy S2 PCIe Dual Port 2
+ - 153b:1182
* - Terratec Cinergy S2 USB BOX
- 0ccd:0x0105
* - Terratec Cinergy S2 USB HD
diff --git a/Documentation/admin-guide/media/dvb_references.rst b/Documentation/admin-guide/media/dvb_references.rst
index 48445ac76275..4f0fd4259cfa 100644
--- a/Documentation/admin-guide/media/dvb_references.rst
+++ b/Documentation/admin-guide/media/dvb_references.rst
@@ -10,7 +10,7 @@ The DVB mailing list linux-dvb is hosted at vger. Please see
http://vger.kernel.org/vger-lists.html#linux-media for details.
There are also some other old lists hosted at:
-https://linuxtv.org/lists.php. If you're insterested on that for historic
+https://linuxtv.org/lists.php. If you're interested on that for historic
reasons, please check the archive at https://linuxtv.org/pipermail/linux-dvb/.
The media subsystem Wiki is hosted at https://linuxtv.org/wiki/.
diff --git a/Documentation/admin-guide/media/em28xx-cardlist.rst b/Documentation/admin-guide/media/em28xx-cardlist.rst
index a5f0e6d22a1a..ace65718ea22 100644
--- a/Documentation/admin-guide/media/em28xx-cardlist.rst
+++ b/Documentation/admin-guide/media/em28xx-cardlist.rst
@@ -434,3 +434,7 @@ EM28xx cards list
- PCTV DVB-S2 Stick (461e v2)
- em28178
- 2013:0461, 2013:0259
+ * - 105
+ - MyGica iGrabber
+ - em2860
+ - 1f4d:1abe
diff --git a/Documentation/admin-guide/media/fimc.rst b/Documentation/admin-guide/media/fimc.rst
index 0b8ddc4a3008..267ef52fe387 100644
--- a/Documentation/admin-guide/media/fimc.rst
+++ b/Documentation/admin-guide/media/fimc.rst
@@ -2,7 +2,7 @@
.. include:: <isonum.txt>
-The Samsung S5P/EXYNOS4 FIMC driver
+The Samsung S5P/Exynos4 FIMC driver
===================================
Copyright |copy| 2012 - 2013 Samsung Electronics Co., Ltd.
@@ -14,12 +14,12 @@ data from LCD controller (FIMD) through the SoC internal writeback data
path. There are multiple FIMC instances in the SoCs (up to 4), having
slightly different capabilities, like pixel alignment constraints, rotator
availability, LCD writeback support, etc. The driver is located at
-drivers/media/platform/exynos4-is directory.
+drivers/media/platform/samsung/exynos4-is directory.
Supported SoCs
--------------
-S5PC100 (mem-to-mem only), S5PV210, EXYNOS4210
+S5PC100 (mem-to-mem only), S5PV210, Exynos4210
Supported features
------------------
@@ -45,7 +45,7 @@ Media device interface
~~~~~~~~~~~~~~~~~~~~~~
The driver supports Media Controller API as defined at :ref:`media_controller`.
-The media device driver name is "SAMSUNG S5P FIMC".
+The media device driver name is "Samsung S5P FIMC".
The purpose of this interface is to allow changing assignment of FIMC instances
to the SoC peripheral camera input at runtime and optionally to control internal
diff --git a/Documentation/admin-guide/media/frontend-cardlist.rst b/Documentation/admin-guide/media/frontend-cardlist.rst
index 73a248c1b064..ba5b7c69a978 100644
--- a/Documentation/admin-guide/media/frontend-cardlist.rst
+++ b/Documentation/admin-guide/media/frontend-cardlist.rst
@@ -68,7 +68,7 @@ cx24116 Conexant CX24116 based
cx24117 Conexant CX24117 based
cx24120 Conexant CX24120 based
cx24123 Conexant CX24123 based
-ds3000 Montage Tehnology DS3000 based
+ds3000 Montage Technology DS3000 based
mb86a16 Fujitsu MB86A16 based
mt312 Zarlink VP310/MT312/ZL10313 based
s5h1420 Samsung S5H1420 based
@@ -83,7 +83,7 @@ tda10086 Philips TDA10086 based
tda8083 Philips TDA8083 based
tda8261 Philips TDA8261 based
tda826x Philips TDA826X silicon tuner
-ts2020 Montage Tehnology TS2020 based tuners
+ts2020 Montage Technology TS2020 based tuners
tua6100 Infineon TUA6100 PLL
cx24113 Conexant CX24113/CX24128 tuner for DVB-S/DSS
itd1000 Integrant ITD1000 Zero IF tuner for DVB-S/DSS
diff --git a/Documentation/admin-guide/media/gspca-cardlist.rst b/Documentation/admin-guide/media/gspca-cardlist.rst
index adda933616f1..e3404d1589da 100644
--- a/Documentation/admin-guide/media/gspca-cardlist.rst
+++ b/Documentation/admin-guide/media/gspca-cardlist.rst
@@ -305,7 +305,7 @@ pac7302 093a:2625 Genius iSlim 310
pac7302 093a:2626 Labtec 2200
pac7302 093a:2627 Genius FaceCam 300
pac7302 093a:2628 Genius iLook 300
-pac7302 093a:2629 Genious iSlim 300
+pac7302 093a:2629 Genius iSlim 300
pac7302 093a:262a Webcam 300k
pac7302 093a:262c Philips SPC 230 NC
jl2005bcd 0979:0227 Various brands, 19 known cameras supported
diff --git a/Documentation/admin-guide/media/i2c-cardlist.rst b/Documentation/admin-guide/media/i2c-cardlist.rst
index e60d459d18a9..1825a0bb47bd 100644
--- a/Documentation/admin-guide/media/i2c-cardlist.rst
+++ b/Documentation/admin-guide/media/i2c-cardlist.rst
@@ -58,27 +58,29 @@ Camera sensor devices
============ ==========================================================
Driver Name
============ ==========================================================
+ccs MIPI CCS compliant camera sensors (also SMIA++ and SMIA)
et8ek8 ET8EK8 camera sensor
hi556 Hynix Hi-556 sensor
+hi846 Hynix Hi-846 sensor
+imx208 Sony IMX208 sensor
imx214 Sony IMX214 sensor
imx219 Sony IMX219 sensor
imx258 Sony IMX258 sensor
imx274 Sony IMX274 sensor
imx290 Sony IMX290 sensor
imx319 Sony IMX319 sensor
+imx334 Sony IMX334 sensor
imx355 Sony IMX355 sensor
-m5mols Fujitsu M-5MOLS 8MP sensor
+imx412 Sony IMX412 sensor
mt9m001 mt9m001
-mt9m032 MT9M032 camera sensor
mt9m111 mt9m111, mt9m112 and mt9m131
mt9p031 Aptina MT9P031
-mt9t001 Aptina MT9T001
mt9t112 Aptina MT9T111/MT9T112
mt9v011 Micron mt9v011 sensor
mt9v032 Micron MT9V032 sensor
mt9v111 Aptina MT9V111 sensor
-noon010pc30 Siliconfile NOON010PC30 sensor
ov13858 OmniVision OV13858 sensor
+ov13b10 OmniVision OV13B10 sensor
ov2640 OmniVision OV2640 sensor
ov2659 OmniVision OV2659 sensor
ov2680 OmniVision OV2680 sensor
@@ -103,10 +105,6 @@ s5c73m3 Samsung S5C73M3 sensor
s5k4ecgx Samsung S5K4ECGX sensor
s5k5baf Samsung S5K5BAF sensor
s5k6a3 Samsung S5K6A3 sensor
-s5k6aa Samsung S5K6AAFX sensor
-smiapp SMIA++/SMIA sensor
-sr030pc30 Siliconfile SR030PC30 sensor
-vs6624 ST VS6624 sensor
============ ==========================================================
Flash devices
@@ -138,6 +136,7 @@ Driver Name
ad5820 AD5820 lens voice coil
ak7375 AK7375 lens voice coil
dw9714 DW9714 lens voice coil
+dw9768 DW9768 lens voice coil
dw9807-vcm DW9807 lens voice coil
============ ==========================================================
@@ -216,7 +215,6 @@ Video encoders
============ ==========================================================
Driver Name
============ ==========================================================
-ad9389b Analog Devices AD9389B encoder
adv7170 Analog Devices ADV7170 video encoder
adv7175 Analog Devices ADV7175 video encoder
adv7343 ADV7343 video encoder
@@ -278,7 +276,7 @@ tda9887 TDA 9885/6/7 analog IF demodulator
tea5761 TEA 5761 radio tuner
tea5767 TEA 5767 radio tuner
tua9001 Infineon TUA9001 silicon tuner
-tuner-xc2028 XCeive xc2028/xc3028 tuners
+xc2028 XCeive xc2028/xc3028 tuners
xc4000 Xceive XC4000 silicon tuner
xc5000 Xceive XC5000 silicon tuner
============ ==================================================
diff --git a/Documentation/admin-guide/media/imx7.rst b/Documentation/admin-guide/media/imx7.rst
index 1e442c97da47..2fa27718f52a 100644
--- a/Documentation/admin-guide/media/imx7.rst
+++ b/Documentation/admin-guide/media/imx7.rst
@@ -33,7 +33,7 @@ reference manual [#f1]_.
Entities
--------
-imx7-mipi-csi2
+imx-mipi-csi2
--------------
This is the MIPI CSI-2 receiver entity. It has one sink pad to receive the pixel
@@ -155,6 +155,66 @@ the resolutions supported by the sensor.
[fmt:SBGGR10_1X10/800x600@1/30 field:none colorspace:srgb]
-> "imx7-mipi-csis.0":0 [ENABLED]
+i.MX6ULL-EVK with OV5640
+------------------------
+
+On this platform a parallel OV5640 sensor is connected to the CSI port.
+The following example configures a video capture pipeline with an output
+of 640x480 and UYVY8_2X8 format:
+
+.. code-block:: none
+
+ # Setup links
+ media-ctl -l "'ov5640 1-003c':0 -> 'csi':0[1]"
+ media-ctl -l "'csi':1 -> 'csi capture':0[1]"
+
+ # Configure pads for pipeline
+ media-ctl -v -V "'ov5640 1-003c':0 [fmt:UYVY8_2X8/640x480 field:none]"
+
+After this streaming can start:
+
+.. code-block:: none
+
+ gst-launch-1.0 -v v4l2src device=/dev/video1 ! video/x-raw,format=UYVY,width=640,height=480 ! v4l2convert ! fbdevsink
+
+.. code-block:: none
+
+ # media-ctl -p
+ Media controller API version 5.14.0
+
+ Media device information
+ ------------------------
+ driver imx7-csi
+ model imx-media
+ serial
+ bus info
+ hw revision 0x0
+ driver version 5.14.0
+
+ Device topology
+ - entity 1: csi (2 pads, 2 links)
+ type V4L2 subdev subtype Unknown flags 0
+ device node name /dev/v4l-subdev0
+ pad0: Sink
+ [fmt:UYVY8_2X8/640x480 field:none colorspace:srgb xfer:srgb ycbcr:601 quantization:full-range]
+ <- "ov5640 1-003c":0 [ENABLED,IMMUTABLE]
+ pad1: Source
+ [fmt:UYVY8_2X8/640x480 field:none colorspace:srgb xfer:srgb ycbcr:601 quantization:full-range]
+ -> "csi capture":0 [ENABLED,IMMUTABLE]
+
+ - entity 4: csi capture (1 pad, 1 link)
+ type Node subtype V4L flags 0
+ device node name /dev/video1
+ pad0: Sink
+ <- "csi":1 [ENABLED,IMMUTABLE]
+
+ - entity 10: ov5640 1-003c (1 pad, 1 link)
+ type V4L2 subdev subtype Sensor flags 0
+ device node name /dev/v4l-subdev1
+ pad0: Source
+ [fmt:UYVY8_2X8/640x480@1/30 field:none colorspace:srgb xfer:srgb ycbcr:601 quantization:full-range]
+ -> "csi":0 [ENABLED,IMMUTABLE]
+
References
----------
diff --git a/Documentation/admin-guide/media/index.rst b/Documentation/admin-guide/media/index.rst
index 6e0d2bae7154..be7e0e4482ca 100644
--- a/Documentation/admin-guide/media/index.rst
+++ b/Documentation/admin-guide/media/index.rst
@@ -11,23 +11,17 @@ its supported drivers.
Please see:
-- :doc:`/userspace-api/media/index`
- for the userspace APIs used on media devices.
+Documentation/userspace-api/media/index.rst
-- :doc:`/driver-api/media/index`
- for driver development information and Kernel APIs used by
- media devices;
+ - for the userspace APIs used on media devices.
-The media subsystem
-===================
+Documentation/driver-api/media/index.rst
-.. only:: html
-
- .. class:: toc-title
-
- Table of Contents
+ - for driver development information and Kernel APIs used by
+ media devices;
.. toctree::
+ :caption: Table of Contents
:maxdepth: 2
:numbered:
@@ -36,13 +30,14 @@ The media subsystem
remote-controller
+ cec
+
dvb
cardlist
v4l-drivers
dvb-drivers
- cec-drivers
**Copyright** |copy| 1999-2020 : LinuxTV Developers
diff --git a/Documentation/admin-guide/media/ipu3.rst b/Documentation/admin-guide/media/ipu3.rst
index 9361c34f123e..83b3cd03b35c 100644
--- a/Documentation/admin-guide/media/ipu3.rst
+++ b/Documentation/admin-guide/media/ipu3.rst
@@ -51,10 +51,11 @@ to userspace as a V4L2 sub-device node and has two pads:
.. tabularcolumns:: |p{0.8cm}|p{4.0cm}|p{4.0cm}|
.. flat-table::
+ :header-rows: 1
- * - pad
- - direction
- - purpose
+ * - Pad
+ - Direction
+ - Purpose
* - 0
- sink
@@ -86,44 +87,44 @@ raw Bayer format that is specific to IPU3.
Let us take the example of ov5670 sensor connected to CSI2 port 0, for a
2592x1944 image capture.
-Using the media contorller APIs, the ov5670 sensor is configured to send
+Using the media controller APIs, the ov5670 sensor is configured to send
frames in packed raw Bayer format to IPU3 CSI2 receiver.
-# This example assumes /dev/media0 as the CIO2 media device
-
-export MDEV=/dev/media0
-
-# and that ov5670 sensor is connected to i2c bus 10 with address 0x36
-
-export SDEV=$(media-ctl -d $MDEV -e "ov5670 10-0036")
+.. code-block:: none
-# Establish the link for the media devices using media-ctl [#f3]_
-media-ctl -d $MDEV -l "ov5670:0 -> ipu3-csi2 0:0[1]"
+ # This example assumes /dev/media0 as the CIO2 media device
+ export MDEV=/dev/media0
-# Set the format for the media devices
-media-ctl -d $MDEV -V "ov5670:0 [fmt:SGRBG10/2592x1944]"
+ # and that ov5670 sensor is connected to i2c bus 10 with address 0x36
+ export SDEV=$(media-ctl -d $MDEV -e "ov5670 10-0036")
-media-ctl -d $MDEV -V "ipu3-csi2 0:0 [fmt:SGRBG10/2592x1944]"
+ # Establish the link for the media devices using media-ctl [#f3]_
+ media-ctl -d $MDEV -l "ov5670:0 -> ipu3-csi2 0:0[1]"
-media-ctl -d $MDEV -V "ipu3-csi2 0:1 [fmt:SGRBG10/2592x1944]"
+ # Set the format for the media devices
+ media-ctl -d $MDEV -V "ov5670:0 [fmt:SGRBG10/2592x1944]"
+ media-ctl -d $MDEV -V "ipu3-csi2 0:0 [fmt:SGRBG10/2592x1944]"
+ media-ctl -d $MDEV -V "ipu3-csi2 0:1 [fmt:SGRBG10/2592x1944]"
Once the media pipeline is configured, desired sensor specific settings
(such as exposure and gain settings) can be set, using the yavta tool.
e.g
-yavta -w 0x009e0903 444 $SDEV
-
-yavta -w 0x009e0913 1024 $SDEV
+.. code-block:: none
-yavta -w 0x009e0911 2046 $SDEV
+ yavta -w 0x009e0903 444 $SDEV
+ yavta -w 0x009e0913 1024 $SDEV
+ yavta -w 0x009e0911 2046 $SDEV
Once the desired sensor settings are set, frame captures can be done as below.
e.g
-yavta --data-prefix -u -c10 -n5 -I -s2592x1944 --file=/tmp/frame-#.bin \
- -f IPU3_SGRBG10 $(media-ctl -d $MDEV -e "ipu3-cio2 0")
+.. code-block:: none
+
+ yavta --data-prefix -u -c10 -n5 -I -s2592x1944 --file=/tmp/frame-#.bin \
+ -f IPU3_SGRBG10 $(media-ctl -d $MDEV -e "ipu3-cio2 0")
With the above command, 10 frames are captured at 2592x1944 resolution, with
sGRBG10 format and output as IPU3_SGRBG10 format.
@@ -148,10 +149,11 @@ Each pipe has two sink pads and three source pads for the following purpose:
.. tabularcolumns:: |p{0.8cm}|p{4.0cm}|p{4.0cm}|
.. flat-table::
+ :header-rows: 1
- * - pad
- - direction
- - purpose
+ * - Pad
+ - Direction
+ - Purpose
* - 0
- sink
@@ -234,22 +236,23 @@ The IPU3 ImgU pipelines can be configured using the Media Controller, defined at
Running mode and firmware binary selection
------------------------------------------
-ImgU works based on firmware, currently the ImgU firmware support run 2 pipes in
-time-sharing with single input frame data. Each pipe can run at certain mode -
-"VIDEO" or "STILL", "VIDEO" mode is commonly used for video frames capture, and
-"STILL" is used for still frame capture. However, you can also select "VIDEO" to
-capture still frames if you want to capture images with less system load and
-power. For "STILL" mode, ImgU will try to use smaller BDS factor and output
-larger bayer frame for further YUV processing than "VIDEO" mode to get high
-quality images. Besides, "STILL" mode need XNR3 to do noise reduction, hence
-"STILL" mode will need more power and memory bandwidth than "VIDEO" mode. TNR
-will be enabled in "VIDEO" mode and bypassed by "STILL" mode. ImgU is running at
-“VIDEO” mode by default, the user can use v4l2 control V4L2_CID_INTEL_IPU3_MODE
-(currently defined in drivers/staging/media/ipu3/include/intel-ipu3.h) to query
-and set the running mode. For user, there is no difference for buffer queueing
-between the "VIDEO" and "STILL" mode, mandatory input and main output node
-should be enabled and buffers need be queued, the statistics and the view-finder
-queues are optional.
+ImgU works based on firmware, currently the ImgU firmware support run 2 pipes
+in time-sharing with single input frame data. Each pipe can run at certain mode
+- "VIDEO" or "STILL", "VIDEO" mode is commonly used for video frames capture,
+and "STILL" is used for still frame capture. However, you can also select
+"VIDEO" to capture still frames if you want to capture images with less system
+load and power. For "STILL" mode, ImgU will try to use smaller BDS factor and
+output larger bayer frame for further YUV processing than "VIDEO" mode to get
+high quality images. Besides, "STILL" mode need XNR3 to do noise reduction,
+hence "STILL" mode will need more power and memory bandwidth than "VIDEO" mode.
+TNR will be enabled in "VIDEO" mode and bypassed by "STILL" mode. ImgU is
+running at "VIDEO" mode by default, the user can use v4l2 control
+V4L2_CID_INTEL_IPU3_MODE (currently defined in
+drivers/staging/media/ipu3/include/uapi/intel-ipu3.h) to query and set the
+running mode. For user, there is no difference for buffer queueing between the
+"VIDEO" and "STILL" mode, mandatory input and main output node should be
+enabled and buffers need be queued, the statistics and the view-finder queues
+are optional.
The firmware binary will be selected according to current running mode, such log
"using binary if_to_osys_striped " or "using binary if_to_osys_primary_striped"
@@ -269,21 +272,21 @@ all the video nodes setup correctly.
Let us take "ipu3-imgu 0" subdev as an example.
-media-ctl -d $MDEV -r
-
-media-ctl -d $MDEV -l "ipu3-imgu 0 input":0 -> "ipu3-imgu 0":0[1]
-
-media-ctl -d $MDEV -l "ipu3-imgu 0":2 -> "ipu3-imgu 0 output":0[1]
-
-media-ctl -d $MDEV -l "ipu3-imgu 0":3 -> "ipu3-imgu 0 viewfinder":0[1]
+.. code-block:: none
-media-ctl -d $MDEV -l "ipu3-imgu 0":4 -> "ipu3-imgu 0 3a stat":0[1]
+ media-ctl -d $MDEV -r
+ media-ctl -d $MDEV -l "ipu3-imgu 0 input":0 -> "ipu3-imgu 0":0[1]
+ media-ctl -d $MDEV -l "ipu3-imgu 0":2 -> "ipu3-imgu 0 output":0[1]
+ media-ctl -d $MDEV -l "ipu3-imgu 0":3 -> "ipu3-imgu 0 viewfinder":0[1]
+ media-ctl -d $MDEV -l "ipu3-imgu 0":4 -> "ipu3-imgu 0 3a stat":0[1]
Also the pipe mode of the corresponding V4L2 subdev should be set as desired
(e.g 0 for video mode or 1 for still mode) through the control id 0x009819a1 as
below.
-yavta -w "0x009819A1 1" /dev/v4l-subdev7
+.. code-block:: none
+
+ yavta -w "0x009819A1 1" /dev/v4l-subdev7
Certain hardware blocks in ImgU pipeline can change the frame resolution by
cropping or scaling, these hardware blocks include Input Feeder(IF), Bayer Down
@@ -313,8 +316,8 @@ configuration steps of 0.03125 (1/32).
**Geometric Distortion Correction**
-Geometric Distortion Correction is used to performe correction of distortions
-and image filtering. It needs some extra filter and envelop padding pixels to
+Geometric Distortion Correction is used to perform correction of distortions
+and image filtering. It needs some extra filter and envelope padding pixels to
work, so the input resolution of GDC should be larger than the output
resolution.
@@ -371,30 +374,32 @@ v4l2n command can be used. This helps process the raw Bayer frames and produces
the desired results for the main output image and the viewfinder output, in NV12
format.
-v4l2n --pipe=4 --load=/tmp/frame-#.bin --open=/dev/video4
---fmt=type:VIDEO_OUTPUT_MPLANE,width=2592,height=1944,pixelformat=0X47337069
---reqbufs=type:VIDEO_OUTPUT_MPLANE,count:1 --pipe=1 --output=/tmp/frames.out
---open=/dev/video5
---fmt=type:VIDEO_CAPTURE_MPLANE,width=2560,height=1920,pixelformat=NV12
---reqbufs=type:VIDEO_CAPTURE_MPLANE,count:1 --pipe=2 --output=/tmp/frames.vf
---open=/dev/video6
---fmt=type:VIDEO_CAPTURE_MPLANE,width=2560,height=1920,pixelformat=NV12
---reqbufs=type:VIDEO_CAPTURE_MPLANE,count:1 --pipe=3 --open=/dev/video7
---output=/tmp/frames.3A --fmt=type:META_CAPTURE,?
---reqbufs=count:1,type:META_CAPTURE --pipe=1,2,3,4 --stream=5
+.. code-block:: none
+
+ v4l2n --pipe=4 --load=/tmp/frame-#.bin --open=/dev/video4
+ --fmt=type:VIDEO_OUTPUT_MPLANE,width=2592,height=1944,pixelformat=0X47337069 \
+ --reqbufs=type:VIDEO_OUTPUT_MPLANE,count:1 --pipe=1 \
+ --output=/tmp/frames.out --open=/dev/video5 \
+ --fmt=type:VIDEO_CAPTURE_MPLANE,width=2560,height=1920,pixelformat=NV12 \
+ --reqbufs=type:VIDEO_CAPTURE_MPLANE,count:1 --pipe=2 \
+ --output=/tmp/frames.vf --open=/dev/video6 \
+ --fmt=type:VIDEO_CAPTURE_MPLANE,width=2560,height=1920,pixelformat=NV12 \
+ --reqbufs=type:VIDEO_CAPTURE_MPLANE,count:1 --pipe=3 --open=/dev/video7 \
+ --output=/tmp/frames.3A --fmt=type:META_CAPTURE,? \
+ --reqbufs=count:1,type:META_CAPTURE --pipe=1,2,3,4 --stream=5
You can also use yavta [#f2]_ command to do same thing as above:
.. code-block:: none
- yavta --data-prefix -Bcapture-mplane -c10 -n5 -I -s2592x1944 \
- --file=frame-#.out-f NV12 /dev/video5 & \
- yavta --data-prefix -Bcapture-mplane -c10 -n5 -I -s2592x1944 \
- --file=frame-#.vf -f NV12 /dev/video6 & \
- yavta --data-prefix -Bmeta-capture -c10 -n5 -I \
- --file=frame-#.3a /dev/video7 & \
- yavta --data-prefix -Boutput-mplane -c10 -n5 -I -s2592x1944 \
- --file=/tmp/frame-in.cio2 -f IPU3_SGRBG10 /dev/video4
+ yavta --data-prefix -Bcapture-mplane -c10 -n5 -I -s2592x1944 \
+ --file=frame-#.out-f NV12 /dev/video5 & \
+ yavta --data-prefix -Bcapture-mplane -c10 -n5 -I -s2592x1944 \
+ --file=frame-#.vf -f NV12 /dev/video6 & \
+ yavta --data-prefix -Bmeta-capture -c10 -n5 -I \
+ --file=frame-#.3a /dev/video7 & \
+ yavta --data-prefix -Boutput-mplane -c10 -n5 -I -s2592x1944 \
+ --file=/tmp/frame-in.cio2 -f IPU3_SGRBG10 /dev/video4
where /dev/video4, /dev/video5, /dev/video6 and /dev/video7 devices point to
input, output, viewfinder and 3A statistics video nodes respectively.
@@ -408,7 +413,9 @@ as below.
Main output frames
~~~~~~~~~~~~~~~~~~
-raw2pnm -x2560 -y1920 -fNV12 /tmp/frames.out /tmp/frames.out.ppm
+.. code-block:: none
+
+ raw2pnm -x2560 -y1920 -fNV12 /tmp/frames.out /tmp/frames.out.ppm
where 2560x1920 is output resolution, NV12 is the video format, followed
by input frame and output PNM file.
@@ -416,7 +423,9 @@ by input frame and output PNM file.
Viewfinder output frames
~~~~~~~~~~~~~~~~~~~~~~~~
-raw2pnm -x2560 -y1920 -fNV12 /tmp/frames.vf /tmp/frames.vf.ppm
+.. code-block:: none
+
+ raw2pnm -x2560 -y1920 -fNV12 /tmp/frames.vf /tmp/frames.vf.ppm
where 2560x1920 is output resolution, NV12 is the video format, followed
by input frame and output PNM file.
@@ -482,63 +491,63 @@ Name Description
Optical Black Correction Optical Black Correction block subtracts a pre-defined
value from the respective pixel values to obtain better
image quality.
- Defined in :c:type:`ipu3_uapi_obgrid_param`.
+ Defined in struct ipu3_uapi_obgrid_param.
Linearization This algo block uses linearization parameters to
address non-linearity sensor effects. The Lookup table
table is defined in
- :c:type:`ipu3_uapi_isp_lin_vmem_params`.
+ struct ipu3_uapi_isp_lin_vmem_params.
SHD Lens shading correction is used to correct spatial
non-uniformity of the pixel response due to optical
lens shading. This is done by applying a different gain
for each pixel. The gain, black level etc are
- configured in :c:type:`ipu3_uapi_shd_config_static`.
+ configured in struct ipu3_uapi_shd_config_static.
BNR Bayer noise reduction block removes image noise by
applying a bilateral filter.
- See :c:type:`ipu3_uapi_bnr_static_config` for details.
+ See struct ipu3_uapi_bnr_static_config for details.
ANR Advanced Noise Reduction is a block based algorithm
that performs noise reduction in the Bayer domain. The
convolution matrix etc can be found in
- :c:type:`ipu3_uapi_anr_config`.
+ struct ipu3_uapi_anr_config.
DM Demosaicing converts raw sensor data in Bayer format
into RGB (Red, Green, Blue) presentation. Then add
outputs of estimation of Y channel for following stream
processing by Firmware. The struct is defined as
- :c:type:`ipu3_uapi_dm_config`.
+ struct ipu3_uapi_dm_config.
Color Correction Color Correction algo transforms sensor specific color
space to the standard "sRGB" color space. This is done
by applying 3x3 matrix defined in
- :c:type:`ipu3_uapi_ccm_mat_config`.
-Gamma correction Gamma correction :c:type:`ipu3_uapi_gamma_config` is a
+ struct ipu3_uapi_ccm_mat_config.
+Gamma correction Gamma correction struct ipu3_uapi_gamma_config is a
basic non-linear tone mapping correction that is
applied per pixel for each pixel component.
CSC Color space conversion transforms each pixel from the
RGB primary presentation to YUV (Y: brightness,
UV: Luminance) presentation. This is done by applying
a 3x3 matrix defined in
- :c:type:`ipu3_uapi_csc_mat_config`
+ struct ipu3_uapi_csc_mat_config
CDS Chroma down sampling
After the CSC is performed, the Chroma Down Sampling
is applied for a UV plane down sampling by a factor
of 2 in each direction for YUV 4:2:0 using a 4x2
- configurable filter :c:type:`ipu3_uapi_cds_params`.
+ configurable filter struct ipu3_uapi_cds_params.
CHNR Chroma noise reduction
This block processes only the chrominance pixels and
performs noise reduction by cleaning the high
frequency noise.
- See struct :c:type:`ipu3_uapi_yuvp1_chnr_config`.
+ See struct struct ipu3_uapi_yuvp1_chnr_config.
TCC Total color correction as defined in struct
- :c:type:`ipu3_uapi_yuvp2_tcc_static_config`.
+ struct ipu3_uapi_yuvp2_tcc_static_config.
XNR3 eXtreme Noise Reduction V3 is the third revision of
noise reduction algorithm used to improve image
quality. This removes the low frequency noise in the
captured image. Two related structs are being defined,
- :c:type:`ipu3_uapi_isp_xnr3_params` for ISP data memory
- and :c:type:`ipu3_uapi_isp_xnr3_vmem_params` for vector
+ struct ipu3_uapi_isp_xnr3_params for ISP data memory
+ and struct ipu3_uapi_isp_xnr3_vmem_params for vector
memory.
TNR Temporal Noise Reduction block compares successive
frames in time to remove anomalies / noise in pixel
- values. :c:type:`ipu3_uapi_isp_tnr3_vmem_params` and
- :c:type:`ipu3_uapi_isp_tnr3_params` are defined for ISP
+ values. struct ipu3_uapi_isp_tnr3_vmem_params and
+ struct ipu3_uapi_isp_tnr3_params are defined for ISP
vector and data memory respectively.
======================== =======================================================
@@ -570,9 +579,9 @@ processor, while many others will use a set of fixed hardware blocks also
called accelerator cluster (ACC) to crunch pixel data and produce statistics.
ACC parameters of individual algorithms, as defined by
-:c:type:`ipu3_uapi_acc_param`, can be chosen to be applied by the user
-space through struct :c:type:`ipu3_uapi_flags` embedded in
-:c:type:`ipu3_uapi_params` structure. For parameters that are configured as
+struct ipu3_uapi_acc_param, can be chosen to be applied by the user
+space through struct struct ipu3_uapi_flags embedded in
+struct ipu3_uapi_params structure. For parameters that are configured as
not enabled by the user space, the corresponding structs are ignored by the
driver, in which case the existing configuration of the algorithm will be
preserved.
@@ -580,7 +589,7 @@ preserved.
References
==========
-.. [#f5] drivers/staging/media/ipu3/include/intel-ipu3.h
+.. [#f5] drivers/staging/media/ipu3/include/uapi/intel-ipu3.h
.. [#f1] https://github.com/intel/nvt
diff --git a/Documentation/admin-guide/media/ivtv.rst b/Documentation/admin-guide/media/ivtv.rst
index 7b8775d20214..101f16d0263e 100644
--- a/Documentation/admin-guide/media/ivtv.rst
+++ b/Documentation/admin-guide/media/ivtv.rst
@@ -159,7 +159,7 @@ whatever). Otherwise the device numbers can get confusing. The ivtv
Read-only
The raw YUV video output from the current video input. The YUV format
- is non-standard (V4L2_PIX_FMT_HM12).
+ is a 16x16 linear tiled NV12 format (V4L2_PIX_FMT_NV12_16L16)
Note that the YUV and PCM streams are not synchronized, so they are of
limited use.
diff --git a/Documentation/admin-guide/media/meye.rst b/Documentation/admin-guide/media/meye.rst
deleted file mode 100644
index 9098a1e65f8b..000000000000
--- a/Documentation/admin-guide/media/meye.rst
+++ /dev/null
@@ -1,93 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. include:: <isonum.txt>
-
-Vaio Picturebook Motion Eye Camera Driver
-=========================================
-
-Copyright |copy| 2001-2004 Stelian Pop <stelian@popies.net>
-
-Copyright |copy| 2001-2002 Alcôve <www.alcove.com>
-
-Copyright |copy| 2000 Andrew Tridgell <tridge@samba.org>
-
-This driver enable the use of video4linux compatible applications with the
-Motion Eye camera. This driver requires the "Sony Laptop Extras" driver (which
-can be found in the "Misc devices" section of the kernel configuration utility)
-to be compiled and installed (using its "camera=1" parameter).
-
-It can do at maximum 30 fps @ 320x240 or 15 fps @ 640x480.
-
-Grabbing is supported in packed YUV colorspace only.
-
-MJPEG hardware grabbing is supported via a private API (see below).
-
-Hardware supported
-------------------
-
-This driver supports the 'second' version of the MotionEye camera :)
-
-The first version was connected directly on the video bus of the Neomagic
-video card and is unsupported.
-
-The second one, made by Kawasaki Steel is fully supported by this
-driver (PCI vendor/device is 0x136b/0xff01)
-
-The third one, present in recent (more or less last year) Picturebooks
-(C1M* models), is not supported. The manufacturer has given the specs
-to the developers under a NDA (which allows the development of a GPL
-driver however), but things are not moving very fast (see
-http://r-engine.sourceforge.net/) (PCI vendor/device is 0x10cf/0x2011).
-
-There is a forth model connected on the USB bus in TR1* Vaio laptops.
-This camera is not supported at all by the current driver, in fact
-little information if any is available for this camera
-(USB vendor/device is 0x054c/0x0107).
-
-Driver options
---------------
-
-Several options can be passed to the meye driver using the standard
-module argument syntax (<param>=<value> when passing the option to the
-module or meye.<param>=<value> on the kernel boot line when meye is
-statically linked into the kernel). Those options are:
-
-.. code-block:: none
-
- gbuffers: number of capture buffers, default is 2 (32 max)
-
- gbufsize: size of each capture buffer, default is 614400
-
- video_nr: video device to register (0 = /dev/video0, etc)
-
-Module use
-----------
-
-In order to automatically load the meye module on use, you can put those lines
-in your /etc/modprobe.d/meye.conf file:
-
-.. code-block:: none
-
- alias char-major-81 videodev
- alias char-major-81-0 meye
- options meye gbuffers=32
-
-Usage:
-------
-
-.. code-block:: none
-
- xawtv >= 3.49 (<http://bytesex.org/xawtv/>)
- for display and uncompressed video capture:
-
- xawtv -c /dev/video0 -geometry 640x480
- or
- xawtv -c /dev/video0 -geometry 320x240
-
- motioneye (<http://popies.net/meye/>)
- for getting ppm or jpg snapshots, mjpeg video
-
-Bugs / Todo
------------
-
-- 'motioneye' still uses the meye private v4l1 API extensions.
diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst
new file mode 100644
index 000000000000..2977f74d7e26
--- /dev/null
+++ b/Documentation/admin-guide/media/mgb4.rst
@@ -0,0 +1,374 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+mgb4 sysfs interface
+====================
+
+The mgb4 driver provides a sysfs interface, that is used to configure video
+stream related parameters (some of them must be set properly before the v4l2
+device can be opened) and obtain the video device/stream status.
+
+There are two types of parameters - global / PCI card related, found under
+``/sys/class/video4linux/videoX/device`` and module specific found under
+``/sys/class/video4linux/videoX``.
+
+
+Global (PCI card) parameters
+============================
+
+**module_type** (R):
+ Module type.
+
+ | 0 - No module present
+ | 1 - FPDL3
+ | 2 - GMSL
+
+**module_version** (R):
+ Module version number. Zero in case of a missing module.
+
+**fw_type** (R):
+ Firmware type.
+
+ | 1 - FPDL3
+ | 2 - GMSL
+
+**fw_version** (R):
+ Firmware version number.
+
+**serial_number** (R):
+ Card serial number. The format is::
+
+ PRODUCT-REVISION-SERIES-SERIAL
+
+ where each component is a 8b number.
+
+
+Common FPDL3/GMSL input parameters
+==================================
+
+**input_id** (R):
+ Input number ID, zero based.
+
+**oldi_lane_width** (RW):
+ Number of deserializer output lanes.
+
+ | 0 - single
+ | 1 - dual (default)
+
+**color_mapping** (RW):
+ Mapping of the incoming bits in the signal to the colour bits of the pixels.
+
+ | 0 - OLDI/JEIDA
+ | 1 - SPWG/VESA (default)
+
+**link_status** (R):
+ Video link status. If the link is locked, chips are properly connected and
+ communicating at the same speed and protocol. The link can be locked without
+ an active video stream.
+
+ A value of 0 is equivalent to the V4L2_IN_ST_NO_SYNC flag of the V4L2
+ VIDIOC_ENUMINPUT status bits.
+
+ | 0 - unlocked
+ | 1 - locked
+
+**stream_status** (R):
+ Video stream status. A stream is detected if the link is locked, the input
+ pixel clock is running and the DE signal is moving.
+
+ A value of 0 is equivalent to the V4L2_IN_ST_NO_SIGNAL flag of the V4L2
+ VIDIOC_ENUMINPUT status bits.
+
+ | 0 - not detected
+ | 1 - detected
+
+**video_width** (R):
+ Video stream width. This is the actual width as detected by the HW.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in the width
+ field of the v4l2_bt_timings struct.
+
+**video_height** (R):
+ Video stream height. This is the actual height as detected by the HW.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in the height
+ field of the v4l2_bt_timings struct.
+
+**vsync_status** (R):
+ The type of VSYNC pulses as detected by the video format detector.
+
+ The value is equivalent to the flags returned by VIDIOC_QUERY_DV_TIMINGS in
+ the polarities field of the v4l2_bt_timings struct.
+
+ | 0 - active low
+ | 1 - active high
+ | 2 - not available
+
+**hsync_status** (R):
+ The type of HSYNC pulses as detected by the video format detector.
+
+ The value is equivalent to the flags returned by VIDIOC_QUERY_DV_TIMINGS in
+ the polarities field of the v4l2_bt_timings struct.
+
+ | 0 - active low
+ | 1 - active high
+ | 2 - not available
+
+**vsync_gap_length** (RW):
+ If the incoming video signal does not contain synchronization VSYNC and
+ HSYNC pulses, these must be generated internally in the FPGA to achieve
+ the correct frame ordering. This value indicates, how many "empty" pixels
+ (pixels with deasserted Data Enable signal) are necessary to generate the
+ internal VSYNC pulse.
+
+**hsync_gap_length** (RW):
+ If the incoming video signal does not contain synchronization VSYNC and
+ HSYNC pulses, these must be generated internally in the FPGA to achieve
+ the correct frame ordering. This value indicates, how many "empty" pixels
+ (pixels with deasserted Data Enable signal) are necessary to generate the
+ internal HSYNC pulse. The value must be greater than 1 and smaller than
+ vsync_gap_length.
+
+**pclk_frequency** (R):
+ Input pixel clock frequency in kHz.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the pixelclock field of the v4l2_bt_timings struct.
+
+ *Note: The frequency_range parameter must be set properly first to get
+ a valid frequency here.*
+
+**hsync_width** (R):
+ Width of the HSYNC signal in PCLK clock ticks.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the hsync field of the v4l2_bt_timings struct.
+
+**vsync_width** (R):
+ Width of the VSYNC signal in PCLK clock ticks.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the vsync field of the v4l2_bt_timings struct.
+
+**hback_porch** (R):
+ Number of PCLK pulses between deassertion of the HSYNC signal and the first
+ valid pixel in the video line (marked by DE=1).
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the hbackporch field of the v4l2_bt_timings struct.
+
+**hfront_porch** (R):
+ Number of PCLK pulses between the end of the last valid pixel in the video
+ line (marked by DE=1) and assertion of the HSYNC signal.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the hfrontporch field of the v4l2_bt_timings struct.
+
+**vback_porch** (R):
+ Number of video lines between deassertion of the VSYNC signal and the video
+ line with the first valid pixel (marked by DE=1).
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the vbackporch field of the v4l2_bt_timings struct.
+
+**vfront_porch** (R):
+ Number of video lines between the end of the last valid pixel line (marked
+ by DE=1) and assertion of the VSYNC signal.
+
+ The value is identical to what VIDIOC_QUERY_DV_TIMINGS returns in
+ the vfrontporch field of the v4l2_bt_timings struct.
+
+**frequency_range** (RW)
+ PLL frequency range of the OLDI input clock generator. The PLL frequency is
+ derived from the Pixel Clock Frequency (PCLK) and is equal to PCLK if
+ oldi_lane_width is set to "single" and PCLK/2 if oldi_lane_width is set to
+ "dual".
+
+ | 0 - PLL < 50MHz (default)
+ | 1 - PLL >= 50MHz
+
+ *Note: This parameter can not be changed while the input v4l2 device is
+ open.*
+
+
+Common FPDL3/GMSL output parameters
+===================================
+
+**output_id** (R):
+ Output number ID, zero based.
+
+**video_source** (RW):
+ Output video source. If set to 0 or 1, the source is the corresponding card
+ input and the v4l2 output devices are disabled. If set to 2 or 3, the source
+ is the corresponding v4l2 video output device. The default is
+ the corresponding v4l2 output, i.e. 2 for OUT1 and 3 for OUT2.
+
+ | 0 - input 0
+ | 1 - input 1
+ | 2 - v4l2 output 0
+ | 3 - v4l2 output 1
+
+ *Note: This parameter can not be changed while ANY of the input/output v4l2
+ devices is open.*
+
+**display_width** (RW):
+ Display width. There is no autodetection of the connected display, so the
+ proper value must be set before the start of streaming. The default width
+ is 1280.
+
+ *Note: This parameter can not be changed while the output v4l2 device is
+ open.*
+
+**display_height** (RW):
+ Display height. There is no autodetection of the connected display, so the
+ proper value must be set before the start of streaming. The default height
+ is 640.
+
+ *Note: This parameter can not be changed while the output v4l2 device is
+ open.*
+
+**frame_rate** (RW):
+ Output video frame rate in frames per second. The default frame rate is
+ 60Hz.
+
+**hsync_polarity** (RW):
+ HSYNC signal polarity.
+
+ | 0 - active low (default)
+ | 1 - active high
+
+**vsync_polarity** (RW):
+ VSYNC signal polarity.
+
+ | 0 - active low (default)
+ | 1 - active high
+
+**de_polarity** (RW):
+ DE signal polarity.
+
+ | 0 - active low
+ | 1 - active high (default)
+
+**pclk_frequency** (RW):
+ Output pixel clock frequency. Allowed values are between 25000-190000(kHz)
+ and there is a non-linear stepping between two consecutive allowed
+ frequencies. The driver finds the nearest allowed frequency to the given
+ value and sets it. When reading this property, you get the exact
+ frequency set by the driver. The default frequency is 70000kHz.
+
+ *Note: This parameter can not be changed while the output v4l2 device is
+ open.*
+
+**hsync_width** (RW):
+ Width of the HSYNC signal in pixels. The default value is 16.
+
+**vsync_width** (RW):
+ Width of the VSYNC signal in video lines. The default value is 2.
+
+**hback_porch** (RW):
+ Number of PCLK pulses between deassertion of the HSYNC signal and the first
+ valid pixel in the video line (marked by DE=1). The default value is 32.
+
+**hfront_porch** (RW):
+ Number of PCLK pulses between the end of the last valid pixel in the video
+ line (marked by DE=1) and assertion of the HSYNC signal. The default value
+ is 32.
+
+**vback_porch** (RW):
+ Number of video lines between deassertion of the VSYNC signal and the video
+ line with the first valid pixel (marked by DE=1). The default value is 2.
+
+**vfront_porch** (RW):
+ Number of video lines between the end of the last valid pixel line (marked
+ by DE=1) and assertion of the VSYNC signal. The default value is 2.
+
+
+FPDL3 specific input parameters
+===============================
+
+**fpdl3_input_width** (RW):
+ Number of deserializer input lines.
+
+ | 0 - auto (default)
+ | 1 - single
+ | 2 - dual
+
+FPDL3 specific output parameters
+================================
+
+**fpdl3_output_width** (RW):
+ Number of serializer output lines.
+
+ | 0 - auto (default)
+ | 1 - single
+ | 2 - dual
+
+GMSL specific input parameters
+==============================
+
+**gmsl_mode** (RW):
+ GMSL speed mode.
+
+ | 0 - 12Gb/s (default)
+ | 1 - 6Gb/s
+ | 2 - 3Gb/s
+ | 3 - 1.5Gb/s
+
+**gmsl_stream_id** (RW):
+ The GMSL multi-stream contains up to four video streams. This parameter
+ selects which stream is captured by the video input. The value is the
+ zero-based index of the stream. The default stream id is 0.
+
+ *Note: This parameter can not be changed while the input v4l2 device is
+ open.*
+
+**gmsl_fec** (RW):
+ GMSL Forward Error Correction (FEC).
+
+ | 0 - disabled
+ | 1 - enabled (default)
+
+
+====================
+mgb4 mtd partitions
+====================
+
+The mgb4 driver creates a MTD device with two partitions:
+ - mgb4-fw.X - FPGA firmware.
+ - mgb4-data.X - Factory settings, e.g. card serial number.
+
+The *mgb4-fw* partition is writable and is used for FW updates, *mgb4-data* is
+read-only. The *X* attached to the partition name represents the card number.
+Depending on the CONFIG_MTD_PARTITIONED_MASTER kernel configuration, you may
+also have a third partition named *mgb4-flash* available in the system. This
+partition represents the whole, unpartitioned, card's FLASH memory and one should
+not fiddle with it...
+
+====================
+mgb4 iio (triggers)
+====================
+
+The mgb4 driver creates an Industrial I/O (IIO) device that provides trigger and
+signal level status capability. The following scan elements are available:
+
+**activity**:
+ The trigger levels and pending status.
+
+ | bit 1 - trigger 1 pending
+ | bit 2 - trigger 2 pending
+ | bit 5 - trigger 1 level
+ | bit 6 - trigger 2 level
+
+**timestamp**:
+ The trigger event timestamp.
+
+The iio device can operate either in "raw" mode where you can fetch the signal
+levels (activity bits 5 and 6) using sysfs access or in triggered buffer mode.
+In the triggered buffer mode you can follow the signal level changes (activity
+bits 1 and 2) using the iio device in /dev. If you enable the timestamps, you
+will also get the exact trigger event time that can be matched to a video frame
+(every mgb4 video frame has a timestamp with the same clock source).
+
+*Note: although the activity sample always contains all the status bits, it makes
+no sense to get the pending bits in raw mode or the level bits in the triggered
+buffer mode - the values do not represent valid data in such case.*
diff --git a/Documentation/admin-guide/media/omap3isp.rst b/Documentation/admin-guide/media/omap3isp.rst
index bc447bbec7ce..f32e7375a1a2 100644
--- a/Documentation/admin-guide/media/omap3isp.rst
+++ b/Documentation/admin-guide/media/omap3isp.rst
@@ -17,7 +17,7 @@ Introduction
------------
This file documents the Texas Instruments OMAP 3 Image Signal Processor (ISP)
-driver located under drivers/media/platform/omap3isp. The original driver was
+driver located under drivers/media/platform/ti/omap3isp. The original driver was
written by Texas Instruments but since that it has been rewritten (twice) at
Nokia.
diff --git a/Documentation/admin-guide/media/omap4_camera.rst b/Documentation/admin-guide/media/omap4_camera.rst
index 24db4222d36d..2ada9b1e6897 100644
--- a/Documentation/admin-guide/media/omap4_camera.rst
+++ b/Documentation/admin-guide/media/omap4_camera.rst
@@ -25,7 +25,7 @@ As of Revision AB, the ISS is described in detail in section 8.
This driver is supporting **only** the CSI2-A/B interfaces for now.
It makes use of the Media Controller framework [#f2]_, and inherited most of the
-code from OMAP3 ISP driver (found under drivers/media/platform/omap3isp/\*),
+code from OMAP3 ISP driver (found under drivers/media/platform/ti/omap3isp/\*),
except that it doesn't need an IOMMU now for ISS buffers memory mapping.
Supports usage of MMAP buffers only (for now).
diff --git a/Documentation/admin-guide/media/other-usb-cardlist.rst b/Documentation/admin-guide/media/other-usb-cardlist.rst
index bbfdb1389c18..fb88db50e861 100644
--- a/Documentation/admin-guide/media/other-usb-cardlist.rst
+++ b/Documentation/admin-guide/media/other-usb-cardlist.rst
@@ -14,8 +14,6 @@ dvb-as102 nBox DVB-T Dongle 0b89:0007
dvb-as102 Sky IT Digital Key (green led) 2137:0001
b2c2-flexcop-usb Technisat/B2C2 FlexCop II/IIb/III 0af7:0101
Digital TV
-cpia2 Vision's CPiA2 cameras 0553:0100, 0553:0140,
- such as the Digital Blue QX5 0553:0151
go7007 WIS GO7007 MPEG encoder 1943:a250, 093b:a002,
093b:a004, 0eb1:6666,
0eb1:6668
@@ -66,7 +64,6 @@ pwc Visionite VCS-UC300 0d81:1900
pwc Visionite VCS-UM100 0d81:1910
s2255drv Sensoray 2255 1943:2255, 1943:2257
stk1160 STK1160 USB video capture dongle 05e1:0408
-stkwebcam Syntek DC1125 174f:a311, 05e1:0501
dvb-ttusb-budget Technotrend/Hauppauge Nova-USB devices 0b48:1003, 0b48:1004,
0b48:1005
dvb-ttusb_dec Technotrend/Hauppauge MPEG decoder 0b48:1006
@@ -78,15 +75,4 @@ dvb-ttusb_dec Technotrend/Hauppauge MPEG decoder
DEC2540-t 0b48:1009
usbtv Fushicai USBTV007 Audio-Video Grabber 1b71:3002, 1f71:3301,
1f71:3306
-zr364xx USB ZR364XX Camera 08ca:0109, 041e:4024,
- 0d64:0108, 0546:3187,
- 0d64:3108, 0595:4343,
- 0bb0:500d, 0feb:2004,
- 055f:b500, 08ca:2062,
- 052b:1a18, 04c8:0729,
- 04f2:a208, 0784:0040,
- 06d6:0034, 0a17:0062,
- 06d6:003b, 0a17:004e,
- 041e:405d, 08ca:2102,
- 06d6:003d
================ ====================================== =====================
diff --git a/Documentation/admin-guide/media/pci-cardlist.rst b/Documentation/admin-guide/media/pci-cardlist.rst
index 434fe996b541..7d8e3c8987db 100644
--- a/Documentation/admin-guide/media/pci-cardlist.rst
+++ b/Documentation/admin-guide/media/pci-cardlist.rst
@@ -77,7 +77,7 @@ ipu3-cio2 Intel ipu3-cio2 driver
ivtv Conexant cx23416/cx23415 MPEG encoder/decoder
ivtvfb Conexant cx23415 framebuffer
mantis MANTIS based cards
-meye Sony Vaio Picturebook Motion Eye
+mgb4 Digiteq Automotive MGB4 frame grabber
mxb Siemens-Nixdorf 'Multimedia eXtension Board'
netup-unidvb NetUP Universal DVB card
ngene Micronas nGene
@@ -90,6 +90,7 @@ sta2x11_vip STA2X11 VIP Video For Linux
tw5864 Techwell TW5864 video/audio grabber and encoder
tw686x Intersil/Techwell TW686x
tw68 Techwell tw68x Video For Linux
+zoran Zoran-36057/36067 JPEG codec
================ ========================================================
Some of those drivers support multiple devices, as shown at the card
@@ -105,3 +106,4 @@ lists below:
ivtv-cardlist
saa7134-cardlist
saa7164-cardlist
+ zoran-cardlist
diff --git a/Documentation/admin-guide/media/platform-cardlist.rst b/Documentation/admin-guide/media/platform-cardlist.rst
index 261e7772eb3e..1230ae4037ad 100644
--- a/Documentation/admin-guide/media/platform-cardlist.rst
+++ b/Documentation/admin-guide/media/platform-cardlist.rst
@@ -30,7 +30,6 @@ exynos-fimc-is EXYNOS4x12 FIMC-IS (Imaging Subsystem)
exynos-fimc-lite EXYNOS FIMC-LITE camera interface
exynos-gsc Samsung Exynos G-Scaler
exy Samsung S5P/EXYNOS4 SoC series Camera Subsystem
-fsl-viu Freescale VIU
imx-pxp i.MX Pixel Pipeline (PXP)
isdf TI DM365 ISIF video capture
mmp_camera Marvell Armada 610 integrated camera controller
@@ -60,6 +59,7 @@ s5p-mfc Samsung S5P MFC Video Codec
sh_veu SuperH VEU mem2mem video processing
sh_vou SuperH VOU video output
stm32-dcmi STM32 Digital Camera Memory Interface (DCMI)
+stm32-dma2d STM32 Chrom-Art Accelerator Unit
sun4i-csi Allwinner A10 CMOS Sensor Interface Support
sun6i-csi Allwinner V3s Camera Sensor Interface
sun8i-di Allwinner Deinterlace
@@ -72,7 +72,6 @@ via-camera VIAFB camera controller
video-mux Video Multiplexer
vpif_display TI DaVinci VPIF V4L2-Display
vpif_capture TI DaVinci VPIF video capture
-vpss TI DaVinci VPBE V4L2-Display
vsp1 Renesas VSP1 Video Processing Engine
xilinx-tpg Xilinx Video Test Pattern Generator
xilinx-video Xilinx Video IP (EXPERIMENTAL)
diff --git a/Documentation/admin-guide/media/pulse8-cec.rst b/Documentation/admin-guide/media/pulse8-cec.rst
deleted file mode 100644
index 356d08b519f3..000000000000
--- a/Documentation/admin-guide/media/pulse8-cec.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Pulse-Eight CEC Adapter driver
-==============================
-
-The pulse8-cec driver implements the following module option:
-
-``persistent_config``
----------------------
-
-By default this is off, but when set to 1 the driver will store the current
-settings to the device's internal eeprom and restore it the next time the
-device is connected to the USB port.
diff --git a/Documentation/admin-guide/media/qcom_camss.rst b/Documentation/admin-guide/media/qcom_camss.rst
index a72e17d09cb7..8a8f3ff40105 100644
--- a/Documentation/admin-guide/media/qcom_camss.rst
+++ b/Documentation/admin-guide/media/qcom_camss.rst
@@ -18,7 +18,7 @@ The driver implements V4L2, Media controller and V4L2 subdev interfaces.
Camera sensor using V4L2 subdev interface in the kernel is supported.
The driver is implemented using as a reference the Qualcomm Camera Subsystem
-driver for Android as found in Code Aurora [#f1]_ [#f2]_.
+driver for Android as found in Code Linaro [#f1]_ [#f2]_.
Qualcomm Camera Subsystem hardware
@@ -181,5 +181,5 @@ Referenced 2018-06-22.
References
----------
-.. [#f1] https://source.codeaurora.org/quic/la/kernel/msm-3.10/
-.. [#f2] https://source.codeaurora.org/quic/la/kernel/msm-3.18/
+.. [#f1] https://git.codelinaro.org/clo/la/kernel/msm-3.10/
+.. [#f2] https://git.codelinaro.org/clo/la/kernel/msm-3.18/
diff --git a/Documentation/admin-guide/media/remote-controller.rst b/Documentation/admin-guide/media/remote-controller.rst
index fa05410c3cd5..188944b00f4f 100644
--- a/Documentation/admin-guide/media/remote-controller.rst
+++ b/Documentation/admin-guide/media/remote-controller.rst
@@ -68,7 +68,7 @@ Using without lircd
Xorg recognizes several IR keycodes that have its numerical value lower
than 247. With the advent of Wayland, the input driver got updated too,
-and should now accept all keycodes. Yet, you may want to just reasign
+and should now accept all keycodes. Yet, you may want to just reassign
the keycodes to something that your favorite media application likes.
This can be done by setting
diff --git a/Documentation/admin-guide/media/rkisp1.dot b/Documentation/admin-guide/media/rkisp1.dot
new file mode 100644
index 000000000000..54c1953a6130
--- /dev/null
+++ b/Documentation/admin-guide/media/rkisp1.dot
@@ -0,0 +1,18 @@
+digraph board {
+ rankdir=TB
+ n00000001 [label="{{<port0> 0 | <port1> 1} | rkisp1_isp\n/dev/v4l-subdev0 | {<port2> 2 | <port3> 3}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000001:port2 -> n00000006:port0
+ n00000001:port2 -> n00000009:port0
+ n00000001:port3 -> n00000014 [style=bold]
+ n00000006 [label="{{<port0> 0} | rkisp1_resizer_mainpath\n/dev/v4l-subdev1 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000006:port1 -> n0000000c [style=bold]
+ n00000009 [label="{{<port0> 0} | rkisp1_resizer_selfpath\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000009:port1 -> n00000010 [style=bold]
+ n0000000c [label="rkisp1_mainpath\n/dev/video0", shape=box, style=filled, fillcolor=yellow]
+ n00000010 [label="rkisp1_selfpath\n/dev/video1", shape=box, style=filled, fillcolor=yellow]
+ n00000014 [label="rkisp1_stats\n/dev/video2", shape=box, style=filled, fillcolor=yellow]
+ n00000018 [label="rkisp1_params\n/dev/video3", shape=box, style=filled, fillcolor=yellow]
+ n00000018 -> n00000001:port1 [style=bold]
+ n0000001c [label="{{} | imx219 4-0010\n/dev/v4l-subdev3 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
+ n0000001c:port0 -> n00000001:port0
+}
diff --git a/Documentation/admin-guide/media/rkisp1.rst b/Documentation/admin-guide/media/rkisp1.rst
new file mode 100644
index 000000000000..6f14d9561fa5
--- /dev/null
+++ b/Documentation/admin-guide/media/rkisp1.rst
@@ -0,0 +1,197 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+=========================================
+Rockchip Image Signal Processor (rkisp1)
+=========================================
+
+Introduction
+============
+
+This file documents the driver for the Rockchip ISP1 that is part of RK3288
+and RK3399 SoCs. The driver is located under drivers/media/platform/rockchip/
+rkisp1 and uses the Media-Controller API.
+
+Revisions
+=========
+
+There exist multiple smaller revisions to this ISP that got introduced in
+later SoCs. Revisions can be found in the enum :c:type:`rkisp1_cif_isp_version`
+in the UAPI and the revision of the ISP inside the running SoC can be read
+in the field hw_revision of struct media_device_info as returned by
+ioctl MEDIA_IOC_DEVICE_INFO.
+
+Versions in use are:
+
+- RKISP1_V10: used at least in rk3288 and rk3399
+- RKISP1_V11: declared in the original vendor code, but not used
+- RKISP1_V12: used at least in rk3326 and px30
+- RKISP1_V13: used at least in rk1808
+
+Topology
+========
+.. _rkisp1_topology_graph:
+
+.. kernel-figure:: rkisp1.dot
+ :alt: Diagram of the default media pipeline topology
+ :align: center
+
+
+The driver has 4 video devices:
+
+- rkisp1_mainpath: capture device for retrieving images, usually in higher
+ resolution.
+- rkisp1_selfpath: capture device for retrieving images.
+- rkisp1_stats: a metadata capture device that sends statistics.
+- rkisp1_params: a metadata output device that receives parameters
+ configurations from userspace.
+
+The driver has 3 subdevices:
+
+- rkisp1_resizer_mainpath: used to resize and downsample frames for the
+ mainpath capture device.
+- rkisp1_resizer_selfpath: used to resize and downsample frames for the
+ selfpath capture device.
+- rkisp1_isp: is connected to the sensor and is responsible for all the isp
+ operations.
+
+
+rkisp1_mainpath, rkisp1_selfpath - Frames Capture Video Nodes
+-------------------------------------------------------------
+Those are the `mainpath` and `selfpath` capture devices to capture frames.
+Those entities are the DMA engines that write the frames to memory.
+The selfpath video device can capture YUV/RGB formats. Its input is YUV encoded
+stream and it is able to convert it to RGB. The selfpath is not able to
+capture bayer formats.
+The mainpath can capture both bayer and YUV formats but it is not able to
+capture RGB formats.
+Both capture videos support
+the ``V4L2_CAP_IO_MC`` :ref:`capability <device-capabilities>`.
+
+
+rkisp1_resizer_mainpath, rkisp1_resizer_selfpath - Resizers Subdevices Nodes
+----------------------------------------------------------------------------
+Those are resizer entities for the mainpath and the selfpath. Those entities
+can scale the frames up and down and also change the YUV sampling (for example
+YUV4:2:2 -> YUV4:2:0). They also have cropping capability on the sink pad.
+The resizers entities can only operate on YUV:4:2:2 format
+(MEDIA_BUS_FMT_YUYV8_2X8).
+The mainpath capture device supports capturing video in bayer formats. In that
+case the resizer of the mainpath is set to 'bypass' mode - it just forward the
+frame without operating on it.
+
+rkisp1_isp - Image Signal Processing Subdevice Node
+---------------------------------------------------
+This is the isp entity. It is connected to the sensor on sink pad 0 and
+receives the frames using the CSI-2 protocol. It is responsible of configuring
+the CSI-2 protocol. It has a cropping capability on sink pad 0 that is
+connected to the sensor and on source pad 2 connected to the resizer entities.
+Cropping on sink pad 0 defines the image region from the sensor.
+Cropping on source pad 2 defines the region for the Image Stabilizer (IS).
+
+.. _rkisp1_stats:
+
+rkisp1_stats - Statistics Video Node
+------------------------------------
+The statistics video node outputs the 3A (auto focus, auto exposure and auto
+white balance) statistics, and also histogram statistics for the frames that
+are being processed by the rkisp1 to userspace applications.
+Using these data, applications can implement algorithms and re-parameterize
+the driver through the rkisp_params node to improve image quality during a
+video stream.
+The buffer format is defined by struct :c:type:`rkisp1_stat_buffer`, and
+userspace should set
+:ref:`V4L2_META_FMT_RK_ISP1_STAT_3A <v4l2-meta-fmt-rk-isp1-stat-3a>` as the
+dataformat.
+
+.. _rkisp1_params:
+
+rkisp1_params - Parameters Video Node
+-------------------------------------
+The rkisp1_params video node receives a set of parameters from userspace
+to be applied to the hardware during a video stream, allowing userspace
+to dynamically modify values such as black level, cross talk corrections
+and others.
+
+The buffer format is defined by struct :c:type:`rkisp1_params_cfg`, and
+userspace should set
+:ref:`V4L2_META_FMT_RK_ISP1_PARAMS <v4l2-meta-fmt-rk-isp1-params>` as the
+dataformat.
+
+
+Capturing Video Frames Example
+==============================
+
+In the following example, the sensor connected to pad 0 of 'rkisp1_isp' is
+imx219.
+
+The following commands can be used to capture video from the selfpath video
+node with dimension 900x800 planar format YUV 4:2:2. It uses all cropping
+capabilities possible, (see explanation right below)
+
+.. code-block:: bash
+
+ # set the links
+ "media-ctl" "-d" "platform:rkisp1" "-r"
+ "media-ctl" "-d" "platform:rkisp1" "-l" "'imx219 4-0010':0 -> 'rkisp1_isp':0 [1]"
+ "media-ctl" "-d" "platform:rkisp1" "-l" "'rkisp1_isp':2 -> 'rkisp1_resizer_selfpath':0 [1]"
+ "media-ctl" "-d" "platform:rkisp1" "-l" "'rkisp1_isp':2 -> 'rkisp1_resizer_mainpath':0 [0]"
+
+ # set format for imx219 4-0010:0
+ "media-ctl" "-d" "platform:rkisp1" "--set-v4l2" '"imx219 4-0010":0 [fmt:SRGGB10_1X10/1640x1232]'
+
+ # set format for rkisp1_isp pads:
+ "media-ctl" "-d" "platform:rkisp1" "--set-v4l2" '"rkisp1_isp":0 [fmt:SRGGB10_1X10/1640x1232 crop: (0,0)/1600x1200]'
+ "media-ctl" "-d" "platform:rkisp1" "--set-v4l2" '"rkisp1_isp":2 [fmt:YUYV8_2X8/1600x1200 crop: (0,0)/1500x1100]'
+
+ # set format for rkisp1_resizer_selfpath pads:
+ "media-ctl" "-d" "platform:rkisp1" "--set-v4l2" '"rkisp1_resizer_selfpath":0 [fmt:YUYV8_2X8/1500x1100 crop: (300,400)/1400x1000]'
+ "media-ctl" "-d" "platform:rkisp1" "--set-v4l2" '"rkisp1_resizer_selfpath":1 [fmt:YUYV8_2X8/900x800]'
+
+ # set format for rkisp1_selfpath:
+ "v4l2-ctl" "-z" "platform:rkisp1" "-d" "rkisp1_selfpath" "-v" "width=900,height=800,"
+ "v4l2-ctl" "-z" "platform:rkisp1" "-d" "rkisp1_selfpath" "-v" "pixelformat=422P"
+
+ # start streaming:
+ v4l2-ctl "-z" "platform:rkisp1" "-d" "rkisp1_selfpath" "--stream-mmap" "--stream-count" "10"
+
+
+In the above example the sensor is configured to bayer format:
+`SRGGB10_1X10/1640x1232`. The rkisp1_isp:0 pad should be configured to the
+same mbus format and dimensions as the sensor, otherwise streaming will fail
+with 'EPIPE' error. So it is also configured to `SRGGB10_1X10/1640x1232`.
+In addition, the rkisp1_isp:0 pad is configured to cropping `(0,0)/1600x1200`.
+
+The cropping dimensions are automatically propagated to be the format of the
+isp source pad `rkisp1_isp:2`. Another cropping operation is configured on
+the isp source pad: `(0,0)/1500x1100`.
+
+The resizer's sink pad `rkisp1_resizer_selfpath` should be configured to format
+`YUYV8_2X8/1500x1100` in order to match the format on the other side of the
+link. In addition a cropping `(300,400)/1400x1000` is configured on it.
+
+The source pad of the resizer, `rkisp1_resizer_selfpath:1` is configured to
+format `YUYV8_2X8/900x800`. That means that the resizer first crop a window
+of `(300,400)/1400x100` from the received frame and then scales this window
+to dimension `900x800`.
+
+Note that the above example does not uses the stats-params control loop.
+Therefore the capture frames will not go through the 3A algorithms and
+probably won't have a good quality, and can even look dark and greenish.
+
+Configuring Quantization
+========================
+
+The driver supports limited and full range quantization on YUV formats,
+where limited is the default.
+To switch between one or the other, userspace should use the Colorspace
+Conversion API (CSC) for subdevices on source pad 2 of the
+isp (`rkisp1_isp:2`). The quantization configured on this pad is the
+quantization of the captured video frames on the mainpath and selfpath
+video nodes.
+Note that the resizer and capture entities will always report
+``V4L2_QUANTIZATION_DEFAULT`` even if the quantization is configured to full
+range on `rkisp1_isp:2`. So in order to get the configured quantization,
+application should get it from pad `rkisp1_isp:2`.
+
diff --git a/Documentation/admin-guide/media/saa7134.rst b/Documentation/admin-guide/media/saa7134.rst
index 7ab9c70b9abe..51eae7eb5ab7 100644
--- a/Documentation/admin-guide/media/saa7134.rst
+++ b/Documentation/admin-guide/media/saa7134.rst
@@ -50,7 +50,8 @@ To build and install, you should run::
Once the new Kernel is booted, saa7134 driver should be loaded automatically.
Depending on the card you might have to pass ``card=<nr>`` as insmod option.
-If so, please check :doc:`saa7134-cardlist` for valid choices.
+If so, please check Documentation/admin-guide/media/saa7134-cardlist.rst
+for valid choices.
Once you have your card type number, you can pass a modules configuration
via a file (usually, it is either ``/etc/modules.conf`` or some file at
diff --git a/Documentation/admin-guide/media/si476x.rst b/Documentation/admin-guide/media/si476x.rst
index 87062301d6a1..c8882ee9f208 100644
--- a/Documentation/admin-guide/media/si476x.rst
+++ b/Documentation/admin-guide/media/si476x.rst
@@ -142,7 +142,7 @@ The drivers exposes following files:
indicator
0x18 lassi Signed Low side adjacent Channel
Strength indicator
- 0x19 hassi ditto fpr High side
+ 0x19 hassi ditto for High side
0x20 mult Multipath indicator
0x21 dev Frequency deviation
0x24 assi Adjacent channel SSI
diff --git a/Documentation/admin-guide/media/siano-cardlist.rst b/Documentation/admin-guide/media/siano-cardlist.rst
index d387c04d753c..bb731a953878 100644
--- a/Documentation/admin-guide/media/siano-cardlist.rst
+++ b/Documentation/admin-guide/media/siano-cardlist.rst
@@ -20,7 +20,7 @@ Siano cards list
- 2040:1801
* - Hauppauge WinTV MiniCard
- 2040:2000, 2040:200a, 2040:2010, 2040:2011, 2040:2019
- * - Hauppauge WinTV MiniCard
+ * - Hauppauge WinTV MiniCard Rev 2
- 2040:2009
* - Hauppauge WinTV MiniStick
- 2040:5500, 2040:5510, 2040:5520, 2040:5530, 2040:5580, 2040:5590, 2040:b900, 2040:b910, 2040:b980, 2040:b990, 2040:c000, 2040:c010, 2040:c080, 2040:c090, 2040:c0a0, 2040:f5a0
diff --git a/Documentation/admin-guide/media/starfive_camss.rst b/Documentation/admin-guide/media/starfive_camss.rst
new file mode 100644
index 000000000000..ca42e9447c47
--- /dev/null
+++ b/Documentation/admin-guide/media/starfive_camss.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+================================
+Starfive Camera Subsystem driver
+================================
+
+Introduction
+------------
+
+This file documents the driver for the Starfive Camera Subsystem found on
+Starfive JH7110 SoC. The driver is located under drivers/staging/media/starfive/
+camss.
+
+The driver implements V4L2, Media controller and v4l2_subdev interfaces. Camera
+sensor using V4L2 subdev interface in the kernel is supported.
+
+The driver has been successfully used on the Gstreamer 1.18.5 with v4l2src
+plugin.
+
+
+Starfive Camera Subsystem hardware
+----------------------------------
+
+The Starfive Camera Subsystem hardware consists of::
+
+ |\ +---------------+ +-----------+
+ +----------+ | \ | | | |
+ | | | | | | | |
+ | MIPI |----->| |----->| ISP |----->| |
+ | | | | | | | |
+ +----------+ | | | | | Memory |
+ |MUX| +---------------+ | Interface |
+ +----------+ | | | |
+ | | | |---------------------------->| |
+ | Parallel |----->| | | |
+ | | | | | |
+ +----------+ | / | |
+ |/ +-----------+
+
+- MIPI: The MIPI interface, receiving data from a MIPI CSI-2 camera sensor.
+
+- Parallel: The parallel interface, receiving data from a parallel sensor.
+
+- ISP: The ISP, processing raw Bayer data from an image sensor and producing
+ YUV frames.
+
+
+Topology
+--------
+
+The media controller pipeline graph is as follows:
+
+.. _starfive_camss_graph:
+
+.. kernel-figure:: starfive_camss_graph.dot
+ :alt: starfive_camss_graph.dot
+ :align: center
+
+The driver has 2 video devices:
+
+- capture_raw: The capture device, capturing image data directly from a sensor.
+- capture_yuv: The capture device, capturing YUV frame data processed by the
+ ISP module
+
+The driver has 3 subdevices:
+
+- stf_isp: is responsible for all the isp operations, outputs YUV frames.
+- cdns_csi2rx: a CSI-2 bridge supporting up to 4 CSI lanes in input, and 4
+ different pixel streams in output.
+- imx219: an image sensor, image data is sent through MIPI CSI-2.
diff --git a/Documentation/admin-guide/media/starfive_camss_graph.dot b/Documentation/admin-guide/media/starfive_camss_graph.dot
new file mode 100644
index 000000000000..8eff1f161ac7
--- /dev/null
+++ b/Documentation/admin-guide/media/starfive_camss_graph.dot
@@ -0,0 +1,12 @@
+digraph board {
+ rankdir=TB
+ n00000001 [label="{{<port0> 0} | stf_isp\n/dev/v4l-subdev0 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000001:port1 -> n00000008 [style=dashed]
+ n00000004 [label="capture_raw\n/dev/video0", shape=box, style=filled, fillcolor=yellow]
+ n00000008 [label="capture_yuv\n/dev/video1", shape=box, style=filled, fillcolor=yellow]
+ n0000000e [label="{{<port0> 0} | cdns_csi2rx.19800000.csi-bridge\n | {<port1> 1 | <port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green]
+ n0000000e:port1 -> n00000001:port0 [style=dashed]
+ n0000000e:port1 -> n00000004 [style=dashed]
+ n00000018 [label="{{} | imx219 6-0010\n/dev/v4l-subdev1 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000018:port0 -> n0000000e:port0 [style=bold]
+}
diff --git a/Documentation/admin-guide/media/tm6000-cardlist.rst b/Documentation/admin-guide/media/tm6000-cardlist.rst
deleted file mode 100644
index 6d2769c0f4d8..000000000000
--- a/Documentation/admin-guide/media/tm6000-cardlist.rst
+++ /dev/null
@@ -1,83 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-TM6000 cards list
-=================
-
-.. tabularcolumns:: |p{1.4cm}|p{11.1cm}|p{4.2cm}|
-
-.. flat-table::
- :header-rows: 1
- :widths: 2 19 18
- :stub-columns: 0
-
- * - Card number
- - Card name
- - USB IDs
-
- * - 0
- - Unknown tm6000 video grabber
- -
-
- * - 1
- - Generic tm5600 board
- - 6000:0001
-
- * - 2
- - Generic tm6000 board
- -
-
- * - 3
- - Generic tm6010 board
- - 6000:0002
-
- * - 4
- - 10Moons UT 821
- -
-
- * - 5
- - 10Moons UT 330
- -
-
- * - 6
- - ADSTECH Dual TV USB
- - 06e1:f332
-
- * - 7
- - Freecom Hybrid Stick / Moka DVB-T Receiver Dual
- - 14aa:0620
-
- * - 8
- - ADSTECH Mini Dual TV USB
- - 06e1:b339
-
- * - 9
- - Hauppauge WinTV HVR-900H / WinTV USB2-Stick
- - 2040:6600, 2040:6601, 2040:6610, 2040:6611
-
- * - 10
- - Beholder Wander DVB-T/TV/FM USB2.0
- - 6000:dec0
-
- * - 11
- - Beholder Voyager TV/FM USB2.0
- - 6000:dec1
-
- * - 12
- - Terratec Cinergy Hybrid XE / Cinergy Hybrid-Stick
- - 0ccd:0086, 0ccd:00A5
-
- * - 13
- - Twinhan TU501(704D1)
- - 13d3:3240, 13d3:3241, 13d3:3243, 13d3:3264
-
- * - 14
- - Beholder Wander Lite DVB-T/TV/FM USB2.0
- - 6000:dec2
-
- * - 15
- - Beholder Voyager Lite TV/FM USB2.0
- - 6000:dec3
-
- * - 16
- - Terratec Grabster AV 150/250 MX
- - 0ccd:0079
diff --git a/Documentation/admin-guide/media/usb-cardlist.rst b/Documentation/admin-guide/media/usb-cardlist.rst
index 546fd40da4c3..5f5ab0723e48 100644
--- a/Documentation/admin-guide/media/usb-cardlist.rst
+++ b/Documentation/admin-guide/media/usb-cardlist.rst
@@ -43,7 +43,6 @@ Driver Name
airspy AirSpy
au0828 Auvitek AU0828
b2c2-flexcop-usb Technisat/B2C2 Air/Sky/Cable2PC USB
-cpia2 CPiA2 Video For Linux
cx231xx Conexant cx231xx USB video capture
dvb-as102 Abilis AS102 DVB receiver
dvb-ttusb-budget Technotrend/Hauppauge Nova - USB devices
@@ -93,15 +92,10 @@ pwc USB Philips Cameras
s2250 Sensoray 2250/2251
s2255drv USB Sensoray 2255 video capture device
smsusb Siano SMS1xxx based MDTV receiver
-stkwebcam USB Syntek DC1125 Camera
-tm6000-alsa TV Master TM5600/6000/6010 audio
-tm6000-dvb DVB Support for tm6000 based TV cards
-tm6000 TV Master TM5600/6000/6010 driver
ttusb_dec Technotrend/Hauppauge USB DEC devices
usbtv USBTV007 video capture
uvcvideo USB Video Class (UVC)
zd1301 ZyDAS ZD1301
-zr364xx USB ZR364XX Camera
====================== =========================================================
.. toctree::
@@ -110,9 +104,7 @@ zr364xx USB ZR364XX Camera
au0828-cardlist
cx231xx-cardlist
em28xx-cardlist
- tm6000-cardlist
siano-cardlist
- usbvision-cardlist
gspca-cardlist
diff --git a/Documentation/admin-guide/media/usbvision-cardlist.rst b/Documentation/admin-guide/media/usbvision-cardlist.rst
deleted file mode 100644
index 6aee115ee6e2..000000000000
--- a/Documentation/admin-guide/media/usbvision-cardlist.rst
+++ /dev/null
@@ -1,283 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-USBvision cards list
-====================
-
-.. tabularcolumns:: |p{1.4cm}|p{11.1cm}|p{4.2cm}|
-
-.. flat-table::
- :header-rows: 1
- :widths: 2 19 18
- :stub-columns: 0
-
- * - Card number
- - Card name
- - USB IDs
-
- * - 0
- - Xanboo
- - 0a6f:0400
-
- * - 1
- - Belkin USB VideoBus II Adapter
- - 050d:0106
-
- * - 2
- - Belkin Components USB VideoBus
- - 050d:0207
-
- * - 3
- - Belkin USB VideoBus II
- - 050d:0208
-
- * - 4
- - echoFX InterView Lite
- - 0571:0002
-
- * - 5
- - USBGear USBG-V1 resp. HAMA USB
- - 0573:0003
-
- * - 6
- - D-Link V100
- - 0573:0400
-
- * - 7
- - X10 USB Camera
- - 0573:2000
-
- * - 8
- - Hauppauge WinTV USB Live (PAL B/G)
- - 0573:2d00
-
- * - 9
- - Hauppauge WinTV USB Live Pro (NTSC M/N)
- - 0573:2d01
-
- * - 10
- - Zoran Co. PMD (Nogatech) AV-grabber Manhattan
- - 0573:2101
-
- * - 11
- - Nogatech USB-TV (NTSC) FM
- - 0573:4100
-
- * - 12
- - PNY USB-TV (NTSC) FM
- - 0573:4110
-
- * - 13
- - PixelView PlayTv-USB PRO (PAL) FM
- - 0573:4450
-
- * - 14
- - ZTV ZT-721 2.4GHz USB A/V Receiver
- - 0573:4550
-
- * - 15
- - Hauppauge WinTV USB (NTSC M/N)
- - 0573:4d00
-
- * - 16
- - Hauppauge WinTV USB (PAL B/G)
- - 0573:4d01
-
- * - 17
- - Hauppauge WinTV USB (PAL I)
- - 0573:4d02
-
- * - 18
- - Hauppauge WinTV USB (PAL/SECAM L)
- - 0573:4d03
-
- * - 19
- - Hauppauge WinTV USB (PAL D/K)
- - 0573:4d04
-
- * - 20
- - Hauppauge WinTV USB (NTSC FM)
- - 0573:4d10
-
- * - 21
- - Hauppauge WinTV USB (PAL B/G FM)
- - 0573:4d11
-
- * - 22
- - Hauppauge WinTV USB (PAL I FM)
- - 0573:4d12
-
- * - 23
- - Hauppauge WinTV USB (PAL D/K FM)
- - 0573:4d14
-
- * - 24
- - Hauppauge WinTV USB Pro (NTSC M/N)
- - 0573:4d2a
-
- * - 25
- - Hauppauge WinTV USB Pro (NTSC M/N) V2
- - 0573:4d2b
-
- * - 26
- - Hauppauge WinTV USB Pro (PAL/SECAM B/G/I/D/K/L)
- - 0573:4d2c
-
- * - 27
- - Hauppauge WinTV USB Pro (NTSC M/N) V3
- - 0573:4d20
-
- * - 28
- - Hauppauge WinTV USB Pro (PAL B/G)
- - 0573:4d21
-
- * - 29
- - Hauppauge WinTV USB Pro (PAL I)
- - 0573:4d22
-
- * - 30
- - Hauppauge WinTV USB Pro (PAL/SECAM L)
- - 0573:4d23
-
- * - 31
- - Hauppauge WinTV USB Pro (PAL D/K)
- - 0573:4d24
-
- * - 32
- - Hauppauge WinTV USB Pro (PAL/SECAM BGDK/I/L)
- - 0573:4d25
-
- * - 33
- - Hauppauge WinTV USB Pro (PAL/SECAM BGDK/I/L) V2
- - 0573:4d26
-
- * - 34
- - Hauppauge WinTV USB Pro (PAL B/G) V2
- - 0573:4d27
-
- * - 35
- - Hauppauge WinTV USB Pro (PAL B/G,D/K)
- - 0573:4d28
-
- * - 36
- - Hauppauge WinTV USB Pro (PAL I,D/K)
- - 0573:4d29
-
- * - 37
- - Hauppauge WinTV USB Pro (NTSC M/N FM)
- - 0573:4d30
-
- * - 38
- - Hauppauge WinTV USB Pro (PAL B/G FM)
- - 0573:4d31
-
- * - 39
- - Hauppauge WinTV USB Pro (PAL I FM)
- - 0573:4d32
-
- * - 40
- - Hauppauge WinTV USB Pro (PAL D/K FM)
- - 0573:4d34
-
- * - 41
- - Hauppauge WinTV USB Pro (Temic PAL/SECAM B/G/I/D/K/L FM)
- - 0573:4d35
-
- * - 42
- - Hauppauge WinTV USB Pro (Temic PAL B/G FM)
- - 0573:4d36
-
- * - 43
- - Hauppauge WinTV USB Pro (PAL/SECAM B/G/I/D/K/L FM)
- - 0573:4d37
-
- * - 44
- - Hauppauge WinTV USB Pro (NTSC M/N FM) V2
- - 0573:4d38
-
- * - 45
- - Camtel Technology USB TV Genie Pro FM Model TVB330
- - 0768:0006
-
- * - 46
- - Digital Video Creator I
- - 07d0:0001
-
- * - 47
- - Global Village GV-007 (NTSC)
- - 07d0:0002
-
- * - 48
- - Dazzle Fusion Model DVC-50 Rev 1 (NTSC)
- - 07d0:0003
-
- * - 49
- - Dazzle Fusion Model DVC-80 Rev 1 (PAL)
- - 07d0:0004
-
- * - 50
- - Dazzle Fusion Model DVC-90 Rev 1 (SECAM)
- - 07d0:0005
-
- * - 51
- - Eskape Labs MyTV2Go
- - 07f8:9104
-
- * - 52
- - Pinnacle Studio PCTV USB (PAL)
- - 2304:010d
-
- * - 53
- - Pinnacle Studio PCTV USB (SECAM)
- - 2304:0109
-
- * - 54
- - Pinnacle Studio PCTV USB (PAL) FM
- - 2304:0110
-
- * - 55
- - Miro PCTV USB
- - 2304:0111
-
- * - 56
- - Pinnacle Studio PCTV USB (NTSC) FM
- - 2304:0112
-
- * - 57
- - Pinnacle Studio PCTV USB (PAL) FM V2
- - 2304:0210
-
- * - 58
- - Pinnacle Studio PCTV USB (NTSC) FM V2
- - 2304:0212
-
- * - 59
- - Pinnacle Studio PCTV USB (PAL) FM V3
- - 2304:0214
-
- * - 60
- - Pinnacle Studio Linx Video input cable (NTSC)
- - 2304:0300
-
- * - 61
- - Pinnacle Studio Linx Video input cable (PAL)
- - 2304:0301
-
- * - 62
- - Pinnacle PCTV Bungee USB (PAL) FM
- - 2304:0419
-
- * - 63
- - Hauppauge WinTv-USB
- - 2400:4200
-
- * - 64
- - Pinnacle Studio PCTV USB (NTSC) FM V3
- - 2304:0113
-
- * - 65
- - Nogatech USB MicroCam NTSC (NV3000N)
- - 0573:3000
-
- * - 66
- - Nogatech USB MicroCam PAL (NV3001P)
- - 0573:3001
diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst
index 251cc4ede0b6..f4bb2605f07e 100644
--- a/Documentation/admin-guide/media/v4l-drivers.rst
+++ b/Documentation/admin-guide/media/v4l-drivers.rst
@@ -11,23 +11,24 @@ Video4Linux (V4L) driver-specific documentation
bttv
cafe_ccic
- cpia2
cx88
- davinci-vpbe
fimc
imx
imx7
ipu3
ivtv
- meye
+ mgb4
omap3isp
omap4_camera
philips
qcom_camss
rcar-fdp1
+ rkisp1
saa7134
si470x
si4713
si476x
+ starfive_camss
vimc
+ visl
vivid
diff --git a/Documentation/admin-guide/media/vimc.dot b/Documentation/admin-guide/media/vimc.dot
index 57863a13fa39..92a5bb631235 100644
--- a/Documentation/admin-guide/media/vimc.dot
+++ b/Documentation/admin-guide/media/vimc.dot
@@ -5,18 +5,22 @@ digraph board {
n00000001 [label="{{} | Sensor A\n/dev/v4l-subdev0 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
n00000001:port0 -> n00000005:port0 [style=bold]
n00000001:port0 -> n0000000b [style=bold]
+ n00000001 -> n00000002
+ n00000002 [label="{{} | Lens A\n/dev/v4l-subdev5 | {<port0>}}", shape=Mrecord, style=filled, fillcolor=green]
n00000003 [label="{{} | Sensor B\n/dev/v4l-subdev1 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
n00000003:port0 -> n00000008:port0 [style=bold]
n00000003:port0 -> n0000000f [style=bold]
+ n00000003 -> n00000004
+ n00000004 [label="{{} | Lens B\n/dev/v4l-subdev6 | {<port0>}}", shape=Mrecord, style=filled, fillcolor=green]
n00000005 [label="{{<port0> 0} | Debayer A\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
- n00000005:port1 -> n00000017:port0
+ n00000005:port1 -> n00000015:port0
n00000008 [label="{{<port0> 0} | Debayer B\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
- n00000008:port1 -> n00000017:port0 [style=dashed]
+ n00000008:port1 -> n00000015:port0 [style=dashed]
n0000000b [label="Raw Capture 0\n/dev/video0", shape=box, style=filled, fillcolor=yellow]
n0000000f [label="Raw Capture 1\n/dev/video1", shape=box, style=filled, fillcolor=yellow]
- n00000013 [label="RGB/YUV Input\n/dev/video2", shape=box, style=filled, fillcolor=yellow]
- n00000013 -> n00000017:port0 [style=dashed]
- n00000017 [label="{{<port0> 0} | Scaler\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
- n00000017:port1 -> n0000001a [style=bold]
- n0000001a [label="RGB/YUV Capture\n/dev/video3", shape=box, style=filled, fillcolor=yellow]
+ n00000013 [label="{{} | RGB/YUV Input\n/dev/v4l-subdev4 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000013:port0 -> n00000015:port0 [style=dashed]
+ n00000015 [label="{{<port0> 0} | Scaler\n/dev/v4l-subdev5 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000015:port1 -> n00000018 [style=bold]
+ n00000018 [label="RGB/YUV Capture\n/dev/video2", shape=box, style=filled, fillcolor=yellow]
}
diff --git a/Documentation/admin-guide/media/vimc.rst b/Documentation/admin-guide/media/vimc.rst
index 211cc8972410..29d843a8ddb1 100644
--- a/Documentation/admin-guide/media/vimc.rst
+++ b/Documentation/admin-guide/media/vimc.rst
@@ -35,11 +35,11 @@ of commands fits for the default topology:
media-ctl -d platform:vimc -V '"Sensor A":0[fmt:SBGGR8_1X8/640x480]'
media-ctl -d platform:vimc -V '"Debayer A":0[fmt:SBGGR8_1X8/640x480]'
- media-ctl -d platform:vimc -V '"Sensor B":0[fmt:SBGGR8_1X8/640x480]'
- media-ctl -d platform:vimc -V '"Debayer B":0[fmt:SBGGR8_1X8/640x480]'
- v4l2-ctl -z platform:vimc -d "RGB/YUV Capture" -v width=1920,height=1440
+ media-ctl -d platform:vimc -V '"Scaler":0[fmt:RGB888_1X24/640x480]'
+ media-ctl -d platform:vimc -V '"Scaler":0[crop:(100,50)/400x150]'
+ media-ctl -d platform:vimc -V '"Scaler":1[fmt:RGB888_1X24/300x700]'
+ v4l2-ctl -z platform:vimc -d "RGB/YUV Capture" -v width=300,height=700
v4l2-ctl -z platform:vimc -d "Raw Capture 0" -v pixelformat=BA81
- v4l2-ctl -z platform:vimc -d "Raw Capture 1" -v pixelformat=BA81
Subdevices
----------
@@ -53,6 +53,25 @@ vimc-sensor:
* 1 Pad source
+vimc-lens:
+ Ancillary lens for a sensor. Supports auto focus control. Linked to
+ a vimc-sensor using an ancillary link. The lens supports FOCUS_ABSOLUTE
+ control.
+
+.. code-block:: bash
+
+ media-ctl -p
+ ...
+ - entity 28: Lens A (0 pad, 0 link)
+ type V4L2 subdev subtype Lens flags 0
+ device node name /dev/v4l-subdev6
+ - entity 29: Lens B (0 pad, 0 link)
+ type V4L2 subdev subtype Lens flags 0
+ device node name /dev/v4l-subdev7
+ v4l2-ctl -d /dev/v4l-subdev7 -C focus_absolute
+ focus_absolute: 0
+
+
vimc-debayer:
Transforms images in bayer format into a non-bayer format.
Exposes:
@@ -61,9 +80,10 @@ vimc-debayer:
* 1 Pad source
vimc-scaler:
- Scale up the image by a factor of 3. E.g.: a 640x480 image becomes a
- 1920x1440 image. (this value can be configured, see at
- `Module options`_).
+ Re-size the image to meet the source pad resolution. E.g.: if the sync
+ pad is configured to 360x480 and the source to 1280x720, the image will
+ be stretched to fit the source resolution. Works for any resolution
+ within the vimc limitations (even shrinking the image if necessary).
Exposes:
* 1 Pad sink
@@ -76,15 +96,15 @@ vimc-capture:
* 1 Pad sink
* 1 Pad source
-
Module options
--------------
Vimc has a module parameter to configure the driver.
-* ``sca_mult=<unsigned int>``
+* ``allocator=<unsigned int>``
+
+ memory allocator selection, default is 0. It specifies the way buffers
+ will be allocated.
- Image size multiplier factor to be used to multiply both width and
- height, so the image size will be ``sca_mult^2`` bigger than the
- original one. Currently, only supports scaling up (the default value
- is 3).
+ - 0: vmalloc
+ - 1: dma-contig
diff --git a/Documentation/admin-guide/media/visl.rst b/Documentation/admin-guide/media/visl.rst
new file mode 100644
index 000000000000..cd45145cde68
--- /dev/null
+++ b/Documentation/admin-guide/media/visl.rst
@@ -0,0 +1,185 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+The Virtual Stateless Decoder Driver (visl)
+===========================================
+
+A virtual stateless decoder device for stateless uAPI development
+purposes.
+
+This tool's objective is to help the development and testing of
+userspace applications that use the V4L2 stateless API to decode media.
+
+A userspace implementation can use visl to run a decoding loop even when
+no hardware is available or when the kernel uAPI for the codec has not
+been upstreamed yet. This can reveal bugs at an early stage.
+
+This driver can also trace the contents of the V4L2 controls submitted
+to it. It can also dump the contents of the vb2 buffers through a
+debugfs interface. This is in many ways similar to the tracing
+infrastructure available for other popular encode/decode APIs out there
+and can help develop a userspace application by using another (working)
+one as a reference.
+
+.. note::
+
+ No actual decoding of video frames is performed by visl. The
+ V4L2 test pattern generator is used to write various debug information
+ to the capture buffers instead.
+
+Module parameters
+-----------------
+
+- visl_debug: Activates debug info, printing various debug messages through
+ dprintk. Also controls whether per-frame debug info is shown. Defaults to off.
+ Note that enabling this feature can result in slow performance through serial.
+
+- visl_transtime_ms: Simulated process time in milliseconds. Slowing down the
+ decoding speed can be useful for debugging.
+
+- visl_dprintk_frame_start, visl_dprintk_frame_nframes: Dictates a range of
+ frames where dprintk is activated. This only controls the dprintk tracing on a
+ per-frame basis. Note that printing a lot of data can be slow through serial.
+
+- keep_bitstream_buffers: Controls whether bitstream (i.e. OUTPUT) buffers are
+ kept after a decoding session. Defaults to false so as to reduce the amount of
+ clutter. keep_bitstream_buffers == false works well when live debugging the
+ client program with GDB.
+
+- bitstream_trace_frame_start, bitstream_trace_nframes: Similar to
+ visl_dprintk_frame_start, visl_dprintk_nframes, but controls the dumping of
+ buffer data through debugfs instead.
+
+- tpg_verbose: Write extra information on each output frame to ease debugging
+ the API. When set to true, the output frames are not stable for a given input
+ as some information like pointers or queue status will be added to them.
+
+What is the default use case for this driver?
+---------------------------------------------
+
+This driver can be used as a way to compare different userspace implementations.
+This assumes that a working client is run against visl and that the ftrace and
+OUTPUT buffer data is subsequently used to debug a work-in-progress
+implementation.
+
+Even though no video decoding is actually done, the output frames can be used
+against a reference for a given input, except if tpg_verbose is set to true.
+
+Depending on the tpg_verbose parameter value, information on reference frames,
+their timestamps, the status of the OUTPUT and CAPTURE queues and more can be
+read directly from the CAPTURE buffers.
+
+Supported codecs
+----------------
+
+The following codecs are supported:
+
+- FWHT
+- MPEG2
+- VP8
+- VP9
+- H.264
+- HEVC
+- AV1
+
+visl trace events
+-----------------
+The trace events are defined on a per-codec basis, e.g.:
+
+.. code-block:: bash
+
+ $ ls /sys/kernel/tracing/events/ | grep visl
+ visl_av1_controls
+ visl_fwht_controls
+ visl_h264_controls
+ visl_hevc_controls
+ visl_mpeg2_controls
+ visl_vp8_controls
+ visl_vp9_controls
+
+For example, in order to dump HEVC SPS data:
+
+.. code-block:: bash
+
+ $ echo 1 > /sys/kernel/tracing/events/visl_hevc_controls/v4l2_ctrl_hevc_sps/enable
+
+The SPS data will be dumped to the trace buffer, i.e.:
+
+.. code-block:: bash
+
+ $ cat /sys/kernel/tracing/trace
+ video_parameter_set_id 0
+ seq_parameter_set_id 0
+ pic_width_in_luma_samples 1920
+ pic_height_in_luma_samples 1080
+ bit_depth_luma_minus8 0
+ bit_depth_chroma_minus8 0
+ log2_max_pic_order_cnt_lsb_minus4 4
+ sps_max_dec_pic_buffering_minus1 6
+ sps_max_num_reorder_pics 2
+ sps_max_latency_increase_plus1 0
+ log2_min_luma_coding_block_size_minus3 0
+ log2_diff_max_min_luma_coding_block_size 3
+ log2_min_luma_transform_block_size_minus2 0
+ log2_diff_max_min_luma_transform_block_size 3
+ max_transform_hierarchy_depth_inter 2
+ max_transform_hierarchy_depth_intra 2
+ pcm_sample_bit_depth_luma_minus1 0
+ pcm_sample_bit_depth_chroma_minus1 0
+ log2_min_pcm_luma_coding_block_size_minus3 0
+ log2_diff_max_min_pcm_luma_coding_block_size 0
+ num_short_term_ref_pic_sets 0
+ num_long_term_ref_pics_sps 0
+ chroma_format_idc 1
+ sps_max_sub_layers_minus1 0
+ flags AMP_ENABLED|SAMPLE_ADAPTIVE_OFFSET|TEMPORAL_MVP_ENABLED|STRONG_INTRA_SMOOTHING_ENABLED
+
+
+Dumping OUTPUT buffer data through debugfs
+------------------------------------------
+
+If the **VISL_DEBUGFS** Kconfig is enabled, visl will populate
+**/sys/kernel/debug/visl/bitstream** with OUTPUT buffer data according to the
+values of bitstream_trace_frame_start and bitstream_trace_nframes. This can
+highlight errors as broken clients may fail to fill the buffers properly.
+
+A single file is created for each processed OUTPUT buffer. Its name contains an
+integer that denotes the buffer sequence, i.e.:
+
+.. code-block:: c
+
+ snprintf(name, 32, "bitstream%d", run->src->sequence);
+
+Dumping the values is simply a matter of reading from the file, i.e.:
+
+For the buffer with sequence == 0:
+
+.. code-block:: bash
+
+ $ xxd /sys/kernel/debug/visl/bitstream/bitstream0
+ 00000000: 2601 af04 d088 bc25 a173 0e41 a4f2 3274 &......%.s.A..2t
+ 00000010: c668 cb28 e775 b4ac f53a ba60 f8fd 3aa1 .h.(.u...:.`..:.
+ 00000020: 46b4 bcfc 506c e227 2372 e5f5 d7ea 579f F...Pl.'#r....W.
+ 00000030: 6371 5eb5 0eb8 23b5 ca6a 5de5 983a 19e4 cq^...#..j]..:..
+ 00000040: e8c3 4320 b4ba a226 cbc1 4138 3a12 32d6 ..C ...&..A8:.2.
+ 00000050: fef3 247b 3523 4e90 9682 ac8e eb0c a389 ..${5#N.........
+ 00000060: ddd0 6cfc 0187 0e20 7aae b15b 1812 3d33 ..l.... z..[..=3
+ 00000070: e1c5 f425 a83a 00b7 4f18 8127 3c4c aefb ...%.:..O..'<L..
+
+For the buffer with sequence == 1:
+
+.. code-block:: bash
+
+ $ xxd /sys/kernel/debug/visl/bitstream/bitstream1
+ 00000000: 0201 d021 49e1 0c40 aa11 1449 14a6 01dc ...!I..@...I....
+ 00000010: 7023 889a c8cd 2cd0 13b4 dab0 e8ca 21fe p#....,.......!.
+ 00000020: c4c8 ab4c 486e 4e2f b0df 96cc c74e 8dde ...LHnN/.....N..
+ 00000030: 8ce7 ee36 d880 4095 4d64 30a0 ff4f 0c5e ...6..@.Md0..O.^
+ 00000040: f16b a6a1 d806 ca2a 0ece a673 7bea 1f37 .k.....*...s{..7
+ 00000050: 370f 5bb9 1dc4 ba21 6434 bc53 0173 cba0 7.[....!d4.S.s..
+ 00000060: dfe6 bc99 01ea b6e0 346b 92b5 c8de 9f5d ........4k.....]
+ 00000070: e7cc 3484 1769 fef2 a693 a945 2c8b 31da ..4..i.....E,.1.
+
+And so on.
+
+By default, the files are removed during STREAMOFF. This is to reduce the amount
+of clutter.
diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst
index 52e57b773f07..b6f658c0997e 100644
--- a/Documentation/admin-guide/media/vivid.rst
+++ b/Documentation/admin-guide/media/vivid.rst
@@ -60,7 +60,7 @@ all configurable using the following module options:
- node_types:
which devices should each driver instance create. An array of
- hexadecimal values, one for each instance. The default is 0x1d3d.
+ hexadecimal values, one for each instance. The default is 0xe1d3d.
Each value is a bitmask with the following meaning:
- bit 0: Video Capture node
@@ -293,6 +293,15 @@ all configurable using the following module options:
- 0: vmalloc
- 1: dma-contig
+- cache_hints:
+
+ specifies if the device should set queues' user-space cache and memory
+ consistency hint capability (V4L2_BUF_CAP_SUPPORTS_MMAP_CACHE_HINTS).
+ The hints are valid only when using MMAP streaming I/O. Default is 0.
+
+ - 0: forbid hints
+ - 1: allow hints
+
Taken together, all these module options allow you to precisely customize
the driver behavior and test your application with all sorts of permutations.
It is also very suitable to emulate hardware that is not yet available, e.g.
@@ -383,7 +392,7 @@ Which one is returned depends on the chosen channel, each next valid channel
will cycle through the possible audio subchannel combinations. This allows
you to test the various combinations by just switching channels..
-Finally, for these inputs the v4l2_timecode struct is filled in in the
+Finally, for these inputs the v4l2_timecode struct is filled in the
dequeued v4l2_buffer struct.
@@ -571,7 +580,7 @@ Metadata Capture
----------------
The Metadata capture generates UVC format metadata. The PTS and SCR are
-transmitted based on the values set in vivid contols.
+transmitted based on the values set in vivid controls.
The Metadata device will only work for the Webcam input, it will give
back an error for all other inputs.
@@ -705,6 +714,20 @@ The Test Pattern Controls are all specific to video capture.
does the same for the EAV (End of Active Video) code.
+- Insert Video Guard Band
+
+ adds 4 columns of pixels with the HDMI Video Guard Band code at the
+ left hand side of the image. This only works with 3 or 4 byte RGB pixel
+ formats. The RGB pixel value 0xab/0x55/0xab turns out to be equivalent
+ to the HDMI Video Guard Band code that precedes each active video line
+ (see section 5.2.2.1 in the HDMI 1.3 Specification). To test if a video
+ receiver has correct HDMI Video Guard Band processing, enable this
+ control and then move the image to the left hand side of the screen.
+ That will result in video lines that start with multiple pixels that
+ have the same value as the Video Guard Band that precedes them.
+ Receivers that will just keep skipping Video Guard Band values will
+ now fail and either loose sync or these video lines will shift.
+
Capture Feature Selection Controls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -1295,7 +1318,7 @@ instance. This setup would require the following commands:
$ v4l2-ctl -d2 -i2
$ v4l2-ctl -d2 -c horizontal_movement=4
$ v4l2-ctl -d1 --overlay=1
- $ v4l2-ctl -d1 -c loop_video=1
+ $ v4l2-ctl -d0 -c loop_video=1
$ v4l2-ctl -d2 --stream-mmap --overlay=1
And from another console:
diff --git a/Documentation/admin-guide/media/zoran-cardlist.rst b/Documentation/admin-guide/media/zoran-cardlist.rst
new file mode 100644
index 000000000000..d7fc8bed62ff
--- /dev/null
+++ b/Documentation/admin-guide/media/zoran-cardlist.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Zoran cards list
+================
+
+.. tabularcolumns:: |p{1.4cm}|p{11.1cm}|p{4.2cm}|
+
+.. flat-table::
+ :header-rows: 1
+ :widths: 2 19 18
+ :stub-columns: 0
+
+ * - Card number
+ - Card name
+ - PCI subsystem IDs
+
+ * - 0
+ - DC10(old)
+ - <any>
+
+ * - 1
+ - DC10(new)
+ - <any>
+
+ * - 2
+ - DC10_PLUS
+ - 1031:7efe
+
+ * - 3
+ - DC30
+ - <any>
+
+ * - 4
+ - DC30_PLUS
+ - 1031:d801
+
+ * - 5
+ - LML33
+ - <any>
+
+ * - 6
+ - LML33R10
+ - 12f8:8a02
+
+ * - 7
+ - Buz
+ - 13ca:4231
+
+ * - 8
+ - 6-Eyes
+ - <any>
diff --git a/Documentation/admin-guide/media/zr364xx.rst b/Documentation/admin-guide/media/zr364xx.rst
deleted file mode 100644
index 7291e54b8be3..000000000000
--- a/Documentation/admin-guide/media/zr364xx.rst
+++ /dev/null
@@ -1,102 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Zoran 364xx based USB webcam module
-===================================
-
-site: http://royale.zerezo.com/zr364xx/
-
-mail: royale@zerezo.com
-
-
-Introduction
-------------
-
-
-This brings support under Linux for the Aiptek PocketDV 3300 and similar
-devices in webcam mode. If you just want to get on your PC the pictures
-and movies on the camera, you should use the usb-storage module instead.
-
-The driver works with several other cameras in webcam mode (see the list
-below).
-
-Possible chipsets are : ZR36430 (ZR36430BGC) and
-maybe ZR36431, ZR36440, ZR36442...
-
-You can try the experience changing the vendor/product ID values (look
-at the source code).
-
-You can get these values by looking at /var/log/messages when you plug
-your camera, or by typing : cat /sys/kernel/debug/usb/devices.
-
-
-Install
--------
-
-In order to use this driver, you must compile it with your kernel,
-with the following config options::
-
- ./scripts/config -e USB
- ./scripts/config -m MEDIA_SUPPORT
- ./scripts/config -e MEDIA_USB_SUPPORT
- ./scripts/config -e MEDIA_CAMERA_SUPPORT
- ./scripts/config -m USB_ZR364XX
-
-Usage
------
-
-modprobe zr364xx debug=X mode=Y
-
-- debug : set to 1 to enable verbose debug messages
-- mode : 0 = 320x240, 1 = 160x120, 2 = 640x480
-
-You can then use the camera with V4L2 compatible applications, for
-example Ekiga.
-
-To capture a single image, try this: dd if=/dev/video0 of=test.jpg bs=1M
-count=1
-
-links
------
-
-http://mxhaard.free.fr/ (support for many others cams including some Aiptek PocketDV)
-http://www.harmwal.nl/pccam880/ (this project also supports cameras based on this chipset)
-
-Supported devices
------------------
-
-====== ======= ============== ====================
-Vendor Product Distributor Model
-====== ======= ============== ====================
-0x08ca 0x0109 Aiptek PocketDV 3300
-0x08ca 0x0109 Maxell Maxcam PRO DV3
-0x041e 0x4024 Creative PC-CAM 880
-0x0d64 0x0108 Aiptek Fidelity 3200
-0x0d64 0x0108 Praktica DCZ 1.3 S
-0x0d64 0x0108 Genius Digital Camera (?)
-0x0d64 0x0108 DXG Technology Fashion Cam
-0x0546 0x3187 Polaroid iON 230
-0x0d64 0x3108 Praktica Exakta DC 2200
-0x0d64 0x3108 Genius G-Shot D211
-0x0595 0x4343 Concord Eye-Q Duo 1300
-0x0595 0x4343 Concord Eye-Q Duo 2000
-0x0595 0x4343 Fujifilm EX-10
-0x0595 0x4343 Ricoh RDC-6000
-0x0595 0x4343 Digitrex DSC 1300
-0x0595 0x4343 Firstline FDC 2000
-0x0bb0 0x500d Concord EyeQ Go Wireless
-0x0feb 0x2004 CRS Electronic 3.3 Digital Camera
-0x0feb 0x2004 Packard Bell DSC-300
-0x055f 0xb500 Mustek MDC 3000
-0x08ca 0x2062 Aiptek PocketDV 5700
-0x052b 0x1a18 Chiphead Megapix V12
-0x04c8 0x0729 Konica Revio 2
-0x04f2 0xa208 Creative PC-CAM 850
-0x0784 0x0040 Traveler Slimline X5
-0x06d6 0x0034 Trust Powerc@m 750
-0x0a17 0x0062 Pentax Optio 50L
-0x06d6 0x003b Trust Powerc@m 970Z
-0x0a17 0x004e Pentax Optio 50
-0x041e 0x405d Creative DiVi CAM 516
-0x08ca 0x2102 Aiptek DV T300
-0x06d6 0x003d Trust Powerc@m 910Z
-====== ======= ============== ====================
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
index 4e06ffabd78a..7367e6294ef6 100644
--- a/Documentation/admin-guide/mm/cma_debugfs.rst
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -5,10 +5,10 @@ CMA Debugfs Interface
The CMA debugfs interface is useful to retrieve basic information out of the
different CMA areas and to test allocation/release in each of the areas.
-Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
-kernel's CMA index. So the first CMA zone would be:
+Each CMA area represents a directory under <debugfs>/cma/, represented by
+its CMA name like below:
- <debugfs>/cma/cma-0
+ <debugfs>/cma/<cma_name>
The structure of the files created under that directory is as follows:
@@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows:
- [RO] bitmap: The bitmap of page states in the zone.
- [WO] alloc: Allocate N pages from that CMA area. For example::
- echo 5 > <debugfs>/cma/cma-2/alloc
+ echo 5 > <debugfs>/cma/<cma_name>/alloc
-would try to allocate 5 pages from the cma-2 area.
+would try to allocate 5 pages from the 'cma_name' area.
- [WO] free: Free N pages from that CMA area, similar to the above.
diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst
index c2531b14bf46..e796b0a7e4a5 100644
--- a/Documentation/admin-guide/mm/concepts.rst
+++ b/Documentation/admin-guide/mm/concepts.rst
@@ -1,5 +1,3 @@
-.. _mm_concepts:
-
=================
Concepts overview
=================
@@ -35,7 +33,7 @@ physical memory (demand paging) and provides a mechanism for the
protection and controlled sharing of data between processes.
With virtual memory, each and every memory access uses a virtual
-address. When the CPU decodes the an instruction that reads (or
+address. When the CPU decodes an instruction that reads (or
writes) from (or to) the system memory, it translates the `virtual`
address encoded in that instruction to a `physical` address that the
memory controller can understand.
@@ -86,16 +84,15 @@ memory with the huge pages. The first one is `HugeTLB filesystem`, or
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
store. For the files created in this filesystem the data resides in
the memory and mapped using huge pages. The hugetlbfs is described at
-:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
+Documentation/admin-guide/mm/hugetlbpage.rst.
Another, more recent, mechanism that enables use of the huge pages is
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
requires users and/or system administrators to configure what parts of
the system memory should and can be mapped by the huge pages, THP
manages such mappings transparently to the user and hence the
-name. See
-:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
-for more details about THP.
+name. See Documentation/admin-guide/mm/transhuge.rst for more details
+about THP.
Zones
=====
@@ -125,8 +122,8 @@ processor. Each bank is referred to as a `node` and for each node Linux
constructs an independent memory management subsystem. A node has its
own set of zones, lists of free and used pages and various statistics
counters. You can find more details about NUMA in
-:ref:`Documentation/vm/numa.rst <numa>` and in
-:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
+Documentation/mm/numa.rst` and in
+Documentation/admin-guide/mm/numa_memory_policy.rst.
Page cache
==========
@@ -184,7 +181,7 @@ pages either asynchronously or synchronously, depending on the state
of the system. When the system is not loaded, most of the memory is free
and allocation requests will be satisfied immediately from the free
pages supply. As the load increases, the amount of the free pages goes
-down and when it reaches a certain threshold (high watermark), an
+down and when it reaches a certain threshold (low watermark), an
allocation request will awaken the ``kswapd`` daemon. It will
asynchronously scan memory pages and either just free them if the data
they contain is available elsewhere, or evict to the backing storage
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst
new file mode 100644
index 000000000000..33d37bb2fb4e
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+DAMON: Data Access MONitor
+==========================
+
+:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.
+Using DAMON, users can analyze the memory access patterns of their systems and
+optimize those.
+
+.. toctree::
+ :maxdepth: 2
+
+ start
+ usage
+ reclaim
+ lru_sort
diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst
new file mode 100644
index 000000000000..7b0775d281b4
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/lru_sort.rst
@@ -0,0 +1,294 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+DAMON-based LRU-lists Sorting
+=============================
+
+DAMON-based LRU-lists Sorting (DAMON_LRU_SORT) is a static kernel module that
+aimed to be used for proactive and lightweight data access pattern based
+(de)prioritization of pages on their LRU-lists for making LRU-lists a more
+trusworthy data access pattern source.
+
+Where Proactive LRU-lists Sorting is Required?
+==============================================
+
+As page-granularity access checking overhead could be significant on huge
+systems, LRU lists are normally not proactively sorted but partially and
+reactively sorted for special events including specific user requests, system
+calls and memory pressure. As a result, LRU lists are sometimes not so
+perfectly prepared to be used as a trustworthy access pattern source for some
+situations including reclamation target pages selection under sudden memory
+pressure.
+
+Because DAMON can identify access patterns of best-effort accuracy while
+inducing only user-specified range of overhead, proactively running
+DAMON_LRU_SORT could be helpful for making LRU lists more trustworthy access
+pattern source with low and controlled overhead.
+
+How It Works?
+=============
+
+DAMON_LRU_SORT finds hot pages (pages of memory regions that showing access
+rates that higher than a user-specified threshold) and cold pages (pages of
+memory regions that showing no access for a time that longer than a
+user-specified threshold) using DAMON, and prioritizes hot pages while
+deprioritizing cold pages on their LRU-lists. To avoid it consuming too much
+CPU for the prioritizations, a CPU time usage limit can be configured. Under
+the limit, it prioritizes and deprioritizes more hot and cold pages first,
+respectively. System administrators can also configure under what situation
+this scheme should automatically activated and deactivated with three memory
+pressure watermarks.
+
+Its default parameters for hotness/coldness thresholds and CPU quota limit are
+conservatively chosen. That is, the module under its default parameters could
+be widely used without harm for common situations while providing a level of
+benefits for systems having clear hot/cold access patterns under memory
+pressure while consuming only a limited small portion of CPU time.
+
+Interface: Module Parameters
+============================
+
+To use this feature, you should first ensure your system is running on a kernel
+that is built with ``CONFIG_DAMON_LRU_SORT=y``.
+
+To let sysadmins enable or disable it and tune for the given system,
+DAMON_LRU_SORT utilizes module parameters. That is, you can put
+``damon_lru_sort.<parameter>=<value>`` on the kernel boot command line or write
+proper values to ``/sys/module/damon_lru_sort/parameters/<parameter>`` files.
+
+Below are the description of each parameter.
+
+enabled
+-------
+
+Enable or disable DAMON_LRU_SORT.
+
+You can enable DAMON_LRU_SORT by setting the value of this parameter as ``Y``.
+Setting it as ``N`` disables DAMON_LRU_SORT. Note that DAMON_LRU_SORT could do
+no real monitoring and LRU-lists sorting due to the watermarks-based activation
+condition. Refer to below descriptions for the watermarks parameter for this.
+
+commit_inputs
+-------------
+
+Make DAMON_LRU_SORT reads the input parameters again, except ``enabled``.
+
+Input parameters that updated while DAMON_LRU_SORT is running are not applied
+by default. Once this parameter is set as ``Y``, DAMON_LRU_SORT reads values
+of parametrs except ``enabled`` again. Once the re-reading is done, this
+parameter is set as ``N``. If invalid parameters are found while the
+re-reading, DAMON_LRU_SORT will be disabled.
+
+hot_thres_access_freq
+---------------------
+
+Access frequency threshold for hot memory regions identification in permil.
+
+If a memory region is accessed in frequency of this or higher, DAMON_LRU_SORT
+identifies the region as hot, and mark it as accessed on the LRU list, so that
+it could not be reclaimed under memory pressure. 50% by default.
+
+cold_min_age
+------------
+
+Time threshold for cold memory regions identification in microseconds.
+
+If a memory region is not accessed for this or longer time, DAMON_LRU_SORT
+identifies the region as cold, and mark it as unaccessed on the LRU list, so
+that it could be reclaimed first under memory pressure. 120 seconds by
+default.
+
+quota_ms
+--------
+
+Limit of time for trying the LRU lists sorting in milliseconds.
+
+DAMON_LRU_SORT tries to use only up to this time within a time window
+(quota_reset_interval_ms) for trying LRU lists sorting. This can be used
+for limiting CPU consumption of DAMON_LRU_SORT. If the value is zero, the
+limit is disabled.
+
+10 ms by default.
+
+quota_reset_interval_ms
+-----------------------
+
+The time quota charge reset interval in milliseconds.
+
+The charge reset interval for the quota of time (quota_ms). That is,
+DAMON_LRU_SORT does not try LRU-lists sorting for more than quota_ms
+milliseconds or quota_sz bytes within quota_reset_interval_ms milliseconds.
+
+1 second by default.
+
+wmarks_interval
+---------------
+
+The watermarks check time interval in microseconds.
+
+Minimal time to wait before checking the watermarks, when DAMON_LRU_SORT is
+enabled but inactive due to its watermarks rule. 5 seconds by default.
+
+wmarks_high
+-----------
+
+Free memory rate (per thousand) for the high watermark.
+
+If free memory of the system in bytes per thousand bytes is higher than this,
+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the
+watermarks. 200 (20%) by default.
+
+wmarks_mid
+----------
+
+Free memory rate (per thousand) for the middle watermark.
+
+If free memory of the system in bytes per thousand bytes is between this and
+the low watermark, DAMON_LRU_SORT becomes active, so starts the monitoring and
+the LRU-lists sorting. 150 (15%) by default.
+
+wmarks_low
+----------
+
+Free memory rate (per thousand) for the low watermark.
+
+If free memory of the system in bytes per thousand bytes is lower than this,
+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the
+watermarks. 50 (5%) by default.
+
+sample_interval
+---------------
+
+Sampling interval for the monitoring in microseconds.
+
+The sampling interval of DAMON for the cold memory monitoring. Please refer to
+the DAMON documentation (:doc:`usage`) for more detail. 5ms by default.
+
+aggr_interval
+-------------
+
+Aggregation interval for the monitoring in microseconds.
+
+The aggregation interval of DAMON for the cold memory monitoring. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail. 100ms by
+default.
+
+min_nr_regions
+--------------
+
+Minimum number of monitoring regions.
+
+The minimal number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set lower-bound of the monitoring quality.
+But, setting this too high could result in increased monitoring overhead.
+Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by
+default.
+
+max_nr_regions
+--------------
+
+Maximum number of monitoring regions.
+
+The maximum number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set upper-bound of the monitoring overhead.
+However, setting this too low could result in bad monitoring quality. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail. 1000 by
+defaults.
+
+monitor_region_start
+--------------------
+
+Start of target memory region in physical address.
+
+The start physical address of memory region that DAMON_LRU_SORT will do work
+against. By default, biggest System RAM is used as the region.
+
+monitor_region_end
+------------------
+
+End of target memory region in physical address.
+
+The end physical address of memory region that DAMON_LRU_SORT will do work
+against. By default, biggest System RAM is used as the region.
+
+kdamond_pid
+-----------
+
+PID of the DAMON thread.
+
+If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread. Else,
+-1.
+
+nr_lru_sort_tried_hot_regions
+-----------------------------
+
+Number of hot memory regions that tried to be LRU-sorted.
+
+bytes_lru_sort_tried_hot_regions
+--------------------------------
+
+Total bytes of hot memory regions that tried to be LRU-sorted.
+
+nr_lru_sorted_hot_regions
+-------------------------
+
+Number of hot memory regions that successfully be LRU-sorted.
+
+bytes_lru_sorted_hot_regions
+----------------------------
+
+Total bytes of hot memory regions that successfully be LRU-sorted.
+
+nr_hot_quota_exceeds
+--------------------
+
+Number of times that the time quota limit for hot regions have exceeded.
+
+nr_lru_sort_tried_cold_regions
+------------------------------
+
+Number of cold memory regions that tried to be LRU-sorted.
+
+bytes_lru_sort_tried_cold_regions
+---------------------------------
+
+Total bytes of cold memory regions that tried to be LRU-sorted.
+
+nr_lru_sorted_cold_regions
+--------------------------
+
+Number of cold memory regions that successfully be LRU-sorted.
+
+bytes_lru_sorted_cold_regions
+-----------------------------
+
+Total bytes of cold memory regions that successfully be LRU-sorted.
+
+nr_cold_quota_exceeds
+---------------------
+
+Number of times that the time quota limit for cold regions have exceeded.
+
+Example
+=======
+
+Below runtime example commands make DAMON_LRU_SORT to find memory regions
+having >=50% access frequency and LRU-prioritize while LRU-deprioritizing
+memory regions that not accessed for 120 seconds. The prioritization and
+deprioritization is limited to be done using only up to 1% CPU time to avoid
+DAMON_LRU_SORT consuming too much CPU time for the (de)prioritization. It also
+asks DAMON_LRU_SORT to do nothing if the system's free memory rate is more than
+50%, but start the real works if it becomes lower than 40%. If DAMON_RECLAIM
+doesn't make progress and therefore the free memory rate becomes lower than
+20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to
+the LRU-list based page granularity reclamation. ::
+
+ # cd /sys/module/damon_lru_sort/parameters
+ # echo 500 > hot_thres_access_freq
+ # echo 120000000 > cold_min_age
+ # echo 10 > quota_ms
+ # echo 1000 > quota_reset_interval_ms
+ # echo 500 > wmarks_high
+ # echo 400 > wmarks_mid
+ # echo 200 > wmarks_low
+ # echo Y > enabled
diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst
new file mode 100644
index 000000000000..af05ae617018
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/reclaim.rst
@@ -0,0 +1,301 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+DAMON-based Reclamation
+=======================
+
+DAMON-based Reclamation (DAMON_RECLAIM) is a static kernel module that aimed to
+be used for proactive and lightweight reclamation under light memory pressure.
+It doesn't aim to replace the LRU-list based page_granularity reclamation, but
+to be selectively used for different level of memory pressure and requirements.
+
+Where Proactive Reclamation is Required?
+========================================
+
+On general memory over-committed systems, proactively reclaiming cold pages
+helps saving memory and reducing latency spikes that incurred by the direct
+reclaim of the process or CPU consumption of kswapd, while incurring only
+minimal performance degradation [1]_ [2]_ .
+
+Free Pages Reporting [3]_ based memory over-commit virtualization systems are
+good example of the cases. In such systems, the guest VMs reports their free
+memory to host, and the host reallocates the reported memory to other guests.
+As a result, the memory of the systems are fully utilized. However, the
+guests could be not so memory-frugal, mainly because some kernel subsystems and
+user-space applications are designed to use as much memory as available. Then,
+guests could report only small amount of memory as free to host, results in
+memory utilization drop of the systems. Running the proactive reclamation in
+guests could mitigate this problem.
+
+How It Works?
+=============
+
+DAMON_RECLAIM finds memory regions that didn't accessed for specific time
+duration and page out. To avoid it consuming too much CPU for the paging out
+operation, a speed limit can be configured. Under the speed limit, it pages
+out memory regions that didn't accessed longer time first. System
+administrators can also configure under what situation this scheme should
+automatically activated and deactivated with three memory pressure watermarks.
+
+Interface: Module Parameters
+============================
+
+To use this feature, you should first ensure your system is running on a kernel
+that is built with ``CONFIG_DAMON_RECLAIM=y``.
+
+To let sysadmins enable or disable it and tune for the given system,
+DAMON_RECLAIM utilizes module parameters. That is, you can put
+``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write
+proper values to ``/sys/module/damon_reclaim/parameters/<parameter>`` files.
+
+Below are the description of each parameter.
+
+enabled
+-------
+
+Enable or disable DAMON_RECLAIM.
+
+You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``.
+Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do
+no real monitoring and reclamation due to the watermarks-based activation
+condition. Refer to below descriptions for the watermarks parameter for this.
+
+commit_inputs
+-------------
+
+Make DAMON_RECLAIM reads the input parameters again, except ``enabled``.
+
+Input parameters that updated while DAMON_RECLAIM is running are not applied
+by default. Once this parameter is set as ``Y``, DAMON_RECLAIM reads values
+of parametrs except ``enabled`` again. Once the re-reading is done, this
+parameter is set as ``N``. If invalid parameters are found while the
+re-reading, DAMON_RECLAIM will be disabled.
+
+min_age
+-------
+
+Time threshold for cold memory regions identification in microseconds.
+
+If a memory region is not accessed for this or longer time, DAMON_RECLAIM
+identifies the region as cold, and reclaims it.
+
+120 seconds by default.
+
+quota_ms
+--------
+
+Limit of time for the reclamation in milliseconds.
+
+DAMON_RECLAIM tries to use only up to this time within a time window
+(quota_reset_interval_ms) for trying reclamation of cold pages. This can be
+used for limiting CPU consumption of DAMON_RECLAIM. If the value is zero, the
+limit is disabled.
+
+10 ms by default.
+
+quota_sz
+--------
+
+Limit of size of memory for the reclamation in bytes.
+
+DAMON_RECLAIM charges amount of memory which it tried to reclaim within a time
+window (quota_reset_interval_ms) and makes no more than this limit is tried.
+This can be used for limiting consumption of CPU and IO. If this value is
+zero, the limit is disabled.
+
+128 MiB by default.
+
+quota_reset_interval_ms
+-----------------------
+
+The time/size quota charge reset interval in milliseconds.
+
+The charget reset interval for the quota of time (quota_ms) and size
+(quota_sz). That is, DAMON_RECLAIM does not try reclamation for more than
+quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms
+milliseconds.
+
+1 second by default.
+
+quota_mem_pressure_us
+---------------------
+
+Desired level of memory pressure-stall time in microseconds.
+
+While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
+increases and decreases the effective level of the quota aiming this level of
+memory pressure is incurred. System-wide ``some`` memory PSI in microseconds
+per quota reset interval (``quota_reset_interval_ms``) is collected and
+compared to this value to see if the aim is satisfied. Value zero means
+disabling this auto-tuning feature.
+
+Disabled by default.
+
+quota_autotune_feedback
+-----------------------
+
+User-specifiable feedback for auto-tuning of the effective quota.
+
+While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
+increases and decreases the effective level of the quota aiming receiving this
+feedback of value ``10,000`` from the user. DAMON_RECLAIM assumes the feedback
+value and the quota are positively proportional. Value zero means disabling
+this auto-tuning feature.
+
+Disabled by default.
+
+wmarks_interval
+---------------
+
+Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is
+enabled but inactive due to its watermarks rule.
+
+wmarks_high
+-----------
+
+Free memory rate (per thousand) for the high watermark.
+
+If free memory of the system in bytes per thousand bytes is higher than this,
+DAMON_RECLAIM becomes inactive, so it does nothing but only periodically checks
+the watermarks.
+
+wmarks_mid
+----------
+
+Free memory rate (per thousand) for the middle watermark.
+
+If free memory of the system in bytes per thousand bytes is between this and
+the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring and
+the reclaiming.
+
+wmarks_low
+----------
+
+Free memory rate (per thousand) for the low watermark.
+
+If free memory of the system in bytes per thousand bytes is lower than this,
+DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks the
+watermarks. In the case, the system falls back to the LRU-list based page
+granularity reclamation logic.
+
+sample_interval
+---------------
+
+Sampling interval for the monitoring in microseconds.
+
+The sampling interval of DAMON for the cold memory monitoring. Please refer to
+the DAMON documentation (:doc:`usage`) for more detail.
+
+aggr_interval
+-------------
+
+Aggregation interval for the monitoring in microseconds.
+
+The aggregation interval of DAMON for the cold memory monitoring. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+min_nr_regions
+--------------
+
+Minimum number of monitoring regions.
+
+The minimal number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set lower-bound of the monitoring quality.
+But, setting this too high could result in increased monitoring overhead.
+Please refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+max_nr_regions
+--------------
+
+Maximum number of monitoring regions.
+
+The maximum number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set upper-bound of the monitoring overhead.
+However, setting this too low could result in bad monitoring quality. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+monitor_region_start
+--------------------
+
+Start of target memory region in physical address.
+
+The start physical address of memory region that DAMON_RECLAIM will do work
+against. That is, DAMON_RECLAIM will find cold memory regions in this region
+and reclaims. By default, biggest System RAM is used as the region.
+
+monitor_region_end
+------------------
+
+End of target memory region in physical address.
+
+The end physical address of memory region that DAMON_RECLAIM will do work
+against. That is, DAMON_RECLAIM will find cold memory regions in this region
+and reclaims. By default, biggest System RAM is used as the region.
+
+skip_anon
+---------
+
+Skip anonymous pages reclamation.
+
+If this parameter is set as ``Y``, DAMON_RECLAIM does not reclaim anonymous
+pages. By default, ``N``.
+
+
+kdamond_pid
+-----------
+
+PID of the DAMON thread.
+
+If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. Else,
+-1.
+
+nr_reclaim_tried_regions
+------------------------
+
+Number of memory regions that tried to be reclaimed by DAMON_RECLAIM.
+
+bytes_reclaim_tried_regions
+---------------------------
+
+Total bytes of memory regions that tried to be reclaimed by DAMON_RECLAIM.
+
+nr_reclaimed_regions
+--------------------
+
+Number of memory regions that successfully be reclaimed by DAMON_RECLAIM.
+
+bytes_reclaimed_regions
+-----------------------
+
+Total bytes of memory regions that successfully be reclaimed by DAMON_RECLAIM.
+
+nr_quota_exceeds
+----------------
+
+Number of times that the time/space quota limits have exceeded.
+
+Example
+=======
+
+Below runtime example commands make DAMON_RECLAIM to find memory regions that
+not accessed for 30 seconds or more and pages out. The reclamation is limited
+to be done only up to 1 GiB per second to avoid DAMON_RECLAIM consuming too
+much CPU time for the paging out operation. It also asks DAMON_RECLAIM to do
+nothing if the system's free memory rate is more than 50%, but start the real
+works if it becomes lower than 40%. If DAMON_RECLAIM doesn't make progress and
+therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to
+do nothing again, so that we can fall back to the LRU-list based page
+granularity reclamation. ::
+
+ # cd /sys/module/damon_reclaim/parameters
+ # echo 30000000 > min_age
+ # echo $((1 * 1024 * 1024 * 1024)) > quota_sz
+ # echo 1000 > quota_reset_interval_ms
+ # echo 500 > wmarks_high
+ # echo 400 > wmarks_mid
+ # echo 200 > wmarks_low
+ # echo Y > enabled
+
+.. [1] https://research.google/pubs/pub48551/
+.. [2] https://lwn.net/Articles/787611/
+.. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
new file mode 100644
index 000000000000..7aa0071ff1c3
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -0,0 +1,127 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Getting Started
+===============
+
+This document briefly describes how you can use DAMON by demonstrating its
+default user space tool. Please note that this document describes only a part
+of its features for brevity. Please refer to the usage `doc
+<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more
+details.
+
+
+Prerequisites
+=============
+
+Kernel
+------
+
+You should first ensure your system is running on a kernel built with
+``CONFIG_DAMON_*=y``.
+
+
+User Space Tool
+---------------
+
+For the demonstration, we will use the default user space tool for DAMON,
+called DAMON Operator (DAMO). It is available at
+https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
+your ``$PATH``. It's not mandatory, though.
+
+Because DAMO is using the sysfs interface (refer to :doc:`usage` for the
+detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
+mounted.
+
+
+Recording Data Access Patterns
+==============================
+
+The commands below record the memory access patterns of a program and save the
+monitoring results to a file. ::
+
+ $ git clone https://github.com/sjp38/masim
+ $ cd masim; make; ./masim ./configs/zigzag.cfg &
+ $ sudo damo record -o damon.data $(pidof masim)
+
+The first two lines of the commands download an artificial memory access
+generator program and run it in the background. The generator will repeatedly
+access two 100 MiB sized memory regions one by one. You can substitute this
+with your real workload. The last line asks ``damo`` to record the access
+pattern in the ``damon.data`` file.
+
+
+Visualizing Recorded Patterns
+=============================
+
+You can visualize the pattern in a heatmap, showing which memory region
+(x-axis) got accessed when (y-axis) and how frequently (number).::
+
+ $ sudo damo report heats --heatmap stdout
+ 22222222222222222222222222222222222222211111111111111111111111111111111111111100
+ 44444444444444444444444444444444444444434444444444444444444444444444444444443200
+ 44444444444444444444444444444444444444433444444444444444444444444444444444444200
+ 33333333333333333333333333333333333333344555555555555555555555555555555555555200
+ 33333333333333333333333333333333333344444444444444444444444444444444444444444200
+ 22222222222222222222222222222222222223355555555555555555555555555555555555555200
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ 33333333333333333333333333333333333333355555555555555555555555555555555555555200
+ 88888888888888888888888888888888888888600000000000000000000000000000000000000000
+ 88888888888888888888888888888888888888600000000000000000000000000000000000000000
+ 33333333333333333333333333333333333333444444444444444444444444444444444444443200
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ [...]
+ # access_frequency: 0 1 2 3 4 5 6 7 8 9
+ # x-axis: space (139728247021568-139728453431248: 196.848 MiB)
+ # y-axis: time (15256597248362-15326899978162: 1 m 10.303 s)
+ # resolution: 80x40 (2.461 MiB and 1.758 s for each character)
+
+You can also visualize the distribution of the working set size, sorted by the
+size.::
+
+ $ sudo damo report wss --range 0 101 10
+ # <percentile> <wss>
+ # target_id 18446632103789443072
+ # avr: 107.708 MiB
+ 0 0 B | |
+ 10 95.328 MiB |**************************** |
+ 20 95.332 MiB |**************************** |
+ 30 95.340 MiB |**************************** |
+ 40 95.387 MiB |**************************** |
+ 50 95.387 MiB |**************************** |
+ 60 95.398 MiB |**************************** |
+ 70 95.398 MiB |**************************** |
+ 80 95.504 MiB |**************************** |
+ 90 190.703 MiB |********************************************************* |
+ 100 196.875 MiB |***********************************************************|
+
+Using ``--sortby`` option with the above command, you can show how the working
+set size has chronologically changed.::
+
+ $ sudo damo report wss --range 0 101 10 --sortby time
+ # <percentile> <wss>
+ # target_id 18446632103789443072
+ # avr: 107.708 MiB
+ 0 3.051 MiB | |
+ 10 190.703 MiB |***********************************************************|
+ 20 95.336 MiB |***************************** |
+ 30 95.328 MiB |***************************** |
+ 40 95.387 MiB |***************************** |
+ 50 95.332 MiB |***************************** |
+ 60 95.320 MiB |***************************** |
+ 70 95.398 MiB |***************************** |
+ 80 95.398 MiB |***************************** |
+ 90 95.340 MiB |***************************** |
+ 100 95.398 MiB |***************************** |
+
+
+Data Access Pattern Aware Memory Management
+===========================================
+
+Below command makes every memory region of size >=4K that has not accessed for
+>=60 seconds in your workload to be swapped out. ::
+
+ $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \
+ --damos_age 60s max --damos_action pageout \
+ <pid of your workload>
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
new file mode 100644
index 000000000000..6fce035fdbf5
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -0,0 +1,901 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Detailed Usages
+===============
+
+DAMON provides below interfaces for different users.
+
+- *DAMON user space tool.*
+ `This <https://github.com/awslabs/damo>`_ is for privileged people such as
+ system administrators who want a just-working human-friendly interface.
+ Using this, users can use the DAMON’s major features in a human-friendly way.
+ It may not be highly tuned for special cases, though. For more detail,
+ please refer to its `usage document
+ <https://github.com/awslabs/damo/blob/next/USAGE.md>`_.
+- *sysfs interface.*
+ :ref:`This <sysfs_interface>` is for privileged user space programmers who
+ want more optimized use of DAMON. Using this, users can use DAMON’s major
+ features by reading from and writing to special sysfs files. Therefore,
+ you can write and use your personalized DAMON sysfs wrapper programs that
+ reads/writes the sysfs files instead of you. The `DAMON user space tool
+ <https://github.com/awslabs/damo>`_ is one example of such programs.
+- *Kernel Space Programming Interface.*
+ :doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
+ users can utilize every feature of DAMON most flexibly and efficiently by
+ writing kernel space DAMON application programs for you. You can even extend
+ DAMON for various address spaces. For detail, please refer to the interface
+ :doc:`document </mm/damon/api>`.
+- *debugfs interface. (DEPRECATED!)*
+ :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
+ <sysfs_interface>`. This is deprecated, so users should move to the
+ :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
+ move, please report your usecase to damon@lists.linux.dev and
+ linux-mm@kvack.org.
+
+.. _sysfs_interface:
+
+sysfs Interface
+===============
+
+DAMON sysfs interface is built when ``CONFIG_DAMON_SYSFS`` is defined. It
+creates multiple directories and files under its sysfs directory,
+``<sysfs>/kernel/mm/damon/``. You can control DAMON by writing to and reading
+from the files under the directory.
+
+For a short example, users can monitor the virtual address space of a given
+workload as below. ::
+
+ # cd /sys/kernel/mm/damon/admin/
+ # echo 1 > kdamonds/nr_kdamonds && echo 1 > kdamonds/0/contexts/nr_contexts
+ # echo vaddr > kdamonds/0/contexts/0/operations
+ # echo 1 > kdamonds/0/contexts/0/targets/nr_targets
+ # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid_target
+ # echo on > kdamonds/0/state
+
+Files Hierarchy
+---------------
+
+The files hierarchy of DAMON sysfs interface is shown below. In the below
+figure, parents-children relations are represented with indentations, each
+directory is having ``/`` suffix, and files in each directory are separated by
+comma (",").
+
+.. parsed-literal::
+
+ :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin
+ │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds
+ │ │ :ref:`0 <sysfs_kdamond>`/state,pid
+ │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts
+ │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
+ │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
+ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
+ │ │ │ │ │ │ nr_regions/min,max
+ │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
+ │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
+ │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions
+ │ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end
+ │ │ │ │ │ │ │ │ ...
+ │ │ │ │ │ │ ...
+ │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
+ │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
+ │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
+ │ │ │ │ │ │ │ │ sz/min,max
+ │ │ │ │ │ │ │ │ nr_accesses/min,max
+ │ │ │ │ │ │ │ │ age/min,max
+ │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes
+ │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
+ │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
+ │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
+ │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
+ │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
+ │ │ │ │ │ │ │ │ 0/type,matching,memcg_id
+ │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
+ │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
+ │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
+ │ │ │ │ │ │ │ │ ...
+ │ │ │ │ │ │ ...
+ │ │ │ │ ...
+ │ │ ...
+
+.. _sysfs_root:
+
+Root
+----
+
+The root of the DAMON sysfs interface is ``<sysfs>/kernel/mm/damon/``, and it
+has one directory named ``admin``. The directory contains the files for
+privileged user space programs' control of DAMON. User space tools or daemons
+having the root permission could use this directory.
+
+.. _sysfs_kdamonds:
+
+kdamonds/
+---------
+
+Under the ``admin`` directory, one directory, ``kdamonds``, which has files for
+controlling the kdamonds (refer to
+:ref:`design <damon_design_execution_model_and_data_structures>` for more
+details) exists. In the beginning, this directory has only one file,
+``nr_kdamonds``. Writing a number (``N``) to the file creates the number of
+child directories named ``0`` to ``N-1``. Each directory represents each
+kdamond.
+
+.. _sysfs_kdamond:
+
+kdamonds/<N>/
+-------------
+
+In each kdamond directory, two files (``state`` and ``pid``) and one directory
+(``contexts``) exist.
+
+Reading ``state`` returns ``on`` if the kdamond is currently running, or
+``off`` if it is not running.
+
+Users can write below commands for the kdamond to the ``state`` file.
+
+- ``on``: Start running.
+- ``off``: Stop running.
+- ``commit``: Read the user inputs in the sysfs files except ``state`` file
+ again.
+- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
+ :ref:`quota goals <sysfs_schemes_quota_goals>`.
+- ``update_schemes_stats``: Update the contents of stats files for each
+ DAMON-based operation scheme of the kdamond. For details of the stats,
+ please refer to :ref:`stats section <sysfs_schemes_stats>`.
+- ``update_schemes_tried_regions``: Update the DAMON-based operation scheme
+ action tried regions directory for each DAMON-based operation scheme of the
+ kdamond. For details of the DAMON-based operation scheme action tried
+ regions directory, please refer to
+ :ref:`tried_regions section <sysfs_schemes_tried_regions>`.
+- ``update_schemes_tried_bytes``: Update only ``.../tried_regions/total_bytes``
+ files.
+- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
+ action tried regions directory for each DAMON-based operation scheme of the
+ kdamond.
+- ``update_schemes_effective_bytes``: Update the contents of
+ ``effective_bytes`` files for each DAMON-based operation scheme of the
+ kdamond. For more details, refer to :ref:`quotas directory <sysfs_quotas>`.
+
+If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
+
+``contexts`` directory contains files for controlling the monitoring contexts
+that this kdamond will execute.
+
+.. _sysfs_contexts:
+
+kdamonds/<N>/contexts/
+----------------------
+
+In the beginning, this directory has only one file, ``nr_contexts``. Writing a
+number (``N``) to the file creates the number of child directories named as
+``0`` to ``N-1``. Each directory represents each monitoring context (refer to
+:ref:`design <damon_design_execution_model_and_data_structures>` for more
+details). At the moment, only one context per kdamond is supported, so only
+``0`` or ``1`` can be written to the file.
+
+.. _sysfs_context:
+
+contexts/<N>/
+-------------
+
+In each context directory, two files (``avail_operations`` and ``operations``)
+and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
+exist.
+
+DAMON supports multiple types of :ref:`monitoring operations
+<damon_design_configurable_operations_set>`, including those for virtual address
+space and the physical address space. You can get the list of available
+monitoring operations set on the currently running kernel by reading
+``avail_operations`` file. Based on the kernel configuration, the file will
+list different available operation sets. Please refer to the :ref:`design
+<damon_operations_set>` for the list of all available operation sets and their
+brief explanations.
+
+You can set and get what type of monitoring operations DAMON will use for the
+context by writing one of the keywords listed in ``avail_operations`` file and
+reading from the ``operations`` file.
+
+.. _sysfs_monitoring_attrs:
+
+contexts/<N>/monitoring_attrs/
+------------------------------
+
+Files for specifying attributes of the monitoring including required quality
+and efficiency of the monitoring are in ``monitoring_attrs`` directory.
+Specifically, two directories, ``intervals`` and ``nr_regions`` exist in this
+directory.
+
+Under ``intervals`` directory, three files for DAMON's sampling interval
+(``sample_us``), aggregation interval (``aggr_us``), and update interval
+(``update_us``) exist. You can set and get the values in micro-seconds by
+writing to and reading from the files.
+
+Under ``nr_regions`` directory, two files for the lower-bound and upper-bound
+of DAMON's monitoring regions (``min`` and ``max``, respectively), which
+controls the monitoring overhead, exist. You can set and get the values by
+writing to and rading from the files.
+
+For more details about the intervals and monitoring regions range, please refer
+to the Design document (:doc:`/mm/damon/design`).
+
+.. _sysfs_targets:
+
+contexts/<N>/targets/
+---------------------
+
+In the beginning, this directory has only one file, ``nr_targets``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each monitoring target.
+
+.. _sysfs_target:
+
+targets/<N>/
+------------
+
+In each target directory, one file (``pid_target``) and one directory
+(``regions``) exist.
+
+If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should
+be a process. You can specify the process to DAMON by writing the pid of the
+process to the ``pid_target`` file.
+
+.. _sysfs_regions:
+
+targets/<N>/regions
+-------------------
+
+In case of ``fvaddr`` or ``paddr`` monitoring operations sets, users are
+required to set the monitoring target address ranges. In case of ``vaddr``
+operations set, it is not mandatory, but users can optionally set the initial
+monitoring region to specific address ranges. Please refer to the :ref:`design
+<damon_design_vaddr_target_regions_construction>` for more details.
+
+For such cases, users can explicitly set the initial monitoring target regions
+as they want, by writing proper values to the files under this directory.
+
+In the beginning, this directory has only one file, ``nr_regions``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each initial monitoring target region.
+
+.. _sysfs_region:
+
+regions/<N>/
+------------
+
+In each region directory, you will find two files (``start`` and ``end``). You
+can set and get the start and end addresses of the initial monitoring target
+region by writing to and reading from the files, respectively.
+
+Each region should not overlap with others. ``end`` of directory ``N`` should
+be equal or smaller than ``start`` of directory ``N+1``.
+
+.. _sysfs_schemes:
+
+contexts/<N>/schemes/
+---------------------
+
+The directory for DAMON-based Operation Schemes (:ref:`DAMOS
+<damon_design_damos>`). Users can get and set the schemes by reading from and
+writing to files under this directory.
+
+In the beginning, this directory has only one file, ``nr_schemes``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each DAMON-based operation scheme.
+
+.. _sysfs_scheme:
+
+schemes/<N>/
+------------
+
+In each scheme directory, five directories (``access_pattern``, ``quotas``,
+``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
+(``action`` and ``apply_interval``) exist.
+
+The ``action`` file is for setting and getting the scheme's :ref:`action
+<damon_design_damos_action>`. The keywords that can be written to and read
+from the file and their meaning are same to those of the list on
+:ref:`design doc <damon_design_damos_action>`.
+
+The ``apply_interval_us`` file is for setting and getting the scheme's
+:ref:`apply_interval <damon_design_damos>` in microseconds.
+
+.. _sysfs_access_pattern:
+
+schemes/<N>/access_pattern/
+---------------------------
+
+The directory for the target access :ref:`pattern
+<damon_design_damos_access_pattern>` of the given DAMON-based operation scheme.
+
+Under the ``access_pattern`` directory, three directories (``sz``,
+``nr_accesses``, and ``age``) each having two files (``min`` and ``max``)
+exist. You can set and get the access pattern for the given scheme by writing
+to and reading from the ``min`` and ``max`` files under ``sz``,
+``nr_accesses``, and ``age`` directories, respectively. Note that the ``min``
+and the ``max`` form a closed interval.
+
+.. _sysfs_quotas:
+
+schemes/<N>/quotas/
+-------------------
+
+The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
+DAMON-based operation scheme.
+
+Under ``quotas`` directory, four files (``ms``, ``bytes``,
+``reset_interval_ms``, ``effective_bytes``) and two directores (``weights`` and
+``goals``) exist.
+
+You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
+``reset interval`` in milliseconds by writing the values to the three files,
+respectively. Then, DAMON tries to use only up to ``time quota`` milliseconds
+for applying the ``action`` to memory regions of the ``access_pattern``, and to
+apply the action to only up to ``bytes`` bytes of memory regions within the
+``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
+quota limits unless at least one :ref:`goal <sysfs_schemes_quota_goals>` is
+set.
+
+The time quota is internally transformed to a size quota. Between the
+transformed size quota and user-specified size quota, smaller one is applied.
+Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the
+effective size quota is further adjusted. Reading ``effective_bytes`` returns
+the current effective size quota. The file is not updated in real time, so
+users should ask DAMON sysfs interface to update the content of the file for
+the stats by writing a special keyword, ``update_schemes_effective_bytes`` to
+the relevant ``kdamonds/<N>/state`` file.
+
+Under ``weights`` directory, three files (``sz_permil``,
+``nr_accesses_permil``, and ``age_permil``) exist.
+You can set the :ref:`prioritization weights
+<damon_design_damos_quotas_prioritization>` for size, access frequency, and age
+in per-thousand unit by writing the values to the three files under the
+``weights`` directory.
+
+.. _sysfs_schemes_quota_goals:
+
+schemes/<N>/quotas/goals/
+-------------------------
+
+The directory for the :ref:`automatic quota tuning goals
+<damon_design_damos_quotas_auto_tuning>` of the given DAMON-based operation
+scheme.
+
+In the beginning, this directory has only one file, ``nr_goals``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each goal and current achievement.
+Among the multiple feedback, the best one is used.
+
+Each goal directory contains three files, namely ``target_metric``,
+``target_value`` and ``current_value``. Users can set and get the three
+parameters for the quota auto-tuning goals that specified on the :ref:`design
+doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each
+of the files. Note that users should further write
+``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
+directory <sysfs_kdamond>` to pass the feedback to DAMON.
+
+.. _sysfs_watermarks:
+
+schemes/<N>/watermarks/
+-----------------------
+
+The directory for the :ref:`watermarks <damon_design_damos_watermarks>` of the
+given DAMON-based operation scheme.
+
+Under the watermarks directory, five files (``metric``, ``interval_us``,
+``high``, ``mid``, and ``low``) for setting the metric, the time interval
+between check of the metric, and the three watermarks exist. You can set and
+get the five values by writing to the files, respectively.
+
+Keywords and meanings of those that can be written to the ``metric`` file are
+as below.
+
+ - none: Ignore the watermarks
+ - free_mem_rate: System's free memory rate (per thousand)
+
+The ``interval`` should written in microseconds unit.
+
+.. _sysfs_filters:
+
+schemes/<N>/filters/
+--------------------
+
+The directory for the :ref:`filters <damon_design_damos_filters>` of the given
+DAMON-based operation scheme.
+
+In the beginning, this directory has only one file, ``nr_filters``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each filter. The filters are evaluated
+in the numeric order.
+
+Each filter directory contains six files, namely ``type``, ``matcing``,
+``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
+file, you can write one of four special keywords: ``anon`` for anonymous pages,
+``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
+open-ended interval), or ``target`` for specific DAMON monitoring target
+filtering. In case of the memory cgroup filtering, you can specify the memory
+cgroup of the interest by writing the path of the memory cgroup from the
+cgroups mount point to ``memcg_path`` file. In case of the address range
+filtering, you can specify the start and end address of the range to
+``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
+target filtering, you can specify the index of the target between the list of
+the DAMON context's monitoring targets list to ``target_idx`` file. You can
+write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
+not match to the type, respectively. Then, the scheme's action will not be
+applied to the pages that specified to be filtered out.
+
+For example, below restricts a DAMOS action to be applied to only non-anonymous
+pages of all memory cgroups except ``/having_care_already``.::
+
+ # echo 2 > nr_filters
+ # # filter out anonymous pages
+ echo anon > 0/type
+ echo Y > 0/matching
+ # # further filter out all cgroups except one at '/having_care_already'
+ echo memcg > 1/type
+ echo /having_care_already > 1/memcg_path
+ echo N > 1/matching
+
+Note that ``anon`` and ``memcg`` filters are currently supported only when
+``paddr`` :ref:`implementation <sysfs_context>` is being used.
+
+Also, memory regions that are filtered out by ``addr`` or ``target`` filters
+are not counted as the scheme has tried to those, while regions that filtered
+out by other type filters are counted as the scheme has tried to. The
+difference is applied to :ref:`stats <damos_stats>` and
+:ref:`tried regions <sysfs_schemes_tried_regions>`.
+
+.. _sysfs_schemes_stats:
+
+schemes/<N>/stats/
+------------------
+
+DAMON counts the total number and bytes of regions that each scheme is tried to
+be applied, the two numbers for the regions that each scheme is successfully
+applied, and the total number of the quota limit exceeds. This statistics can
+be used for online analysis or tuning of the schemes.
+
+The statistics can be retrieved by reading the files under ``stats`` directory
+(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and
+``qt_exceeds``), respectively. The files are not updated in real time, so you
+should ask DAMON sysfs interface to update the content of the files for the
+stats by writing a special keyword, ``update_schemes_stats`` to the relevant
+``kdamonds/<N>/state`` file.
+
+.. _sysfs_schemes_tried_regions:
+
+schemes/<N>/tried_regions/
+--------------------------
+
+This directory initially has one file, ``total_bytes``.
+
+When a special keyword, ``update_schemes_tried_regions``, is written to the
+relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so
+that reading it returns the total size of the scheme tried regions, and creates
+directories named integer starting from ``0`` under this directory. Each
+directory contains files exposing detailed information about each of the memory
+region that the corresponding scheme's ``action`` has tried to be applied under
+this directory, during next :ref:`apply interval <damon_design_damos>` of the
+corresponding scheme. The information includes address range, ``nr_accesses``,
+and ``age`` of the region.
+
+Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
+file will only update the ``total_bytes`` file, and will not create the
+subdirectories.
+
+The directories will be removed when another special keyword,
+``clear_schemes_tried_regions``, is written to the relevant
+``kdamonds/<N>/state`` file.
+
+The expected usage of this directory is investigations of schemes' behaviors,
+and query-like efficient data access monitoring results retrievals. For the
+latter use case, in particular, users can set the ``action`` as ``stat`` and
+set the ``access pattern`` as their interested pattern that they want to query.
+
+.. _sysfs_schemes_tried_region:
+
+tried_regions/<N>/
+------------------
+
+In each region directory, you will find four files (``start``, ``end``,
+``nr_accesses``, and ``age``). Reading the files will show the start and end
+addresses, ``nr_accesses``, and ``age`` of the region that corresponding
+DAMON-based operation scheme ``action`` has tried to be applied.
+
+Example
+~~~~~~~
+
+Below commands applies a scheme saying "If a memory region of size in [4KiB,
+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
+interval in [10, 20], page out the region. For the paging out, use only up to
+10ms per second, and also don't page out more than 1GiB per second. Under the
+limitation, page out memory regions having longer age first. Also, check the
+free memory rate of the system every 5 seconds, start the monitoring and paging
+out when the free memory rate becomes lower than 50%, but stop it if the free
+memory rate becomes larger than 60%, or lower than 30%". ::
+
+ # cd <sysfs>/kernel/mm/damon/admin
+ # # populate directories
+ # echo 1 > kdamonds/nr_kdamonds; echo 1 > kdamonds/0/contexts/nr_contexts;
+ # echo 1 > kdamonds/0/contexts/0/schemes/nr_schemes
+ # cd kdamonds/0/contexts/0/schemes/0
+ # # set the basic access pattern and the action
+ # echo 4096 > access_pattern/sz/min
+ # echo 8192 > access_pattern/sz/max
+ # echo 0 > access_pattern/nr_accesses/min
+ # echo 5 > access_pattern/nr_accesses/max
+ # echo 10 > access_pattern/age/min
+ # echo 20 > access_pattern/age/max
+ # echo pageout > action
+ # # set quotas
+ # echo 10 > quotas/ms
+ # echo $((1024*1024*1024)) > quotas/bytes
+ # echo 1000 > quotas/reset_interval_ms
+ # # set watermark
+ # echo free_mem_rate > watermarks/metric
+ # echo 5000000 > watermarks/interval_us
+ # echo 600 > watermarks/high
+ # echo 500 > watermarks/mid
+ # echo 300 > watermarks/low
+
+Please note that it's highly recommended to use user space tools like `damo
+<https://github.com/awslabs/damo>`_ rather than manually reading and writing
+the files as above. Above is only for an example.
+
+.. _tracepoint:
+
+Tracepoints for Monitoring Results
+==================================
+
+Users can get the monitoring results via the :ref:`tried_regions
+<sysfs_schemes_tried_regions>`. The interface is useful for getting a
+snapshot, but it could be inefficient for fully recording all the monitoring
+results. For the purpose, two trace points, namely ``damon:damon_aggregated``
+and ``damon:damos_before_apply``, are provided. ``damon:damon_aggregated``
+provides the whole monitoring results, while ``damon:damos_before_apply``
+provides the monitoring results for regions that each DAMON-based Operation
+Scheme (:ref:`DAMOS <damon_design_damos>`) is gonna be applied. Hence,
+``damon:damos_before_apply`` is more useful for recording internal behavior of
+DAMOS, or DAMOS target access
+:ref:`pattern <damon_design_damos_access_pattern>` based query-like efficient
+monitoring results recording.
+
+While the monitoring is turned on, you could record the tracepoint events and
+show results using tracepoint supporting tools like ``perf``. For example::
+
+ # echo on > kdamonds/0/state
+ # perf record -e damon:damon_aggregated &
+ # sleep 5
+ # kill 9 $(pidof perf)
+ # echo off > kdamonds/0/state
+ # perf script
+ kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
+ [...]
+
+Each line of the perf script output represents each monitoring region. The
+first five fields are as usual other tracepoint outputs. The sixth field
+(``target_id=X``) shows the ide of the monitoring target of the region. The
+seventh field (``nr_regions=X``) shows the total number of monitoring regions
+for the target. The eighth field (``X-Y:``) shows the start (``X``) and end
+(``Y``) addresses of the region in bytes. The ninth field (``X``) shows the
+``nr_accesses`` of the region (refer to
+:ref:`design <damon_design_region_based_sampling>` for more details of the
+counter). Finally the tenth field (``X``) shows the ``age`` of the region
+(refer to :ref:`design <damon_design_age_tracking>` for more details of the
+counter).
+
+If the event was ``damon:damos_beofre_apply``, the ``perf script`` output would
+be somewhat like below::
+
+ kdamond.0 47293 [000] 80801.060214: damon:damos_before_apply: ctx_idx=0 scheme_idx=0 target_idx=0 nr_regions=11 121932607488-135128711168: 0 136
+ [...]
+
+Each line of the output represents each monitoring region that each DAMON-based
+Operation Scheme was about to be applied at the traced time. The first five
+fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``)
+of the scheme in the list of the contexts of the context's kdamond, the index
+of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in
+addition to the output of ``damon_aggregated`` tracepoint.
+
+
+.. _debugfs_interface:
+
+debugfs Interface (DEPRECATED!)
+===============================
+
+.. note::
+
+ THIS IS DEPRECATED!
+
+ DAMON debugfs interface is deprecated, so users should move to the
+ :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
+ move, please report your usecase to damon@lists.linux.dev and
+ linux-mm@kvack.org.
+
+DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``,
+``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``,
+``mk_contexts`` and ``rm_contexts`` under its debugfs directory,
+``<debugfs>/damon/``.
+
+
+``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation
+notice. Reading it returns the deprecation notice, as below::
+
+ # cat DEPRECATED
+ DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org.
+
+
+Attributes
+----------
+
+Users can get and set the ``sampling interval``, ``aggregation interval``,
+``update interval``, and min/max number of monitoring target regions by
+reading from and writing to the ``attrs`` file. To know about the monitoring
+attributes in detail, please refer to the :doc:`/mm/damon/design`. For
+example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
+1000, and then check it again::
+
+ # cd <debugfs>/damon
+ # echo 5000 100000 1000000 10 1000 > attrs
+ # cat attrs
+ 5000 100000 1000000 10 1000
+
+
+Target IDs
+----------
+
+Some types of address spaces supports multiple monitoring target. For example,
+the virtual memory address spaces monitoring can have multiple processes as the
+monitoring targets. Users can set the targets by writing relevant id values of
+the targets to, and get the ids of the current targets by reading from the
+``target_ids`` file. In case of the virtual address spaces monitoring, the
+values should be pids of the monitoring target processes. For example, below
+commands set processes having pids 42 and 4242 as the monitoring targets and
+check it again::
+
+ # cd <debugfs>/damon
+ # echo 42 4242 > target_ids
+ # cat target_ids
+ 42 4242
+
+Users can also monitor the physical memory address space of the system by
+writing a special keyword, "``paddr\n``" to the file. Because physical address
+space monitoring doesn't support multiple targets, reading the file will show a
+fake value, ``42``, as below::
+
+ # cd <debugfs>/damon
+ # echo paddr > target_ids
+ # cat target_ids
+ 42
+
+Note that setting the target ids doesn't start the monitoring.
+
+
+Initial Monitoring Target Regions
+---------------------------------
+
+In case of the virtual address space monitoring, DAMON automatically sets and
+updates the monitoring target regions so that entire memory mappings of target
+processes can be covered. However, users can want to limit the monitoring
+region to specific address ranges, such as the heap, the stack, or specific
+file-mapped area. Or, some users can know the initial access pattern of their
+workloads and therefore want to set optimal initial regions for the 'adaptive
+regions adjustment'.
+
+In contrast, DAMON do not automatically sets and updates the monitoring target
+regions in case of physical memory monitoring. Therefore, users should set the
+monitoring target regions by themselves.
+
+In such cases, users can explicitly set the initial monitoring target regions
+as they want, by writing proper values to the ``init_regions`` file. The input
+should be a sequence of three integers separated by white spaces that represent
+one region in below form.::
+
+ <target idx> <start address> <end address>
+
+The ``target idx`` should be the index of the target in ``target_ids`` file,
+starting from ``0``, and the regions should be passed in address order. For
+example, below commands will set a couple of address ranges, ``1-100`` and
+``100-200`` as the initial monitoring target region of pid 42, which is the
+first one (index ``0``) in ``target_ids``, and another couple of address
+ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one
+(index ``1``) in ``target_ids``.::
+
+ # cd <debugfs>/damon
+ # cat target_ids
+ 42 4242
+ # echo "0 1 100 \
+ 0 100 200 \
+ 1 20 40 \
+ 1 50 100" > init_regions
+
+Note that this sets the initial monitoring target regions only. In case of
+virtual memory monitoring, DAMON will automatically updates the boundary of the
+regions after one ``update interval``. Therefore, users should set the
+``update interval`` large enough in this case, if they don't want the
+update.
+
+
+Schemes
+-------
+
+Users can get and set the DAMON-based operation :ref:`schemes
+<damon_design_damos>` by reading from and writing to ``schemes`` debugfs file.
+Reading the file also shows the statistics of each scheme. To the file, each
+of the schemes should be represented in each line in below form::
+
+ <target access pattern> <action> <quota> <watermarks>
+
+You can disable schemes by simply writing an empty string to the file.
+
+Target Access Pattern
+~~~~~~~~~~~~~~~~~~~~~
+
+The target access :ref:`pattern <damon_design_damos_access_pattern>` of the
+scheme. The ``<target access pattern>`` is constructed with three ranges in
+below form::
+
+ min-size max-size min-acc max-acc min-age max-age
+
+Specifically, bytes for the size of regions (``min-size`` and ``max-size``),
+number of monitored accesses per aggregate interval for access frequency
+(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of
+regions (``min-age`` and ``max-age``) are specified. Note that the ranges are
+closed interval.
+
+Action
+~~~~~~
+
+The ``<action>`` is a predefined integer for memory management :ref:`actions
+<damon_design_damos_action>`. The mapping between the ``<action>`` values and
+the memory management actions is as below. For the detailed meaning of the
+action and DAMON operations set supporting each action, please refer to the
+list on :ref:`design doc <damon_design_damos_action>`.
+
+ - 0: ``willneed``
+ - 1: ``cold``
+ - 2: ``pageout``
+ - 3: ``hugepage``
+ - 4: ``nohugepage``
+ - 5: ``stat``
+
+Quota
+~~~~~
+
+Users can set the :ref:`quotas <damon_design_damos_quotas>` of the given scheme
+via the ``<quota>`` in below form::
+
+ <ms> <sz> <reset interval> <priority weights>
+
+This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying
+the action to memory regions of the ``target access pattern`` within the
+``<reset interval>`` milliseconds, and to apply the action to only up to
+``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both
+``<ms>`` and ``<sz>`` zero disables the quota limits.
+
+For the :ref:`prioritization <damon_design_damos_quotas_prioritization>`, users
+can set the weights for the three properties in ``<priority weights>`` in below
+form::
+
+ <size weight> <access frequency weight> <age weight>
+
+Watermarks
+~~~~~~~~~~
+
+Users can specify :ref:`watermarks <damon_design_damos_watermarks>` of the
+given scheme via ``<watermarks>`` in below form::
+
+ <metric> <check interval> <high mark> <middle mark> <low mark>
+
+``<metric>`` is a predefined integer for the metric to be checked. The
+supported numbers and their meanings are as below.
+
+ - 0: Ignore the watermarks
+ - 1: System's free memory rate (per thousand)
+
+The value of the metric is checked every ``<check interval>`` microseconds.
+
+If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the
+scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme
+is activated.
+
+.. _damos_stats:
+
+Statistics
+~~~~~~~~~~
+
+It also counts the total number and bytes of regions that each scheme is tried
+to be applied, the two numbers for the regions that each scheme is successfully
+applied, and the total number of the quota limit exceeds. This statistics can
+be used for online analysis or tuning of the schemes.
+
+The statistics can be shown by reading the ``schemes`` file. Reading the file
+will show each scheme you entered in each line, and the five numbers for the
+statistics will be added at the end of each line.
+
+Example
+~~~~~~~
+
+Below commands applies a scheme saying "If a memory region of size in [4KiB,
+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
+interval in [10, 20], page out the region. For the paging out, use only up to
+10ms per second, and also don't page out more than 1GiB per second. Under the
+limitation, page out memory regions having longer age first. Also, check the
+free memory rate of the system every 5 seconds, start the monitoring and paging
+out when the free memory rate becomes lower than 50%, but stop it if the free
+memory rate becomes larger than 60%, or lower than 30%".::
+
+ # cd <debugfs>/damon
+ # scheme="4096 8192 0 5 10 20 2" # target access pattern and action
+ # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas
+ # scheme+=" 0 0 100" # prioritization weights
+ # scheme+=" 1 5000000 600 500 300" # watermarks
+ # echo "$scheme" > schemes
+
+
+Turning On/Off
+--------------
+
+Setting the files as described above doesn't incur effect unless you explicitly
+start the monitoring. You can start, stop, and check the current status of the
+monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file.
+Writing ``on`` to the file starts the monitoring of the targets with the
+attributes. Writing ``off`` to the file stops those. DAMON also stops if
+every target process is terminated. Below example commands turn on, off, and
+check the status of DAMON::
+
+ # cd <debugfs>/damon
+ # echo on > monitor_on_DEPRECATED
+ # echo off > monitor_on_DEPRECATED
+ # cat monitor_on_DEPRECATED
+ off
+
+Please note that you cannot write to the above-mentioned debugfs files while
+the monitoring is turned on. If you write to the files while DAMON is running,
+an error code such as ``-EBUSY`` will be returned.
+
+
+Monitoring Thread PID
+---------------------
+
+DAMON does requested monitoring with a kernel thread called ``kdamond``. You
+can get the pid of the thread by reading the ``kdamond_pid`` file. When the
+monitoring is turned off, reading the file returns ``none``. ::
+
+ # cd <debugfs>/damon
+ # cat monitor_on_DEPRECATED
+ off
+ # cat kdamond_pid
+ none
+ # echo on > monitor_on_DEPRECATED
+ # cat kdamond_pid
+ 18594
+
+
+Using Multiple Monitoring Threads
+---------------------------------
+
+One ``kdamond`` thread is created for each monitoring context. You can create
+and remove monitoring contexts for multiple ``kdamond`` required use case using
+the ``mk_contexts`` and ``rm_contexts`` files.
+
+Writing the name of the new context to the ``mk_contexts`` file creates a
+directory of the name on the DAMON debugfs directory. The directory will have
+DAMON debugfs files for the context. ::
+
+ # cd <debugfs>/damon
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+ # echo foo > mk_contexts
+ # ls foo
+ # attrs init_regions kdamond_pid schemes target_ids
+
+If the context is not needed anymore, you can remove it and the corresponding
+directory by putting the name of the context to the ``rm_contexts`` file. ::
+
+ # echo foo > rm_contexts
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+
+Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files
+are in the root directory only.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 5026e58826e2..e4d4b4a8dc97 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -1,5 +1,3 @@
-.. _hugetlbpage:
-
=============
HugeTLB Pages
=============
@@ -60,8 +58,12 @@ HugePages_Surp
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
maximum number of surplus huge pages is controlled by
``/proc/sys/vm/nr_overcommit_hugepages``.
+ Note: When the feature of freeing unused vmemmap pages associated
+ with each hugetlb page is enabled, the number of surplus huge pages
+ may be temporarily larger than the maximum number of surplus huge
+ pages when the system is under memory pressure.
Hugepagesize
- is the default hugepage size (in Kb).
+ is the default hugepage size (in kB).
Hugetlb
is the total amount of memory (in kB), consumed by huge
pages of all sizes.
@@ -80,6 +82,10 @@ returned to the huge page pool when freed by a task. A user with root
privileges can dynamically allocate more or free some persistent huge pages
by increasing or decreasing the value of ``nr_hugepages``.
+Note: When the feature of freeing unused vmemmap pages associated with each
+hugetlb page is enabled, we can fail to free the huge pages triggered by
+the user when the system is under memory pressure. Please try again later.
+
Pages that are used as huge pages are reserved inside the kernel and cannot
be used for other purposes. Huge pages cannot be swapped out under
memory pressure.
@@ -101,39 +107,63 @@ be specified in bytes with optional scale suffix [kKmMgG]. The default huge
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
Hugetlb boot command line parameter semantics
-hugepagesz - Specify a huge page size. Used in conjunction with hugepages
+
+hugepagesz
+ Specify a huge page size. Used in conjunction with hugepages
parameter to preallocate a number of huge pages of the specified
size. Hence, hugepagesz and hugepages are typically specified in
- pairs such as:
+ pairs such as::
+
hugepagesz=2M hugepages=512
+
hugepagesz can only be specified once on the command line for a
specific huge page size. Valid huge page sizes are architecture
dependent.
-hugepages - Specify the number of huge pages to preallocate. This typically
+hugepages
+ Specify the number of huge pages to preallocate. This typically
follows a valid hugepagesz or default_hugepagesz parameter. However,
if hugepages is the first or only hugetlb command line parameter it
implicitly specifies the number of huge pages of default size to
allocate. If the number of huge pages of default size is implicitly
specified, it can not be overwritten by a hugepagesz,hugepages
- parameter pair for the default size.
- For example, on an architecture with 2M default huge page size:
+ parameter pair for the default size. This parameter also has a
+ node format. The node format specifies the number of huge pages
+ to allocate on specific nodes.
+
+ For example, on an architecture with 2M default huge page size::
+
hugepages=256 hugepagesz=2M hugepages=512
+
will result in 256 2M huge pages being allocated and a warning message
indicating that the hugepages=512 parameter is ignored. If a hugepages
parameter is preceded by an invalid hugepagesz parameter, it will
be ignored.
-default_hugepagesz - Specify the default huge page size. This parameter can
+
+ Node format example::
+
+ hugepagesz=2M hugepages=0:1,1:2
+
+ It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
+ If the node number is invalid, the parameter will be ignored.
+
+default_hugepagesz
+ Specify the default huge page size. This parameter can
only be specified once on the command line. default_hugepagesz can
optionally be followed by the hugepages parameter to preallocate a
specific number of huge pages of default size. The number of default
sized huge pages to preallocate can also be implicitly specified as
mentioned in the hugepages section above. Therefore, on an
- architecture with 2M default huge page size:
+ architecture with 2M default huge page size::
+
hugepages=256
default_hugepagesz=2M hugepages=256
hugepages=256 default_hugepagesz=2M
+
will all result in 256 2M huge pages being allocated. Valid default
huge page size is architecture dependent.
+hugetlb_free_vmemmap
+ When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB
+ Vmemmap Optimization (HVO).
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size.
@@ -212,8 +242,12 @@ will exist, of the form::
hugepages-${size}kB
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist. In addition, two additional interfaces for demoting huge
+pages may exist::
+ demote
+ demote_size
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
@@ -221,7 +255,29 @@ Inside each of these directories, the same set of files will exist::
resv_hugepages
surplus_hugepages
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages. For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512
+2MB huge pages. Demote interfaces are not available for the smallest
+huge page size. The demote interfaces are:
+
+demote_size
+ is the size of demoted pages. When a page is demoted a corresponding
+ number of huge pages of demote_size will be created. By default,
+ demote_size is set to the next smaller huge page size. If there are
+ multiple smaller huge page sizes, demote_size can be set to any of
+ these smaller sizes. Only huge page sizes less than the current huge
+ pages size are allowed.
+
+demote
+ is used to demote a number of huge pages. A user with root privileges
+ can write to this file. It may not be possible to demote the
+ requested number of huge pages. To determine how many pages were
+ actually demoted, compare the value of nr_hugepages before and after
+ writing to the demote interface. demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
.. _mem_policy_and_hp_alloc:
@@ -255,7 +311,7 @@ memory policy mode--bind, preferred, local or interleave--may be used. The
resulting effect on persistent huge page allocation is as follows:
#. Regardless of mempolicy mode [see
- :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
+ Documentation/admin-guide/mm/numa_memory_policy.rst],
persistent huge pages will be distributed across the node or nodes
specified in the mempolicy as if "interleave" had been specified.
However, if a node in the policy does not contain sufficient contiguous
@@ -403,13 +459,13 @@ Examples
.. _map_hugetlb:
``map_hugetlb``
- see tools/testing/selftests/vm/map_hugetlb.c
+ see tools/testing/selftests/mm/map_hugetlb.c
``hugepage-shm``
- see tools/testing/selftests/vm/hugepage-shm.c
+ see tools/testing/selftests/mm/hugepage-shm.c
``hugepage-mmap``
- see tools/testing/selftests/vm/hugepage-mmap.c
+ see tools/testing/selftests/mm/hugepage-mmap.c
The `libhugetlbfs`_ library provides a wide range of userspace tools
to help with huge page usability, environment setup, and control.
diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
index df9394fb39c2..16fcf38dac56 100644
--- a/Documentation/admin-guide/mm/idle_page_tracking.rst
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
@@ -1,5 +1,3 @@
-.. _idle_page_tracking:
-
==================
Idle Page Tracking
==================
@@ -65,14 +63,13 @@ workload one should:
are not reclaimable, he or she can filter them out using
``/proc/kpageflags``.
-The page-types tool in the tools/vm directory can be used to assist in this.
+The page-types tool in the tools/mm directory can be used to assist in this.
If the tool is run initially with the appropriate option, it will mark all the
queried pages as idle. Subsequent runs of the tool can then show which pages have
their idle flag cleared in the interim.
-See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
-information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
-``/proc/kpagecgroup``.
+See Documentation/admin-guide/mm/pagemap.rst for more information about
+``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
.. _impl_details:
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 11db46448354..1f883abf3f00 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -3,9 +3,9 @@ Memory Management
=================
Linux memory management subsystem is responsible, as the name implies,
-for managing the memory in the system. This includes implemnetation of
+for managing the memory in the system. This includes implementation of
virtual memory and demand paging, memory allocation both for kernel
-internal structures and user space programms, mapping of files into
+internal structures and user space programs, mapping of files into
processes address space and many other cool things.
Linux memory management is a complex system with many configurable
@@ -16,8 +16,7 @@ are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_.
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
Linux memory management has its own jargon and if you are not yet
-familiar with it, consider reading
-:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
+familiar with it, consider reading Documentation/admin-guide/mm/concepts.rst.
Here we document in detail how to interact with various mechanisms in
the Linux memory management.
@@ -27,13 +26,19 @@ the Linux memory management.
concepts
cma_debugfs
+ damon/index
hugetlbpage
idle_page_tracking
ksm
memory-hotplug
+ multigen_lru
+ nommu-mmap
numa_memory_policy
numaperf
pagemap
+ shrinker_debugfs
soft-dirty
+ swap_numa
transhuge
userfaultfd
+ zswap
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 874eb0c77d34..a639cac12477 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -1,5 +1,3 @@
-.. _admin_guide_ksm:
-
=======================
Kernel Samepage Merging
=======================
@@ -9,7 +7,7 @@ Overview
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
-and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
+and http://lwn.net/Articles/306704/ and https://lwn.net/Articles/330589/
KSM was originally developed for use with KVM (where it was known as
Kernel Shared Memory), to fit more virtual machines into physical memory,
@@ -22,7 +20,7 @@ content which can be replaced by a single write-protected page (which
is automatically copied if a process later wants to update its
content). The amount of pages that KSM daemon scans in a single pass
and the time between the passes are configured using :ref:`sysfs
-intraface <ksm_sysfs>`
+interface <ksm_sysfs>`
KSM only merges anonymous (private) pages, never pagecache (file) pages.
KSM's merged pages were originally locked into kernel memory, but can now
@@ -52,7 +50,7 @@ with EAGAIN, but more probably arousing the Out-Of-Memory killer.
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
built with CONFIG_KSM=y, those calls will normally succeed: even if the
-the KSM daemon is not currently running, MADV_MERGEABLE still registers
+KSM daemon is not currently running, MADV_MERGEABLE still registers
the range for whenever the KSM daemon is started; even if the range
cannot contain any pages which KSM could actually merge; even if
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
@@ -82,6 +80,9 @@ pages_to_scan
how many pages to scan before ksmd goes to sleep
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
+ The pages_to_scan value cannot be changed if ``advisor_mode`` has
+ been set to scan-time.
+
Default: 100 (chosen for demonstration purposes)
sleep_millisecs
@@ -157,8 +158,44 @@ stable_node_chains_prune_millisecs
scan. It's a noop if not a single KSM page hit the
``max_page_sharing`` yet.
+smart_scan
+ Historically KSM checked every candidate page for each scan. It did
+ not take into account historic information. When smart scan is
+ enabled, pages that have previously not been de-duplicated get
+ skipped. How often these pages are skipped depends on how often
+ de-duplication has already been tried and failed. By default this
+ optimization is enabled. The ``pages_skipped`` metric shows how
+ effective the setting is.
+
+advisor_mode
+ The ``advisor_mode`` selects the current advisor. Two modes are
+ supported: none and scan-time. The default is none. By setting
+ ``advisor_mode`` to scan-time, the scan time advisor is enabled.
+ The section about ``advisor`` explains in detail how the scan time
+ advisor works.
+
+adivsor_max_cpu
+ specifies the upper limit of the cpu percent usage of the ksmd
+ background thread. The default is 70.
+
+advisor_target_scan_time
+ specifies the target scan time in seconds to scan all the candidate
+ pages. The default value is 200 seconds.
+
+advisor_min_pages_to_scan
+ specifies the lower limit of the ``pages_to_scan`` parameter of the
+ scan time advisor. The default is 500.
+
+adivsor_max_pages_to_scan
+ specifies the upper limit of the ``pages_to_scan`` parameter of the
+ scan time advisor. The default is 30000.
+
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
+general_profit
+ how effective is KSM. The calculation is explained below.
+pages_scanned
+ how many pages are being scanned for ksm
pages_shared
how many shared pages are being used
pages_sharing
@@ -167,12 +204,21 @@ pages_unshared
how many pages unique but repeatedly checked for merging
pages_volatile
how many pages changing too fast to be placed in a tree
+pages_skipped
+ how many pages did the "smart" page scanning algorithm skip
full_scans
how many times all mergeable areas have been scanned
stable_node_chains
the number of KSM pages that hit the ``max_page_sharing`` limit
stable_node_dups
number of duplicated KSM pages
+ksm_zero_pages
+ how many zero pages that are still mapped into processes were mapped by
+ KSM when deduplicating.
+
+When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` +
+``ksm_zero_pages`` represents the actual number of pages saved by KSM.
+if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0.
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
@@ -184,6 +230,94 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
+Monitoring KSM profit
+=====================
+
+KSM can save memory by merging identical pages, but also can consume
+additional memory, because it needs to generate a number of rmap_items to
+save each scanned page's brief rmap information. Some of these pages may
+be merged, but some may not be abled to be merged after being checked
+several times, which are unprofitable memory consumed.
+
+1) How to determine whether KSM save memory or consume memory in system-wide
+ range? Here is a simple approximate calculation for reference::
+
+ general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) *
+ sizeof(rmap_item);
+
+ where ksm_saved_pages equals to the sum of ``pages_sharing`` +
+ ``ksm_zero_pages`` of the system, and all_rmap_items can be easily
+ obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared``
+ and ``pages_volatile``.
+
+2) The KSM profit inner a single process can be similarly obtained by the
+ following approximate calculation::
+
+ process_profit =~ ksm_saved_pages * sizeof(page) -
+ ksm_rmap_items * sizeof(rmap_item).
+
+ where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and
+ ``ksm_zero_pages``, both of which are shown under the directory
+ ``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in
+ ``/proc/<pid>/ksm_stat``. The process profit is also shown in
+ ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
+
+From the perspective of application, a high ratio of ``ksm_rmap_items`` to
+``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
+administrators have to rethink how to change madvise policy. Giving an example
+for reference, a page's size is usually 4K, and the rmap_item's size is
+separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.
+so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU
+or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,
+because the ksm profit is approximately zero or negative.
+
+Monitoring KSM events
+=====================
+
+There are some counters in /proc/vmstat that may be used to monitor KSM events.
+KSM might help save memory, it's a tradeoff by may suffering delay on KSM COW
+or on swapping in copy. Those events could help users evaluate whether or how
+to use KSM. For example, if cow_ksm increases too fast, user may decrease the
+range of madvise(, , MADV_MERGEABLE).
+
+cow_ksm
+ is incremented every time a KSM page triggers copy on write (COW)
+ when users try to write to a KSM page, we have to make a copy.
+
+ksm_swpin_copy
+ is incremented every time a KSM page is copied when swapping in
+ note that KSM page might be copied when swapping in because do_swap_page()
+ cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
+
+Advisor
+=======
+
+The number of candidate pages for KSM is dynamic. It can be often observed
+that during the startup of an application more candidate pages need to be
+processed. Without an advisor the ``pages_to_scan`` parameter needs to be
+sized for the maximum number of candidate pages. The scan time advisor can
+changes the ``pages_to_scan`` parameter based on demand.
+
+The advisor can be enabled, so KSM can automatically adapt to changes in the
+number of candidate pages to scan. Two advisors are implemented: none and
+scan-time. With none, no advisor is enabled. The default is none.
+
+The scan time advisor changes the ``pages_to_scan`` parameter based on the
+observed scan times. The possible values for the ``pages_to_scan`` parameter is
+limited by the ``advisor_max_cpu`` parameter. In addition there is also the
+``advisor_target_scan_time`` parameter. This parameter sets the target time to
+scan all the KSM candidate pages. The parameter ``advisor_target_scan_time``
+decides how aggressive the scan time advisor scans candidate pages. Lower
+values make the scan time advisor to scan more aggresively. This is the most
+important parameter for the configuration of the scan time advisor.
+
+The initial value and the maximum value can be changed with
+``advisor_min_pages_to_scan`` and ``advisor_max_pages_to_scan``. The default
+values are sufficient for most workloads and use cases.
+
+The ``pages_to_scan`` parameter is re-calculated after a scan has been completed.
+
+
--
Izik Eidus,
Hugh Dickins, 17 Nov 2009
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 5c4432c96c4b..098f14d83e99 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -1,444 +1,695 @@
-.. _admin_guide_memory_hotplug:
+==================
+Memory Hot(Un)Plug
+==================
-==============
-Memory Hotplug
-==============
+This document describes generic Linux support for memory hot(un)plug with
+a focus on System RAM, including ZONE_MOVABLE support.
-:Created: Jul 28 2007
-:Updated: Add some details about locking internals: Aug 20 2018
+.. contents:: :local:
-This document is about memory hotplug including how-to-use and current status.
-Because Memory Hotplug is still under development, contents of this text will
-be changed often.
+Introduction
+============
-.. contents:: :local:
+Memory hot(un)plug allows for increasing and decreasing the size of physical
+memory available to a machine at runtime. In the simplest case, it consists of
+physically plugging or unplugging a DIMM at runtime, coordinated with the
+operating system.
-.. note::
+Memory hot(un)plug is used for various purposes:
- (1) x86_64's has special implementation for memory hotplug.
- This text does not describe it.
- (2) This text assumes that sysfs is mounted at ``/sys``.
+- The physical memory available to a machine can be adjusted at runtime, up- or
+ downgrading the memory capacity. This dynamic memory resizing, sometimes
+ referred to as "capacity on demand", is frequently used with virtual machines
+ and logical partitions.
+- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
+ example is replacing failing memory modules.
-Introduction
-============
+- Reducing energy consumption either by physically unplugging memory modules or
+ by logically unplugging (parts of) memory modules from Linux.
+
+Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also
+used to expose persistent memory, other performance-differentiated memory and
+reserved memory regions as ordinary system RAM to Linux.
-Purpose of memory hotplug
--------------------------
+Linux only supports memory hot(un)plug on selected 64 bit architectures, such as
+x86_64, arm64, ppc64 and s390x.
-Memory Hotplug allows users to increase/decrease the amount of memory.
-Generally, there are two purposes.
+Memory Hot(Un)Plug Granularity
+------------------------------
-(A) For changing the amount of memory.
- This is to allow a feature like capacity on demand.
-(B) For installing/removing DIMMs or NUMA-nodes physically.
- This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
+Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
+physical memory address space into chunks of the same size: memory sections. The
+size of a memory section is architecture dependent. For example, x86_64 uses
+128 MiB and ppc64 uses 16 MiB.
-(A) is required by highly virtualized environments and (B) is required by
-hardware which supports memory power management.
+Memory sections are combined into chunks referred to as "memory blocks". The
+size of a memory block is architecture dependent and corresponds to the smallest
+granularity that can be hot(un)plugged. The default size of a memory block is
+the same as memory section size, unless an architecture specifies otherwise.
-Linux memory hotplug is designed for both purpose.
+All memory blocks have the same size.
-Phases of memory hotplug
+Phases of Memory Hotplug
------------------------
-There are 2 phases in Memory Hotplug:
+Memory hotplug consists of two phases:
- 1) Physical Memory Hotplug phase
- 2) Logical Memory Hotplug phase.
+(1) Adding the memory to Linux
+(2) Onlining memory blocks
-The First phase is to communicate hardware/firmware and make/erase
-environment for hotplugged memory. Basically, this phase is necessary
-for the purpose (B), but this is good phase for communication between
-highly virtualized environments too.
+In the first phase, metadata, such as the memory map ("memmap") and page tables
+for the direct mapping, is allocated and initialized, and memory blocks are
+created; the latter also creates sysfs files for managing newly created memory
+blocks.
-When memory is hotplugged, the kernel recognizes new memory, makes new memory
-management tables, and makes sysfs files for new memory's operation.
+In the second phase, added memory is exposed to the page allocator. After this
+phase, the memory is visible in memory statistics, such as free and total
+memory, of the system.
-If firmware supports notification of connection of new memory to OS,
-this phase is triggered automatically. ACPI can notify this event. If not,
-"probe" operation by system administration is used instead.
-(see :ref:`memory_hotplug_physical_mem`).
+Phases of Memory Hotunplug
+--------------------------
-Logical Memory Hotplug phase is to change memory state into
-available/unavailable for users. Amount of memory from user's view is
-changed by this phase. The kernel makes all memory in it as free pages
-when a memory range is available.
+Memory hotunplug consists of two phases:
-In this document, this phase is described as online/offline.
+(1) Offlining memory blocks
+(2) Removing the memory from Linux
-Logical Memory Hotplug phase is triggered by write of sysfs file by system
-administrator. For the hot-add case, it must be executed after Physical Hotplug
-phase by hand.
-(However, if you writes udev's hotplug scripts for memory hotplug, these
-phases can be execute in seamless way.)
+In the first phase, memory is "hidden" from the page allocator again, for
+example, by migrating busy memory to other memory locations and removing all
+relevant free pages from the page allocator After this phase, the memory is no
+longer visible in memory statistics of the system.
-Unit of Memory online/offline operation
----------------------------------------
+In the second phase, the memory blocks are removed and metadata is freed.
-Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
-into chunks of the same size. These chunks are called "sections". The size of
-a memory section is architecture dependent. For example, power uses 16MiB, ia64
-uses 1GiB.
+Memory Hotplug Notifications
+============================
-Memory sections are combined into chunks referred to as "memory blocks". The
-size of a memory block is architecture dependent and represents the logical
-unit upon which memory online/offline operations are to be performed. The
-default size of a memory block is the same as memory section size unless an
-architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
+There are various ways how Linux is notified about memory hotplug events such
+that it can start adding hotplugged memory. This description is limited to
+systems that support ACPI; mechanisms specific to other firmware interfaces or
+virtual machines are not described.
-To determine the size (in bytes) of a memory block please read this file::
+ACPI Notifications
+------------------
- /sys/devices/system/memory/block_size_bytes
+Platforms that support ACPI, such as x86_64, can support memory hotplug
+notifications via ACPI.
-Kernel Configuration
-====================
+In general, a firmware supporting memory hotplug defines a memory class object
+HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
+driver will hotplug the memory to Linux.
-To use memory hotplug feature, kernel must be compiled with following
-config options.
+If the firmware supports hotplug of NUMA nodes, it defines an object _HID
+"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
+assigned memory devices are added to Linux by the ACPI driver.
-- For all memory hotplug:
- - Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``)
- - Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``)
+Similarly, Linux can be notified about requests to hotunplug a memory device or
+a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
+blocks, and, if successful, hotunplug the memory from Linux.
-- To enable memory removal, the following are also necessary:
- - Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``)
- - Page Migration (``CONFIG_MIGRATION``)
+Manual Probing
+--------------
-- For ACPI memory hotplug, the following are also necessary:
- - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``)
- - This option can be kernel module.
+On some architectures, the firmware may not be able to notify the operating
+system about a memory hotplug event. Instead, the memory has to be manually
+probed from user space.
-- As a related configuration, if your box has a feature of NUMA-node hotplug
- via ACPI, then this option is necessary too.
+The probe interface is located at::
- - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
- (``CONFIG_ACPI_CONTAINER``).
+ /sys/devices/system/memory/probe
- This option can be kernel module too.
+Only complete memory blocks can be probed. Individual memory blocks are probed
+by providing the physical start address of the memory block::
+ % echo addr > /sys/devices/system/memory/probe
-.. _memory_hotplug_sysfs_files:
+Which results in a memory block for the range [addr, addr + memory_block_size)
+being created.
-sysfs files for memory hotplug
-==============================
+.. note::
-All memory blocks have their device information in sysfs. Each memory block
-is described under ``/sys/devices/system/memory`` as::
+ Using the probe interface is discouraged as it is easy to crash the kernel,
+ because Linux cannot validate user input; this interface might be removed in
+ the future.
- /sys/devices/system/memory/memoryXXX
+Onlining and Offlining Memory Blocks
+====================================
-where XXX is the memory block id.
+After a memory block has been created, Linux has to be instructed to actually
+make use of that memory: the memory block has to be "online".
-For the memory block covered by the sysfs directory. It is expected that all
-memory sections in this range are present and no memory holes exist in the
-range. Currently there is no way to determine if there is a memory hole, but
-the existence of one should not affect the hotplug capabilities of the memory
-block.
+Before a memory block can be removed, Linux has to stop using any memory part of
+the memory block: the memory block has to be "offlined".
-For example, assume 1GiB memory block size. A device for a memory starting at
-0x100000000 is ``/sys/device/system/memory/memory4``::
+The Linux kernel can be configured to automatically online added memory blocks
+and drivers automatically trigger offlining of memory blocks when trying
+hotunplug of memory. Memory blocks can only be removed once offlining succeeded
+and drivers may trigger offlining of memory blocks when attempting hotunplug of
+memory.
- (0x100000000 / 1Gib = 4)
+Onlining Memory Blocks Manually
+-------------------------------
-This device covers address range [0x100000000 ... 0x140000000)
+If auto-onlining of memory blocks isn't enabled, user-space has to manually
+trigger onlining of memory blocks. Often, udev rules are used to automate this
+task in user space.
-Under each memory block, you can see 5 files:
+Onlining of a memory block can be triggered via::
-- ``/sys/devices/system/memory/memoryXXX/phys_index``
-- ``/sys/devices/system/memory/memoryXXX/phys_device``
-- ``/sys/devices/system/memory/memoryXXX/state``
-- ``/sys/devices/system/memory/memoryXXX/removable``
-- ``/sys/devices/system/memory/memoryXXX/valid_zones``
+ % echo online > /sys/devices/system/memory/memoryXXX/state
-=================== ============================================================
-``phys_index`` read-only and contains memory block id, same as XXX.
-``state`` read-write
-
- - at read: contains online/offline state of memory.
- - at write: user can specify "online_kernel",
-
- "online_movable", "online", "offline" command
- which will be performed on all sections in the block.
-``phys_device`` read-only: designed to show the name of physical memory
- device. This is not well implemented now.
-``removable`` read-only: contains an integer value indicating
- whether the memory block is removable or not
- removable. A value of 1 indicates that the memory
- block is removable and a value of 0 indicates that
- it is not removable. A memory block is removable only if
- every section in the block is removable.
-``valid_zones`` read-only: designed to show which zones this memory block
- can be onlined to.
-
- The first column shows it`s default zone.
-
- "memory6/valid_zones: Normal Movable" shows this memoryblock
- can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
- by online_movable.
-
- "memory7/valid_zones: Movable Normal" shows this memoryblock
- can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
- by online_kernel.
-=================== ============================================================
+Or alternatively::
+
+ % echo 1 > /sys/devices/system/memory/memoryXXX/online
+
+The kernel will select the target zone automatically, depending on the
+configured ``online_policy``.
+
+One can explicitly request to associate an offline memory block with
+ZONE_MOVABLE by::
+
+ % echo online_movable > /sys/devices/system/memory/memoryXXX/state
+
+Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
+
+ % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+
+In any case, if onlining succeeds, the state of the memory block is changed to
+be "online". If it fails, the state of the memory block will remain unchanged
+and the above commands will fail.
+
+Onlining Memory Blocks Automatically
+------------------------------------
+
+The kernel can be configured to try auto-onlining of newly added memory blocks.
+If this feature is disabled, the memory blocks will stay offline until
+explicitly onlined from user space.
+
+The configured auto-online behavior can be observed via::
+
+ % cat /sys/devices/system/memory/auto_online_blocks
+
+Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or
+``online_movable`` to that file, like::
+
+ % echo online > /sys/devices/system/memory/auto_online_blocks
+
+Similarly to manual onlining, with ``online`` the kernel will select the
+target zone automatically, depending on the configured ``online_policy``.
+
+Modifying the auto-online behavior will only affect all subsequently added
+memory blocks only.
.. note::
- These directories/files appear after physical memory hotplug phase.
+ In corner cases, auto-onlining can fail. The kernel won't retry. Note that
+ auto-onlining is not expected to fail in default configurations.
-If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
-via symbolic links located in the ``/sys/devices/system/node/node*`` directories.
+.. note::
-For example::
+ DLPAR on ppc64 ignores the ``offline`` setting and will still online added
+ memory blocks; if onlining fails, memory blocks are removed again.
- /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+Offlining Memory Blocks
+-----------------------
-A backlink will also be created::
+In the current implementation, Linux's memory offlining will try migrating all
+movable pages off the affected memory block. As most kernel allocations, such as
+page tables, are unmovable, page migration can fail and, therefore, inhibit
+memory offlining from succeeding.
- /sys/devices/system/memory/memory9/node0 -> ../../node/node0
+Having the memory provided by memory block managed by ZONE_MOVABLE significantly
+increases memory offlining reliability; still, memory offlining can fail in
+some corner cases.
-.. _memory_hotplug_physical_mem:
+Further, memory offlining might retry for a long time (or even forever), until
+aborted by the user.
-Physical memory hot-add phase
-=============================
+Offlining of a memory block can be triggered via::
-Hardware(Firmware) Support
---------------------------
+ % echo offline > /sys/devices/system/memory/memoryXXX/state
-On x86_64/ia64 platform, memory hotplug by ACPI is supported.
+Or alternatively::
-In general, the firmware (ACPI) which supports memory hotplug defines
-memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
-Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev
-script. This will be done automatically.
+ % echo 0 > /sys/devices/system/memory/memoryXXX/online
-But scripts for memory hotplug are not contained in generic udev package(now).
-You may have to write it by yourself or online/offline memory by hand.
-Please see :ref:`memory_hotplug_how_to_online_memory` and
-:ref:`memory_hotplug_how_to_offline_memory`.
+If offlining succeeds, the state of the memory block is changed to be "offline".
+If it fails, the state of the memory block will remain unchanged and the above
+commands will fail, for example, via::
-If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
-"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
-calls hotplug code for all of objects which are defined in it.
-If memory device is found, memory hotplug code will be called.
+ bash: echo: write error: Device or resource busy
-Notify memory hot-add event by hand
------------------------------------
+or via::
-On some architectures, the firmware may not notify the kernel of a memory
-hotplug event. Therefore, the memory "probe" interface is supported to
-explicitly notify the kernel. This interface depends on
-CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
-if hotplug is supported, although for x86 this should be handled by ACPI
-notification.
+ bash: echo: write error: Invalid argument
-Probe interface is located at::
+Observing the State of Memory Blocks
+------------------------------------
- /sys/devices/system/memory/probe
+The state (online/offline/going-offline) of a memory block can be observed
+either via::
-You can tell the physical address of new memory to the kernel by::
+ % cat /sys/devices/system/memory/memoryXXX/state
- % echo start_address_of_new_memory > /sys/devices/system/memory/probe
+Or alternatively (1/0) via::
-Then, [start_address_of_new_memory, start_address_of_new_memory +
-memory_block_size] memory range is hot-added. In this case, hotplug script is
-not called (in current implementation). You'll have to online memory by
-yourself. Please see :ref:`memory_hotplug_how_to_online_memory`.
+ % cat /sys/devices/system/memory/memoryXXX/online
-Logical Memory hot-add phase
-============================
+For an online memory block, the managing zone can be observed via::
-State of memory
----------------
+ % cat /sys/devices/system/memory/memoryXXX/valid_zones
-To see (online/offline) state of a memory block, read 'state' file::
+Configuring Memory Hot(Un)Plug
+==============================
- % cat /sys/device/system/memory/memoryXXX/state
+There are various ways how system administrators can configure memory
+hot(un)plug and interact with memory blocks, especially, to online them.
+Memory Hot(Un)Plug Configuration via Sysfs
+------------------------------------------
-- If the memory block is online, you'll read "online".
-- If the memory block is offline, you'll read "offline".
+Some memory hot(un)plug properties can be configured or inspected via sysfs in::
+ /sys/devices/system/memory/
-.. _memory_hotplug_how_to_online_memory:
+The following files are currently defined:
-How to online memory
---------------------
+====================== =========================================================
+``auto_online_blocks`` read-write: set or get the default state of new memory
+ blocks; configure auto-onlining.
-When the memory is hot-added, the kernel decides whether or not to "online"
-it according to the policy which can be read from "auto_online_blocks" file::
+ The default value depends on the
+ CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
+ option.
- % cat /sys/devices/system/memory/auto_online_blocks
+ See the ``state`` property of memory blocks for details.
+``block_size_bytes`` read-only: the size in bytes of a memory block.
+``probe`` write-only: add (probe) selected memory blocks manually
+ from user space by supplying the physical start address.
-The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
-option. If it is disabled the default is "offline" which means the newly added
-memory is not in a ready-to-use state and you have to "online" the newly added
-memory blocks manually. Automatic onlining can be requested by writing "online"
-to "auto_online_blocks" file::
+ Availability depends on the CONFIG_ARCH_MEMORY_PROBE
+ kernel configuration option.
+``uevent`` read-write: generic udev file for device subsystems.
+``crash_hotplug`` read-only: when changes to the system memory map
+ occur due to hot un/plug of memory, this file contains
+ '1' if the kernel updates the kdump capture kernel memory
+ map itself (via elfcorehdr), or '0' if userspace must update
+ the kdump capture kernel memory map.
- % echo online > /sys/devices/system/memory/auto_online_blocks
+ Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
+ configuration option.
+====================== =========================================================
-This sets a global policy and impacts all memory blocks that will subsequently
-be hotplugged. Currently offline blocks keep their state. It is possible, under
-certain circumstances, that some memory blocks will be added but will fail to
-online. User space tools can check their "state" files
-(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually.
+.. note::
-If the automatic onlining wasn't requested, failed, or some memory block was
-offlined it is possible to change the individual block's state by writing to the
-"state" file::
+ When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
+ additional files ``hard_offline_page`` and ``soft_offline_page`` are available
+ to trigger hwpoisoning of pages, for example, for testing purposes. Note that
+ this functionality is not really related to memory hot(un)plug or actual
+ offlining of memory blocks.
- % echo online > /sys/devices/system/memory/memoryXXX/state
+Memory Block Configuration via Sysfs
+------------------------------------
-This onlining will not change the ZONE type of the target memory block,
-If the memory block doesn't belong to any zone an appropriate kernel zone
-(usually ZONE_NORMAL) will be used unless movable_node kernel command line
-option is specified when ZONE_MOVABLE will be used.
+Each memory block is represented as a memory block device that can be
+onlined or offlined. All memory blocks have their device information located in
+sysfs. Each present memory block is listed under
+``/sys/devices/system/memory`` as::
-You can explicitly request to associate it with ZONE_MOVABLE by::
+ /sys/devices/system/memory/memoryXXX
- % echo online_movable > /sys/devices/system/memory/memoryXXX/state
+where XXX is the memory block id; the number of digits is variable.
-.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
+A present memory block indicates that some memory in the range is present;
+however, a memory block might span memory holes. A memory block spanning memory
+holes cannot be offlined.
-Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
+For example, assume 1 GiB memory block size. A device for a memory starting at
+0x100000000 is ``/sys/devices/system/memory/memory4``::
- % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+ (0x100000000 / 1Gib = 4)
+
+This device covers address range [0x100000000 ... 0x140000000)
-.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
+The following files are currently defined:
-An explicit zone onlining can fail (e.g. when the range is already within
-and existing and incompatible zone already).
+=================== ============================================================
+``online`` read-write: simplified interface to trigger onlining /
+ offlining and to observe the state of a memory block.
+ When onlining, the zone is selected automatically.
+``phys_device`` read-only: legacy interface only ever used on s390x to
+ expose the covered storage increment.
+``phys_index`` read-only: the memory block id (XXX).
+``removable`` read-only: legacy interface that indicated whether a memory
+ block was likely to be offlineable or not. Nowadays, the
+ kernel return ``1`` if and only if it supports memory
+ offlining.
+``state`` read-write: advanced interface to trigger onlining /
+ offlining and to observe the state of a memory block.
+
+ When writing, ``online``, ``offline``, ``online_kernel`` and
+ ``online_movable`` are supported.
+
+ ``online_movable`` specifies onlining to ZONE_MOVABLE.
+ ``online_kernel`` specifies onlining to the default kernel
+ zone for the memory block, such as ZONE_NORMAL.
+ ``online`` let's the kernel select the zone automatically.
+
+ When reading, ``online``, ``offline`` and ``going-offline``
+ may be returned.
+``uevent`` read-write: generic uevent file for devices.
+``valid_zones`` read-only: when a block is online, shows the zone it
+ belongs to; when a block is offline, shows what zone will
+ manage it when the block will be onlined.
+
+ For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
+ ``Movable`` and ``none`` may be returned. ``none`` indicates
+ that memory provided by a memory block is managed by
+ multiple zones or spans multiple nodes; such memory blocks
+ cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
+ Other values indicate a kernel zone.
+
+ For offline memory blocks, the first column shows the
+ zone the kernel would select when onlining the memory block
+ right now without further specifying a zone.
+
+ Availability depends on the CONFIG_MEMORY_HOTREMOVE
+ kernel configuration option.
+=================== ============================================================
+
+.. note::
-After this, memory block XXX's state will be 'online' and the amount of
-available memory will be increased.
+ If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
+ directories can also be accessed via symbolic links located in the
+ ``/sys/devices/system/node/node*`` directories.
-This may be changed in future.
+ For example::
-Logical memory remove
-=====================
+ /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
-Memory offline and ZONE_MOVABLE
--------------------------------
+ A backlink will also be created::
-Memory offlining is more complicated than memory online. Because memory offline
-has to make the whole memory block be unused, memory offline can fail if
-the memory block includes memory which cannot be freed.
+ /sys/devices/system/memory/memory9/node0 -> ../../node/node0
+
+Command Line Parameters
+-----------------------
+
+Some command line parameters affect memory hot(un)plug handling. The following
+command line parameters are relevant:
+
+======================== =======================================================
+``memhp_default_state`` configure auto-onlining by essentially setting
+ ``/sys/devices/system/memory/auto_online_blocks``.
+``movable_node`` configure automatic zone selection in the kernel when
+ using the ``contig-zones`` online policy. When
+ set, the kernel will default to ZONE_MOVABLE when
+ onlining a memory block, unless other zones can be kept
+ contiguous.
+======================== =======================================================
+
+See Documentation/admin-guide/kernel-parameters.txt for a more generic
+description of these command line parameters.
+
+Module Parameters
+------------------
+
+Instead of additional command line parameters or sysfs files, the
+``memory_hotplug`` subsystem now provides a dedicated namespace for module
+parameters. Module parameters can be set via the command line by predicating
+them with ``memory_hotplug.`` such as::
+
+ memory_hotplug.memmap_on_memory=1
+
+and they can be observed (and some even modified at runtime) via::
+
+ /sys/module/memory_hotplug/parameters/
+
+The following module parameters are currently defined:
+
+================================ ===============================================
+``memmap_on_memory`` read-write: Allocate memory for the memmap from
+ the added memory block itself. Even if enabled,
+ actual support depends on various other system
+ properties and should only be regarded as a
+ hint whether the behavior would be desired.
+
+ While allocating the memmap from the memory
+ block itself makes memory hotplug less likely
+ to fail and keeps the memmap on the same NUMA
+ node in any case, it can fragment physical
+ memory in a way that huge pages in bigger
+ granularity cannot be formed on hotplugged
+ memory.
+
+ With value "force" it could result in memory
+ wastage due to memmap size limitations. For
+ example, if the memmap for a memory block
+ requires 1 MiB, but the pageblock size is 2
+ MiB, 1 MiB of hotplugged memory will be wasted.
+ Note that there are still cases where the
+ feature cannot be enforced: for example, if the
+ memmap is smaller than a single page, or if the
+ architecture does not support the forced mode
+ in all configurations.
+
+``online_policy`` read-write: Set the basic policy used for
+ automatic zone selection when onlining memory
+ blocks without specifying a target zone.
+ ``contig-zones`` has been the kernel default
+ before this parameter was added. After an
+ online policy was configured and memory was
+ online, the policy should not be changed
+ anymore.
+
+ When set to ``contig-zones``, the kernel will
+ try keeping zones contiguous. If a memory block
+ intersects multiple zones or no zone, the
+ behavior depends on the ``movable_node`` kernel
+ command line parameter: default to ZONE_MOVABLE
+ if set, default to the applicable kernel zone
+ (usually ZONE_NORMAL) if not set.
+
+ When set to ``auto-movable``, the kernel will
+ try onlining memory blocks to ZONE_MOVABLE if
+ possible according to the configuration and
+ memory device details. With this policy, one
+ can avoid zone imbalances when eventually
+ hotplugging a lot of memory later and still
+ wanting to be able to hotunplug as much as
+ possible reliably, very desirable in
+ virtualized environments. This policy ignores
+ the ``movable_node`` kernel command line
+ parameter and isn't really applicable in
+ environments that require it (e.g., bare metal
+ with hotunpluggable nodes) where hotplugged
+ memory might be exposed via the
+ firmware-provided memory map early during boot
+ to the system instead of getting detected,
+ added and onlined later during boot (such as
+ done by virtio-mem or by some hypervisors
+ implementing emulated DIMMs). As one example, a
+ hotplugged DIMM will be onlined either
+ completely to ZONE_MOVABLE or completely to
+ ZONE_NORMAL, not a mixture.
+ As another example, as many memory blocks
+ belonging to a virtio-mem device will be
+ onlined to ZONE_MOVABLE as possible,
+ special-casing units of memory blocks that can
+ only get hotunplugged together. *This policy
+ does not protect from setups that are
+ problematic with ZONE_MOVABLE and does not
+ change the zone of memory blocks dynamically
+ after they were onlined.*
+``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL
+ memory ratio in % for the ``auto-movable``
+ online policy. Whether the ratio applies only
+ for the system across all NUMA nodes or also
+ per NUMA nodes depends on the
+ ``auto_movable_numa_aware`` configuration.
+
+ All accounting is based on present memory pages
+ in the zones combined with accounting per
+ memory device. Memory dedicated to the CMA
+ allocator is accounted as MOVABLE, although
+ residing on one of the kernel zones. The
+ possible ratio depends on the actual workload.
+ The kernel default is "301" %, for example,
+ allowing for hotplugging 24 GiB to a 8 GiB VM
+ and automatically onlining all hotplugged
+ memory to ZONE_MOVABLE in many setups. The
+ additional 1% deals with some pages being not
+ present, for example, because of some firmware
+ allocations.
+
+ Note that ZONE_NORMAL memory provided by one
+ memory device does not allow for more
+ ZONE_MOVABLE memory for a different memory
+ device. As one example, onlining memory of a
+ hotplugged DIMM to ZONE_NORMAL will not allow
+ for another hotplugged DIMM to get onlined to
+ ZONE_MOVABLE automatically. In contrast, memory
+ hotplugged by a virtio-mem device that got
+ onlined to ZONE_NORMAL will allow for more
+ ZONE_MOVABLE memory within *the same*
+ virtio-mem device.
+``auto_movable_numa_aware`` read-write: Configure whether the
+ ``auto_movable_ratio`` in the ``auto-movable``
+ online policy also applies per NUMA
+ node in addition to the whole system across all
+ NUMA nodes. The kernel default is "Y".
+
+ Disabling NUMA awareness can be helpful when
+ dealing with NUMA nodes that should be
+ completely hotunpluggable, onlining the memory
+ completely to ZONE_MOVABLE automatically if
+ possible.
+
+ Parameter availability depends on CONFIG_NUMA.
+================================ ===============================================
+
+ZONE_MOVABLE
+============
-In general, memory offline can use 2 techniques.
+ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
+Further, having system RAM managed by ZONE_MOVABLE instead of one of the
+kernel zones can increase the number of possible transparent huge pages and
+dynamically allocated huge pages.
-(1) reclaim and free all memory in the memory block.
-(2) migrate all pages in the memory block.
+Most kernel allocations are unmovable. Important examples include the memory
+map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
+can only be served from the kernel zones.
-In the current implementation, Linux's memory offline uses method (2), freeing
-all pages in the memory block by page migration. But not all pages are
-migratable. Under current Linux, migratable pages are anonymous pages and
-page caches. For offlining a memory block by migration, the kernel has to
-guarantee that the memory block contains only migratable pages.
+Most user space pages, such as anonymous memory, and page cache pages are
+movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
-Now, a boot option for making a memory block which consists of migratable pages
-is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
-create ZONE_MOVABLE...a zone which is just used for movable pages.
-(See also Documentation/admin-guide/kernel-parameters.rst)
+Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable
+allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
+absolutely no guarantee whether a memory block can be offlined successfully.
-Assume the system has "TOTAL" amount of memory at boot time, this boot option
-creates ZONE_MOVABLE as following.
+Zone Imbalances
+---------------
+
+Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
+which can harm the system or degrade performance. As one example, the kernel
+might crash because it runs out of free memory for unmovable allocations,
+although there is still plenty of free memory left in ZONE_MOVABLE.
-1) When kernelcore=YYYY boot option is used,
- Size of memory not for movable pages (not for offline) is YYYY.
- Size of memory for movable pages (for offline) is TOTAL-YYYY.
+Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
+are definitely impossible due to the overhead for the memory map.
-2) When movablecore=ZZZZ boot option is used,
- Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
- Size of memory for movable pages (for offline) is ZZZZ.
+Actual safe zone ratios depend on the workload. Extreme cases, like excessive
+long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
.. note::
- Unfortunately, there is no information to show which memory block belongs
- to ZONE_MOVABLE. This is TBD.
+ CMA memory part of a kernel zone essentially behaves like memory in
+ ZONE_MOVABLE and similar considerations apply, especially when combining
+ CMA with ZONE_MOVABLE.
-.. _memory_hotplug_how_to_offline_memory:
+ZONE_MOVABLE Sizing Considerations
+----------------------------------
-How to offline memory
----------------------
+We usually expect that a large portion of available system RAM will actually
+be consumed by user space, either directly or indirectly via the page cache. In
+the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
-You can offline a memory block by using the same sysfs interface that was used
-in memory onlining::
+With that in mind, it makes sense that we can have a big portion of system RAM
+managed by ZONE_MOVABLE. However, there are some things to consider when using
+ZONE_MOVABLE, especially when fine-tuning zone ratios:
- % echo offline > /sys/devices/system/memory/memoryXXX/state
+- Having a lot of offline memory blocks. Even offline memory blocks consume
+ memory for metadata and page tables in the direct map; having a lot of offline
+ memory blocks is not a typical case, though.
+
+- Memory ballooning without balloon compaction is incompatible with
+ ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
+ pseries CMM, fully support balloon compaction.
+
+ Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
+ disabled. In that case, balloon inflation will only perform unmovable
+ allocations and silently create a zone imbalance, usually triggered by
+ inflation requests from the hypervisor.
+
+- Gigantic pages are unmovable, resulting in user space consuming a
+ lot of unmovable memory.
+
+- Huge pages are unmovable when an architectures does not support huge
+ page migration, resulting in a similar issue as with gigantic pages.
+
+- Page tables are unmovable. Excessive swapping, mapping extremely large
+ files or ZONE_DEVICE memory can be problematic, although only really relevant
+ in corner cases. When we manage a lot of user space memory that has been
+ swapped out or is served from a file/persistent memory/... we still need a lot
+ of page tables to manage that memory once user space accessed that memory.
+
+- In certain DAX configurations the memory map for the device memory will be
+ allocated from the kernel zones.
+
+- KASAN can have a significant memory overhead, for example, consuming 1/8th of
+ the total system memory size as (unmovable) tracking metadata.
+
+- Long-term pinning of pages. Techniques that rely on long-term pinnings
+ (especially, RDMA and vfio/mdev) are fundamentally problematic with
+ ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
+ on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
+ have to be migrated off that zone while pinning. Pinning a page can fail
+ even if there is plenty of free memory in ZONE_MOVABLE.
+
+ In addition, using ZONE_MOVABLE might make page pinning more expensive,
+ because of the page migration overhead.
+
+By default, all the memory configured at boot time is managed by the kernel
+zones and ZONE_MOVABLE is not used.
+
+To enable ZONE_MOVABLE to include the memory present at boot and to control the
+ratio between movable and kernel zones there are two command line options:
+``kernelcore=`` and ``movablecore=``. See
+Documentation/admin-guide/kernel-parameters.rst for their description.
+
+Memory Offlining and ZONE_MOVABLE
+---------------------------------
+
+Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
+block might fail:
+
+- Memory blocks with memory holes; this applies to memory blocks present during
+ boot and can apply to memory blocks hotplugged via the XEN balloon and the
+ Hyper-V balloon.
+
+- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
+ offlining; this applies to memory blocks present during boot only.
+
+- Special memory blocks prevented by the system from getting offlined. Examples
+ include any memory available during boot on arm64 or memory blocks spanning
+ the crashkernel area on s390x; this usually applies to memory blocks present
+ during boot only.
+
+- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
+ memory blocks present during boot only.
+
+- Concurrent activity that operates on the same physical memory area, such as
+ allocating gigantic pages, can result in temporary offlining failures.
+
+- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap
+ Optimization (HVO) is enabled.
+
+ Offlining code may be able to migrate huge page contents, but may not be able
+ to dissolve the source huge page because it fails allocating (unmovable) pages
+ for the vmemmap, because the system might not have free memory in the kernel
+ zones left.
+
+ Users that depend on memory offlining to succeed for movable zones should
+ carefully consider whether the memory savings gained from this feature are
+ worth the risk of possibly not being able to offline memory in certain
+ situations.
+
+Further, when running into out of memory situations while migrating pages, or
+when still encountering permanently unmovable pages within ZONE_MOVABLE
+(-> BUG), memory offlining will keep retrying until it eventually succeeds.
+
+When offlining is triggered from user space, the offlining context can be
+terminated by sending a signal. A timeout based offlining can easily be
+implemented via::
-If offline succeeds, the state of the memory block is changed to be "offline".
-If it fails, some error core (like -EBUSY) will be returned by the kernel.
-Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
-it. If it doesn't contain 'unmovable' memory, you'll get success.
-
-A memory block under ZONE_MOVABLE is considered to be able to be offlined
-easily. But under some busy state, it may return -EBUSY. Even if a memory
-block cannot be offlined due to -EBUSY, you can retry offlining it and may be
-able to offline it (or not). (For example, a page is referred to by some kernel
-internal call and released soon.)
-
-Consideration:
- Memory hotplug's design direction is to make the possibility of memory
- offlining higher and to guarantee unplugging memory under any situation. But
- it needs more work. Returning -EBUSY under some situation may be good because
- the user can decide to retry more or not by himself. Currently, memory
- offlining code does some amount of retry with 120 seconds timeout.
-
-Physical memory remove
-======================
-
-Need more implementation yet....
- - Notification completion of remove works by OS to firmware.
- - Guard from remove if not yet.
-
-
-Locking Internals
-=================
-
-When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
-the device_hotplug_lock should be held to:
-
-- synchronize against online/offline requests (e.g. via sysfs). This way, memory
- block devices can only be accessed (.online/.state attributes) by user
- space once memory has been fully added. And when removing memory, we
- know nobody is in critical sections.
-- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
-
-Especially, there is a possible lock inversion that is avoided using
-device_hotplug_lock when adding memory and user space tries to online that
-memory faster than expected:
-
-- device_online() will first take the device_lock(), followed by
- mem_hotplug_lock
-- add_memory_resource() will first take the mem_hotplug_lock, followed by
- the device_lock() (while creating the devices, during bus_add_device()).
-
-As the device is visible to user space before taking the device_lock(), this
-can result in a lock inversion.
-
-onlining/offlining of memory should be done via device_online()/
-device_offline() - to make sure it is properly synchronized to actions
-via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
-
-When adding/removing/onlining/offlining memory or adding/removing
-heterogeneous/device memory, we should always hold the mem_hotplug_lock in
-write mode to serialise memory hotplug (e.g. access to global/zone
-variables).
-
-In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
-mode allows for a quite efficient get_online_mems/put_online_mems
-implementation, so code accessing memory can protect from that memory
-vanishing.
-
-
-Future Work
-===========
-
- - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
- sysctl or new control file.
- - showing memory block and physical device relationship.
- - test and make it better memory offlining.
- - support HugeTLB page migration and offlining.
- - memmap removing at memory offline.
- - physical remove memory.
+ % timeout $TIMEOUT offline_block | failure_handling
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..33e068830497
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enabled`` accepts different values to enable or disable the
+following components. Its default value depends on
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
+unless some of them have unforeseen side effects. Writing to
+``enabled`` has no effect when a component is not supported by the
+hardware, and valid values will be accepted even when the main switch
+is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+ batches, when MMU sets it (e.g., on x86). This behavior can
+ theoretically worsen lock contention (mmap_lock). If it is
+ disabled, the multi-gen LRU will suffer a minor performance
+ degradation for workloads that contiguously map hot pages,
+ whose accessed bits can be otherwise cleared by fewer larger
+ batches.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+ well, when MMU sets it (e.g., on x86). This behavior was not
+ verified on x86 varieties other than Intel and AMD. If it is
+ disabled, the multi-gen LRU will suffer a negligible
+ performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+ echo y >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0007
+ echo 5 >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+The default value ``0`` means disabled.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application needs
+in a given time interval, and it is usually done with little impact on
+the performance of the application. E.g., data centers want to
+optimize job scheduling (bin packing) to improve memory utilizations.
+When a new job comes in, the job scheduler needs to find out whether
+each server it manages can allocate a certain amount of memory for
+this new job before it can pick a candidate. To do so, the job
+scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram. The
+histograms are noncumulative.
+::
+
+ memcg memcg_id memcg_path
+ node node_id
+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages
+ ...
+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages
+
+Each bin contains an estimated number of pages that have been accessed
+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
+the former is the largest and that of the latter is the smallest.
+
+Users can write the following command to ``lru_gen`` to create a new
+generation ``max_gen_nr+1``:
+
+ ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
+
+``can_swap`` defaults to the swap setting and, if it is set to ``1``,
+it forces the scan of anon pages when swap is off, and vice versa.
+``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
+employs heuristics to reduce the overhead, which is likely to reduce
+the coverage as well.
+
+A typical use case is that a job scheduler runs this command at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold pages defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces page reclaim when there is no memory
+pressure. It usually targets cold pages only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim cold pages on
+the server it selected, to improve the chance of successfully landing
+this new job.
+
+Users can write the following command to ``lru_gen`` to evict
+generations less than or equal to ``min_gen_nr``.
+
+ ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
+
+``min_gen_nr`` should be less than ``max_gen_nr-1``, since
+``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
+the active list) and therefore cannot be evicted. ``swappiness``
+overrides the default value in ``/proc/sys/vm/swappiness``.
+``nr_to_reclaim`` limits the number of pages to evict.
+
+A typical use case is that a job scheduler runs this command before it
+tries to land a new job on a server. If it fails to materialize enough
+cold pages because of the overestimation, it retries on the next
+server according to the ranking result obtained from the working set
+estimation step. This less forceful approach limits the impacts on the
+existing jobs.
diff --git a/Documentation/admin-guide/mm/nommu-mmap.rst b/Documentation/admin-guide/mm/nommu-mmap.rst
new file mode 100644
index 000000000000..530fed08de2c
--- /dev/null
+++ b/Documentation/admin-guide/mm/nommu-mmap.rst
@@ -0,0 +1,283 @@
+=============================
+No-MMU memory mapping support
+=============================
+
+The kernel has limited support for memory mapping under no-MMU conditions, such
+as are used in uClinux environments. From the userspace point of view, memory
+mapping is made use of in conjunction with the mmap() system call, the shmat()
+call and the execve() system call. From the kernel's point of view, execve()
+mapping is actually performed by the binfmt drivers, which call back into the
+mmap() routines to do the actual work.
+
+Memory mapping behaviour also involves the way fork(), vfork(), clone() and
+ptrace() work. Under uClinux there is no fork(), and clone() must be supplied
+the CLONE_VM flag.
+
+The behaviour is similar between the MMU and no-MMU cases, but not identical;
+and it's also much more restricted in the latter case:
+
+ (#) Anonymous mapping, MAP_PRIVATE
+
+ In the MMU case: VM regions backed by arbitrary pages; copy-on-write
+ across fork.
+
+ In the no-MMU case: VM regions backed by arbitrary contiguous runs of
+ pages.
+
+ (#) Anonymous mapping, MAP_SHARED
+
+ These behave very much like private mappings, except that they're
+ shared across fork() or clone() without CLONE_VM in the MMU case. Since
+ the no-MMU case doesn't support these, behaviour is identical to
+ MAP_PRIVATE there.
+
+ (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE
+
+ In the MMU case: VM regions backed by pages read from file; changes to
+ the underlying file are reflected in the mapping; copied across fork.
+
+ In the no-MMU case:
+
+ - If one exists, the kernel will re-use an existing mapping to the
+ same segment of the same file if that has compatible permissions,
+ even if this was created by another process.
+
+ - If possible, the file mapping will be directly on the backing device
+ if the backing device has the NOMMU_MAP_DIRECT capability and
+ appropriate mapping protection capabilities. Ramfs, romfs, cramfs
+ and mtd might all permit this.
+
+ - If the backing device can't or won't permit direct sharing,
+ but does have the NOMMU_MAP_COPY capability, then a copy of the
+ appropriate bit of the file will be read into a contiguous bit of
+ memory and any extraneous space beyond the EOF will be cleared
+
+ - Writes to the file do not affect the mapping; writes to the mapping
+ are visible in other processes (no MMU protection), but should not
+ happen.
+
+ (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE
+
+ In the MMU case: like the non-PROT_WRITE case, except that the pages in
+ question get copied before the write actually happens. From that point
+ on writes to the file underneath that page no longer get reflected into
+ the mapping's backing pages. The page is then backed by swap instead.
+
+ In the no-MMU case: works much like the non-PROT_WRITE case, except
+ that a copy is always taken and never shared.
+
+ (#) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: VM regions backed by pages read from file; changes to
+ pages written back to file; writes to file reflected into pages backing
+ mapping; shared across fork.
+
+ In the no-MMU case: not supported.
+
+ (#) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: As for ordinary regular files.
+
+ In the no-MMU case: The filesystem providing the memory-backed file
+ (such as ramfs or tmpfs) may choose to honour an open, truncate, mmap
+ sequence by providing a contiguous sequence of pages to map. In that
+ case, a shared-writable memory mapping will be possible. It will work
+ as for the MMU case. If the filesystem does not provide any such
+ support, then the mapping request will be denied.
+
+ (#) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: As for ordinary regular files.
+
+ In the no-MMU case: As for memory backed regular files, but the
+ blockdev must be able to provide a contiguous run of pages without
+ truncate being called. The ramdisk driver could do this if it allocated
+ all its memory as a contiguous array upfront.
+
+ (#) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: As for ordinary regular files.
+
+ In the no-MMU case: The character device driver may choose to honour
+ the mmap() by providing direct access to the underlying device if it
+ provides memory or quasi-memory that can be accessed directly. Examples
+ of such are frame buffers and flash devices. If the driver does not
+ provide any such support, then the mapping request will be denied.
+
+
+Further notes on no-MMU MMAP
+============================
+
+ (#) A request for a private mapping of a file may return a buffer that is not
+ page-aligned. This is because XIP may take place, and the data may not be
+ paged aligned in the backing store.
+
+ (#) A request for an anonymous mapping will always be page aligned. If
+ possible the size of the request should be a power of two otherwise some
+ of the space may be wasted as the kernel must allocate a power-of-2
+ granule but will only discard the excess if appropriately configured as
+ this has an effect on fragmentation.
+
+ (#) The memory allocated by a request for an anonymous mapping will normally
+ be cleared by the kernel before being returned in accordance with the
+ Linux man pages (ver 2.22 or later).
+
+ In the MMU case this can be achieved with reasonable performance as
+ regions are backed by virtual pages, with the contents only being mapped
+ to cleared physical pages when a write happens on that specific page
+ (prior to which, the pages are effectively mapped to the global zero page
+ from which reads can take place). This spreads out the time it takes to
+ initialize the contents of a page - depending on the write-usage of the
+ mapping.
+
+ In the no-MMU case, however, anonymous mappings are backed by physical
+ pages, and the entire map is cleared at allocation time. This can cause
+ significant delays during a userspace malloc() as the C library does an
+ anonymous mapping and the kernel then does a memset for the entire map.
+
+ However, for memory that isn't required to be precleared - such as that
+ returned by malloc() - mmap() can take a MAP_UNINITIALIZED flag to
+ indicate to the kernel that it shouldn't bother clearing the memory before
+ returning it. Note that CONFIG_MMAP_ALLOW_UNINITIALIZED must be enabled
+ to permit this, otherwise the flag will be ignored.
+
+ uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this
+ to allocate the brk and stack region.
+
+ (#) A list of all the private copy and anonymous mappings on the system is
+ visible through /proc/maps in no-MMU mode.
+
+ (#) A list of all the mappings in use by a process is visible through
+ /proc/<pid>/maps in no-MMU mode.
+
+ (#) Supplying MAP_FIXED or a requesting a particular mapping address will
+ result in an error.
+
+ (#) Files mapped privately usually have to have a read method provided by the
+ driver or filesystem so that the contents can be read into the memory
+ allocated if mmap() chooses not to map the backing device directly. An
+ error will result if they don't. This is most likely to be encountered
+ with character device files, pipes, fifos and sockets.
+
+
+Interprocess shared memory
+==========================
+
+Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU
+mode. The former through the usual mechanism, the latter through files created
+on ramfs or tmpfs mounts.
+
+
+Futexes
+=======
+
+Futexes are supported in NOMMU mode if the arch supports them. An error will
+be given if an address passed to the futex system call lies outside the
+mappings made by a process or if the mapping in which the address lies does not
+support futexes (such as an I/O chardev mapping).
+
+
+No-MMU mremap
+=============
+
+The mremap() function is partially supported. It may change the size of a
+mapping, and may move it [#]_ if MREMAP_MAYMOVE is specified and if the new size
+of the mapping exceeds the size of the slab object currently occupied by the
+memory to which the mapping refers, or if a smaller slab object could be used.
+
+MREMAP_FIXED is not supported, though it is ignored if there's no change of
+address and the object does not need to be moved.
+
+Shared mappings may not be moved. Shareable mappings may not be moved either,
+even if they are not currently shared.
+
+The mremap() function must be given an exact match for base address and size of
+a previously mapped object. It may not be used to create holes in existing
+mappings, move parts of existing mappings or resize parts of mappings. It must
+act on a complete mapping.
+
+.. [#] Not currently supported.
+
+
+Providing shareable character device support
+============================================
+
+To provide shareable character device support, a driver must provide a
+file->f_op->get_unmapped_area() operation. The mmap() routines will call this
+to get a proposed address for the mapping. This may return an error if it
+doesn't wish to honour the mapping because it's too long, at a weird offset,
+under some unsupported combination of flags or whatever.
+
+The driver should also provide backing device information with capabilities set
+to indicate the permitted types of mapping on such devices. The default is
+assumed to be readable and writable, not executable, and only shareable
+directly (can't be copied).
+
+The file->f_op->mmap() operation will be called to actually inaugurate the
+mapping. It can be rejected at that point. Returning the ENOSYS error will
+cause the mapping to be copied instead if NOMMU_MAP_COPY is specified.
+
+The vm_ops->close() routine will be invoked when the last mapping on a chardev
+is removed. An existing mapping will be shared, partially or not, if possible
+without notifying the driver.
+
+It is permitted also for the file->f_op->get_unmapped_area() operation to
+return -ENOSYS. This will be taken to mean that this operation just doesn't
+want to handle it, despite the fact it's got an operation. For instance, it
+might try directing the call to a secondary driver which turns out not to
+implement it. Such is the case for the framebuffer driver which attempts to
+direct the call to the device-specific driver. Under such circumstances, the
+mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
+copy mapped otherwise.
+
+.. important::
+
+ Some types of device may present a different appearance to anyone
+ looking at them in certain modes. Flash chips can be like this; for
+ instance if they're in programming or erase mode, you might see the
+ status reflected in the mapping, instead of the data.
+
+ In such a case, care must be taken lest userspace see a shared or a
+ private mapping showing such information when the driver is busy
+ controlling the device. Remember especially: private executable
+ mappings may still be mapped directly off the device under some
+ circumstances!
+
+
+Providing shareable memory-backed file support
+==============================================
+
+Provision of shared mappings on memory backed files is similar to the provision
+of support for shared mapped character devices. The main difference is that the
+filesystem providing the service will probably allocate a contiguous collection
+of pages and permit mappings to be made on that.
+
+It is recommended that a truncate operation applied to such a file that
+increases the file size, if that file is empty, be taken as a request to gather
+enough pages to honour a mapping. This is required to support POSIX shared
+memory.
+
+Memory backed devices are indicated by the mapping's backing device info having
+the memory_backed flag set.
+
+
+Providing shareable block device support
+========================================
+
+Provision of shared mappings on block device files is exactly the same as for
+character devices. If there isn't a real device underneath, then the driver
+should allocate sufficient contiguous memory to honour any supported mapping.
+
+
+Adjusting page trimming behaviour
+=================================
+
+NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
+when performing an allocation. This can have adverse effects on memory
+fragmentation, and as such, is left configurable. The default behaviour is to
+aggressively trim allocations and discard any excess pages back in to the page
+allocator. In order to retain finer-grained control over fragmentation, this
+behaviour can either be disabled completely, or bumped up to a higher page
+watermark where trimming begins.
+
+Page trimming behaviour is configurable via the sysctl ``vm.nr_trim_pages``.
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 067a90a1499c..a70f20ce1ffb 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -1,5 +1,3 @@
-.. _numa_memory_policy:
-
==================
NUMA Memory Policy
==================
@@ -111,7 +109,7 @@ VMA Policy
* A task may install a new VMA policy on a sub-range of a
previously mmap()ed region. When this happens, Linux splits
the existing virtual memory area into 2 or 3 VMAs, each with
- it's own policy.
+ its own policy.
* By default, VMA policy applies only to pages allocated after
the policy is installed. Any pages already faulted into the
@@ -245,6 +243,22 @@ MPOL_INTERLEAVED
address range or file. During system boot up, the temporary
interleaved system default policy works in this mode.
+MPOL_PREFERRED_MANY
+ This mode specifies that the allocation should be preferably
+ satisfied from the nodemask specified in the policy. If there is
+ a memory pressure on all nodes in the nodemask, the allocation
+ can fall back to all existing numa nodes. This is effectively
+ MPOL_PREFERRED allowed for a mask rather than a single node.
+
+MPOL_WEIGHTED_INTERLEAVE
+ This mode operates the same as MPOL_INTERLEAVE, except that
+ interleaving behavior is executed based on weights set in
+ /sys/kernel/mm/mempolicy/weighted_interleave/
+
+ Weighted interleave allocates pages on nodes according to a
+ weight. For example if nodes [0,1] are weighted [5,2], 5 pages
+ will be allocated on node0 for every 2 pages allocated on node1.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
@@ -253,10 +267,10 @@ MPOL_F_STATIC_NODES
nodes changes after the memory policy has been defined.
Without this flag, any time a mempolicy is rebound because of a
- change in the set of allowed nodes, the node (Preferred) or
- nodemask (Bind, Interleave) is remapped to the new set of
- allowed nodes. This may result in nodes being used that were
- previously undesired.
+ change in the set of allowed nodes, the preferred nodemask (Preferred
+ Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
+ remapped to the new set of allowed nodes. This may result in nodes
+ being used that were previously undesired.
With this flag, if the user-specified nodes overlap with the
nodes allowed by the task's cpuset, then the memory policy is
@@ -353,7 +367,7 @@ and NUMA nodes. "Usage" here means one of the following:
2) examination of the policy to determine the policy mode and associated node
or node lists, if any, for page allocation. This is considered a "hot
path". Note that for MPOL_BIND, the "usage" extends across the entire
- allocation process, which may sleep during page reclaimation, because the
+ allocation process, which may sleep during page reclamation, because the
BIND policy nodemask is used, by reference, to filter ineligible nodes.
We can avoid taking an extra reference during the usages listed above as
@@ -401,7 +415,7 @@ follows:
Memory Policy APIs
==================
-Linux supports 3 system calls for controlling memory policy. These APIS
+Linux supports 4 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space.
@@ -453,6 +467,20 @@ requested via the 'flags' argument.
See the mbind(2) man page for more details.
+Set home node for a Range of Task's Address Spacec::
+
+ long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
+ unsigned long home_node,
+ unsigned long flags);
+
+sys_set_mempolicy_home_node set the home node for a VMA policy present in the
+task's address range. The system call updates the home node only for the existing
+mempolicy range. Other address ranges are ignored. A home node is the NUMA node
+closest to which page allocation will come from. Specifying the home node override
+the default allocation policy to allocate memory close to the local node for an
+executing CPU.
+
+
Memory Policy Command Line Interface
====================================
diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
index a80c3c37226e..90a12b6a8bfc 100644
--- a/Documentation/admin-guide/mm/numaperf.rst
+++ b/Documentation/admin-guide/mm/numaperf.rst
@@ -1,6 +1,7 @@
-.. _numaperf:
+=======================
+NUMA Memory Performance
+=======================
-=============
NUMA Locality
=============
@@ -56,7 +57,11 @@ nodes' access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.
-================
+The access class "1" is used to allow differentiation between initiators
+that are CPUs and hence suitable for generic task scheduling, and
+IO initiators such as GPUs and NICs. Unlike access class 0, only
+nodes containing CPUs are considered.
+
NUMA Performance
================
@@ -69,7 +74,7 @@ memory node's access class 0 initiators as follows::
/sys/devices/system/node/nodeY/access0/initiators/
These attributes apply only when accessed from nodes that have the
-are linked under the this access's inititiators.
+are linked under the this access's initiators.
The performance characteristics the kernel provides for the local initiators
are exported are as follows::
@@ -88,7 +93,9 @@ The latency attributes are provided in nanoseconds.
The values reported here correspond to the rated latency and bandwidth
for the platform.
-==========
+Access class 1 takes the same form but only includes values for CPU to
+memory activity.
+
NUMA Cache
==========
@@ -129,7 +136,7 @@ will create the following directory::
/sys/devices/system/node/nodeX/memory_side_cache/
-If that directory is not present, the system either does not not provide
+If that directory is not present, the system either does not provide
a memory-side cache, or that information is not accessible to the kernel.
The attributes for each level of cache is provided under its cache
@@ -143,7 +150,7 @@ Each cache level's directory provides its attributes. For example, the
following shows a single cache level and the attributes available for
software to query::
- # tree sys/devices/system/node/node0/memory_side_cache/
+ # tree /sys/devices/system/node/node0/memory_side_cache/
/sys/devices/system/node/node0/memory_side_cache/
|-- index1
| |-- indexing
@@ -162,7 +169,6 @@ The "size" is the number of bytes provided by this cache level.
The "write_policy" will be 0 for write-back, and non-zero for
write-through caching.
-========
See Also
========
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index 340a5aee9b80..f5f065c67615 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -1,5 +1,3 @@
-.. _pagemap:
-
=============================
Examining Process Page Tables
=============================
@@ -19,9 +17,11 @@ There are four components to pagemap:
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see
- :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
+ Documentation/admin-guide/mm/soft-dirty.rst)
* Bit 56 page exclusively mapped (since 4.2)
- * Bits 57-60 zero
+ * Bit 57 pte is uffd-wp write-protected (since 5.13) (see
+ Documentation/admin-guide/mm/userfaultfd.rst)
+ * Bits 58-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present
@@ -44,7 +44,7 @@ There are four components to pagemap:
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
-The page-types tool in the tools/vm directory can be used to query the
+The page-types tool in the tools/mm directory can be used to query the
number of times a page is mapped.
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
@@ -88,13 +88,14 @@ Short descriptions to the page flags
====================================
0 - LOCKED
- page is being locked for exclusive access, e.g. by undergoing read/write IO
+ The page is being locked for exclusive access, e.g. by undergoing read/write
+ IO.
7 - SLAB
- page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
- When compound page is used, SLUB/SLQB will only set this flag on the head
- page; SLOB will not flag it at all.
+ The page is managed by the SLAB/SLUB kernel memory allocator.
+ When compound page is used, either will only set this flag on the head
+ page.
10 - BUDDY
- a free memory block managed by the buddy system allocator
+ A free memory block managed by the buddy system allocator.
The buddy system organizes free memory in blocks of various orders.
An order N block has 2^N physically contiguous pages, with the BUDDY flag
set for and _only_ for the first page.
@@ -102,75 +103,74 @@ Short descriptions to the page flags
A compound page with order N consists of 2^N physically contiguous pages.
A compound page with order 2 takes the form of "HTTT", where H donates its
head page and T donates its tail page(s). The major consumers of compound
- pages are hugeTLB pages
- (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
+ pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst),
the SLUB etc. memory allocators and various device drivers.
However in this interface, only huge/giga pages are made visible
to end users.
16 - COMPOUND_TAIL
A compound page tail (see description above).
17 - HUGE
- this is an integral part of a HugeTLB page
+ This is an integral part of a HugeTLB page.
19 - HWPOISON
- hardware detected memory corruption on this page: don't touch the data!
+ Hardware detected memory corruption on this page: don't touch the data!
20 - NOPAGE
- no page frame exists at the requested address
+ No page frame exists at the requested address.
21 - KSM
- identical memory pages dynamically shared between one or more processes
+ Identical memory pages dynamically shared between one or more processes.
22 - THP
- contiguous pages which construct transparent hugepages
+ Contiguous pages which construct transparent hugepages.
23 - OFFLINE
- page is logically offline
+ The page is logically offline.
24 - ZERO_PAGE
- zero page for pfn_zero or huge_zero page
+ Zero page for pfn_zero or huge_zero page.
25 - IDLE
- page has not been accessed since it was marked idle (see
- :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
+ The page has not been accessed since it was marked idle (see
+ Documentation/admin-guide/mm/idle_page_tracking.rst).
Note that this flag may be stale in case the page was accessed via
a PTE. To make sure the flag is up-to-date one has to read
``/sys/kernel/mm/page_idle/bitmap`` first.
26 - PGTABLE
- page is in use as a page table
+ The page is in use as a page table.
IO related page flags
---------------------
1 - ERROR
- IO error occurred
+ IO error occurred.
3 - UPTODATE
- page has up-to-date data
+ The page has up-to-date data.
ie. for file backed page: (in-memory data revision >= on-disk one)
4 - DIRTY
- page has been written to, hence contains new data
+ The page has been written to, hence contains new data.
i.e. for file backed page: (in-memory data revision > on-disk one)
8 - WRITEBACK
- page is being synced to disk
+ The page is being synced to disk.
LRU related page flags
----------------------
5 - LRU
- page is in one of the LRU lists
+ The page is in one of the LRU lists.
6 - ACTIVE
- page is in the active LRU list
+ The page is in the active LRU list.
18 - UNEVICTABLE
- page is in the unevictable (non-)LRU list It is somehow pinned and
+ The page is in the unevictable (non-)LRU list It is somehow pinned and
not a candidate for LRU page reclaims, e.g. ramfs pages,
- shmctl(SHM_LOCK) and mlock() memory segments
+ shmctl(SHM_LOCK) and mlock() memory segments.
2 - REFERENCED
- page has been referenced since last LRU list enqueue/requeue
+ The page has been referenced since last LRU list enqueue/requeue.
9 - RECLAIM
- page will be reclaimed soon after its pageout IO completed
+ The page will be reclaimed soon after its pageout IO completed.
11 - MMAP
- a memory mapped page
+ A memory mapped page.
12 - ANON
- a memory mapped page that is not part of a file
+ A memory mapped page that is not part of a file.
13 - SWAPCACHE
- page is mapped to swap space, i.e. has an associated swap entry
+ The page is mapped to swap space, i.e. has an associated swap entry.
14 - SWAPBACKED
- page is backed by swap/RAM
+ The page is backed by swap/RAM.
-The page-types tool in the tools/vm directory can be used to query the
+The page-types tool in the tools/mm directory can be used to query the
above flags.
Using pagemap to do something useful
@@ -194,6 +194,28 @@ you can go through every map in the process, find the PFNs, look those up
in kpagecount, and tally up the number of pages that are only referenced
once.
+Exceptions for Shared Memory
+============================
+
+Page table entries for shared pages are cleared when the pages are zapped or
+swapped out. This makes swapped out pages indistinguishable from never-allocated
+ones.
+
+In kernel space, the swap location can still be retrieved from the page cache.
+However, values stored only on the normal PTE get lost irretrievably when the
+page is swapped out (i.e. SOFT_DIRTY).
+
+In user space, whether the page is present, swapped or none can be deduced with
+the help of lseek and/or mincore system calls.
+
+lseek() can differentiate between accessed pages (present or swapped out) and
+holes (none/non-allocated) by specifying the SEEK_DATA flag on the file where
+the pages are backed. For anonymous shared pages, the file can be found in
+``/proc/pid/map_files/``.
+
+mincore() can differentiate between pages in memory (present, including swap
+cache) and out of memory (swapped out or none/non-allocated).
+
Other notes
===========
@@ -205,3 +227,93 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
always 12 at most architectures). Since Linux 3.11 their meaning changes
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
flags unconditionally.
+
+Pagemap Scan IOCTL
+==================
+
+The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
+clear the info about page table entries. The following operations are supported
+in this IOCTL:
+
+- Scan the address range and get the memory ranges matching the provided criteria.
+ This is performed when the output buffer is specified.
+- Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect
+ the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if
+ non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be
+ used with or without ``PM_SCAN_CHECK_WPASYNC``.
+- Both of those operations can be combined into one atomic operation where we can
+ get and write protect the pages as well.
+
+Following flags about pages are currently supported:
+
+- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
+- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
+- ``PAGE_IS_FILE`` - Page is file backed
+- ``PAGE_IS_PRESENT`` - Page is present in the memory
+- ``PAGE_IS_SWAPPED`` - Page is in swapped
+- ``PAGE_IS_PFNZERO`` - Page has zero PFN
+- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
+- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
+
+The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
+
+ 1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
+ field. This field will be helpful in recognizing the structure if extensions
+ are done later.
+ 2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
+ and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
+ operation is optionally performed depending upon if the output buffer is
+ provided or not.
+ 3. The range is specified through ``start`` and ``end``.
+ 4. The walk can abort before visiting the complete range such as the user buffer
+ can get full etc. The walk ending address is specified in``end_walk``.
+ 5. The output buffer of ``struct page_region`` array and size is specified in
+ ``vec`` and ``vec_len``.
+ 6. The optional maximum requested pages are specified in the ``max_pages``.
+ 7. The masks are specified in ``category_mask``, ``category_anyof_mask``,
+ ``category_inverted`` and ``return_mask``.
+
+Find pages which have been written and WP them as well::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC,
+ ..
+ .category_mask = PAGE_IS_WRITTEN,
+ .return_mask = PAGE_IS_WRITTEN,
+ };
+
+Find pages which have been written, are file backed, not swapped and either
+present or huge::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .flags = 0,
+ ..
+ .category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED,
+ .category_inverted = PAGE_IS_SWAPPED,
+ .category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE,
+ .return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED |
+ PAGE_IS_PRESENT | PAGE_IS_HUGE,
+ };
+
+The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative
+of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence
+the user can find the true soft-dirty pages in case of normal pages. (There may
+still be extra dirty pages reported for THP or Hugetlb pages.)
+
+"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to
+implement memory dirty tracking in userspace:
+
+ 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
+ 2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
+ are set by ``UFFDIO_API`` IOCTL.
+ 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
+ through ``UFFDIO_REGISTER`` IOCTL.
+ 4. Then any part of the registered memory or the whole memory region must
+ be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING``
+ or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
+ same operation. The former is better in terms of performance.
+ 5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
+ have been written to since they were last marked and/or optionally write protect
+ the pages as well.
diff --git a/Documentation/admin-guide/mm/shrinker_debugfs.rst b/Documentation/admin-guide/mm/shrinker_debugfs.rst
new file mode 100644
index 000000000000..c582033bd113
--- /dev/null
+++ b/Documentation/admin-guide/mm/shrinker_debugfs.rst
@@ -0,0 +1,133 @@
+==========================
+Shrinker Debugfs Interface
+==========================
+
+Shrinker debugfs interface provides a visibility into the kernel memory
+shrinkers subsystem and allows to get information about individual shrinkers
+and interact with them.
+
+For each shrinker registered in the system a directory in **<debugfs>/shrinker/**
+is created. The directory's name is composed from the shrinker's name and an
+unique id: e.g. *kfree_rcu-0* or *sb-xfs:vda1-36*.
+
+Each shrinker directory contains **count** and **scan** files, which allow to
+trigger *count_objects()* and *scan_objects()* callbacks for each memcg and
+numa node (if applicable).
+
+Usage:
+------
+
+1. *List registered shrinkers*
+
+ ::
+
+ $ cd /sys/kernel/debug/shrinker/
+ $ ls
+ dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
+ mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
+ mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
+ rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
+ sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
+ sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
+ sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
+ sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
+ sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
+ sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
+ sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
+ sb-dax-11 sb-proc-45 sb-tmpfs-35
+ sb-debugfs-7 sb-proc-46 sb-tmpfs-40
+
+2. *Get information about a specific shrinker*
+
+ ::
+
+ $ cd sb-btrfs\:vda2-24/
+ $ ls
+ count scan
+
+3. *Count objects*
+
+ Each line in the output has the following format::
+
+ <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ...
+ <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ...
+ ...
+
+ If there are no objects on all numa nodes, a line is omitted. If there
+ are no objects at all, the output might be empty.
+
+ If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed
+ as cgroup inode id. If the shrinker is not numa-aware, 0's are printed
+ for all nodes except the first one.
+ ::
+
+ $ cat count
+ 1 224 2
+ 21 98 0
+ 55 818 10
+ 2367 2 0
+ 2401 30 0
+ 225 13 0
+ 599 35 0
+ 939 124 0
+ 1041 3 0
+ 1075 1 0
+ 1109 1 0
+ 1279 60 0
+ 1313 7 0
+ 1347 39 0
+ 1381 3 0
+ 1449 14 0
+ 1483 63 0
+ 1517 53 0
+ 1551 6 0
+ 1585 1 0
+ 1619 6 0
+ 1653 40 0
+ 1687 11 0
+ 1721 8 0
+ 1755 4 0
+ 1789 52 0
+ 1823 888 0
+ 1857 1 0
+ 1925 2 0
+ 1959 32 0
+ 2027 22 0
+ 2061 9 0
+ 2469 799 0
+ 2537 861 0
+ 2639 1 0
+ 2707 70 0
+ 2775 4 0
+ 2877 84 0
+ 293 1 0
+ 735 8 0
+
+4. *Scan objects*
+
+ The expected input format::
+
+ <cgroup inode id> <numa id> <number of objects to scan>
+
+ For a non-memcg-aware shrinker or on a system with no memory
+ cgrups **0** should be passed as cgroup id.
+ ::
+
+ $ cd /sys/kernel/debug/shrinker/
+ $ cd sb-btrfs\:vda2-24/
+
+ $ cat count | head -n 5
+ 1 212 0
+ 21 97 0
+ 55 802 5
+ 2367 2 0
+ 225 13 0
+
+ $ echo "55 0 200" > scan
+
+ $ cat count | head -n 5
+ 1 212 0
+ 21 96 0
+ 55 752 5
+ 2367 2 0
+ 225 13 0
diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst
index cb0cfd6672fa..aeea936caa44 100644
--- a/Documentation/admin-guide/mm/soft-dirty.rst
+++ b/Documentation/admin-guide/mm/soft-dirty.rst
@@ -1,5 +1,3 @@
-.. _soft_dirty:
-
===============
Soft-Dirty PTEs
===============
diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst
new file mode 100644
index 000000000000..2e630627bcee
--- /dev/null
+++ b/Documentation/admin-guide/mm/swap_numa.rst
@@ -0,0 +1,78 @@
+===========================================
+Automatically bind swap device to numa node
+===========================================
+
+If the system has more than one swap device and swap device has the node
+information, we can make use of this information to decide which swap
+device to use in get_swap_pages() to get better performance.
+
+
+How to use this feature
+=======================
+
+Swap device has priority and that decides the order of it to be used. To make
+use of automatically binding, there is no need to manipulate priority settings
+for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
+swapB, with swapA attached to node 0 and swapB attached to node 1, are going
+to be swapped on. Simply swapping them on by doing::
+
+ # swapon /dev/swapA
+ # swapon /dev/swapB
+
+Then node 0 will use the two swap devices in the order of swapA then swapB and
+node 1 will use the two swap devices in the order of swapB then swapA. Note
+that the order of them being swapped on doesn't matter.
+
+A more complex example on a 4 node machine. Assume 6 swap devices are going to
+be swapped on: swapA and swapB are attached to node 0, swapC is attached to
+node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
+The way to swap them on is the same as above::
+
+ # swapon /dev/swapA
+ # swapon /dev/swapB
+ # swapon /dev/swapC
+ # swapon /dev/swapD
+ # swapon /dev/swapE
+ # swapon /dev/swapF
+
+Then node 0 will use them in the order of::
+
+ swapA/swapB -> swapC -> swapD -> swapE -> swapF
+
+swapA and swapB will be used in a round robin mode before any other swap device.
+
+node 1 will use them in the order of::
+
+ swapC -> swapA -> swapB -> swapD -> swapE -> swapF
+
+node 2 will use them in the order of::
+
+ swapD/swapE -> swapA -> swapB -> swapC -> swapF
+
+Similaly, swapD and swapE will be used in a round robin mode before any
+other swap devices.
+
+node 3 will use them in the order of::
+
+ swapF -> swapA -> swapB -> swapC -> swapD -> swapE
+
+
+Implementation details
+======================
+
+The current code uses a priority based list, swap_avail_list, to decide
+which swap device to use and if multiple swap devices share the same
+priority, they are used round robin. This change here replaces the single
+global swap_avail_list with a per-numa-node list, i.e. for each numa node,
+it sees its own priority based list of available swap devices. Swap
+device's priority can be promoted on its matching node's swap_avail_list.
+
+The current swap device's priority is set as: user can set a >=0 value,
+or the system will pick one starting from -1 then downwards. The priority
+value in the swap_avail_list is the negated value of the swap device's
+due to plist being sorted from low to high. The new policy doesn't change
+the semantics for priority >=0 cases, the previous starting from -1 then
+downwards now becomes starting from -2 then downwards and -1 is reserved
+as the promoted value. So if multiple swap devices are attached to the same
+node, they will all be promoted to priority -1 on that node's plist and will
+be used round robin before any other swap devices.
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 6a233e42be08..04eb45a2f940 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -1,5 +1,3 @@
-.. _admin_guide_transhuge:
-
============================
Transparent Hugepage Support
============================
@@ -47,10 +45,25 @@ components:
the two is using hugepages just because of the fact the TLB miss is
going to run faster.
+Modern kernels support "multi-size THP" (mTHP), which introduces the
+ability to allocate memory in blocks that are bigger than a base page
+but smaller than traditional PMD-size (as described above), in
+increments of a power-of-2 number of pages. mTHP can back anonymous
+memory (for example 16K, 32K, 64K, etc). These THPs continue to be
+PTE-mapped, but in many cases can still provide similar benefits to
+those outlined above: Page faults are significantly reduced (by a
+factor of e.g. 4, 8, 16, etc), but latency spikes are much less
+prominent because the size of each page isn't as huge as the PMD-sized
+variant and there is less memory to clear in each page fault. Some
+architectures also employ TLB compression mechanisms to squeeze more
+entries in when a set of PTEs are virtually and physically contiguous
+and approporiately aligned. In this case, TLB misses will occur less
+often.
+
THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into huge pages.
+collapses sequences of basic pages into PMD-sized huge pages.
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls.
@@ -97,12 +110,40 @@ Global THP controls
Transparent Hugepage Support for anonymous memory can be entirely disabled
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
regions (to avoid the risk of consuming more memory resources) or enabled
-system wide. This can be achieved with one of::
+system wide. This can be achieved per-supported-THP-size with one of::
+
+ echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+ echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+ echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+
+where <size> is the hugepage size being addressed, the available sizes
+for which vary by system.
+
+For example::
+
+ echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
+
+Alternatively it is possible to specify that a given hugepage size
+will inherit the top-level "enabled" value::
+
+ echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+
+For example::
+
+ echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
+
+The top-level setting (for use with "inherit") can be set by issuing
+one of the following commands::
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled
+By default, PMD-sized hugepages have enabled="inherit" and all other
+hugepage sizes have enabled="never". If enabling multiple hugepage
+sizes, the kernel will select the most appropriate enabled size for a
+given allocation.
+
It's also possible to limit defrag efforts in the VM to generate
anonymous hugepages in case they're not immediately free to madvise
regions or to never try to defrag memory and simply fallback to regular
@@ -148,25 +189,34 @@ madvise
never
should be self-explanatory.
-By default kernel tries to use huge zero page on read page fault to
-anonymous mapping. It's possible to disable huge zero page by writing 0
-or enable it back by writing 1::
+By default kernel tries to use huge, PMD-mappable zero page on read
+page fault to anonymous mapping. It's possible to disable huge zero
+page by writing 0 or enable it back by writing 1::
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
-Some userspace (such as a test program, or an optimized memory allocation
-library) may want to know the size (in bytes) of a transparent hugepage::
+Some userspace (such as a test program, or an optimized memory
+allocation library) may want to know the size (in bytes) of a
+PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
-khugepaged will be automatically started when
-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
-be automatically shutdown if it's set to "never".
+khugepaged will be automatically started when one or more hugepage
+sizes are enabled (either by directly setting "always" or "madvise",
+or by setting "inherit" while the top-level enabled is set to "always"
+or "madvise"), and it'll be automatically shutdown when the last
+hugepage size is disabled (either by directly setting "never", or by
+setting "inherit" while the top-level enabled is set to "never").
Khugepaged controls
-------------------
+.. note::
+ khugepaged currently only searches for opportunities to collapse to
+ PMD-sized THP and no attempt is made to collapse to other THP
+ sizes.
+
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
@@ -191,7 +241,14 @@ allocation failure to throttle the next allocation attempt::
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
-The khugepaged progress can be seen in the number of pages collapsed::
+The khugepaged progress can be seen in the number of pages collapsed (note
+that this counter may not be an exact count of the number of pages
+collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
+being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
+one 2M hugepage. Each may happen independently, or together, depending on
+the type of memory and the failures that occur. As such, this value should
+be interpreted roughly as a sign of progress, and counters in /proc/vmstat
+consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@@ -277,19 +334,26 @@ force
Need of application restart
===========================
-The transparent_hugepage/enabled values and tmpfs mount option only affect
-future behavior. So to make them effective you need to restart any
-application that could have been using hugepages. This also applies to the
-regions registered in khugepaged.
+The transparent_hugepage/enabled and
+transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
+option only affect future behavior. So to make them effective you need
+to restart any application that could have been using hugepages. This
+also applies to the regions registered in khugepaged.
Monitoring usage
================
-The number of anonymous transparent huge pages currently used by the
+.. note::
+ Currently the below counters only record events relating to
+ PMD-sized THP. Events relating to other THP sizes are not included.
+
+The number of PMD-sized anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
-To identify what applications are using anonymous transparent huge pages,
-it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
-for each mapping.
+To identify what applications are using PMD-sized anonymous transparent huge
+pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
+fields for each mapping. (Note that AnonHugePages only applies to traditional
+PMD-sized THP for historical reasons and should have been called
+AnonHugePmdMapped).
The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@@ -305,8 +369,7 @@ monitor how successfully the system is providing huge pages for use.
thp_fault_alloc
is incremented every time a huge page is successfully
- allocated to handle a page fault. This applies to both the
- first time a page is faulted and for COW faults.
+ allocated to handle a page fault.
thp_collapse_alloc
is incremented by khugepaged when it has found
@@ -367,10 +430,9 @@ thp_split_pmd
page table entry.
thp_zero_page_alloc
- is incremented every time a huge zero page is
- successfully allocated. It includes allocations which where
- dropped due race with other allocation. Note, it doesn't count
- every map of the huge zero page, only its allocation.
+ is incremented every time a huge zero page used for thp is
+ successfully allocated. Note, it doesn't count every map of
+ the huge zero page, only its allocation.
thp_zero_page_alloc_failed
is incremented if kernel fails to allocate
@@ -402,30 +464,15 @@ compact_fail
is incremented if the system tries to compact memory
but failed.
-compact_pages_moved
- is incremented each time a page is moved. If
- this value is increasing rapidly, it implies that the system
- is copying a lot of data to satisfy the huge page allocation.
- It is possible that the cost of copying exceeds any savings
- from reduced TLB misses.
-
-compact_pagemigrate_failed
- is incremented when the underlying mechanism
- for moving a page failed.
-
-compact_blocks_moved
- is incremented each time memory compaction examines
- a huge page aligned range of pages.
-
It is possible to establish how long the stalls were using the function
-tracer to record how long was spent in __alloc_pages_nodemask and
+tracer to record how long was spent in __alloc_pages() and
using the mm_page_alloc tracepoint to identify which allocations were
for huge pages.
Optimizing the applications
===========================
-To be guaranteed that the kernel will map a 2M page immediately in any
+To be guaranteed that the kernel will map a THP immediately in any
memory region, the mmap region has to be hugepage naturally
aligned. posix_memalign() can provide that guarantee.
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 1dc2d5f823b4..e5cc8848dcb3 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -1,5 +1,3 @@
-.. _userfaultfd:
-
===========
Userfaultfd
===========
@@ -17,7 +15,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
Design
======
-Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
+Userspace creates a new userfaultfd, initializes it, and registers one or more
+regions of virtual memory with it. Then, any page faults which occur within the
+region(s) result in a message being delivered to the userfaultfd, notifying
+userspace of the fault.
The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
@@ -34,12 +35,11 @@ The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the
``userfaultfd`` runtime load never takes the mmap_lock for writing).
-
Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.
-The ``userfaultfd`` once opened by invoking the syscall, can also be
+The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
@@ -50,6 +50,39 @@ is a corner case that would currently return ``-EBUSY``).
API
===
+Creating a userfaultfd
+----------------------
+
+There are two ways to create a new userfaultfd, each of which provide ways to
+restrict access to this functionality (since historically userfaultfds which
+handle kernel page faults have been a useful tool for exploiting the kernel).
+
+The first way, supported since userfaultfd was introduced, is the
+userfaultfd(2) syscall. Access to this is controlled in several ways:
+
+- Any user can always create a userfaultfd which traps userspace page faults
+ only. Such a userfaultfd can be created using the userfaultfd(2) syscall
+ with the flag UFFD_USER_MODE_ONLY.
+
+- In order to also trap kernel page faults for the address space, either the
+ process needs the CAP_SYS_PTRACE capability, or the system must have
+ vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
+ is set to 0.
+
+The second way, added to the kernel more recently, is by opening
+/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
+yields equivalent userfaultfds to the userfaultfd(2) syscall.
+
+Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
+filesystem permissions (user/group/mode), which gives fine grained access to
+userfaultfd specifically, without also granting other unrelated privileges at
+the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
+to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
+vm.unprivileged_userfaultfd is not considered.
+
+Initializing a userfaultfd
+--------------------------
+
When first opened the ``userfaultfd`` must be enabled invoking the
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the ``read/POLLIN`` protocol
@@ -63,36 +96,40 @@ the generic ioctl available.
The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
defines what memory types are supported by the ``userfaultfd`` and what
-events, except page fault notifications, may be generated.
-
-If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
-virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
-``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
-set if the kernel supports registering ``userfaultfd`` ranges on shared
-memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
-``MAP_SHARED``, ``memfd_create``, etc).
-
-The userland application that wants to use ``userfaultfd`` with hugetlbfs
-or shared memory need to set the corresponding flag in
-``uffdio_api.features`` to enable those features.
-
-If the userland desires to receive notifications for events other than
-page faults, it has to verify that ``uffdio_api.features`` has appropriate
-``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
-detail below in `Non-cooperative userfaultfd`_ section.
-
-Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
-be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
-register a memory range in the ``userfaultfd`` by setting the
+events, except page fault notifications, may be generated:
+
+- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
+ other than page faults are supported. These events are described in more
+ detail below in the `Non-cooperative userfaultfd`_ section.
+
+- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
+ indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
+ registrations for hugetlbfs and shared memory (covering all shmem APIs,
+ i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
+ etc) virtual memory areas, respectively.
+
+- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
+ ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
+ areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
+ support for shmem virtual memory areas.
+
+- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an
+ existing page contents from userspace.
+
+The userland application should set the feature flags it intends to use
+when invoking the ``UFFDIO_API`` ioctl, to request that those features be
+enabled if supported.
+
+Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
+ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
+bitmask) to register a memory range in the ``userfaultfd`` by setting the
uffdio_register structure accordingly. The ``uffdio_register.mode``
bitmask will specify to the kernel which kind of faults to track for
-the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
-pages). The ``UFFDIO_REGISTER`` ioctl will return the
+the range. The ``UFFDIO_REGISTER`` ioctl will return the
``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be
-supported for all memory types depending on the underlying virtual
-memory backend (anonymous memory vs tmpfs vs real filebacked
-mappings).
+supported for all memory types (e.g. anonymous memory vs. shmem vs.
+hugetlbfs), or all types of intercepted faults.
Userland can use the ``uffdio_register.ioctls`` to manage the virtual
address space in the background (to add or potentially also remove
@@ -100,21 +137,46 @@ memory from the ``userfaultfd`` registered range). This means a userfault
could be triggering just before userland maps in the background the
user-faulted page.
-The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
-atomically copies a page into the userfault registered range and wakes
-up the blocked userfaults
-(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
-Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
-guaranteeing that nothing can see an half copied page since it'll
-keep userfaulting until the copy has finished.
+Resolving Userfaults
+--------------------
+
+There are three basic ways to resolve userfaults:
+
+- ``UFFDIO_COPY`` atomically copies some existing page contents from
+ userspace.
+
+- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
+
+- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
+
+These operations are atomic in the sense that they guarantee nothing can
+see a half-populated page, since readers will keep userfaulting until the
+operation has finished.
+
+By default, these wake up userfaults blocked on the range in question.
+They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
+that waking will be done separately at some later time.
+
+Which ioctl to choose depends on the kind of page fault, and what we'd
+like to do to resolve it:
+
+- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
+ resolved by either providing a new page (``UFFDIO_COPY``), or mapping
+ the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
+ the zero page for a missing fault. With userfaultfd, userspace can
+ decide what content to provide before the faulting thread continues.
+
+- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
+ the page cache). Userspace has the option of modifying the page's
+ contents before resolving the fault. Once the contents are correct
+ (modified or not), userspace asks the kernel to map the page and let the
+ faulting thread continue with ``UFFDIO_CONTINUE``.
Notes:
-- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
- you must provide some kind of page in your thread after reading from
- the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
- The normal behavior of the OS automatically providing a zero page on
- an annonymous mmaping is not in place.
+- You can tell which kind of fault occurred by examining
+ ``pagefault.flags`` within the ``uffd_msg``, checking for the
+ ``UFFD_PAGEFAULT_FLAG_*`` flags.
- None of the page-delivering ioctls default to the range that you
registered with. You must fill in all fields for the appropriate
@@ -122,9 +184,9 @@ Notes:
- You get the address of the access that triggered the missing page
event out of a struct uffd_msg that you read in the thread from the
- uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or
- ``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then
- the first of any of those IOCTLs wakes up the faulting thread.
+ uffd. You can supply as many pages as you want with these IOCTLs.
+ Keep in mind that unless you used DONTWAKE then the first of any of
+ those IOCTLs wakes up the faulting thread.
- Be sure to test for all errors including
(``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
@@ -160,6 +222,81 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used.
+Userfaultfd write-protect mode currently behave differently on none ptes
+(when e.g. page is missing) over different types of memories.
+
+For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
+(e.g. when pages are missing and not populated). For file-backed memories
+like shmem and hugetlbfs, none ptes will be write protected just like a
+present pte. In other words, there will be a userfaultfd write fault
+message generated when writing to a missing page on file typed memories,
+as long as the page range was write-protected before. Such a message will
+not be generated on anonymous memories by default.
+
+If the application wants to be able to write protect none ptes on anonymous
+memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On
+newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
+and set the feature bit in advance to make sure none ptes will also be
+write protected even upon anonymous memory.
+
+When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either
+``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when
+resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE``
+respectively, it may be desirable for the new page / mapping to be
+write-protected (so future writes will also result in a WP fault). These ioctls
+support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
+respectively) to configure the mapping this way.
+
+If the userfaultfd context has ``UFFD_FEATURE_WP_ASYNC`` feature bit set,
+any vma registered with write-protection will work in async mode rather
+than the default sync mode.
+
+In async mode, there will be no message generated when a write operation
+happens, meanwhile the write-protection will be resolved automatically by
+the kernel. It can be seen as a more accurate version of soft-dirty
+tracking and it can be different in a few ways:
+
+ - The dirty result will not be affected by vma changes (e.g. vma
+ merging) because the dirty is only tracked by the pte.
+
+ - It supports range operations by default, so one can enable tracking on
+ any range of memory as long as page aligned.
+
+ - Dirty information will not get lost if the pte was zapped due to
+ various reasons (e.g. during split of a shmem transparent huge page).
+
+ - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit
+ set; dirty when uffd-wp bit cleared), it has different semantics on
+ some of the memory operations. For example: ``MADV_DONTNEED`` on
+ anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as
+ dirtying of memory by dropping uffd-wp bit during the procedure.
+
+The user app can collect the "written/dirty" status by looking up the
+uffd-wp bit for the pages being interested in /proc/pagemap.
+
+The page will not be under track of uffd-wp async mode until the page is
+explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode
+flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
+that was tracked by async mode userfaultfd-wp is invalid.
+
+When userfaultfd-wp async mode is used alone, it can be applied to all
+kinds of memory.
+
+Memory Poisioning Emulation
+---------------------------
+
+In response to a fault (either missing or minor), an action userspace can
+take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
+future faulters to either get a SIGBUS, or in KVM's case the guest will
+receive an MCE as if there were hardware memory poisoning.
+
+This is used to emulate hardware memory poisoning. Imagine a VM running on a
+machine which experiences a real hardware memory error. Later, we live migrate
+the VM to another physical machine. Since we want the migration to be
+transparent to the guest, we want that same address range to act as if it was
+still poisoned, even though it's on a new physical host which ostensibly
+doesn't have a memory error in the exact same spot.
+
QEMU/KVM
========
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
new file mode 100644
index 000000000000..b42132969e31
--- /dev/null
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -0,0 +1,178 @@
+=====
+zswap
+=====
+
+Overview
+========
+
+Zswap is a lightweight compressed cache for swap pages. It takes pages that are
+in the process of being swapped out and attempts to compress them into a
+dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
+for potentially reduced swap I/O. This trade-off can also result in a
+significant performance improvement if reads from the compressed cache are
+faster than reads from a swap device.
+
+Some potential benefits:
+
+* Desktop/laptop users with limited RAM capacities can mitigate the
+ performance impact of swapping.
+* Overcommitted guests that share a common I/O resource can
+ dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
+ throttling by the hypervisor. This allows more work to get done with less
+ impact to the guest workload and guests sharing the I/O subsystem
+* Users with SSDs as swap devices can extend the life of the device by
+ drastically reducing life-shortening writes.
+
+Zswap evicts pages from compressed cache on an LRU basis to the backing swap
+device when the compressed pool reaches its size limit. This requirement had
+been identified in prior community discussions.
+
+Whether Zswap is enabled at the boot time depends on whether
+the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not.
+This setting can then be overridden by providing the kernel command line
+``zswap.enabled=`` option, for example ``zswap.enabled=0``.
+Zswap can also be enabled and disabled at runtime using the sysfs interface.
+An example command to enable zswap at runtime, assuming sysfs is mounted
+at ``/sys``, is::
+
+ echo 1 > /sys/module/zswap/parameters/enabled
+
+When zswap is disabled at runtime it will stop storing pages that are
+being swapped out. However, it will _not_ immediately write out or fault
+back into memory all of the pages stored in the compressed pool. The
+pages stored in zswap will remain in the compressed pool until they are
+either invalidated or faulted back into memory. In order to force all
+pages out of the compressed pool, a swapoff on the swap device(s) will
+fault back into memory all swapped out pages, including those in the
+compressed pool.
+
+Design
+======
+
+Zswap receives pages for compression from the swap subsystem and is able to
+evict pages from its own compressed pool on an LRU basis and write them back to
+the backing swap device in the case that the compressed pool is full.
+
+Zswap makes use of zpool for the managing the compressed memory pool. Each
+allocation in zpool is not directly accessible by address. Rather, a handle is
+returned by the allocation routine and that handle must be mapped before being
+accessed. The compressed memory pool grows on demand and shrinks as compressed
+pages are freed. The pool is not preallocated. By default, a zpool
+of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
+but it can be overridden at boot time by setting the ``zpool`` attribute,
+e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
+``zpool`` attribute, e.g.::
+
+ echo zbud > /sys/module/zswap/parameters/zpool
+
+The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which
+means the compression ratio will always be 2:1 or worse (because of half-full
+zbud pages). The zsmalloc type zpool has a more complex compressed page
+storage method, and it can achieve greater storage densities.
+
+When a swap page is passed from swapout to zswap, zswap maintains a mapping
+of the swap entry, a combination of the swap type and swap offset, to the zpool
+handle that references that compressed swap page. This mapping is achieved
+with a red-black tree per swap type. The swap offset is the search key for the
+tree nodes.
+
+During a page fault on a PTE that is a swap entry, the swapin code calls the
+zswap load function to decompress the page into the page allocated by the page
+fault handler.
+
+Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
+in the swap_map goes to 0) the swap code calls the zswap invalidate function
+to free the compressed entry.
+
+Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
+controlled policy:
+
+* max_pool_percent - The maximum percentage of memory that the compressed
+ pool can occupy.
+
+The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT``
+Kconfig option, but it can be overridden at boot time by setting the
+``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
+It can also be changed at runtime using the sysfs "compressor"
+attribute, e.g.::
+
+ echo lzo > /sys/module/zswap/parameters/compressor
+
+When the zpool and/or compressor parameter is changed at runtime, any existing
+compressed pages are not modified; they are left in their own zpool. When a
+request is made for a page in an old zpool, it is uncompressed using its
+original compressor. Once all pages are removed from an old zpool, the zpool
+and its compressor are freed.
+
+Some of the pages in zswap are same-value filled pages (i.e. contents of the
+page have same value or repetitive pattern). These pages include zero-filled
+pages and they are handled differently. During store operation, a page is
+checked if it is a same-value filled page before compressing it. If true, the
+compressed length of the page is set to zero and the pattern or same-filled
+value is stored.
+
+Same-value filled pages identification feature is enabled by default and can be
+disabled at boot time by setting the ``same_filled_pages_enabled`` attribute
+to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and
+disabled at runtime using the sysfs ``same_filled_pages_enabled``
+attribute, e.g.::
+
+ echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled
+
+When zswap same-filled page identification is disabled at runtime, it will stop
+checking for the same-value filled pages during store operation.
+In other words, every page will be then considered non-same-value filled.
+However, the existing pages which are marked as same-value filled pages remain
+stored unchanged in zswap until they are either loaded or invalidated.
+
+In some circumstances it might be advantageous to make use of just the zswap
+ability to efficiently store same-filled pages without enabling the whole
+compressed page storage.
+In this case the handling of non-same-value pages by zswap (enabled by default)
+can be disabled by setting the ``non_same_filled_pages_enabled`` attribute
+to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``.
+It can also be enabled and disabled at runtime using the sysfs
+``non_same_filled_pages_enabled`` attribute, e.g.::
+
+ echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled
+
+Disabling both ``zswap.same_filled_pages_enabled`` and
+``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new
+pages by zswap.
+
+To prevent zswap from shrinking pool when zswap is full and there's a high
+pressure on swap (this will result in flipping pages in and out zswap pool
+without any real benefit but with a performance drop for the system), a
+special parameter has been introduced to implement a sort of hysteresis to
+refuse taking pages into zswap pool until it has sufficient space if the limit
+has been hit. To set the threshold at which zswap would start accepting pages
+again after it became full, use the sysfs ``accept_threshold_percent``
+attribute, e. g.::
+
+ echo 80 > /sys/module/zswap/parameters/accept_threshold_percent
+
+Setting this parameter to 100 will disable the hysteresis.
+
+Some users cannot tolerate the swapping that comes with zswap store failures
+and zswap writebacks. Swapping can be disabled entirely (without disabling
+zswap itself) on a cgroup-basis as follows:
+
+ echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
+
+Note that if the store failures are recurring (for e.g if the pages are
+incompressible), users can observe reclaim inefficiency after disabling
+writeback (because the same pages might be rejected again and again).
+
+When there is a sizable amount of cold memory residing in the zswap pool, it
+can be advantageous to proactively write these cold pages to swap and reclaim
+the memory for other use cases. By default, the zswap shrinker is disabled.
+User can enable it as follows:
+
+ echo Y > /sys/module/zswap/parameters/shrinker_enabled
+
+This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
+selected.
+
+A debugfs interface is provided for various statistic about pool size, number
+of pages stored, same-value filled pages and various counters for the reasons
+pages are rejected.
diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst
index f8b584179cff..a8667a777490 100644
--- a/Documentation/admin-guide/module-signing.rst
+++ b/Documentation/admin-guide/module-signing.rst
@@ -28,10 +28,10 @@ trusted userspace bits.
This facility uses X.509 ITU-T standard certificates to encode the public keys
involved. The signatures are not themselves encoded in any industrial standard
-type. The facility currently only supports the RSA public key encryption
-standard (though it is pluggable and permits others to be used). The possible
-hash algorithms that can be used are SHA-1, SHA-224, SHA-256, SHA-384, and
-SHA-512 (the algorithm is selected by data in the signature).
+type. The built-in facility currently only supports the RSA & NIST P-384 ECDSA
+public key signing standard (though it is pluggable and permits others to be
+used). The possible hash algorithms that can be used are SHA-2 and SHA-3 of
+sizes 256, 384, and 512 (the algorithm is selected by data in the signature).
==========================
@@ -81,11 +81,12 @@ This has a number of options available:
sign the modules with:
=============================== ==========================================
- ``CONFIG_MODULE_SIG_SHA1`` :menuselection:`Sign modules with SHA-1`
- ``CONFIG_MODULE_SIG_SHA224`` :menuselection:`Sign modules with SHA-224`
``CONFIG_MODULE_SIG_SHA256`` :menuselection:`Sign modules with SHA-256`
``CONFIG_MODULE_SIG_SHA384`` :menuselection:`Sign modules with SHA-384`
``CONFIG_MODULE_SIG_SHA512`` :menuselection:`Sign modules with SHA-512`
+ ``CONFIG_MODULE_SIG_SHA3_256`` :menuselection:`Sign modules with SHA3-256`
+ ``CONFIG_MODULE_SIG_SHA3_384`` :menuselection:`Sign modules with SHA3-384`
+ ``CONFIG_MODULE_SIG_SHA3_512`` :menuselection:`Sign modules with SHA3-512`
=============================== ==========================================
The algorithm selected here will also be built into the kernel (rather
@@ -106,7 +107,7 @@ This has a number of options available:
certificate and a private key.
If the PEM file containing the private key is encrypted, or if the
- PKCS#11 token requries a PIN, this can be provided at build time by
+ PKCS#11 token requires a PIN, this can be provided at build time by
means of the ``KBUILD_SIGN_PIN`` variable.
@@ -145,6 +146,10 @@ into vmlinux) using parameters in the::
file (which is also generated if it does not already exist).
+One can select between RSA (``MODULE_SIG_KEY_TYPE_RSA``) and ECDSA
+(``MODULE_SIG_KEY_TYPE_ECDSA``) to generate either RSA 4k or NIST
+P-384 keypair.
+
It is strongly recommended that you provide your own x509.genkey file.
Most notably, in the x509.genkey file, the req_distinguished_name section
@@ -266,7 +271,7 @@ for which it has a public key. Otherwise, it will also load modules that are
unsigned. Any module for which the kernel has a key, but which proves to have
a signature mismatch will not be permitted to load.
-Any module that has an unparseable signature will be rejected.
+Any module that has an unparsable signature will be rejected.
=========================================
diff --git a/Documentation/admin-guide/nfs/fault_injection.rst b/Documentation/admin-guide/nfs/fault_injection.rst
deleted file mode 100644
index eb029c0c15ce..000000000000
--- a/Documentation/admin-guide/nfs/fault_injection.rst
+++ /dev/null
@@ -1,70 +0,0 @@
-===================
-NFS Fault Injection
-===================
-
-Fault injection is a method for forcing errors that may not normally occur, or
-may be difficult to reproduce. Forcing these errors in a controlled environment
-can help the developer find and fix bugs before their code is shipped in a
-production system. Injecting an error on the Linux NFS server will allow us to
-observe how the client reacts and if it manages to recover its state correctly.
-
-NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this
-feature.
-
-
-Using Fault Injection
-=====================
-On the client, mount the fault injection server through NFS v4.0+ and do some
-work over NFS (open files, take locks, ...).
-
-On the server, mount the debugfs filesystem to <debug_dir> and ls
-<debug_dir>/nfsd. This will show a list of files that will be used for
-injecting faults on the NFS server. As root, write a number n to the file
-corresponding to the action you want the server to take. The server will then
-process the first n items it finds. So if you want to forget 5 locks, echo '5'
-to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget
-all corresponding items. A log message will be created containing the number
-of items forgotten (check dmesg).
-
-Go back to work on the client and check if the client recovered from the error
-correctly.
-
-
-Available Faults
-================
-forget_clients:
- The NFS server keeps a list of clients that have placed a mount call. If
- this list is cleared, the server will have no knowledge of who the client
- is, forcing the client to reauthenticate with the server.
-
-forget_openowners:
- The NFS server keeps a list of what files are currently opened and who
- they were opened by. Clearing this list will force the client to reopen
- its files.
-
-forget_locks:
- The NFS server keeps a list of what files are currently locked in the VFS.
- Clearing this list will force the client to reclaim its locks (files are
- unlocked through the VFS as they are cleared from this list).
-
-forget_delegations:
- A delegation is used to assure the client that a file, or part of a file,
- has not changed since the delegation was awarded. Clearing this list will
- force the client to reacquire its delegation before accessing the file
- again.
-
-recall_delegations:
- Delegations can be recalled by the server when another client attempts to
- access a file. This test will notify the client that its delegation has
- been revoked, forcing the client to reacquire the delegation before using
- the file again.
-
-
-tools/nfs/inject_faults.sh script
-=================================
-This script has been created to ease the fault injection process. This script
-will detect the mounted debugfs directory and write to the files located there
-based on the arguments passed by the user. For example, running
-`inject_faults.sh forget_locks 1` as root will instruct the server to forget
-one lock. Running `inject_faults forget_locks` will instruct the server to
-forgetall locks.
diff --git a/Documentation/admin-guide/nfs/index.rst b/Documentation/admin-guide/nfs/index.rst
index 6b5a3c90fac5..3601a708f333 100644
--- a/Documentation/admin-guide/nfs/index.rst
+++ b/Documentation/admin-guide/nfs/index.rst
@@ -12,4 +12,3 @@ NFS
nfs-idmapper
pnfs-block-server
pnfs-scsi-server
- fault_injection
diff --git a/Documentation/admin-guide/nfs/nfs-client.rst b/Documentation/admin-guide/nfs/nfs-client.rst
index c4b777c7584b..36760685dd34 100644
--- a/Documentation/admin-guide/nfs/nfs-client.rst
+++ b/Documentation/admin-guide/nfs/nfs-client.rst
@@ -36,10 +36,9 @@ administrative requirements that require particular behavior that does not
work well as part of an nfs_client_id4 string.
The nfs.nfs4_unique_id boot parameter specifies a unique string that can be
-used instead of a system's node name when an NFS client identifies itself to
-a server. Thus, if the system's node name is not unique, or it changes, its
-nfs.nfs4_unique_id stays the same, preventing collision with other clients
-or loss of state during NFS reboot recovery or transparent state migration.
+used together with a system's node name when an NFS client identifies itself to
+a server. Thus, if the system's node name is not unique, its
+nfs.nfs4_unique_id can help prevent collisions with other clients.
The nfs.nfs4_unique_id string is typically a UUID, though it can contain
anything that is believed to be unique across all NFS clients. An
@@ -53,8 +52,12 @@ outstanding NFSv4 state has expired, to prevent loss of NFSv4 state.
This string can be stored in an NFS client's grub.conf, or it can be provided
via a net boot facility such as PXE. It may also be specified as an nfs.ko
-module parameter. Specifying a uniquifier string is not support for NFS
-clients running in containers.
+module parameter.
+
+This uniquifier string will be the same for all NFS clients running in
+containers unless it is overridden by a value written to
+/sys/fs/nfs/net/nfs_client/identifier which will be local to the network
+namespace of the process which writes.
The DNS resolver
@@ -65,8 +68,8 @@ migrated onto another server by means of the special "fs_locations"
attribute. See `RFC3530 Section 6: Filesystem Migration and Replication`_ and
`Implementation Guide for Referrals in NFSv4`_.
-.. _RFC3530 Section 6\: Filesystem Migration and Replication: http://tools.ietf.org/html/rfc3530#section-6
-.. _Implementation Guide for Referrals in NFSv4: http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
+.. _RFC3530 Section 6\: Filesystem Migration and Replication: https://tools.ietf.org/html/rfc3530#section-6
+.. _Implementation Guide for Referrals in NFSv4: https://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
The fs_locations information can take the form of either an ip address and
a path, or a DNS hostname and a path. The latter requires the NFS client to
diff --git a/Documentation/admin-guide/nfs/nfs-rdma.rst b/Documentation/admin-guide/nfs/nfs-rdma.rst
index ef0f3678b1fb..f137485f8bde 100644
--- a/Documentation/admin-guide/nfs/nfs-rdma.rst
+++ b/Documentation/admin-guide/nfs/nfs-rdma.rst
@@ -65,7 +65,7 @@ use with NFS/RDMA.
If the version is less than 1.1.2 or the command does not exist,
you should install the latest version of nfs-utils.
- Download the latest package from: http://www.kernel.org/pub/linux/utils/nfs
+ Download the latest package from: https://www.kernel.org/pub/linux/utils/nfs
Uncompress the package and follow the installation instructions.
diff --git a/Documentation/admin-guide/nfs/nfsroot.rst b/Documentation/admin-guide/nfs/nfsroot.rst
index c6772075c80c..135218f33394 100644
--- a/Documentation/admin-guide/nfs/nfsroot.rst
+++ b/Documentation/admin-guide/nfs/nfsroot.rst
@@ -264,7 +264,7 @@ They depend on various facilities being available:
access to the floppy drive device, /dev/fd0
For more information on syslinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
+ for prebuilt kernels, see https://syslinux.zytor.com/
.. note::
Previously it was possible to write a kernel directly to
@@ -292,7 +292,7 @@ They depend on various facilities being available:
cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso
For more information on isolinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
+ for prebuilt kernels, see https://syslinux.zytor.com/
- Using LILO
@@ -346,7 +346,7 @@ They depend on various facilities being available:
see Documentation/admin-guide/serial-console.rst for more information.
For more information on isolinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
+ for prebuilt kernels, see https://syslinux.zytor.com/
diff --git a/Documentation/admin-guide/nfs/pnfs-block-server.rst b/Documentation/admin-guide/nfs/pnfs-block-server.rst
index b00a2e705cc4..20fe9f5117fe 100644
--- a/Documentation/admin-guide/nfs/pnfs-block-server.rst
+++ b/Documentation/admin-guide/nfs/pnfs-block-server.rst
@@ -8,7 +8,7 @@ to handling all the metadata access to the NFS export also hands out layouts
to the clients to directly access the underlying block devices that are
shared with the client.
-To use pNFS block layouts with with the Linux NFS server the exported file
+To use pNFS block layouts with the Linux NFS server the exported file
system needs to support the pNFS block layouts (currently just XFS), and the
file system must sit on shared storage (typically iSCSI) that is accessible
to the clients in addition to the MDS. As of now the file system needs to
diff --git a/Documentation/admin-guide/nfs/pnfs-scsi-server.rst b/Documentation/admin-guide/nfs/pnfs-scsi-server.rst
index d2f6ee558071..b2eec2288329 100644
--- a/Documentation/admin-guide/nfs/pnfs-scsi-server.rst
+++ b/Documentation/admin-guide/nfs/pnfs-scsi-server.rst
@@ -9,7 +9,7 @@ which in addition to handling all the metadata access to the NFS export,
also hands out layouts to the clients so that they can directly access the
underlying SCSI LUNs that are shared with the client.
-To use pNFS SCSI layouts with with the Linux NFS server, the exported file
+To use pNFS SCSI layouts with the Linux NFS server, the exported file
system needs to support the pNFS SCSI layouts (currently just XFS), and the
file system must sit on a SCSI LUN that is accessible to the clients in
addition to the MDS. As of now the file system needs to sit directly on the
diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
index 1307b5274a0f..34aa334320ca 100644
--- a/Documentation/admin-guide/perf-security.rst
+++ b/Documentation/admin-guide/perf-security.rst
@@ -72,7 +72,7 @@ monitoring and observability operations, thus, bypass *scope* permissions
checks in the kernel. CAP_PERFMON implements the principle of least
privilege [13]_ (POSIX 1003.1e: 2.2.2.39) for performance monitoring and
observability operations in the kernel and provides a secure approach to
-perfomance monitoring and observability in the system.
+performance monitoring and observability in the system.
For backward compatibility reasons the access to perf_events monitoring and
observability operations is also open for CAP_SYS_ADMIN privileged
@@ -84,11 +84,14 @@ capabilities then providing the process with CAP_PERFMON capability singly
is recommended as the preferred secure approach to resolve double access
denial logging related to usage of performance monitoring and observability.
-Unprivileged processes using perf_events system call are also subject
-for PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose
-outcome determines whether monitoring is permitted. So unprivileged
-processes provided with CAP_SYS_PTRACE capability are effectively
-permitted to pass the check.
+Prior Linux v5.9 unprivileged processes using perf_events system call
+are also subject for PTRACE_MODE_READ_REALCREDS ptrace access mode check
+[7]_ , whose outcome determines whether monitoring is permitted.
+So unprivileged processes provided with CAP_SYS_PTRACE capability are
+effectively permitted to pass the check. Starting from Linux v5.9
+CAP_SYS_PTRACE capability is not required and CAP_PERFMON is enough to
+be provided for processes to make performance monitoring and observability
+operations.
Other capabilities being granted to unprivileged processes can
effectively enable capturing of additional data required for later
@@ -99,11 +102,11 @@ CAP_SYSLOG capability permits reading kernel space memory addresses from
Privileged Perf users groups
---------------------------------
-Mechanisms of capabilities, privileged capability-dumb files [6]_ and
-file system ACLs [10]_ can be used to create dedicated groups of
-privileged Perf users who are permitted to execute performance monitoring
-and observability without scope limits. The following steps can be
-taken to create such groups of privileged Perf users.
+Mechanisms of capabilities, privileged capability-dumb files [6]_,
+file system ACLs [10]_ and sudo [15]_ utility can be used to create
+dedicated groups of privileged Perf users who are permitted to execute
+performance monitoring and observability without limits. The following
+steps can be taken to create such groups of privileged Perf users.
1. Create perf_users group of privileged Perf users, assign perf_users
group to Perf tool executable and limit access to the executable for
@@ -133,7 +136,7 @@ taken to create such groups of privileged Perf users.
# getcap perf
perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
-If the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
+If the libcap [16]_ installed doesn't yet support "cap_perfmon", use "38" instead,
i.e.:
::
@@ -159,6 +162,60 @@ performance monitoring and observability by using functionality of the
configured Perf tool executable that, when executes, passes perf_events
subsystem scope checks.
+In case Perf tool executable can't be assigned required capabilities (e.g.
+file system is mounted with nosuid option or extended attributes are
+not supported by the file system) then creation of the capabilities
+privileged environment, naturally shell, is possible. The shell provides
+inherent processes with CAP_PERFMON and other required capabilities so that
+performance monitoring and observability operations are available in the
+environment without limits. Access to the environment can be open via sudo
+utility for members of perf_users group only. In order to create such
+environment:
+
+1. Create shell script that uses capsh utility [16]_ to assign CAP_PERFMON
+ and other required capabilities into ambient capability set of the shell
+ process, lock the process security bits after enabling SECBIT_NO_SETUID_FIXUP,
+ SECBIT_NOROOT and SECBIT_NO_CAP_AMBIENT_RAISE bits and then change
+ the process identity to sudo caller of the script who should essentially
+ be a member of perf_users group:
+
+::
+
+ # ls -alh /usr/local/bin/perf.shell
+ -rwxr-xr-x. 1 root root 83 Oct 13 23:57 /usr/local/bin/perf.shell
+ # cat /usr/local/bin/perf.shell
+ exec /usr/sbin/capsh --iab=^cap_perfmon --secbits=239 --user=$SUDO_USER -- -l
+
+2. Extend sudo policy at /etc/sudoers file with a rule for perf_users group:
+
+::
+
+ # grep perf_users /etc/sudoers
+ %perf_users ALL=/usr/local/bin/perf.shell
+
+3. Check that members of perf_users group have access to the privileged
+ shell and have CAP_PERFMON and other required capabilities enabled
+ in permitted, effective and ambient capability sets of an inherent process:
+
+::
+
+ $ id
+ uid=1003(capsh_test) gid=1004(capsh_test) groups=1004(capsh_test),1000(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
+ $ sudo perf.shell
+ [sudo] password for capsh_test:
+ $ grep Cap /proc/self/status
+ CapInh: 0000004000000000
+ CapPrm: 0000004000000000
+ CapEff: 0000004000000000
+ CapBnd: 000000ffffffffff
+ CapAmb: 0000004000000000
+ $ capsh --decode=0000004000000000
+ 0x0000004000000000=cap_perfmon
+
+As a result, members of perf_users group have access to the privileged
+environment where they can use tools employing performance monitoring APIs
+governed by CAP_PERFMON Linux capability.
+
This specific access control management is only available to superuser
or root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_
capabilities.
@@ -264,3 +321,5 @@ Bibliography
.. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_
.. [13] `<https://sites.google.com/site/fullycapable>`_
.. [14] `<http://man7.org/linux/man-pages/man8/auditd.8.html>`_
+.. [15] `<https://man7.org/linux/man-pages/man8/sudo.8.html>`_
+.. [16] `<https://git.kernel.org/pub/scm/libs/libcap/libcap.git/>`_
diff --git a/Documentation/admin-guide/perf/alibaba_pmu.rst b/Documentation/admin-guide/perf/alibaba_pmu.rst
new file mode 100644
index 000000000000..7d840023903f
--- /dev/null
+++ b/Documentation/admin-guide/perf/alibaba_pmu.rst
@@ -0,0 +1,105 @@
+=============================================================
+Alibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU)
+=============================================================
+
+The Yitian 710, custom-built by Alibaba Group's chip development business,
+T-Head, implements uncore PMU for performance and functional debugging to
+facilitate system maintenance.
+
+DDR Sub-System Driveway (DRW) PMU Driver
+=========================================
+
+Yitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel
+is independent of others to service system memory requests. And one DDR5
+channel is split into two independent sub-channels. The DDR Sub-System Driveway
+implements separate PMUs for each sub-channel to monitor various performance
+metrics.
+
+The Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf.
+For example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two
+sub-channels of the same channel in die 0. And the PMU device of die 1 is
+prefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000.
+
+Each sub-channel has 36 PMU counters in total, which is classified into
+four groups:
+
+- Group 0: PMU Cycle Counter. This group has one pair of counters
+ pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count
+ based on DDRC core clock.
+
+- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used
+ to count the total access number of either the eight bank groups in a
+ selected rank, or four ranks separately in the first 4 counters. The base
+ transfer unit is 64B.
+
+- Group 2: PMU Retry Counters. This group has 10 counters, that intend to
+ count the total retry number of each type of uncorrectable error.
+
+- Group 3: PMU Common Counters. This group has 16 counters, that are used
+ to count the common events.
+
+For now, the Driveway PMU driver only uses counters in group 0 and group 3.
+
+The DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution
+for connecting an SoC application bus to DDR memory devices. The DDRCTL
+receives transactions Host Interface (HIF) which is custom-defined by Synopsys.
+These transactions are queued internally and scheduled for access while
+satisfying the SDRAM protocol timing requirements, transaction priorities, and
+dependencies between the transactions. The DDRCTL in turn issues commands on
+the DDR PHY Interface (DFI) to the PHY module, which launches and captures data
+to and from the SDRAM. The driveway PMUs have hardware logic to gather
+statistics and performance logging signals on HIF, DFI, etc.
+
+By counting the READ, WRITE and RMW commands sent to the DDRC through the HIF
+interface, we could calculate the bandwidth. Example usage of counting memory
+data bandwidth::
+
+ perf stat \
+ -e ali_drw_21000/hif_wr/ \
+ -e ali_drw_21000/hif_rd/ \
+ -e ali_drw_21000/hif_rmw/ \
+ -e ali_drw_21000/cycle/ \
+ -e ali_drw_21080/hif_wr/ \
+ -e ali_drw_21080/hif_rd/ \
+ -e ali_drw_21080/hif_rmw/ \
+ -e ali_drw_21080/cycle/ \
+ -e ali_drw_23000/hif_wr/ \
+ -e ali_drw_23000/hif_rd/ \
+ -e ali_drw_23000/hif_rmw/ \
+ -e ali_drw_23000/cycle/ \
+ -e ali_drw_23080/hif_wr/ \
+ -e ali_drw_23080/hif_rd/ \
+ -e ali_drw_23080/hif_rmw/ \
+ -e ali_drw_23080/cycle/ \
+ -e ali_drw_25000/hif_wr/ \
+ -e ali_drw_25000/hif_rd/ \
+ -e ali_drw_25000/hif_rmw/ \
+ -e ali_drw_25000/cycle/ \
+ -e ali_drw_25080/hif_wr/ \
+ -e ali_drw_25080/hif_rd/ \
+ -e ali_drw_25080/hif_rmw/ \
+ -e ali_drw_25080/cycle/ \
+ -e ali_drw_27000/hif_wr/ \
+ -e ali_drw_27000/hif_rd/ \
+ -e ali_drw_27000/hif_rmw/ \
+ -e ali_drw_27000/cycle/ \
+ -e ali_drw_27080/hif_wr/ \
+ -e ali_drw_27080/hif_rd/ \
+ -e ali_drw_27080/hif_rmw/ \
+ -e ali_drw_27080/cycle/ -- sleep 10
+
+Example usage of counting all memory read/write bandwidth by metric::
+
+ perf stat -M ddr_read_bandwidth.all -- sleep 10
+ perf stat -M ddr_write_bandwidth.all -- sleep 10
+
+The average DRAM bandwidth can be calculated as follows:
+
+- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle
+- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle
+
+Here, DDRC_WIDTH = 64 bytes.
+
+The current driver does not support sampling. So "perf record" is
+unsupported. Also attach to a task is unsupported as the events are all
+uncore.
diff --git a/Documentation/admin-guide/perf/ampere_cspmu.rst b/Documentation/admin-guide/perf/ampere_cspmu.rst
new file mode 100644
index 000000000000..94f93f5aee6c
--- /dev/null
+++ b/Documentation/admin-guide/perf/ampere_cspmu.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================
+Ampere SoC Performance Monitoring Unit (PMU)
+============================================
+
+Ampere SoC PMU is a generic PMU IP that follows Arm CoreSight PMU architecture.
+Therefore, the driver is implemented as a submodule of arm_cspmu driver. At the
+first phase it's used for counting MCU events on AmpereOne.
+
+
+MCU PMU events
+--------------
+
+The PMU driver supports setting filters for "rank", "bank", and "threshold".
+Note, that the filters are per PMU instance rather than per event.
+
+
+Example for perf tool use::
+
+ / # perf list ampere
+
+ ampere_mcu_pmu_0/act_sent/ [Kernel PMU event]
+ <...>
+ ampere_mcu_pmu_1/rd_sent/ [Kernel PMU event]
+ <...>
+
+ / # perf stat -a -e ampere_mcu_pmu_0/act_sent,bank=5,rank=3,threshold=2/,ampere_mcu_pmu_1/rd_sent/ \
+ sleep 1
diff --git a/Documentation/admin-guide/perf/arm-ccn.rst b/Documentation/admin-guide/perf/arm-ccn.rst
index 832b0c64023a..f62f7fe50eba 100644
--- a/Documentation/admin-guide/perf/arm-ccn.rst
+++ b/Documentation/admin-guide/perf/arm-ccn.rst
@@ -27,7 +27,7 @@ Crosspoint PMU events require "xp" (index), "bus" (bus number)
and "vc" (virtual channel ID).
Crosspoint watchpoint-based events (special "event" value 0xfe)
-require "xp" and "vc" as as above plus "port" (device port index),
+require "xp" and "vc" as above plus "port" (device port index),
"dir" (transmit/receive direction), comparator values ("cmp_l"
and "cmp_h") and "mask", being index of the comparator mask.
diff --git a/Documentation/admin-guide/perf/arm-cmn.rst b/Documentation/admin-guide/perf/arm-cmn.rst
new file mode 100644
index 000000000000..796e25b7027b
--- /dev/null
+++ b/Documentation/admin-guide/perf/arm-cmn.rst
@@ -0,0 +1,65 @@
+=============================
+Arm Coherent Mesh Network PMU
+=============================
+
+CMN-600 is a configurable mesh interconnect consisting of a rectangular
+grid of crosspoints (XPs), with each crosspoint supporting up to two
+device ports to which various AMBA CHI agents are attached.
+
+CMN implements a distributed PMU design as part of its debug and trace
+functionality. This consists of a local monitor (DTM) at every XP, which
+counts up to 4 event signals from the connected device nodes and/or the
+XP itself. Overflow from these local counters is accumulated in up to 8
+global counters implemented by the main controller (DTC), which provides
+overall PMU control and interrupts for global counter overflow.
+
+PMU events
+----------
+
+The PMU driver registers a single PMU device for the whole interconnect,
+see /sys/bus/event_source/devices/arm_cmn_0. Multi-chip systems may link
+more than one CMN together via external CCIX links - in this situation,
+each mesh counts its own events entirely independently, and additional
+PMU devices will be named arm_cmn_{1..n}.
+
+Most events are specified in a format based directly on the TRM
+definitions - "type" selects the respective node type, and "eventid" the
+event number. Some events require an additional occupancy ID, which is
+specified by "occupid".
+
+* Since RN-D nodes do not have any distinct events from RN-I nodes, they
+ are treated as the same type (0xa), and the common event templates are
+ named "rnid_*".
+
+* The cycle counter is treated as a synthetic event belonging to the DTC
+ node ("type" == 0x3, "eventid" is ignored).
+
+* XP events also encode the port and channel in the "eventid" field, to
+ match the underlying pmu_event0_id encoding for the pmu_event_sel
+ register. The event templates are named with prefixes to cover all
+ permutations.
+
+By default each event provides an aggregate count over all nodes of the
+given type. To target a specific node, "bynodeid" must be set to 1 and
+"nodeid" to the appropriate value derived from the CMN configuration
+(as defined in the "Node ID Mapping" section of the TRM).
+
+Watchpoints
+-----------
+
+The PMU can also count watchpoint events to monitor specific flit
+traffic. Watchpoints are treated as a synthetic event type, and like PMU
+events can be global or targeted with a particular XP's "nodeid" value.
+Since the watchpoint direction is otherwise implicit in the underlying
+register selection, separate events are provided for flit uploads and
+downloads.
+
+The flit match value and mask are passed in config1 and config2 ("val"
+and "mask" respectively). "wp_dev_sel", "wp_chn_sel", "wp_grp" and
+"wp_exclusive" are specified per the TRM definitions for dtm_wp_config0.
+Where a watchpoint needs to match fields from both match groups on the
+REQ or SNP channel, it can be specified as two events - one for each
+group - with the same nonzero "combine" value. The count for such a
+pair of combined events will be attributed to the primary match.
+Watchpoint events with a "combine" value of 0 are considered independent
+and will count individually.
diff --git a/Documentation/admin-guide/perf/cxl.rst b/Documentation/admin-guide/perf/cxl.rst
new file mode 100644
index 000000000000..9233ea0d0b10
--- /dev/null
+++ b/Documentation/admin-guide/perf/cxl.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+CXL Performance Monitoring Unit (CPMU)
+======================================
+
+The CXL rev 3.0 specification provides a definition of CXL Performance
+Monitoring Unit in section 13.2: Performance Monitoring.
+
+CXL components (e.g. Root Port, Switch Upstream Port, End Point) may have
+any number of CPMU instances. CPMU capabilities are fully discoverable from
+the devices. The specification provides event definitions for all CXL protocol
+message types and a set of additional events for things commonly counted on
+CXL devices (e.g. DRAM events).
+
+CPMU driver
+===========
+
+The CPMU driver registers a perf PMU with the name pmu_mem<X>.<Y> on the CXL bus
+representing the Yth CPMU for memX.
+
+ /sys/bus/cxl/device/pmu_mem<X>.<Y>
+
+The associated PMU is registered as
+
+ /sys/bus/event_sources/devices/cxl_pmu_mem<X>.<Y>
+
+In common with other CXL bus devices, the id has no specific meaning and the
+relationship to specific CXL device should be established via the device parent
+of the device on the CXL bus.
+
+PMU driver provides description of available events and filter options in sysfs.
+
+The "format" directory describes all formats of the config (event vendor id,
+group id and mask) config1 (threshold, filter enables) and config2 (filter
+parameters) fields of the perf_event_attr structure. The "events" directory
+describes all documented events show in perf list.
+
+The events shown in perf list are the most fine grained events with a single
+bit of the event mask set. More general events may be enable by setting
+multiple mask bits in config. For example, all Device to Host Read Requests
+may be captured on a single counter by setting the bits for all of
+
+* d2h_req_rdcurr
+* d2h_req_rdown
+* d2h_req_rdshared
+* d2h_req_rdany
+* d2h_req_rdownnodata
+
+Example of usage::
+
+ $#perf list
+ cxl_pmu_mem0.0/clock_ticks/ [Kernel PMU event]
+ cxl_pmu_mem0.0/d2h_req_rdshared/ [Kernel PMU event]
+ cxl_pmu_mem0.0/h2d_req_snpcur/ [Kernel PMU event]
+ cxl_pmu_mem0.0/h2d_req_snpdata/ [Kernel PMU event]
+ cxl_pmu_mem0.0/h2d_req_snpinv/ [Kernel PMU event]
+ -----------------------------------------------------------
+
+ $# perf stat -a -e cxl_pmu_mem0.0/clock_ticks/ -e cxl_pmu_mem0.0/d2h_req_rdshared/
+
+Vendor specific events may also be available and if so can be used via
+
+ $# perf stat -a -e cxl_pmu_mem0.0/vid=VID,gid=GID,mask=MASK/
+
+The driver does not support sampling so "perf record" is unsupported.
+It only supports system-wide counting so attaching to a task is
+unsupported.
diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
new file mode 100644
index 000000000000..d47cd229d710
--- /dev/null
+++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
@@ -0,0 +1,94 @@
+======================================================================
+Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
+======================================================================
+
+DesignWare Cores (DWC) PCIe PMU
+===============================
+
+The PMU is a PCIe configuration space register block provided by each PCIe Root
+Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
+injection, and Statistics).
+
+As the name indicates, the RAS DES capability supports system level
+debugging, AER error injection, and collection of statistics. To facilitate
+collection of statistics, Synopsys DesignWare Cores PCIe controller
+provides the following two features:
+
+- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
+ time spent in each low-power LTSSM state) and
+- one 32-bit counter for Event Counting (error and non-error events for
+ a specified lane)
+
+Note: There is no interrupt for counter overflow.
+
+Time Based Analysis
+-------------------
+
+Using this feature you can obtain information regarding RX/TX data
+throughput and time spent in each low-power LTSSM state by the controller.
+The PMU measures data in two categories:
+
+- Group#0: Percentage of time the controller stays in LTSSM states.
+- Group#1: Amount of data processed (Units of 16 bytes).
+
+Lane Event counters
+-------------------
+
+Using this feature you can obtain Error and Non-Error information in
+specific lane by the controller. The PMU event is selected by all of:
+
+- Group i
+- Event j within the Group i
+- Lane k
+
+Some of the events only exist for specific configurations.
+
+DesignWare Cores (DWC) PCIe PMU Driver
+=======================================
+
+This driver adds PMU devices for each PCIe Root Port named based on the BDF of
+the Root Port. For example,
+
+ 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
+
+the PMU device name for this Root Port is dwc_rootport_3018.
+
+The DWC PCIe PMU driver registers a perf PMU driver, which provides
+description of available events and configuration options in sysfs, see
+/sys/bus/event_source/devices/dwc_rootport_{bdf}.
+
+The "format" directory describes format of the config fields of the
+perf_event_attr structure. The "events" directory provides configuration
+templates for all documented events. For example,
+"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
+
+The "perf list" command shall list the available events from sysfs, e.g.::
+
+ $# perf list | grep dwc_rootport
+ <...>
+ dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event]
+ <...>
+ dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event]
+
+Time Based Analysis Event Usage
+-------------------------------
+
+Example usage of counting PCIe RX TLP data payload (Units of bytes)::
+
+ $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
+
+The average RX/TX bandwidth can be calculated using the following formula:
+
+ PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window
+ PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window
+
+Lane Event Usage
+-------------------------------
+
+Each lane has the same event set and to avoid generating a list of hundreds
+of events, the user need to specify the lane ID explicitly, e.g.::
+
+ $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/
+
+The driver does not support sampling, therefore "perf record" will not
+work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst
new file mode 100644
index 000000000000..5541ff40e06a
--- /dev/null
+++ b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst
@@ -0,0 +1,146 @@
+================================================
+HiSilicon PCIe Performance Monitoring Unit (PMU)
+================================================
+
+On Hip09, HiSilicon PCIe Performance Monitoring Unit (PMU) could monitor
+bandwidth, latency, bus utilization and buffer occupancy data of PCIe.
+
+Each PCIe Core has a PMU to monitor multi Root Ports of this PCIe Core and
+all Endpoints downstream these Root Ports.
+
+
+HiSilicon PCIe PMU driver
+=========================
+
+The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe
+Core id.::
+
+ /sys/bus/event_source/hisi_pcie<sicl>_core<core>
+
+PMU driver provides description of available events and filter options in sysfs,
+see /sys/bus/event_source/devices/hisi_pcie<sicl>_core<core>.
+
+The "format" directory describes all formats of the config (events) and config1
+(filter options) fields of the perf_event_attr structure. The "events" directory
+describes all documented events shown in perf list.
+
+The "identifier" sysfs file allows users to identify the version of the
+PMU hardware device.
+
+The "bus" sysfs file allows users to get the bus number of Root Ports
+monitored by PMU.
+
+Example usage of perf::
+
+ $# perf list
+ hisi_pcie0_core0/rx_mwr_latency/ [kernel PMU event]
+ hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]
+ ------------------------------------------
+
+ $# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0xffff/
+ $# perf stat -e hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/
+
+The related events usually used to calculate the bandwidth, latency or others.
+They need to start and end counting at the same time, therefore related events
+are best used in the same event group to get the expected value. There are two
+ways to know if they are related events:
+
+a) By event name, such as the latency events "xxx_latency, xxx_cnt" or
+ bandwidth events "xxx_flux, xxx_time".
+b) By event type, such as "event=0xXXXX, event=0x1XXXX".
+
+Example usage of perf group::
+
+ $# perf stat -e "{hisi_pcie0_core0/rx_mwr_latency,port=0xffff/,hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/}"
+
+The current driver does not support sampling. So "perf record" is unsupported.
+Also attach to a task is unsupported for PCIe PMU.
+
+Filter options
+--------------
+
+1. Target filter
+
+ PMU could only monitor the performance of traffic downstream target Root
+ Ports or downstream target Endpoint. PCIe PMU driver support "port" and
+ "bdf" interfaces for users.
+ Please notice that, one of these two interfaces must be set, and these two
+ interfaces aren't supported at the same time. If they are both set, only
+ "port" filter is valid.
+ If "port" filter not being set or is set explicitly to zero (default), the
+ "bdf" filter will be in effect, because "bdf=0" meaning 0000:000:00.0.
+
+ - port
+
+ "port" filter can be used in all PCIe PMU events, target Root Port can be
+ selected by configuring the 16-bits-bitmap "port". Multi ports can be
+ selected for AP-layer-events, and only one port can be selected for
+ TL/DL-layer-events.
+
+ For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of
+ bitmap should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4
+ lanes), bit8 is set, port=0x100; if these two Root Ports are both
+ monitored, port=0x101.
+
+ Example usage of perf::
+
+ $# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0x1/ sleep 5
+
+ - bdf
+
+ "bdf" filter can only be used in bandwidth events, target Endpoint is
+ selected by configuring BDF to "bdf". Counter only counts the bandwidth of
+ message requested by target Endpoint.
+
+ For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0.
+
+ Example usage of perf::
+
+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,bdf=0x3900/ sleep 5
+
+2. Trigger filter
+
+ Event statistics start when the first time TLP length is greater/smaller
+ than trigger condition. You can set the trigger condition by writing
+ "trig_len", and set the trigger mode by writing "trig_mode". This filter can
+ only be used in bandwidth events.
+
+ For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0"
+ means statistics start when TLP length > trigger condition, "trig_mode=1"
+ means start when TLP length < condition.
+
+ Example usage of perf::
+
+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,trig_len=0x4,trig_mode=1/ sleep 5
+
+3. Threshold filter
+
+ Counter counts when TLP length within the specified range. You can set the
+ threshold by writing "thr_len", and set the threshold mode by writing
+ "thr_mode". This filter can only be used in bandwidth events.
+
+ For example, "thr_len=4" means threshold is 2^4 DW, "thr_mode=0" means
+ counter counts when TLP length >= threshold, and "thr_mode=1" means counts
+ when TLP length < threshold.
+
+ Example usage of perf::
+
+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,thr_len=0x4,thr_mode=1/ sleep 5
+
+4. TLP Length filter
+
+ When counting bandwidth, the data can be composed of certain parts of TLP
+ packets. You can specify it through "len_mode":
+
+ - 2'b00: Reserved (Do not use this since the behaviour is undefined)
+ - 2'b01: Bandwidth of TLP payloads
+ - 2'b10: Bandwidth of TLP headers
+ - 2'b11: Bandwidth of both TLP payloads and headers
+
+ For example, "len_mode=2" means only counting the bandwidth of TLP headers
+ and "len_mode=3" means the final bandwidth data is composed of both TLP
+ headers and payloads. Default value if not specified is 2'b11.
+
+ Example usage of perf::
+
+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,len_mode=0x1/ sleep 5
diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst
index 404a5c3d9d00..e0174d20809a 100644
--- a/Documentation/admin-guide/perf/hisi-pmu.rst
+++ b/Documentation/admin-guide/perf/hisi-pmu.rst
@@ -53,6 +53,72 @@ Example usage of perf::
$# perf stat -a -e hisi_sccl3_l3c0/rd_hit_cpipe/ sleep 5
$# perf stat -a -e hisi_sccl3_l3c0/config=0x02/ sleep 5
+For HiSilicon uncore PMU v2 whose identifier is 0x30, the topology is the same
+as PMU v1, but some new functions are added to the hardware.
+
+1. L3C PMU supports filtering by core/thread within the cluster which can be
+specified as a bitmap::
+
+ $# perf stat -a -e hisi_sccl3_l3c0/config=0x02,tt_core=0x3/ sleep 5
+
+This will only count the operations from core/thread 0 and 1 in this cluster.
+
+2. Tracetag allow the user to chose to count only read, write or atomic
+operations via the tt_req parameeter in perf. The default value counts all
+operations. tt_req is 3bits, 3'b100 represents read operations, 3'b101
+represents write operations, 3'b110 represents atomic store operations and
+3'b111 represents atomic non-store operations, other values are reserved::
+
+ $# perf stat -a -e hisi_sccl3_l3c0/config=0x02,tt_req=0x4/ sleep 5
+
+This will only count the read operations in this cluster.
+
+3. Datasrc allows the user to check where the data comes from. It is 5 bits.
+Some important codes are as follows:
+
+- 5'b00001: comes from L3C in this die;
+- 5'b01000: comes from L3C in the cross-die;
+- 5'b01001: comes from L3C which is in another socket;
+- 5'b01110: comes from the local DDR;
+- 5'b01111: comes from the cross-die DDR;
+- 5'b10000: comes from cross-socket DDR;
+
+etc, it is mainly helpful to find that the data source is nearest from the CPU
+cores. If datasrc_cfg is used in the multi-chips, the datasrc_skt shall be
+configured in perf command::
+
+ $# perf stat -a -e hisi_sccl3_l3c0/config=0xb9,datasrc_cfg=0xE/,
+ hisi_sccl3_l3c0/config=0xb9,datasrc_cfg=0xF/ sleep 5
+
+4. Some HiSilicon SoCs encapsulate multiple CPU and IO dies. Each CPU die
+contains several Compute Clusters (CCLs). The I/O dies are called Super I/O
+clusters (SICL) containing multiple I/O clusters (ICLs). Each CCL/ICL in the
+SoC has a unique ID. Each ID is 11bits, include a 6-bit SCCL-ID and 5-bit
+CCL/ICL-ID. For I/O die, the ICL-ID is followed by:
+
+- 5'b00000: I/O_MGMT_ICL;
+- 5'b00001: Network_ICL;
+- 5'b00011: HAC_ICL;
+- 5'b10000: PCIe_ICL;
+
+5. uring_channel: UC PMU events 0x47~0x59 supports filtering by tx request
+uring channel. It is 2 bits. Some important codes are as follows:
+
+- 2'b11: count the events which sent to the uring_ext (MATA) channel;
+- 2'b01: is the same as 2'b11;
+- 2'b10: count the events which sent to the uring (non-MATA) channel;
+- 2'b00: default value, count the events which sent to the both uring and
+ uring_ext channel;
+
+Users could configure IDs to count data come from specific CCL/ICL, by setting
+srcid_cmd & srcid_msk, and data desitined for specific CCL/ICL by setting
+tgtid_cmd & tgtid_msk. A set bit in srcid_msk/tgtid_msk means the PMU will not
+check the bit when matching against the srcid_cmd/tgtid_cmd.
+
+If all of these options are disabled, it can works by the default value that
+doesn't distinguish the filter condition and ID information and will return
+the total counter values in the PMU counters.
+
The current driver does not support sampling. So "perf record" is unsupported.
Also attach to a task is unsupported as the events are all uncore.
diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst
new file mode 100644
index 000000000000..75a40846d47f
--- /dev/null
+++ b/Documentation/admin-guide/perf/hns3-pmu.rst
@@ -0,0 +1,136 @@
+======================================
+HNS3 Performance Monitoring Unit (PMU)
+======================================
+
+HNS3(HiSilicon network system 3) Performance Monitoring Unit (PMU) is an
+End Point device to collect performance statistics of HiSilicon SoC NIC.
+On Hip09, each SICL(Super I/O cluster) has one PMU device.
+
+HNS3 PMU supports collection of performance statistics such as bandwidth,
+latency, packet rate and interrupt rate.
+
+Each HNS3 PMU supports 8 hardware events.
+
+HNS3 PMU driver
+===============
+
+The HNS3 PMU driver registers a perf PMU with the name of its sicl id.::
+
+ /sys/devices/hns3_pmu_sicl_<sicl_id>
+
+PMU driver provides description of available events, filter modes, format,
+identifier and cpumask in sysfs.
+
+The "events" directory describes the event code of all supported events
+shown in perf list.
+
+The "filtermode" directory describes the supported filter modes of each
+event.
+
+The "format" directory describes all formats of the config (events) and
+config1 (filter options) fields of the perf_event_attr structure.
+
+The "identifier" file shows version of PMU hardware device.
+
+The "bdf_min" and "bdf_max" files show the supported bdf range of each
+pmu device.
+
+The "hw_clk_freq" file shows the hardware clock frequency of each pmu
+device.
+
+Example usage of checking event code and subevent code::
+
+ $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time
+ config=0x00204
+ $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num
+ config=0x10204
+
+Each performance statistic has a pair of events to get two values to
+calculate real performance data in userspace.
+
+The bits 0~15 of config (here 0x0204) are the true hardware event code. If
+two events have same value of bits 0~15 of config, that means they are
+event pair. And the bit 16 of config indicates getting counter 0 or
+counter 1 of hardware event.
+
+After getting two values of event pair in userspace, the formula of
+computation to calculate real performance data is:::
+
+ counter 0 / counter 1
+
+Example usage of checking supported filter mode::
+
+ $# cat /sys/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num
+ filter mode supported: global/port/port-tc/func/func-queue/
+
+Example usage of perf::
+
+ $# perf list
+ hns3_pmu_sicl_0/bw_ssu_rpu_byte_num/ [kernel PMU event]
+ hns3_pmu_sicl_0/bw_ssu_rpu_time/ [kernel PMU event]
+ ------------------------------------------
+
+ $# perf stat -g -e hns3_pmu_sicl_0/bw_ssu_rpu_byte_num,global=1/ -e hns3_pmu_sicl_0/bw_ssu_rpu_time,global=1/ -I 1000
+ or
+ $# perf stat -g -e hns3_pmu_sicl_0/config=0x00002,global=1/ -e hns3_pmu_sicl_0/config=0x10002,global=1/ -I 1000
+
+
+Filter modes
+--------------
+
+1. global mode
+PMU collect performance statistics for all HNS3 PCIe functions of IO DIE.
+Set the "global" filter option to 1 will enable this mode.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,global=1/ -I 1000
+
+2. port mode
+PMU collect performance statistic of one whole physical port. The port id
+is same as mac id. The "tc" filter option must be set to 0xF in this mode,
+here tc stands for traffic class.
+
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0xF/ -I 1000
+
+3. port-tc mode
+PMU collect performance statistic of one tc of physical port. The port id
+is same as mac id. The "tc" filter option must be set to 0 ~ 7 in this
+mode.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0/ -I 1000
+
+4. func mode
+PMU collect performance statistic of one PF/VF. The function id is BDF of
+PF/VF, its conversion formula::
+
+ func = (bus << 8) + (device << 3) + (function)
+
+for example:
+ BDF func
+ 35:00.0 0x3500
+ 35:00.1 0x3501
+ 35:01.0 0x3508
+
+In this mode, the "queue" filter option must be set to 0xFFFF.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0xFFFF/ -I 1000
+
+5. func-queue mode
+PMU collect performance statistic of one queue of PF/VF. The function id
+is BDF of PF/VF, the "queue" filter option must be set to the exact queue
+id of function.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0/ -I 1000
+
+6. func-intr mode
+PMU collect performance statistic of one interrupt of PF/VF. The function
+id is BDF of PF/VF, the "intr" filter option must be set to the exact
+interrupt id of function.
+Example usage of perf::
+
+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x00301,bdf=0x3500,intr=0/ -I 1000
diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst
index f05f56c73b7d..77418ae5a290 100644
--- a/Documentation/admin-guide/perf/imx-ddr.rst
+++ b/Documentation/admin-guide/perf/imx-ddr.rst
@@ -4,7 +4,7 @@ Freescale i.MX8 DDR Performance Monitoring Unit (PMU)
There are no performance counters inside the DRAM controller, so performance
signals are brought out to the edge of the controller where a set of 4 x 32 bit
-counters is implemented. This is controlled by the CSV modes programed in counter
+counters is implemented. This is controlled by the CSV modes programmed in counter
control register which causes a large number of PERF signals to be generated.
Selection of the value for each counter is done via the config registers. There
@@ -13,8 +13,8 @@ is one register for each counter. Counter 0 is special in that it always counts
interrupt is raised. If any other counter overflows, it continues counting, and
no interrupt is raised.
-The "format" directory describes format of the config (event ID) and config1
-(AXI filtering) fields of the perf_event_attr structure, see /sys/bus/event_source/
+The "format" directory describes format of the config (event ID) and config1/2
+(AXI filter setting) fields of the perf_event_attr structure, see /sys/bus/event_source/
devices/imx8_ddr0/format/. The "events" directory describes the events types
hardware supported that can be used with perf tool, see /sys/bus/event_source/
devices/imx8_ddr0/events/. The "caps" directory describes filter features implemented
@@ -28,12 +28,11 @@ in DDR PMU, see /sys/bus/events_source/devices/imx8_ddr0/caps/.
AXI filtering is only used by CSV modes 0x41 (axid-read) and 0x42 (axid-write)
to count reading or writing matches filter setting. Filter setting is various
from different DRAM controller implementations, which is distinguished by quirks
-in the driver. You also can dump info from userspace, filter in "caps" directory
-indicates whether PMU supports AXI ID filter or not; enhanced_filter indicates
-whether PMU supports enhanced AXI ID filter or not. Value 0 for un-supported, and
-value 1 for supported.
+in the driver. You also can dump info from userspace, "caps" directory show the
+type of AXI filter (filter, enhanced_filter and super_filter). Value 0 for
+un-supported, and value 1 for supported.
-* With DDR_CAP_AXI_ID_FILTER quirk(filter: 1, enhanced_filter: 0).
+* With DDR_CAP_AXI_ID_FILTER quirk(filter: 1, enhanced_filter: 0, super_filter: 0).
Filter is defined with two configuration parts:
--AXI_ID defines AxID matching value.
--AXI_MASKING defines which bits of AxID are meaningful for the matching.
@@ -65,7 +64,37 @@ value 1 for supported.
perf stat -a -e imx8_ddr0/axid-read,axi_id=0x12/ cmd, which will monitor ARID=0x12
-* With DDR_CAP_AXI_ID_FILTER_ENHANCED quirk(filter: 1, enhanced_filter: 1).
+* With DDR_CAP_AXI_ID_FILTER_ENHANCED quirk(filter: 1, enhanced_filter: 1, super_filter: 0).
This is an extension to the DDR_CAP_AXI_ID_FILTER quirk which permits
counting the number of bytes (as opposed to the number of bursts) from DDR
read and write transactions concurrently with another set of data counters.
+
+* With DDR_CAP_AXI_ID_PORT_CHANNEL_FILTER quirk(filter: 0, enhanced_filter: 0, super_filter: 1).
+ There is a limitation in previous AXI filter, it cannot filter different IDs
+ at the same time as the filter is shared between counters. This quirk is the
+ extension of AXI ID filter. One improvement is that counter 1-3 has their own
+ filter, means that it supports concurrently filter various IDs. Another
+ improvement is that counter 1-3 supports AXI PORT and CHANNEL selection. Support
+ selecting address channel or data channel.
+
+ Filter is defined with 2 configuration registers per counter 1-3.
+ --Counter N MASK COMP register - including AXI_ID and AXI_MASKING.
+ --Counter N MUX CNTL register - including AXI CHANNEL and AXI PORT.
+
+ - 0: address channel
+ - 1: data channel
+
+ PMU in DDR subsystem, only one single port0 exists, so axi_port is reserved
+ which should be 0.
+
+ .. code-block:: bash
+
+ perf stat -a -e imx8_ddr0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD,axi_channel=0xH/ cmd
+ perf stat -a -e imx8_ddr0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD,axi_channel=0xH/ cmd
+
+ .. note::
+
+ axi_channel is inverted in userspace, and it will be reverted in driver
+ automatically. So that users do not need specify axi_channel if want to
+ monitor data channel from DDR transactions, since data channel is more
+ meaningful.
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 47c99f40cc16..7eb3dcd6f4da 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -8,10 +8,20 @@ Performance monitor support
:maxdepth: 1
hisi-pmu
+ hisi-pcie-pmu
+ hns3-pmu
imx-ddr
qcom_l2_pmu
qcom_l3_pmu
+ starfive_starlink_pmu
arm-ccn
+ arm-cmn
xgene-pmu
arm_dsu_pmu
thunderx2-pmu
+ alibaba_pmu
+ dwc_pcie_pmu
+ nvidia-pmu
+ meson-ddr-pmu
+ cxl
+ ampere_cspmu
diff --git a/Documentation/admin-guide/perf/meson-ddr-pmu.rst b/Documentation/admin-guide/perf/meson-ddr-pmu.rst
new file mode 100644
index 000000000000..8e71be1d6346
--- /dev/null
+++ b/Documentation/admin-guide/perf/meson-ddr-pmu.rst
@@ -0,0 +1,70 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================================
+Amlogic SoC DDR Bandwidth Performance Monitoring Unit (PMU)
+===========================================================
+
+The Amlogic Meson G12 SoC contains a bandwidth monitor inside DRAM controller.
+The monitor includes 4 channels. Each channel can count the request accessing
+DRAM. The channel can count up to 3 AXI port simultaneously. It can be helpful
+to show if the performance bottleneck is on DDR bandwidth.
+
+Currently, this driver supports the following 5 perf events:
+
++ meson_ddr_bw/total_rw_bytes/
++ meson_ddr_bw/chan_1_rw_bytes/
++ meson_ddr_bw/chan_2_rw_bytes/
++ meson_ddr_bw/chan_3_rw_bytes/
++ meson_ddr_bw/chan_4_rw_bytes/
+
+meson_ddr_bw/chan_{1,2,3,4}_rw_bytes/ events are channel-specific events.
+Each channel support filtering, which can let the channel to monitor
+individual IP module in SoC.
+
+Below are DDR access request event filter keywords:
+
++ arm - from CPU
++ vpu_read1 - from OSD + VPP read
++ gpu - from 3D GPU
++ pcie - from PCIe controller
++ hdcp - from HDCP controller
++ hevc_front - from HEVC codec front end
++ usb3_0 - from USB3.0 controller
++ hevc_back - from HEVC codec back end
++ h265enc - from HEVC encoder
++ vpu_read2 - from DI read
++ vpu_write1 - from VDIN write
++ vpu_write2 - from di write
++ vdec - from legacy codec video decoder
++ hcodec - from H264 encoder
++ ge2d - from ge2d
++ spicc1 - from SPI controller 1
++ usb0 - from USB2.0 controller 0
++ dma - from system DMA controller 1
++ arb0 - from arb0
++ sd_emmc_b - from SD eMMC b controller
++ usb1 - from USB2.0 controller 1
++ audio - from Audio module
++ sd_emmc_c - from SD eMMC c controller
++ spicc2 - from SPI controller 2
++ ethernet - from Ethernet controller
+
+
+Examples:
+
+ + Show the total DDR bandwidth per seconds:
+
+ .. code-block:: bash
+
+ perf stat -a -e meson_ddr_bw/total_rw_bytes/ -I 1000 sleep 10
+
+
+ + Show individual DDR bandwidth from CPU and GPU respectively, as well as
+ sum of them:
+
+ .. code-block:: bash
+
+ perf stat -a -e meson_ddr_bw/chan_1_rw_bytes,arm=1/ -I 1000 sleep 10
+ perf stat -a -e meson_ddr_bw/chan_2_rw_bytes,gpu=1/ -I 1000 sleep 10
+ perf stat -a -e meson_ddr_bw/chan_3_rw_bytes,arm=1,gpu=1/ -I 1000 sleep 10
+
diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-pmu.rst
new file mode 100644
index 000000000000..2e0d47cfe7ea
--- /dev/null
+++ b/Documentation/admin-guide/perf/nvidia-pmu.rst
@@ -0,0 +1,299 @@
+=========================================================
+NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
+=========================================================
+
+The NVIDIA Tegra SoC includes various system PMUs to measure key performance
+metrics like memory bandwidth, latency, and utilization:
+
+* Scalable Coherency Fabric (SCF)
+* NVLink-C2C0
+* NVLink-C2C1
+* CNVLink
+* PCIE
+
+PMU Driver
+----------
+
+The PMUs in this document are based on ARM CoreSight PMU Architecture as
+described in document: ARM IHI 0091. Since this is a standard architecture, the
+PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes
+the available events and configuration of each PMU in sysfs. Please see the
+sections below to get the sysfs path of each PMU. Like other uncore PMU drivers,
+the driver provides "cpumask" sysfs attribute to show the CPU id used to handle
+the PMU event. There is also "associated_cpus" sysfs attribute, which contains a
+list of CPUs associated with the PMU instance.
+
+.. _SCF_PMU_Section:
+
+SCF PMU
+-------
+
+The SCF PMU monitors system level cache events, CPU traffic, and
+strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
+:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU
+traffic coverage.
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.
+
+Example usage:
+
+* Count event id 0x0 in socket 0::
+
+ perf stat -a -e nvidia_scf_pmu_0/event=0x0/
+
+* Count event id 0x0 in socket 1::
+
+ perf stat -a -e nvidia_scf_pmu_1/event=0x0/
+
+NVLink-C2C0 PMU
+--------------------
+
+The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with
+NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU
+varies dependent on the chip configuration:
+
+* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.
+
+ In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU.
+
+* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.
+
+ In this config, the PMU captures read and relaxed ordered (RO) writes from
+ PCIE device of the remote SoC.
+
+Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
+the PMU traffic coverage.
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
+
+Example usage:
+
+* Count event id 0x0 from the GPU/CPU connected with socket 0::
+
+ perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/
+
+* Count event id 0x0 from the GPU/CPU connected with socket 1::
+
+ perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/
+
+* Count event id 0x0 from the GPU/CPU connected with socket 2::
+
+ perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/
+
+* Count event id 0x0 from the GPU/CPU connected with socket 3::
+
+ perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/
+
+NVLink-C2C1 PMU
+-------------------
+
+The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with
+NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU
+traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic.
+Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
+the PMU traffic coverage.
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
+
+Example usage:
+
+* Count event id 0x0 from the GPU connected with socket 0::
+
+ perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/
+
+* Count event id 0x0 from the GPU connected with socket 1::
+
+ perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/
+
+* Count event id 0x0 from the GPU connected with socket 2::
+
+ perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/
+
+* Count event id 0x0 from the GPU connected with socket 3::
+
+ perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/
+
+CNVLink PMU
+---------------
+
+The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets
+to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
+(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
+for more info about the PMU traffic coverage.
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.
+
+Each SoC socket can be connected to one or more sockets via CNVLink. The user can
+use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
+Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
+socket 1 to 3.
+/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
+shows the valid bits that can be set in the "rem_socket" parameter.
+
+The PMU can not distinguish the remote traffic initiator, therefore it does not
+provide filter to select the traffic source to monitor. It reports combined
+traffic from remote GPU and PCIE devices.
+
+Example usage:
+
+* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0::
+
+ perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/
+
+* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1::
+
+ perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/
+
+* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2::
+
+ perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/
+
+* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3::
+
+ perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/
+
+
+PCIE PMU
+------------
+
+The PCIE PMU monitors all read/write traffic from PCIE root ports to
+local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
+for more info about the PMU traffic coverage.
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.
+
+Each SoC socket can support multiple root ports. The user can use
+"root_port" bitmap parameter to select the port(s) to monitor, i.e.
+"root_port=0xF" corresponds to root port 0 to 3.
+/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
+shows the valid bits that can be set in the "root_port" parameter.
+
+Example usage:
+
+* Count event id 0x0 from root port 0 and 1 of socket 0::
+
+ perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/
+
+* Count event id 0x0 from root port 0 and 1 of socket 1::
+
+ perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/
+
+.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section:
+
+Traffic Coverage
+----------------
+
+The PMU traffic coverage may vary dependent on the chip configuration:
+
+* **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC.
+
+ Example configuration with two Grace SoCs::
+
+ ********************************* *********************************
+ * SOCKET-A * * SOCKET-B *
+ * * * *
+ * :::::::: * * :::::::: *
+ * : PCIE : * * : PCIE : *
+ * :::::::: * * :::::::: *
+ * | * * | *
+ * | * * | *
+ * ::::::: ::::::::: * * ::::::::: ::::::: *
+ * : : : : * * : : : : *
+ * : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : *
+ * : : C2C : SoC : * * : SoC : C2C : : *
+ * ::::::: ::::::::: * * ::::::::: ::::::: *
+ * | | * * | | *
+ * | | * * | | *
+ * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *
+ * & GMEM & & CMEM & * * & CMEM & & GMEM & *
+ * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *
+ * * * *
+ ********************************* *********************************
+
+ GMEM = GPU Memory (e.g. HBM)
+ CMEM = CPU Memory (e.g. LPDDR5X)
+
+ |
+ | Following table contains traffic coverage of Grace SoC PMU in socket-A:
+
+ ::
+
+ +--------------+-------+-----------+-----------+-----+----------+----------+
+ | | Source |
+ + +-------+-----------+-----------+-----+----------+----------+
+ | Destination | |GPU ATS |GPU Not-ATS| | Socket-B | Socket-B |
+ | |PCI R/W|Translated,|Translated | CPU | CPU/PCIE1| GPU/PCIE2|
+ | | |EGM | | | | |
+ +==============+=======+===========+===========+=====+==========+==========+
+ | Local | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | SCF PMU | CNVLink |
+ | SYSRAM/CMEM | PMU |PMU |PMU | PMU | | PMU |
+ +--------------+-------+-----------+-----------+-----+----------+----------+
+ | Local GMEM | PCIE | N/A |NVLink-C2C1| SCF | SCF PMU | CNVLink |
+ | | PMU | |PMU | PMU | | PMU |
+ +--------------+-------+-----------+-----------+-----+----------+----------+
+ | Remote | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | |
+ | SYSRAM/CMEM | PMU |PMU |PMU | PMU | N/A | N/A |
+ | over CNVLink | | | | | | |
+ +--------------+-------+-----------+-----------+-----+----------+----------+
+ | Remote GMEM | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | |
+ | over CNVLink | PMU |PMU |PMU | PMU | N/A | N/A |
+ +--------------+-------+-----------+-----------+-----+----------+----------+
+
+ PCIE1 traffic represents strongly ordered (SO) writes.
+ PCIE2 traffic represents reads and relaxed ordered (RO) writes.
+
+* **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected.
+
+ Example configuration with two Grace SoCs::
+
+ ******************* *******************
+ * SOCKET-A * * SOCKET-B *
+ * * * *
+ * :::::::: * * :::::::: *
+ * : PCIE : * * : PCIE : *
+ * :::::::: * * :::::::: *
+ * | * * | *
+ * | * * | *
+ * ::::::::: * * ::::::::: *
+ * : : * * : : *
+ * : Grace :<--------NVLink------->: Grace : *
+ * : SoC : * C2C * : SoC : *
+ * ::::::::: * * ::::::::: *
+ * | * * | *
+ * | * * | *
+ * &&&&&&&& * * &&&&&&&& *
+ * & CMEM & * * & CMEM & *
+ * &&&&&&&& * * &&&&&&&& *
+ * * * *
+ ******************* *******************
+
+ GMEM = GPU Memory (e.g. HBM)
+ CMEM = CPU Memory (e.g. LPDDR5X)
+
+ |
+ | Following table contains traffic coverage of Grace SoC PMU in socket-A:
+
+ ::
+
+ +-----------------+-----------+---------+----------+-------------+
+ | | Source |
+ + +-----------+---------+----------+-------------+
+ | Destination | | | Socket-B | Socket-B |
+ | | PCI R/W | CPU | CPU/PCIE1| PCIE2 |
+ | | | | | |
+ +=================+===========+=========+==========+=============+
+ | Local | PCIE PMU | SCF PMU | SCF PMU | NVLink-C2C0 |
+ | SYSRAM/CMEM | | | | PMU |
+ +-----------------+-----------+---------+----------+-------------+
+ | Remote | | | | |
+ | SYSRAM/CMEM | PCIE PMU | SCF PMU | N/A | N/A |
+ | over NVLink-C2C | | | | |
+ +-----------------+-----------+---------+----------+-------------+
+
+ PCIE1 traffic represents strongly ordered (SO) writes.
+ PCIE2 traffic represents reads and relaxed ordered (RO) writes.
diff --git a/Documentation/admin-guide/perf/starfive_starlink_pmu.rst b/Documentation/admin-guide/perf/starfive_starlink_pmu.rst
new file mode 100644
index 000000000000..2932ddb4eb76
--- /dev/null
+++ b/Documentation/admin-guide/perf/starfive_starlink_pmu.rst
@@ -0,0 +1,46 @@
+================================================
+StarFive StarLink Performance Monitor Unit (PMU)
+================================================
+
+StarFive StarLink Performance Monitor Unit (PMU) exists within the
+StarLink Coherent Network on Chip (CNoC) that connects multiple CPU
+clusters with an L3 memory system.
+
+The uncore PMU supports overflow interrupt, up to 16 programmable 64bit
+event counters, and an independent 64bit cycle counter.
+The PMU can only be accessed via Memory Mapped I/O and are common to the
+cores connected to the same PMU.
+
+Driver exposes supported PMU events in sysfs "events" directory under::
+
+ /sys/bus/event_source/devices/starfive_starlink_pmu/events/
+
+Driver exposes cpu used to handle PMU events in sysfs "cpumask" directory
+under::
+
+ /sys/bus/event_source/devices/starfive_starlink_pmu/cpumask/
+
+Driver describes the format of config (event ID) in sysfs "format" directory
+under::
+
+ /sys/bus/event_source/devices/starfive_starlink_pmu/format/
+
+Example of perf usage::
+
+ $ perf list
+
+ starfive_starlink_pmu/cycles/ [Kernel PMU event]
+ starfive_starlink_pmu/read_hit/ [Kernel PMU event]
+ starfive_starlink_pmu/read_miss/ [Kernel PMU event]
+ starfive_starlink_pmu/read_request/ [Kernel PMU event]
+ starfive_starlink_pmu/release_request/ [Kernel PMU event]
+ starfive_starlink_pmu/write_hit/ [Kernel PMU event]
+ starfive_starlink_pmu/write_miss/ [Kernel PMU event]
+ starfive_starlink_pmu/write_request/ [Kernel PMU event]
+ starfive_starlink_pmu/writeback/ [Kernel PMU event]
+
+
+ $ perf stat -a -e /starfive_starlink_pmu/cycles/ sleep 1
+
+Sampling is not supported. As a result, "perf record" is not supported.
+Attaching to a task is not supported, only system-wide counting is supported.
diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst
new file mode 100644
index 000000000000..1e0d101b020a
--- /dev/null
+++ b/Documentation/admin-guide/pm/amd-pstate.rst
@@ -0,0 +1,775 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============================================
+``amd-pstate`` CPU Performance Scaling Driver
+===============================================
+
+:Copyright: |copy| 2021 Advanced Micro Devices, Inc.
+
+:Author: Huang Rui <ray.huang@amd.com>
+
+
+Introduction
+===================
+
+``amd-pstate`` is the AMD CPU performance scaling driver that introduces a
+new CPU frequency control mechanism on modern AMD APU and CPU series in
+Linux kernel. The new mechanism is based on Collaborative Processor
+Performance Control (CPPC) which provides finer grain frequency management
+than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
+the ACPI P-states driver to manage CPU frequency and clocks with switching
+only in 3 P-states. CPPC replaces the ACPI P-states controls and allows a
+flexible, low-latency interface for the Linux kernel to directly
+communicate the performance hints to hardware.
+
+``amd-pstate`` leverages the Linux kernel governors such as ``schedutil``,
+``ondemand``, etc. to manage the performance hints which are provided by
+CPPC hardware functionality that internally follows the hardware
+specification (for details refer to AMD64 Architecture Programmer's Manual
+Volume 2: System Programming [1]_). Currently, ``amd-pstate`` supports basic
+frequency control function according to kernel governors on some of the
+Zen2 and Zen3 processors, and we will implement more AMD specific functions
+in future after we verify them on the hardware and SBIOS.
+
+
+AMD CPPC Overview
+=======================
+
+Collaborative Processor Performance Control (CPPC) interface enumerates a
+continuous, abstract, and unit-less performance value in a scale that is
+not tied to a specific performance state / frequency. This is an ACPI
+standard [2]_ which software can specify application performance goals and
+hints as a relative target to the infrastructure limits. AMD processors
+provide the low latency register model (MSR) instead of an AML code
+interpreter for performance adjustments. ``amd-pstate`` will initialize a
+``struct cpufreq_driver`` instance, ``amd_pstate_driver``, with the callbacks
+to manage each performance update behavior. ::
+
+ Highest Perf ------>+-----------------------+ +-----------------------+
+ | | | |
+ | | | |
+ | | Max Perf ---->| |
+ | | | |
+ | | | |
+ Nominal Perf ------>+-----------------------+ +-----------------------+
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | Desired Perf ---->| |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ | | | |
+ Lowest non- | | | |
+ linear perf ------>+-----------------------+ +-----------------------+
+ | | | |
+ | | Lowest perf ---->| |
+ | | | |
+ Lowest perf ------>+-----------------------+ +-----------------------+
+ | | | |
+ | | | |
+ | | | |
+ 0 ------>+-----------------------+ +-----------------------+
+
+ AMD P-States Performance Scale
+
+
+.. _perf_cap:
+
+AMD CPPC Performance Capability
+--------------------------------
+
+Highest Performance (RO)
+.........................
+
+This is the absolute maximum performance an individual processor may reach,
+assuming ideal conditions. This performance level may not be sustainable
+for long durations and may only be achievable if other platform components
+are in a specific state; for example, it may require other processors to be in
+an idle state. This would be equivalent to the highest frequencies
+supported by the processor.
+
+Nominal (Guaranteed) Performance (RO)
+......................................
+
+This is the maximum sustained performance level of the processor, assuming
+ideal operating conditions. In the absence of an external constraint (power,
+thermal, etc.), this is the performance level the processor is expected to
+be able to maintain continuously. All cores/processors are expected to be
+able to sustain their nominal performance state simultaneously.
+
+Lowest non-linear Performance (RO)
+...................................
+
+This is the lowest performance level at which nonlinear power savings are
+achieved, for example, due to the combined effects of voltage and frequency
+scaling. Above this threshold, lower performance levels should be generally
+more energy efficient than higher performance levels. This register
+effectively conveys the most efficient performance level to ``amd-pstate``.
+
+Lowest Performance (RO)
+........................
+
+This is the absolute lowest performance level of the processor. Selecting a
+performance level lower than the lowest nonlinear performance level may
+cause an efficiency penalty but should reduce the instantaneous power
+consumption of the processor.
+
+AMD CPPC Performance Control
+------------------------------
+
+``amd-pstate`` passes performance goals through these registers. The
+register drives the behavior of the desired performance target.
+
+Minimum requested performance (RW)
+...................................
+
+``amd-pstate`` specifies the minimum allowed performance level.
+
+Maximum requested performance (RW)
+...................................
+
+``amd-pstate`` specifies a limit the maximum performance that is expected
+to be supplied by the hardware.
+
+Desired performance target (RW)
+...................................
+
+``amd-pstate`` specifies a desired target in the CPPC performance scale as
+a relative number. This can be expressed as percentage of nominal
+performance (infrastructure max). Below the nominal sustained performance
+level, desired performance expresses the average performance level of the
+processor subject to hardware. Above the nominal performance level,
+the processor must provide at least nominal performance requested and go higher
+if current operating conditions allow.
+
+Energy Performance Preference (EPP) (RW)
+.........................................
+
+This attribute provides a hint to the hardware if software wants to bias
+toward performance (0x0) or energy efficiency (0xff).
+
+
+Key Governors Support
+=======================
+
+``amd-pstate`` can be used with all the (generic) scaling governors listed
+by the ``scaling_available_governors`` policy attribute in ``sysfs``. Then,
+it is responsible for the configuration of policy objects corresponding to
+CPUs and provides the ``CPUFreq`` core (and the scaling governors attached
+to the policy objects) with accurate information on the maximum and minimum
+operating frequencies supported by the hardware. Users can check the
+``scaling_cur_freq`` information comes from the ``CPUFreq`` core.
+
+``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic
+frequency control. It is to fine tune the processor configuration on
+``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate``
+registers the adjust_perf callback to implement performance update behavior
+similar to CPPC. It is initialized by ``sugov_start`` and then populates the
+CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as the
+utilization update callback function in the CPU scheduler. The CPU scheduler
+will call ``cpufreq_update_util`` and assigns the target performance according
+to the ``struct sugov_cpu`` that the utilization update belongs to.
+Then, ``amd-pstate`` updates the desired performance according to the CPU
+scheduler assigned.
+
+.. _processor_support:
+
+Processor Support
+=======================
+
+The ``amd-pstate`` initialization will fail if the ``_CPC`` entry in the ACPI
+SBIOS does not exist in the detected processor. It uses ``acpi_cpc_valid``
+to check the existence of ``_CPC``. All Zen based processors support the legacy
+ACPI hardware P-States function, so when ``amd-pstate`` fails initialization,
+the kernel will fall back to initialize the ``acpi-cpufreq`` driver.
+
+There are two types of hardware implementations for ``amd-pstate``: one is
+`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support
+<perf_cap_>`_. It can use the :c:macro:`X86_FEATURE_CPPC` feature flag to
+indicate the different types. (For details, refer to the Processor Programming
+Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors [3]_.)
+``amd-pstate`` is to register different ``static_call`` instances for different
+hardware implementations.
+
+Currently, some of the Zen2 and Zen3 processors support ``amd-pstate``. In the
+future, it will be supported on more and more AMD processors.
+
+Full MSR Support
+-----------------
+
+Some new Zen3 processors such as Cezanne provide the MSR registers directly
+while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set.
+``amd-pstate`` can handle the MSR register to implement the fast switch
+function in ``CPUFreq`` that can reduce the latency of frequency control in
+interrupt context. The functions with a ``pstate_xxx`` prefix represent the
+operations on MSR registers.
+
+Shared Memory Support
+----------------------
+
+If the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, the
+processor supports the shared memory solution. In this case, ``amd-pstate``
+uses the ``cppc_acpi`` helper methods to implement the callback functions
+that are defined on ``static_call``. The functions with the ``cppc_xxx`` prefix
+represent the operations of ACPI CPPC helpers for the shared memory solution.
+
+
+AMD P-States and ACPI hardware P-States always can be supported in one
+processor. But AMD P-States has the higher priority and if it is enabled
+with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond
+to the request from AMD P-States.
+
+
+User Space Interface in ``sysfs`` - Per-policy control
+======================================================
+
+``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
+control its functionality at the system level. They are located in the
+``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. ::
+
+ root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd*
+ /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf
+ /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq
+ /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq
+
+
+``amd_pstate_highest_perf / amd_pstate_max_freq``
+
+Maximum CPPC performance and CPU frequency that the driver is allowed to
+set, in percent of the maximum supported CPPC performance level (the highest
+performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).
+In some ASICs, the highest CPPC performance is not the one in the ``_CPC``
+table, so we need to expose it to sysfs. If boost is not active, but
+still supported, this maximum frequency will be larger than the one in
+``cpuinfo``.
+This attribute is read-only.
+
+``amd_pstate_lowest_nonlinear_freq``
+
+The lowest non-linear CPPC CPU frequency that the driver is allowed to set,
+in percent of the maximum supported CPPC performance level. (Please see the
+lowest non-linear performance in `AMD CPPC Performance Capability
+<perf_cap_>`_.)
+This attribute is read-only.
+
+``energy_performance_available_preferences``
+
+A list of all the supported EPP preferences that could be used for
+``energy_performance_preference`` on this system.
+These profiles represent different hints that are provided
+to the low-level firmware about the user's desired energy vs efficiency
+tradeoff. ``default`` represents the epp value is set by platform
+firmware. This attribute is read-only.
+
+``energy_performance_preference``
+
+The current energy performance preference can be read from this attribute.
+and user can change current preference according to energy or performance needs
+Please get all support profiles list from
+``energy_performance_available_preferences`` attribute, all the profiles are
+integer values defined between 0 to 255 when EPP feature is enabled by platform
+firmware, if EPP feature is disabled, driver will ignore the written value
+This attribute is read-write.
+
+Other performance and frequency values can be read back from
+``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
+
+
+``amd-pstate`` vs ``acpi-cpufreq``
+======================================
+
+On the majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables
+provided by the platform firmware are used for CPU performance scaling, but
+only provide 3 P-states on AMD processors.
+However, on modern AMD APU and CPU series, hardware provides the Collaborative
+Processor Performance Control according to the ACPI protocol and customizes this
+for AMD platforms. That is, fine-grained and continuous frequency ranges
+instead of the legacy hardware P-states. ``amd-pstate`` is the kernel
+module which supports the new AMD P-States mechanism on most of the future AMD
+platforms. The AMD P-States mechanism is the more performance and energy
+efficiency frequency management method on AMD processors.
+
+
+``amd-pstate`` Driver Operation Modes
+======================================
+
+``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
+non-autonomous (passive) mode and guided autonomous (guided) mode.
+Active/passive/guided mode can be chosen by different kernel parameters.
+
+- In autonomous mode, platform ignores the desired performance level request
+ and takes into account only the values set to the minimum, maximum and energy
+ performance preference registers.
+- In non-autonomous mode, platform gets desired performance level
+ from OS directly through Desired Performance Register.
+- In guided-autonomous mode, platform sets operating performance level
+ autonomously according to the current workload and within the limits set by
+ OS through min and max performance registers.
+
+Active Mode
+------------
+
+``amd_pstate=active``
+
+This is the low-level firmware control mode which is implemented by ``amd_pstate_epp``
+driver with ``amd_pstate=active`` passed to the kernel in the command line.
+In this mode, ``amd_pstate_epp`` driver provides a hint to the hardware if software
+wants to bias toward performance (0x0) or energy efficiency (0xff) to the CPPC firmware.
+then CPPC power algorithm will calculate the runtime workload and adjust the realtime
+cores frequency according to the power supply and thermal, core voltage and some other
+hardware conditions.
+
+Passive Mode
+------------
+
+``amd_pstate=passive``
+
+It will be enabled if the ``amd_pstate=passive`` is passed to the kernel in the command line.
+In this mode, ``amd_pstate`` driver software specifies a desired QoS target in the CPPC
+performance scale as a relative number. This can be expressed as percentage of nominal
+performance (infrastructure max). Below the nominal sustained performance level,
+desired performance expresses the average performance level of the processor subject
+to the Performance Reduction Tolerance register. Above the nominal performance level,
+processor must provide at least nominal performance requested and go higher if current
+operating conditions allow.
+
+Guided Mode
+-----------
+
+``amd_pstate=guided``
+
+If ``amd_pstate=guided`` is passed to kernel command line option then this mode
+is activated. In this mode, driver requests minimum and maximum performance
+level and the platform autonomously selects a performance level in this range
+and appropriate to the current workload.
+
+``amd-pstate`` Preferred Core
+=================================
+
+The core frequency is subjected to the process variation in semiconductors.
+Not all cores are able to reach the maximum frequency respecting the
+infrastructure limits. Consequently, AMD has redefined the concept of
+maximum frequency of a part. This means that a fraction of cores can reach
+maximum frequency. To find the best process scheduling policy for a given
+scenario, OS needs to know the core ordering informed by the platform through
+highest performance capability register of the CPPC interface.
+
+``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
+cores that can achieve a higher frequency with lower voltage. The preferred
+core rankings can dynamically change based on the workload, platform conditions,
+thermals and ageing.
+
+The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
+driver will also determine whether or not ``amd-pstate`` preferred core is
+supported by the platform.
+
+``amd-pstate`` driver will provide an initial core ordering when the system boots.
+The platform uses the CPPC interfaces to communicate the core ranking to the
+operating system and scheduler to make sure that OS is choosing the cores
+with highest performance firstly for scheduling the process. When ``amd-pstate``
+driver receives a message with the highest performance change, it will
+update the core ranking and set the cpu's priority.
+
+``amd-pstate`` Preferred Core Switch
+=====================================
+Kernel Parameters
+-----------------
+
+``amd-pstate`` peferred core`` has two states: enable and disable.
+Enable/disable states can be chosen by different kernel parameters.
+Default enable ``amd-pstate`` preferred core.
+
+``amd_prefcore=disable``
+
+For systems that support ``amd-pstate`` preferred core, the core rankings will
+always be advertised by the platform. But OS can choose to ignore that via the
+kernel parameter ``amd_prefcore=disable``.
+
+User Space Interface in ``sysfs`` - General
+===========================================
+
+Global Attributes
+-----------------
+
+``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
+control its functionality at the system level. They are located in the
+``/sys/devices/system/cpu/amd_pstate/`` directory and affect all CPUs.
+
+``status``
+ Operation mode of the driver: "active", "passive" or "disable".
+
+ "active"
+ The driver is functional and in the ``active mode``
+
+ "passive"
+ The driver is functional and in the ``passive mode``
+
+ "guided"
+ The driver is functional and in the ``guided mode``
+
+ "disable"
+ The driver is unregistered and not functional now.
+
+ This attribute can be written to in order to change the driver's
+ operation mode or to unregister it. The string written to it must be
+ one of the possible values of it and, if successful, writing one of
+ these values to the sysfs file will cause the driver to switch over
+ to the operation mode represented by that string - or to be
+ unregistered in the "disable" case.
+
+``prefcore``
+ Preferred core state of the driver: "enabled" or "disabled".
+
+ "enabled"
+ Enable the ``amd-pstate`` preferred core.
+
+ "disabled"
+ Disable the ``amd-pstate`` preferred core
+
+
+ This attribute is read-only to check the state of preferred core set
+ by the kernel parameter.
+
+``cpupower`` tool support for ``amd-pstate``
+===============================================
+
+``amd-pstate`` is supported by the ``cpupower`` tool, which can be used to dump
+frequency information. Development is in progress to support more and more
+operations for the new ``amd-pstate`` module with this tool. ::
+
+ root@hr-test1:/home/ray# cpupower frequency-info
+ analyzing CPU 0:
+ driver: amd-pstate
+ CPUs which run at the same hardware frequency: 0
+ CPUs which need to have their frequency coordinated by software: 0
+ maximum transition latency: 131 us
+ hardware limits: 400 MHz - 4.68 GHz
+ available cpufreq governors: ondemand conservative powersave userspace performance schedutil
+ current policy: frequency should be within 400 MHz and 4.68 GHz.
+ The governor "schedutil" may decide which speed to use
+ within this range.
+ current CPU frequency: Unable to call hardware
+ current CPU frequency: 4.02 GHz (asserted by call to kernel)
+ boost state support:
+ Supported: yes
+ Active: yes
+ AMD PSTATE Highest Performance: 166. Maximum Frequency: 4.68 GHz.
+ AMD PSTATE Nominal Performance: 117. Nominal Frequency: 3.30 GHz.
+ AMD PSTATE Lowest Non-linear Performance: 39. Lowest Non-linear Frequency: 1.10 GHz.
+ AMD PSTATE Lowest Performance: 15. Lowest Frequency: 400 MHz.
+
+
+Diagnostics and Tuning
+=======================
+
+Trace Events
+--------------
+
+There are two static trace events that can be used for ``amd-pstate``
+diagnostics. One of them is the ``cpu_frequency`` trace event generally used
+by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
+specific to ``amd-pstate``. The following sequence of shell commands can
+be used to enable them and see their output (if the kernel is
+configured to support event tracing). ::
+
+ root@hr-test1:/home/ray# cd /sys/kernel/tracing/
+ root@hr-test1:/sys/kernel/tracing# echo 1 > events/amd_cpu/enable
+ root@hr-test1:/sys/kernel/tracing# cat trace
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 47827/42233061 #P:2
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / delay
+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION
+ # | | | |||| | |
+ <idle>-0 [015] dN... 4995.979886: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=15 changed=false fast_switch=true
+ <idle>-0 [007] d.h.. 4995.979893: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
+ cat-2161 [000] d.... 4995.980841: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=0 changed=false fast_switch=true
+ sshd-2125 [004] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=4 changed=false fast_switch=true
+ <idle>-0 [007] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
+ <idle>-0 [003] d.s.. 4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true
+ <idle>-0 [011] d.s.. 4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true
+
+The ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling
+governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the
+policies with other scaling governors).
+
+
+Tracer Tool
+-------------
+
+``amd_pstate_tracer.py`` can record and parse ``amd-pstate`` trace log, then
+generate performance plots. This utility can be used to debug and tune the
+performance of ``amd-pstate`` driver. The tracer tool needs to import intel
+pstate tracer.
+
+Tracer tool located in ``linux/tools/power/x86/amd_pstate_tracer``. It can be
+used in two ways. If trace file is available, then directly parse the file
+with command ::
+
+ ./amd_pstate_trace.py [-c cpus] -t <trace_file> -n <test_name>
+
+Or generate trace file with root privilege, then parse and plot with command ::
+
+ sudo ./amd_pstate_trace.py [-c cpus] -n <test_name> -i <interval> [-m kbytes]
+
+The test result can be found in ``results/test_name``. Following is the example
+about part of the output. ::
+
+ common_cpu common_secs common_usecs min_perf des_perf max_perf freq mperf apef tsc load duration_ms sample_num elapsed_time common_comm
+ CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40
+ CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264
+
+Unit Tests for amd-pstate
+-------------------------
+
+``amd-pstate-ut`` is a test module for testing the ``amd-pstate`` driver.
+
+ * It can help all users to verify their processor support (SBIOS/Firmware or Hardware).
+
+ * Kernel can have a basic function test to avoid the kernel regression during the update.
+
+ * We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization.
+
+1. Test case descriptions
+
+ 1). Basic tests
+
+ Test prerequisite and basic functions for the ``amd-pstate`` driver.
+
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | Index | Functions | Description |
+ +=========+================================+====================================================================================+
+ | 1 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. |
+ | | || |
+ | | || The detail refer to `Processor Support <processor_support_>`_. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | 2 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. |
+ | | || |
+ | | || AMD P-States and ACPI hardware P-States always can be supported in one processor. |
+ | | | But AMD P-States has the higher priority and if it is enabled with |
+ | | | :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the |
+ | | | request from AMD P-States. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | 3 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. |
+ | | || highest_perf >= nominal_perf > lowest_nonlinear_perf > lowest_perf > 0. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+ | 4 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode |
+ | | | are reasonable. |
+ | | || max_freq >= nominal_freq > lowest_nonlinear_freq > min_freq > 0 |
+ | | || If boost is not active but supported, this maximum frequency will be larger than |
+ | | | the one in ``cpuinfo``. |
+ +---------+--------------------------------+------------------------------------------------------------------------------------+
+
+ 2). Tbench test
+
+ Test and monitor the cpu changes when running tbench benchmark under the specified governor.
+ These changes include desire performance, frequency, load, performance, energy etc.
+ The specified governor is ondemand or schedutil.
+ Tbench can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
+
+ 3). Gitsource test
+
+ Test and monitor the cpu changes when running gitsource benchmark under the specified governor.
+ These changes include desire performance, frequency, load, time, energy etc.
+ The specified governor is ondemand or schedutil.
+ Gitsource can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.
+
+#. How to execute the tests
+
+ We use test module in the kselftest frameworks to implement it.
+ We create ``amd-pstate-ut`` module and tie it into kselftest.(for
+ details refer to Linux Kernel Selftests [4]_).
+
+ 1). Build
+
+ + open the :c:macro:`CONFIG_X86_AMD_PSTATE` configuration option.
+ + set the :c:macro:`CONFIG_X86_AMD_PSTATE_UT` configuration option to M.
+ + make project
+ + make selftest ::
+
+ $ cd linux
+ $ make -C tools/testing/selftests
+
+ + make perf ::
+
+ $ cd tools/perf/
+ $ make
+
+
+ 2). Installation & Steps ::
+
+ $ make -C tools/testing/selftests install INSTALL_PATH=~/kselftest
+ $ cp tools/perf/perf /usr/bin/perf
+ $ sudo ./kselftest/run_kselftest.sh -c amd-pstate
+
+ 3). Specified test case ::
+
+ $ cd ~/kselftest/amd-pstate
+ $ sudo ./run.sh -t basic
+ $ sudo ./run.sh -t tbench
+ $ sudo ./run.sh -t tbench -m acpi-cpufreq
+ $ sudo ./run.sh -t gitsource
+ $ sudo ./run.sh -t gitsource -m acpi-cpufreq
+ $ ./run.sh --help
+ ./run.sh: illegal option -- -
+ Usage: ./run.sh [OPTION...]
+ [-h <help>]
+ [-o <output-file-for-dump>]
+ [-c <all: All testing,
+ basic: Basic testing,
+ tbench: Tbench testing,
+ gitsource: Gitsource testing.>]
+ [-t <tbench time limit>]
+ [-p <tbench process number>]
+ [-l <loop times for tbench>]
+ [-i <amd tracer interval>]
+ [-m <comparative test: acpi-cpufreq>]
+
+
+ 4). Results
+
+ + basic
+
+ When you finish test, you will get the following log info ::
+
+ $ dmesg | grep "amd_pstate_ut" | tee log.txt
+ [12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success!
+ [12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success!
+ [12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success!
+ [12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success!
+
+ + tbench
+
+ When you finish test, you will get selftest.tbench.csv and png images.
+ The selftest.tbench.csv file contains the raw data and the drop of the comparative test.
+ The png images shows the performance, energy and performan per watt of each test.
+ Open selftest.tbench.csv :
+
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + Governor | Round | Des-perf | Freq | Load | Performance | Energy | Performance Per Watt |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + Unit | | | GHz | | MB/s | J | MB/J |
+ +=================================================+==============+==========+=========+==========+=============+=========+======================+
+ + amd-pstate-ondemand | 1 | | | | 2504.05 | 1563.67 | 158.5378 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand | 2 | | | | 2243.64 | 1430.32 | 155.2941 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand | 3 | | | | 2183.88 | 1401.32 | 154.2860 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand | Average | | | | 2310.52 | 1465.1 | 156.1268 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | 1 | 165.329 | 1.62257 | 99.798 | 2136.54 | 1395.26 | 151.5971 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | 2 | 166 | 1.49761 | 99.9993 | 2100.56 | 1380.5 | 150.6377 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | 3 | 166 | 1.47806 | 99.9993 | 2084.12 | 1375.76 | 149.9737 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | Average | 165.776 | 1.53275 | 99.9322 | 2107.07 | 1383.84 | 150.7399 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | 1 | | | | 2529.9 | 1564.4 | 160.0997 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | 2 | | | | 2249.76 | 1432.97 | 155.4297 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | 3 | | | | 2181.46 | 1406.88 | 153.5060 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | Average | | | | 2320.37 | 1468.08 | 156.4741 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | 1 | | | | 2137.64 | 1385.24 | 152.7723 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | 2 | | | | 2107.05 | 1372.23 | 152.0138 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | 3 | | | | 2085.86 | 1365.35 | 151.2433 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | Average | | | | 2110.18 | 1374.27 | 152.0136 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | -9.0584 | -6.3899 | -2.8506 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | | | | 8.8053 | -5.5463 | -3.4503 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | -0.4245 | -0.2029 | -0.2219 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | -0.1473 | 0.6963 | -0.8378 |
+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+
+
+ + gitsource
+
+ When you finish test, you will get selftest.gitsource.csv and png images.
+ The selftest.gitsource.csv file contains the raw data and the drop of the comparative test.
+ The png images shows the performance, energy and performan per watt of each test.
+ Open selftest.gitsource.csv :
+
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + Governor | Round | Des-perf | Freq | Load | Time | Energy | Performance Per Watt |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + Unit | | | GHz | | s | J | 1/J |
+ +=================================================+==============+==========+==========+==========+=============+=========+======================+
+ + amd-pstate-ondemand | 1 | 50.119 | 2.10509 | 23.3076 | 475.69 | 865.78 | 0.001155027 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand | 2 | 94.8006 | 1.98771 | 56.6533 | 467.1 | 839.67 | 0.001190944 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand | 3 | 76.6091 | 2.53251 | 43.7791 | 467.69 | 855.85 | 0.001168429 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand | Average | 73.8429 | 2.20844 | 41.2467 | 470.16 | 853.767 | 0.001171279 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | 1 | 165.919 | 1.62319 | 98.3868 | 464.17 | 866.8 | 0.001153668 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | 2 | 165.97 | 1.31309 | 99.5712 | 480.15 | 880.4 | 0.001135847 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | 3 | 165.973 | 1.28448 | 99.9252 | 481.79 | 867.02 | 0.001153375 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-schedutil | Average | 165.954 | 1.40692 | 99.2944 | 475.37 | 871.407 | 0.001147569 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | 1 | | | | 2379.62 | 742.96 | 0.001345967 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | 2 | | | | 441.74 | 817.49 | 0.001223256 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | 3 | | | | 455.48 | 820.01 | 0.001219497 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand | Average | | | | 425.613 | 793.487 | 0.001260260 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | 1 | | | | 459.69 | 838.54 | 0.001192548 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | 2 | | | | 466.55 | 830.89 | 0.001203528 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | 3 | | | | 470.38 | 837.32 | 0.001194286 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil | Average | | | | 465.54 | 835.583 | 0.001196769 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | 9.3810 | 5.3051 | -5.0379 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | 124.7392 | -36.2934 | 140.7329 | 1.1081 | 2.0661 | -2.0242 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | 10.4665 | 7.5968 | -7.0605 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+ + acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | 2.1115 | 4.2873 | -4.1110 |
+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+
+
+Reference
+===========
+
+.. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming,
+ https://www.amd.com/system/files/TechDocs/24593.pdf
+
+.. [2] Advanced Configuration and Power Interface Specification,
+ https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf
+
+.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors
+ https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip
+
+.. [4] Linux Kernel Selftests,
+ https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 0c74a7784964..6adb7988e0eb 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -1,7 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
-.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
=======================
@@ -92,16 +91,16 @@ control the P-state of multiple CPUs at the same time and writing to it affects
all of those CPUs simultaneously.
Sets of CPUs sharing hardware P-state control interfaces are represented by
-``CPUFreq`` as |struct cpufreq_policy| objects. For consistency,
-|struct cpufreq_policy| is also used when there is only one CPU in the given
+``CPUFreq`` as struct cpufreq_policy objects. For consistency,
+struct cpufreq_policy is also used when there is only one CPU in the given
set.
-The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for
+The ``CPUFreq`` core maintains a pointer to a struct cpufreq_policy object for
every CPU in the system, including CPUs that are currently offline. If multiple
CPUs share the same hardware P-state control interface, all of the pointers
-corresponding to them point to the same |struct cpufreq_policy| object.
+corresponding to them point to the same struct cpufreq_policy object.
-``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design
+``CPUFreq`` uses struct cpufreq_policy as its basic data type and the design
of its user space interface is based on the policy concept.
@@ -147,9 +146,9 @@ CPUs in it.
The next major initialization step for a new policy object is to attach a
scaling governor to it (to begin with, that is the default scaling governor
-determined by the kernel configuration, but it may be changed later
-via ``sysfs``). First, a pointer to the new policy object is passed to the
-governor's ``->init()`` callback which is expected to initialize all of the
+determined by the kernel command line or configuration, but it may be changed
+later via ``sysfs``). First, a pointer to the new policy object is passed to
+the governor's ``->init()`` callback which is expected to initialize all of the
data structures necessary to handle the given policy and, possibly, to add
a governor ``sysfs`` interface to it. Next, the governor is started by
invoking its ``->start()`` callback.
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
index a96a423e3779..19754beb5a4e 100644
--- a/Documentation/admin-guide/pm/cpuidle.rst
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -347,81 +347,8 @@ for tickless systems. It follows the same basic strategy as the ``menu`` `one
<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
given conditions. However, it applies a different approach to that problem.
-First, it does not use sleep length correction factors, but instead it attempts
-to correlate the observed idle duration values with the available idle states
-and use that information to pick up the idle state that is most likely to
-"match" the upcoming CPU idle interval. Second, it does not take the tasks
-that were running on the given CPU in the past and are waiting on some I/O
-operations to complete now at all (there is no guarantee that they will run on
-the same CPU when they become runnable again) and the pattern detection code in
-it avoids taking timer wakeups into account. It also only uses idle duration
-values less than the current time till the closest timer (with the scheduler
-tick excluded) for that purpose.
-
-Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
-the *sleep length*, which is the time until the closest timer event with the
-assumption that the scheduler tick will be stopped (that also is the upper bound
-on the time until the next CPU wakeup). That value is then used to preselect an
-idle state on the basis of three metrics maintained for each idle state provided
-by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
-
-The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
-state will "match" the observed (post-wakeup) idle duration if it "matches" the
-sleep length. They both are subject to decay (after a CPU wakeup) every time
-the target residency of the idle state corresponding to them is less than or
-equal to the sleep length and the target residency of the next idle state is
-greater than the sleep length (that is, when the idle state corresponding to
-them "matches" the sleep length). The ``hits`` metric is increased if the
-former condition is satisfied and the target residency of the given idle state
-is less than or equal to the observed idle duration and the target residency of
-the next idle state is greater than the observed idle duration at the same time
-(that is, it is increased when the given idle state "matches" both the sleep
-length and the observed idle duration). In turn, the ``misses`` metric is
-increased when the given idle state "matches" the sleep length only and the
-observed idle duration is too short for its target residency.
-
-The ``early_hits`` metric measures the likelihood that a given idle state will
-"match" the observed (post-wakeup) idle duration if it does not "match" the
-sleep length. It is subject to decay on every CPU wakeup and it is increased
-when the idle state corresponding to it "matches" the observed (post-wakeup)
-idle duration and the target residency of the next idle state is less than or
-equal to the sleep length (i.e. the idle state "matching" the sleep length is
-deeper than the given one).
-
-The governor walks the list of idle states provided by the ``CPUIdle`` driver
-and finds the last (deepest) one with the target residency less than or equal
-to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle
-state are compared with each other and it is preselected if the ``hits`` one is
-greater (which means that that idle state is likely to "match" the observed idle
-duration after CPU wakeup). If the ``misses`` one is greater, the governor
-preselects the shallower idle state with the maximum ``early_hits`` metric
-(or if there are multiple shallower idle states with equal ``early_hits``
-metric which also is the maximum, the shallowest of them will be preselected).
-[If there is a wakeup latency constraint coming from the `PM QoS framework
-<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
-target residency within the sleep length, the deepest idle state with the exit
-latency within the constraint is preselected without consulting the ``hits``,
-``misses`` and ``early_hits`` metrics.]
-
-Next, the governor takes several idle duration values observed most recently
-into consideration and if at least a half of them are greater than or equal to
-the target residency of the preselected idle state, that idle state becomes the
-final candidate to ask for. Otherwise, the average of the most recent idle
-duration values below the target residency of the preselected idle state is
-computed and the governor walks the idle states shallower than the preselected
-one and finds the deepest of them with the target residency within that average.
-That idle state is then taken as the final candidate to ask for.
-
-Still, at this point the governor may need to refine the idle state selection if
-it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
-generally happens if the target residency of the idle state selected so far is
-less than the tick period and the tick has not been stopped already (in a
-previous iteration of the idle loop). Then, like in the ``menu`` governor
-`case <menu-gov_>`_, the sleep length used in the previous computations may not
-reflect the real time until the closest timer event and if it really is greater
-than that time, a shallower state with a suitable target residency may need to
-be selected.
-
+.. kernel-doc:: drivers/cpuidle/governors/teo.c
+ :doc: teo-description
.. _idle-states-representation:
@@ -478,7 +405,7 @@ order to ask the hardware to enter that state. Also, for each
statistics of the given idle state. That information is exposed by the kernel
via ``sysfs``.
-For each CPU in the system, there is a :file:`/sys/devices/system/cpu<N>/cpuidle/`
+For each CPU in the system, there is a :file:`/sys/devices/system/cpu/cpu<N>/cpuidle/`
directory in ``sysfs``, where the number ``<N>`` is assigned to the given
CPU at the initialization time. That directory contains a set of subdirectories
called :file:`state0`, :file:`state1` and so on, up to the number of idle state
@@ -494,7 +421,7 @@ object corresponding to it, as follows:
residency.
``below``
- Total number of times this idle state had been asked for, but cerainly
+ Total number of times this idle state had been asked for, but certainly
a deeper idle state would have been a better match for the observed idle
duration.
@@ -528,6 +455,10 @@ object corresponding to it, as follows:
Total number of times the hardware has been asked by the given CPU to
enter this idle state.
+``rejected``
+ Total number of times a request to enter this idle state on the given
+ CPU was rejected.
+
The :file:`desc` and :file:`name` files both contain strings. The difference
between them is that the name is expected to be more concise, while the
description may be longer and it may contain white space or special characters.
@@ -572,6 +503,11 @@ particular case. For these reasons, the only reliable way to find out how
much time has been spent by the hardware in different idle states supported by
it is to use idle state residency counters in the hardware, if available.
+Generally, an interrupt received when trying to enter an idle state causes the
+idle state entry request to be rejected, in which case the ``CPUIdle`` driver
+may return an error code to indicate that this was the case. The :file:`usage`
+and :file:`rejected` files report the number of times the given idle state
+was entered successfully or rejected, respectively.
.. _cpu-pm-qos:
@@ -676,8 +612,8 @@ the ``menu`` governor to be used on the systems that use the ``ladder`` governor
by default this way, for example.
The other kernel command line parameters controlling CPU idle time management
-described below are only relevant for the *x86* architecture and some of
-them affect Intel processors only.
+described below are only relevant for the *x86* architecture and references
+to ``intel_idle`` affect Intel processors only.
The *x86* architecture support code recognizes three kernel command line
options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
@@ -690,7 +626,7 @@ which of the two parameters is added to the kernel command line. In the
instruction of the CPUs (which, as a rule, suspends the execution of the program
and causes the hardware to attempt to enter the shallowest available idle state)
for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a
-more or less ``lightweight'' sequence of instructions in a tight loop. [Note
+more or less "lightweight" sequence of instructions in a tight loop. [Note
that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
CPUs from saving almost any energy at all may not be the only effect of it.
For example, on Intel hardware it effectively prevents CPUs from using
@@ -699,10 +635,13 @@ idle, so it very well may hurt single-thread computations performance as well as
energy-efficiency. Thus using it for performance reasons may not be a good idea
at all.]
-The ``idle=nomwait`` option disables the ``intel_idle`` driver and causes
-``acpi_idle`` to be used (as long as all of the information needed by it is
-there in the system's ACPI tables), but it is not allowed to use the
-``MWAIT`` instruction of the CPUs to ask the hardware to enter idle states.
+The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of
+the CPU to enter idle states. When this option is used, the ``acpi_idle``
+driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems
+running Intel processors, this option disables the ``intel_idle`` driver
+and forces the use of the ``acpi_idle`` driver instead. Note that in either
+case, ``acpi_idle`` driver will function only if all the information needed
+by it is in the system's ACPI tables.
In addition to the architecture-level kernel command line options affecting CPU
idle time management, there are parameters affecting individual ``CPUIdle``
diff --git a/Documentation/admin-guide/pm/intel-speed-select.rst b/Documentation/admin-guide/pm/intel-speed-select.rst
index b2ca601c21c6..a2bfb971654f 100644
--- a/Documentation/admin-guide/pm/intel-speed-select.rst
+++ b/Documentation/admin-guide/pm/intel-speed-select.rst
@@ -57,7 +57,7 @@ To get help on a command, another level of help is provided. For example for the
Summary of platform capability
------------------------------
-To check the current platform and driver capaibilities, execute::
+To check the current platform and driver capabilities, execute::
#intel-speed-select --info
@@ -114,7 +114,7 @@ base performance profile (which is performance level 0).
Lock/Unlock status
~~~~~~~~~~~~~~~~~~
-Even if there are multiple performance profiles, it is possible that that they
+Even if there are multiple performance profiles, it is possible that they
are locked. If they are locked, users cannot issue a command to change the
performance state. It is possible that there is a BIOS setup to unlock or check
with your system vendor.
@@ -262,6 +262,28 @@ Which shows that the base frequency now increased from 2600 MHz at performance
level 0 to 2800 MHz at performance level 4. As a result, any workload, which can
use fewer CPUs, can see a boost of 200 MHz compared to performance level 0.
+Changing performance level via BMC Interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to change SST-PP level using out of band (OOB) agent (Via some
+remote management console, through BMC "Baseboard Management Controller"
+interface). This mode is supported from the Sapphire Rapids processor
+generation. The kernel and tool change to support this mode is added to Linux
+kernel version 5.18. To enable this feature, kernel config
+"CONFIG_INTEL_HFI_THERMAL" is required. The minimum version of the tool
+is "v1.12" to support this feature, which is part of Linux kernel version 5.18.
+
+To support such configuration, this tool can be used as a daemon. Add
+a command line option --oob::
+
+ # intel-speed-select --oob
+ Intel(R) Speed Select Technology
+ Executing on CPU model:143[0x8f]
+ OOB mode is enabled and will run as daemon
+
+In this mode the tool will online/offline CPUs based on the new performance
+level.
+
Check presence of other Intel(R) SST features
---------------------------------------------
@@ -658,7 +680,7 @@ If -a option is not used, then the following steps are required before enabling
Intel(R) SST-BF:
- Discover Intel(R) SST-BF and note low and high priority base frequency
-- Note the high prioity CPU list
+- Note the high priority CPU list
- Enable CLOS using core-power feature set
- Configure CLOS parameters. Use CLOS.min to set to minimum performance
- Subscribe desired CPUs to CLOS groups
@@ -883,7 +905,7 @@ To enable Intel(R) SST-TF, execute::
enable:success
In this case, the option "-a" is optional. If set, it enables Intel(R) SST-TF
-feature and also sets the CPUs to high and and low priority using Intel Speed
+feature and also sets the CPUs to high and low priority using Intel Speed
Select Technology Core Power (Intel(R) SST-CP) features. The CPU numbers passed
with "-c" arguments are marked as high priority, including its siblings.
diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst
index 89309e1b0e48..39bd6ecce7de 100644
--- a/Documentation/admin-guide/pm/intel_idle.rst
+++ b/Documentation/admin-guide/pm/intel_idle.rst
@@ -20,8 +20,8 @@ Nehalem and later generations of Intel processors, but the level of support for
a particular processor model in it depends on whether or not it recognizes that
processor model and may also depend on information coming from the platform
firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
-works in general, so this is the time to get familiar with :doc:`cpuidle` if you
-have not done that yet.]
+works in general, so this is the time to get familiar with
+Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
logical CPU executing it is idle and so it may be possible to put some of the
@@ -53,7 +53,8 @@ processor) corresponding to them depends on the processor model and it may also
depend on the configuration of the platform.
In order to create a list of available idle states required by the ``CPUIdle``
-subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`),
+subsystem (see :ref:`idle-states-representation` in
+Documentation/admin-guide/pm/cpuidle.rst),
``intel_idle`` can use two sources of information: static tables of idle states
for different processor models included in the driver itself and the ACPI tables
of the system. The former are always used if the processor model at hand is
@@ -98,7 +99,8 @@ states may not be enabled by default if there are no matching entries in the
preliminary list of idle states coming from the ACPI tables. In that case user
space still can enable them later (on a per-CPU basis) with the help of
the ``disable`` idle state attribute in ``sysfs`` (see
-:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that
+:ref:`idle-states-representation` in
+Documentation/admin-guide/pm/cpuidle.rst). This basically means that
the idle states "known" to the driver may not be enabled by default if they have
not been exposed by the platform firmware (through the ACPI tables).
@@ -168,7 +170,7 @@ and ``idle=nomwait``. If any of them is present in the kernel command line, the
``MWAIT`` instruction is not allowed to be used, so the initialization of
``intel_idle`` will fail.
-Apart from that there are four module parameters recognized by ``intel_idle``
+Apart from that there are five module parameters recognized by ``intel_idle``
itself that can be set via the kernel command line (they cannot be updated via
sysfs, so that is the only way to change their values).
@@ -186,7 +188,8 @@ be desirable. In practice, it is only really necessary to do that if the idle
states in question cannot be enabled during system startup, because in the
working state of the system the CPU power management quality of service (PM
QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
-even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`).
+even if they have been enumerated (see :ref:`cpu-pm-qos` in
+Documentation/admin-guide/pm/cpuidle.rst).
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
@@ -202,7 +205,8 @@ Namely, the positions of the bits that are set in the ``states_off`` value are
the indices of idle states to be disabled by default (as reflected by the names
of the corresponding idle state directories in ``sysfs``, :file:`state0`,
:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
-idle state; see :ref:`idle-states-representation` in :doc:`cpuidle`).
+idle state; see :ref:`idle-states-representation` in
+Documentation/admin-guide/pm/cpuidle.rst).
For example, if ``states_off`` is equal to 3, the driver will disable idle
states 0 and 1 by default, and if it is equal to 8, idle state 3 will be
@@ -212,6 +216,21 @@ are ignored).
The idle states disabled this way can be enabled (on a per-CPU basis) from user
space via ``sysfs``.
+The ``ibrs_off`` module parameter is a boolean flag (defaults to
+false). If set, it is used to control if IBRS (Indirect Branch Restricted
+Speculation) should be turned off when the CPU enters an idle state.
+This flag does not affect CPUs that use Enhanced IBRS which can remain
+on with little performance impact.
+
+For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed
+security vulnerabilities by default. Leaving the IBRS mode on while idling may
+have a performance impact on its sibling CPU. The IBRS mode will be turned off
+by default when the CPU enters into a deep idle state, but not in some
+shallower ones. Setting the ``ibrs_off`` module parameter will force the IBRS
+mode to off when the CPU is in any one of the available idle states. This may
+help performance of a sibling CPU at the expense of a slightly higher wakeup
+latency for the idle CPU.
+
.. _intel-idle-core-and-package-idle-states:
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index 39d80bc29ccd..bf13ad25a32f 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -18,8 +18,8 @@ General Information
(``CPUFreq``). It is a scaling driver for the Sandy Bridge and later
generations of Intel processors. Note, however, that some of those processors
may not be supported. [To understand ``intel_pstate`` it is necessary to know
-how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if
-you have not done that yet.]
+how ``CPUFreq`` works in general, so this is the time to read
+Documentation/admin-guide/pm/cpufreq.rst if you have not done that yet.]
For the processors supported by ``intel_pstate``, the P-state concept is broader
than just an operating frequency or an operating performance point (see the
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_).
Operation Modes
===============
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode. Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive. In the
+active mode, it uses its own internal performance scaling governor algorithm or
+allows the hardware to do performance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm. Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
Active Mode
-----------
@@ -120,7 +123,9 @@ Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's
internal P-state selection logic is expected to focus entirely on performance.
This will override the EPP/EPB setting coming from the ``sysfs`` interface
-(see `Energy vs Performance Hints`_ below).
+(see `Energy vs Performance Hints`_ below). Moreover, any attempts to change
+the EPP/EPB to a value different from 0 ("performance") via ``sysfs`` in this
+configuration will be rejected.
Also, in this configuration the range of P-states available to the processor's
internal P-state selection logic is always restricted to the upper boundary
@@ -194,10 +199,11 @@ This is the default operation mode of ``intel_pstate`` for processors without
hardware-managed P-states (HWP) support. It is always used if the
``intel_pstate=passive`` argument is passed to the kernel in the command line
regardless of whether or not the given processor supports HWP. [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.] Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.] Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
If the driver works in this mode, the ``scaling_driver`` policy attribute in
``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +324,9 @@ manuals need to be consulted to get to it too.
For this reason, there is a list of supported processors in ``intel_pstate`` and
the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_. [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature. [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
User Space Interface in ``sysfs``
@@ -360,6 +365,9 @@ argument is passed to the kernel in the command line.
inclusive) including both turbo and non-turbo P-states (see
`Turbo P-states Support`_).
+ This attribute is present only if the value exposed by it is the same
+ for all of the CPUs in the system.
+
The value of this attribute is not affected by the ``no_turbo``
setting described `below <no_turbo_attr_>`_.
@@ -369,19 +377,22 @@ argument is passed to the kernel in the command line.
Ratio of the `turbo range <turbo_>`_ size to the size of the entire
range of supported P-states, in percent.
+ This attribute is present only if the value exposed by it is the same
+ for all of the CPUs in the system.
+
This attribute is read-only.
.. _no_turbo_attr:
``no_turbo``
If set (equal to 1), the driver is not allowed to set any turbo P-states
- (see `Turbo P-states Support`_). If unset (equalt to 0, which is the
+ (see `Turbo P-states Support`_). If unset (equal to 0, which is the
default), turbo P-states can be set by the driver.
[Note that ``intel_pstate`` does not support the general ``boost``
attribute (supported by some other scaling drivers) which is replaced
by this one.]
- This attrubute does not affect the maximum supported frequency value
+ This attribute does not affect the maximum supported frequency value
supplied to the ``CPUFreq`` core and exposed via the policy interface,
but it affects the maximum possible value of per-policy P-state limits
(see `Interpretation of Policy Attributes`_ below for details).
@@ -425,18 +436,24 @@ argument is passed to the kernel in the command line.
as well as the per-policy ones) are then reset to their default
values, possibly depending on the target operation mode.]
- That only is supported in some configurations, though (for example, if
- the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
- the operation mode of the driver cannot be changed), and if it is not
- supported in the current configuration, writes to this attribute will
- fail with an appropriate error.
+``energy_efficiency``
+ This attribute is only present on platforms with CPUs matching the Kaby
+ Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+ optimizations are disabled on these CPU models if HWP is enabled.
+ Enabling energy-efficiency optimizations may limit maximum operating
+ frequency with or without the HWP feature. With HWP enabled, the
+ optimizations are done only in the turbo frequency range. Without it,
+ they are done in the entire available frequency range. Setting this
+ attribute to "1" enables the energy-efficiency optimizations and setting
+ to "0" disables them.
Interpretation of Policy Attributes
-----------------------------------
The interpretation of some ``CPUFreq`` policy attributes described in
-:doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver
-and it generally depends on the driver's `operation mode <Operation Modes_>`_.
+Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate``
+as the current scaling driver and it generally depends on the driver's
+`operation mode <Operation Modes_>`_.
First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and
``scaling_cur_freq`` attributes are produced by applying a processor-specific
@@ -473,8 +490,8 @@ Next, the following policy attributes have special meaning if
policy for the time interval between the last two invocations of the
driver's utilization update callback by the CPU scheduler for that CPU.
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
``base_frequency``
Shows the base frequency of the CPU. Any frequency above this will be
@@ -515,11 +532,11 @@ on the following rules, regardless of the current operation mode of the driver:
3. The global and per-policy limits can be set independently.
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits. Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits. Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
every time before setting a new P-state for a CPU.
Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -530,12 +547,11 @@ at all and the only way to set the limits is by using the policy attributes.
Energy vs Performance Hints
---------------------------
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``. They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``. They are :
``energy_performance_preference``
Current value of the energy vs performance hint for the given policy
@@ -554,7 +570,11 @@ somewhere between the two extremes:
Strings written to the ``energy_performance_preference`` attribute are
internally translated to integer values written to the processor's
Energy-Performance Preference (EPP) knob (if supported) or its
-Energy-Performance Bias (EPB) knob.
+Energy-Performance Bias (EPB) knob. It is also possible to write a positive
+integer value between 0 to 255, if the EPP feature is present. If the EPP
+feature is not present, writing integer value to this attribute is not
+supported. In this case, user can use the
+"/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface.
[Note that tasks may by migrated from one CPU to another by the scheduler's
load-balancing algorithm and if different energy vs performance hints are
@@ -635,12 +655,14 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
Do not register ``intel_pstate`` as the scaling driver even if the
processor is supported by it.
+``active``
+ Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+ with.
+
``passive``
Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
start with.
- This option implies the ``no_hwp`` one described below.
-
``force``
Register ``intel_pstate`` as the scaling driver instead of
``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -655,13 +677,12 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
driver is used instead of ``acpi-cpufreq``.
``no_hwp``
- Do not enable the `hardware-managed P-states (HWP) feature
- <Active Mode With HWP_>`_ even if it is supported by the processor.
+ Do not enable the hardware-managed P-states (HWP) feature even if it is
+ supported by the processor.
``hwp_only``
Register ``intel_pstate`` as the scaling driver only if the
- `hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
- supported by the processor.
+ hardware-managed P-states (HWP) feature is supported by the processor.
``support_acpi_ppc``
Take ACPI ``_PPC`` performance limits into account.
@@ -691,7 +712,7 @@ it works in the `active mode <Active Mode_>`_.
The following sequence of shell commands can be used to enable them and see
their output (if the kernel is generally configured to support event tracing)::
- # cd /sys/kernel/debug/tracing/
+ # cd /sys/kernel/tracing/
# echo 1 > events/power/pstate_sample/enable
# echo 1 > events/power/cpu_frequency/enable
# cat trace
@@ -708,10 +729,10 @@ core (for the policies with other scaling governors).
The ``ftrace`` interface can be used for low-level diagnostics of
``intel_pstate``. For example, to check how often the function to set a
-P-state is called, the ``ftrace`` filter can be set to to
+P-state is called, the ``ftrace`` filter can be set to
:c:func:`intel_pstate_set_pstate`::
- # cd /sys/kernel/debug/tracing/
+ # cd /sys/kernel/tracing/
# cat available_filter_functions | grep -i pstate
intel_pstate_set_pstate
intel_pstate_cpu_init
diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
new file mode 100644
index 000000000000..5ab3440e6cee
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
@@ -0,0 +1,115 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+==============================
+Intel Uncore Frequency Scaling
+==============================
+
+:Copyright: |copy| 2022-2023 Intel Corporation
+
+:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
+
+Introduction
+------------
+
+The uncore can consume significant amount of power in Intel's Xeon servers based
+on the workload characteristics. To optimize the total power and improve overall
+performance, SoCs have internal algorithms for scaling uncore frequency. These
+algorithms monitor workload usage of uncore and set a desirable frequency.
+
+It is possible that users have different expectations of uncore performance and
+want to have control over it. The objective is similar to allowing users to set
+the scaling min/max frequencies via cpufreq sysfs to improve CPU performance.
+Users may have some latency sensitive workloads where they do not want any
+change to uncore frequency. Also, users may have workloads which require
+different core and uncore performance at distinct phases and they may want to
+use both cpufreq and the uncore scaling interface to distribute power and
+improve overall performance.
+
+Sysfs Interface
+---------------
+
+To control uncore frequency, a sysfs interface is provided in the directory:
+`/sys/devices/system/cpu/intel_uncore_frequency/`.
+
+There is one directory for each package and die combination as the scope of
+uncore scaling control is per die in multiple die/package SoCs or per
+package for single die per package SoCs. The name represents the
+scope of control. For example: 'package_00_die_00' is for package id 0 and
+die 0.
+
+Each package_*_die_* contains the following attributes:
+
+``initial_max_freq_khz``
+ Out of reset, this attribute represent the maximum possible frequency.
+ This is a read-only attribute. If users adjust max_freq_khz,
+ they can always go back to maximum using the value from this attribute.
+
+``initial_min_freq_khz``
+ Out of reset, this attribute represent the minimum possible frequency.
+ This is a read-only attribute. If users adjust min_freq_khz,
+ they can always go back to minimum using the value from this attribute.
+
+``max_freq_khz``
+ This attribute is used to set the maximum uncore frequency.
+
+``min_freq_khz``
+ This attribute is used to set the minimum uncore frequency.
+
+``current_freq_khz``
+ This attribute is used to get the current uncore frequency.
+
+SoCs with TPMI (Topology Aware Register and PM Capsule Interface)
+-----------------------------------------------------------------
+
+An SoC can contain multiple power domains with individual or collection
+of mesh partitions. This partition is called fabric cluster.
+
+Certain type of meshes will need to run at the same frequency, they will
+be placed in the same fabric cluster. Benefit of fabric cluster is that it
+offers a scalable mechanism to deal with partitioned fabrics in a SoC.
+
+The current sysfs interface supports controls at package and die level.
+This interface is not enough to support more granular control at
+fabric cluster level.
+
+SoCs with the support of TPMI (Topology Aware Register and PM Capsule
+Interface), can have multiple power domains. Each power domain can
+contain one or more fabric clusters.
+
+To represent controls at fabric cluster level in addition to the
+controls at package and die level (like systems without TPMI
+support), sysfs is enhanced. This granular interface is presented in the
+sysfs with directories names prefixed with "uncore". For example:
+uncore00, uncore01 etc.
+
+The scope of control is specified by attributes "package_id", "domain_id"
+and "fabric_cluster_id" in the directory.
+
+Attributes in each directory:
+
+``domain_id``
+ This attribute is used to get the power domain id of this instance.
+
+``fabric_cluster_id``
+ This attribute is used to get the fabric cluster id of this instance.
+
+``package_id``
+ This attribute is used to get the package id of this instance.
+
+The other attributes are same as presented at package_*_die_* level.
+
+In most of current use cases, the "max_freq_khz" and "min_freq_khz"
+is updated at "package_*_die_*" level. This model will be still supported
+with the following approach:
+
+When user uses controls at "package_*_die_*" level, then every fabric
+cluster is affected in that package and die. For example: user changes
+"max_freq_khz" in the package_00_die_00, then "max_freq_khz" for uncore*
+directory with the same package id will be updated. In this case user can
+still update "max_freq_khz" at each uncore* level, which is more restrictive.
+Similarly, user can update "min_freq_khz" at "package_*_die_*" level
+to apply at each uncore* level.
+
+Support for "current_freq_khz" is available only at each fabric cluster
+level (i.e., in uncore* directory).
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
index f40994c422dc..ee45887811ff 100644
--- a/Documentation/admin-guide/pm/working-state.rst
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -11,6 +11,8 @@ Working-State Power Management
intel_idle
cpufreq
intel_pstate
+ amd-pstate
cpufreq_drivers
intel_epb
intel-speed-select
+ intel_uncore_frequency_scaling
diff --git a/Documentation/admin-guide/pmf.rst b/Documentation/admin-guide/pmf.rst
new file mode 100644
index 000000000000..9ee729ffc19b
--- /dev/null
+++ b/Documentation/admin-guide/pmf.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Set udev rules for PMF Smart PC Builder
+---------------------------------------
+
+AMD PMF(Platform Management Framework) Smart PC Solution builder has to set the system states
+like S0i3, Screen lock, hibernate etc, based on the output actions provided by the PMF
+TA (Trusted Application).
+
+In order for this to work the PMF driver generates a uevent for userspace to react to. Below are
+sample udev rules that can facilitate this experience when a machine has PMF Smart PC solution builder
+enabled.
+
+Please add the following line(s) to
+``/etc/udev/rules.d/99-local.rules``::
+
+ DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="0", RUN+="/usr/bin/systemctl suspend"
+ DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="1", RUN+="/usr/bin/systemctl hibernate"
+ DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="2", RUN+="/bin/loginctl lock-sessions"
+
+EVENT_ID values:
+0= Put the system to S0i3/S2Idle
+1= Put the system to hibernate
+2= Lock the screen
diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst
index bab2d10631f0..3eda08191d13 100644
--- a/Documentation/admin-guide/pnp.rst
+++ b/Documentation/admin-guide/pnp.rst
@@ -281,10 +281,6 @@ ISAPNP drivers. They should serve as a temporary solution only.
They are as follows::
- struct pnp_card *pnp_find_card(unsigned short vendor,
- unsigned short device,
- struct pnp_card *from)
-
struct pnp_dev *pnp_find_dev(struct pnp_card *card,
unsigned short vendor,
unsigned short function,
diff --git a/Documentation/admin-guide/pstore-blk.rst b/Documentation/admin-guide/pstore-blk.rst
index 296d5027787a..1bb2a1c292aa 100644
--- a/Documentation/admin-guide/pstore-blk.rst
+++ b/Documentation/admin-guide/pstore-blk.rst
@@ -35,7 +35,7 @@ module parameters have priority over Kconfig.
Here is an example for module parameters::
- pstore_blk.blkdev=179:7 pstore_blk.kmsg_size=64
+ pstore_blk.blkdev=/dev/mmcblk0p7 pstore_blk.kmsg_size=64 best_effort=y
The detail of each configurations may be of interest to you.
@@ -45,15 +45,18 @@ blkdev
The block device to use. Most of the time, it is a partition of block device.
It's required for pstore/blk. It is also used for MTD device.
-It accepts the following variants for block device:
+When pstore/blk is built as a module, "blkdev" accepts the following variants:
-1. <hex_major><hex_minor> device number in hexadecimal represents itself; no
- leading 0x, for example b302.
-#. /dev/<disk_name> represents the device number of disk
+1. /dev/<disk_name> represents the device number of disk
#. /dev/<disk_name><decimal> represents the device number of partition - device
number of disk plus the partition number
#. /dev/<disk_name>p<decimal> - same as the above; this form is used when disk
name of partitioned disk ends with a digit.
+
+When pstore/blk is built into the kernel, "blkdev" accepts the following variants:
+
+#. <hex_major><hex_minor> device number in hexadecimal representation,
+ with no leading 0x, for example b302.
#. PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF represents the unique id of
a partition if the partition table provides it. The UUID may be either an
EFI/GPT UUID, or refer to an MSDOS partition using the format SSSSSSSS-PP,
@@ -73,7 +76,7 @@ kmsg_size
~~~~~~~~~
The chunk size in KB for oops/panic front-end. It **MUST** be a multiple of 4.
-It's optional if you do not care oops/panic log.
+It's optional if you do not care about the oops/panic log.
There are multiple chunks for oops/panic front-end depending on the remaining
space except other pstore front-ends.
@@ -85,7 +88,7 @@ pmsg_size
~~~~~~~~~
The chunk size in KB for pmsg front-end. It **MUST** be a multiple of 4.
-It's optional if you do not care pmsg log.
+It's optional if you do not care about the pmsg log.
Unlike oops/panic front-end, there is only one chunk for pmsg front-end.
@@ -97,7 +100,7 @@ console_size
~~~~~~~~~~~~
The chunk size in KB for console front-end. It **MUST** be a multiple of 4.
-It's optional if you do not care console log.
+It's optional if you do not care about the console log.
Similar to pmsg front-end, there is only one chunk for console front-end.
@@ -108,7 +111,7 @@ ftrace_size
~~~~~~~~~~~
The chunk size in KB for ftrace front-end. It **MUST** be a multiple of 4.
-It's optional if you do not care console log.
+It's optional if you do not care about the ftrace log.
Similar to oops front-end, there are multiple chunks for ftrace front-end
depending on the count of cpu processors. Each chunk size is equal to
@@ -151,20 +154,11 @@ otherwise KMSG_DUMP_MAX.
Configurations for driver
-------------------------
-Only a block device driver cares about these configurations. A block device
-driver uses ``register_pstore_blk`` to register to pstore/blk.
-
-.. kernel-doc:: fs/pstore/blk.c
- :identifiers: register_pstore_blk
-
-A non-block device driver uses ``register_pstore_device`` with
+A device driver uses ``register_pstore_device`` with
``struct pstore_device_info`` to register to pstore/blk.
.. kernel-doc:: fs/pstore/blk.c
- :identifiers: register_pstore_device
-
-.. kernel-doc:: include/linux/pstore_blk.h
- :identifiers: pstore_device_info
+ :export:
Compression and header
----------------------
@@ -236,8 +230,5 @@ For developer reference, here are all the important structures and APIs:
.. kernel-doc:: include/linux/pstore_zone.h
:internal:
-.. kernel-doc:: fs/pstore/blk.c
- :export:
-
.. kernel-doc:: include/linux/pstore_blk.h
:internal:
diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst
new file mode 100644
index 000000000000..f08149bc53f8
--- /dev/null
+++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst
@@ -0,0 +1,1097 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. [see the bottom of this file for redistribution information]
+
+===========================================
+How to quickly build a trimmed Linux kernel
+===========================================
+
+This guide explains how to swiftly build Linux kernels that are ideal for
+testing purposes, but perfectly fine for day-to-day use, too.
+
+The essence of the process (aka 'TL;DR')
+========================================
+
+*[If you are new to compiling Linux, ignore this TLDR and head over to the next
+section below: it contains a step-by-step guide, which is more detailed, but
+still brief and easy to follow; that guide and its accompanying reference
+section also mention alternatives, pitfalls, and additional aspects, all of
+which might be relevant for you.]*
+
+If your system uses techniques like Secure Boot, prepare it to permit starting
+self-compiled Linux kernels; install compilers and everything else needed for
+building Linux; make sure to have 12 Gigabyte free space in your home directory.
+Now run the following commands to download fresh Linux mainline sources, which
+you then use to configure, build and install your own kernel::
+
+ git clone --depth 1 -b master \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git ~/linux/
+ cd ~/linux/
+ # Hint: if you want to apply patches, do it at this point. See below for details.
+ # Hint: it's recommended to tag your build at this point. See below for details.
+ yes "" | make localmodconfig
+ # Hint: at this point you might want to adjust the build configuration; you'll
+ # have to, if you are running Debian. See below for details.
+ make -j $(nproc --all)
+ # Note: on many commodity distributions the next command suffices, but on Arch
+ # Linux, its derivatives, and some others it does not. See below for details.
+ command -v installkernel && sudo make modules_install install
+ reboot
+
+If you later want to build a newer mainline snapshot, use these commands::
+
+ cd ~/linux/
+ git fetch --depth 1 origin
+ # Note: the next command will discard any changes you did to the code:
+ git checkout --force --detach origin/master
+ # Reminder: if you want to (re)apply patches, do it at this point.
+ # Reminder: you might want to add or modify a build tag at this point.
+ make olddefconfig
+ make -j $(nproc --all)
+ # Reminder: the next command on some distributions does not suffice.
+ command -v installkernel && sudo make modules_install install
+ reboot
+
+Step-by-step guide
+==================
+
+Compiling your own Linux kernel is easy in principle. There are various ways to
+do it. Which of them actually work and is the best depends on the circumstances.
+
+This guide describes a way perfectly suited for those who want to quickly
+install Linux from sources without being bothered by complicated details; the
+goal is to cover everything typically needed on mainstream Linux distributions
+running on commodity PC or server hardware.
+
+The described approach is great for testing purposes, for example to try a
+proposed fix or to check if a problem was already fixed in the latest codebase.
+Nonetheless, kernels built this way are also totally fine for day-to-day use
+while at the same time being easy to keep up to date.
+
+The following steps describe the important aspects of the process; a
+comprehensive reference section later explains each of them in more detail. It
+sometimes also describes alternative approaches, pitfalls, as well as errors
+that might occur at a particular point -- and how to then get things rolling
+again.
+
+..
+ Note: if you see this note, you are reading the text's source file. You
+ might want to switch to a rendered version, as it makes it a lot easier to
+ quickly look something up in the reference section and afterwards jump back
+ to where you left off. Find a the latest rendered version here:
+ https://docs.kernel.org/admin-guide/quickly-build-trimmed-linux.html
+
+.. _backup_sbs:
+
+ * Create a fresh backup and put system repair and restore tools at hand, just
+ to be prepared for the unlikely case of something going sideways.
+
+ [:ref:`details<backup>`]
+
+.. _secureboot_sbs:
+
+ * On platforms with 'Secure Boot' or similar techniques, prepare everything to
+ ensure the system will permit your self-compiled kernel to boot later. The
+ quickest and easiest way to achieve this on commodity x86 systems is to
+ disable such techniques in the BIOS setup utility; alternatively, remove
+ their restrictions through a process initiated by
+ ``mokutil --disable-validation``.
+
+ [:ref:`details<secureboot>`]
+
+.. _buildrequires_sbs:
+
+ * Install all software required to build a Linux kernel. Often you will need:
+ 'bc', 'binutils' ('ld' et al.), 'bison', 'flex', 'gcc', 'git', 'openssl',
+ 'pahole', 'perl', and the development headers for 'libelf' and 'openssl'. The
+ reference section shows how to quickly install those on various popular Linux
+ distributions.
+
+ [:ref:`details<buildrequires>`]
+
+.. _diskspace_sbs:
+
+ * Ensure to have enough free space for building and installing Linux. For the
+ latter 150 Megabyte in /lib/ and 100 in /boot/ are a safe bet. For storing
+ sources and build artifacts 12 Gigabyte in your home directory should
+ typically suffice. If you have less available, be sure to check the reference
+ section for the step that explains adjusting your kernels build
+ configuration: it mentions a trick that reduce the amount of required space
+ in /home/ to around 4 Gigabyte.
+
+ [:ref:`details<diskspace>`]
+
+.. _sources_sbs:
+
+ * Retrieve the sources of the Linux version you intend to build; then change
+ into the directory holding them, as all further commands in this guide are
+ meant to be executed from there.
+
+ *[Note: the following paragraphs describe how to retrieve the sources by
+ partially cloning the Linux stable git repository. This is called a shallow
+ clone. The reference section explains two alternatives:* :ref:`packaged
+ archives<sources_archive>` *and* :ref:`a full git clone<sources_full>` *;
+ prefer the latter, if downloading a lot of data does not bother you, as that
+ will avoid some* :ref:`peculiar characteristics of shallow clones the
+ reference section explains<sources_shallow>` *.]*
+
+ First, execute the following command to retrieve a fresh mainline codebase::
+
+ git clone --no-checkout --depth 1 -b master \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git ~/linux/
+ cd ~/linux/
+
+ If you want to access recent mainline releases and pre-releases, deepen you
+ clone's history to the oldest mainline version you are interested in::
+
+ git fetch --shallow-exclude=v6.0 origin
+
+ In case you want to access a stable/longterm release (say v6.1.5), simply add
+ the branch holding that series; afterwards fetch the history at least up to
+ the mainline version that started the series (v6.1)::
+
+ git remote set-branches --add origin linux-6.1.y
+ git fetch --shallow-exclude=v6.0 origin
+
+ Now checkout the code you are interested in. If you just performed the
+ initial clone, you will be able to check out a fresh mainline codebase, which
+ is ideal for checking whether developers already fixed an issue::
+
+ git checkout --detach origin/master
+
+ If you deepened your clone, you instead of ``origin/master`` can specify the
+ version you deepened to (``v6.0`` above); later releases like ``v6.1`` and
+ pre-release like ``v6.2-rc1`` will work, too. Stable or longterm versions
+ like ``v6.1.5`` work just the same, if you added the appropriate
+ stable/longterm branch as described.
+
+ [:ref:`details<sources>`]
+
+.. _patching_sbs:
+
+ * In case you want to apply a kernel patch, do so now. Often a command like
+ this will do the trick::
+
+ patch -p1 < ../proposed-fix.patch
+
+ If the ``-p1`` is actually needed, depends on how the patch was created; in
+ case it does not apply thus try without it.
+
+ If you cloned the sources with git and anything goes sideways, run ``git
+ reset --hard`` to undo any changes to the sources.
+
+ [:ref:`details<patching>`]
+
+.. _tagging_sbs:
+
+ * If you patched your kernel or have one of the same version installed already,
+ better add a unique tag to the one you are about to build::
+
+ echo "-proposed_fix" > localversion
+
+ Running ``uname -r`` under your kernel later will then print something like
+ '6.1-rc4-proposed_fix'.
+
+ [:ref:`details<tagging>`]
+
+ .. _configuration_sbs:
+
+ * Create the build configuration for your kernel based on an existing
+ configuration.
+
+ If you already prepared such a '.config' file yourself, copy it to
+ ~/linux/ and run ``make olddefconfig``.
+
+ Use the same command, if your distribution or somebody else already tailored
+ your running kernel to your or your hardware's needs: the make target
+ 'olddefconfig' will then try to use that kernel's .config as base.
+
+ Using this make target is fine for everybody else, too -- but you often can
+ save a lot of time by using this command instead::
+
+ yes "" | make localmodconfig
+
+ This will try to pick your distribution's kernel as base, but then disable
+ modules for any features apparently superfluous for your setup. This will
+ reduce the compile time enormously, especially if you are running an
+ universal kernel from a commodity Linux distribution.
+
+ There is a catch: 'localmodconfig' is likely to disable kernel features you
+ did not use since you booted your Linux -- like drivers for currently
+ disconnected peripherals or a virtualization software not haven't used yet.
+ You can reduce or nearly eliminate that risk with tricks the reference
+ section outlines; but when building a kernel just for quick testing purposes
+ it is often negligible if such features are missing. But you should keep that
+ aspect in mind when using a kernel built with this make target, as it might
+ be the reason why something you only use occasionally stopped working.
+
+ [:ref:`details<configuration>`]
+
+.. _configmods_sbs:
+
+ * Check if you might want to or have to adjust some kernel configuration
+ options:
+
+ * Evaluate how you want to handle debug symbols. Enable them, if you later
+ might need to decode a stack trace found for example in a 'panic', 'Oops',
+ 'warning', or 'BUG'; on the other hand disable them, if you are short on
+ storage space or prefer a smaller kernel binary. See the reference section
+ for details on how to do either. If neither applies, it will likely be fine
+ to simply not bother with this. [:ref:`details<configmods_debugsymbols>`]
+
+ * Are you running Debian? Then to avoid known problems by performing
+ additional adjustments explained in the reference section.
+ [:ref:`details<configmods_distros>`].
+
+ * If you want to influence the other aspects of the configuration, do so now
+ by using make targets like 'menuconfig' or 'xconfig'.
+ [:ref:`details<configmods_individual>`].
+
+.. _build_sbs:
+
+ * Build the image and the modules of your kernel::
+
+ make -j $(nproc --all)
+
+ If you want your kernel packaged up as deb, rpm, or tar file, see the
+ reference section for alternatives.
+
+ [:ref:`details<build>`]
+
+.. _install_sbs:
+
+ * Now install your kernel::
+
+ command -v installkernel && sudo make modules_install install
+
+ Often all left for you to do afterwards is a ``reboot``, as many commodity
+ Linux distributions will then create an initramfs (also known as initrd) and
+ an entry for your kernel in your bootloader's configuration; but on some
+ distributions you have to take care of these two steps manually for reasons
+ the reference section explains.
+
+ On a few distributions like Arch Linux and its derivatives the above command
+ does nothing at all; in that case you have to manually install your kernel,
+ as outlined in the reference section.
+
+ If you are running a immutable Linux distribution, check its documentation
+ and the web to find out how to install your own kernel there.
+
+ [:ref:`details<install>`]
+
+.. _another_sbs:
+
+ * To later build another kernel you need similar steps, but sometimes slightly
+ different commands.
+
+ First, switch back into the sources tree::
+
+ cd ~/linux/
+
+ In case you want to build a version from a stable/longterm series you have
+ not used yet (say 6.2.y), tell git to track it::
+
+ git remote set-branches --add origin linux-6.2.y
+
+ Now fetch the latest upstream changes; you again need to specify the earliest
+ version you care about, as git otherwise might retrieve the entire commit
+ history::
+
+ git fetch --shallow-exclude=v6.0 origin
+
+ Now switch to the version you are interested in -- but be aware the command
+ used here will discard any modifications you performed, as they would
+ conflict with the sources you want to checkout::
+
+ git checkout --force --detach origin/master
+
+ At this point you might want to patch the sources again or set/modify a build
+ tag, as explained earlier. Afterwards adjust the build configuration to the
+ new codebase using olddefconfig, which will now adjust the configuration file
+ you prepared earlier using localmodconfig (~/linux/.config) for your next
+ kernel::
+
+ # reminder: if you want to apply patches, do it at this point
+ # reminder: you might want to update your build tag at this point
+ make olddefconfig
+
+ Now build your kernel::
+
+ make -j $(nproc --all)
+
+ Afterwards install the kernel as outlined above::
+
+ command -v installkernel && sudo make modules_install install
+
+ [:ref:`details<another>`]
+
+.. _uninstall_sbs:
+
+ * Your kernel is easy to remove later, as its parts are only stored in two
+ places and clearly identifiable by the kernel's release name. Just ensure to
+ not delete the kernel you are running, as that might render your system
+ unbootable.
+
+ Start by deleting the directory holding your kernel's modules, which is named
+ after its release name -- '6.0.1-foobar' in the following example::
+
+ sudo rm -rf /lib/modules/6.0.1-foobar
+
+ Now try the following command, which on some distributions will delete all
+ other kernel files installed while also removing the kernel's entry from the
+ bootloader configuration::
+
+ command -v kernel-install && sudo kernel-install -v remove 6.0.1-foobar
+
+ If that command does not output anything or fails, see the reference section;
+ do the same if any files named '*6.0.1-foobar*' remain in /boot/.
+
+ [:ref:`details<uninstall>`]
+
+.. _submit_improvements:
+
+Did you run into trouble following any of the above steps that is not cleared up
+by the reference section below? Or do you have ideas how to improve the text?
+Then please take a moment of your time and let the maintainer of this document
+know by email (Thorsten Leemhuis <linux@leemhuis.info>), ideally while CCing the
+Linux docs mailing list (linux-doc@vger.kernel.org). Such feedback is vital to
+improve this document further, which is in everybody's interest, as it will
+enable more people to master the task described here.
+
+Reference section for the step-by-step guide
+============================================
+
+This section holds additional information for each of the steps in the above
+guide.
+
+.. _backup:
+
+Prepare for emergencies
+-----------------------
+
+ *Create a fresh backup and put system repair and restore tools at hand*
+ [:ref:`... <backup_sbs>`]
+
+Remember, you are dealing with computers, which sometimes do unexpected things
+-- especially if you fiddle with crucial parts like the kernel of an operating
+system. That's what you are about to do in this process. Hence, better prepare
+for something going sideways, even if that should not happen.
+
+[:ref:`back to step-by-step guide <backup_sbs>`]
+
+.. _secureboot:
+
+Dealing with techniques like Secure Boot
+----------------------------------------
+
+ *On platforms with 'Secure Boot' or similar techniques, prepare everything to
+ ensure the system will permit your self-compiled kernel to boot later.*
+ [:ref:`... <secureboot_sbs>`]
+
+Many modern systems allow only certain operating systems to start; they thus by
+default will reject booting self-compiled kernels.
+
+You ideally deal with this by making your platform trust your self-built kernels
+with the help of a certificate and signing. How to do that is not described
+here, as it requires various steps that would take the text too far away from
+its purpose; 'Documentation/admin-guide/module-signing.rst' and various web
+sides already explain this in more detail.
+
+Temporarily disabling solutions like Secure Boot is another way to make your own
+Linux boot. On commodity x86 systems it is possible to do this in the BIOS Setup
+utility; the steps to do so are not described here, as they greatly vary between
+machines.
+
+On mainstream x86 Linux distributions there is a third and universal option:
+disable all Secure Boot restrictions for your Linux environment. You can
+initiate this process by running ``mokutil --disable-validation``; this will
+tell you to create a one-time password, which is safe to write down. Now
+restart; right after your BIOS performed all self-tests the bootloader Shim will
+show a blue box with a message 'Press any key to perform MOK management'. Hit
+some key before the countdown exposes. This will open a menu and choose 'Change
+Secure Boot state' there. Shim's 'MokManager' will now ask you to enter three
+randomly chosen characters from the one-time password specified earlier. Once
+you provided them, confirm that you really want to disable the validation.
+Afterwards, permit MokManager to reboot the machine.
+
+[:ref:`back to step-by-step guide <secureboot_sbs>`]
+
+.. _buildrequires:
+
+Install build requirements
+--------------------------
+
+ *Install all software required to build a Linux kernel.*
+ [:ref:`...<buildrequires_sbs>`]
+
+The kernel is pretty stand-alone, but besides tools like the compiler you will
+sometimes need a few libraries to build one. How to install everything needed
+depends on your Linux distribution and the configuration of the kernel you are
+about to build.
+
+Here are a few examples what you typically need on some mainstream
+distributions:
+
+ * Debian, Ubuntu, and derivatives::
+
+ sudo apt install bc binutils bison dwarves flex gcc git make openssl \
+ pahole perl-base libssl-dev libelf-dev
+
+ * Fedora and derivatives::
+
+ sudo dnf install binutils /usr/include/{libelf.h,openssl/pkcs7.h} \
+ /usr/bin/{bc,bison,flex,gcc,git,openssl,make,perl,pahole}
+
+ * openSUSE and derivatives::
+
+ sudo zypper install bc binutils bison dwarves flex gcc git make perl-base \
+ openssl openssl-devel libelf-dev
+
+In case you wonder why these lists include openssl and its development headers:
+they are needed for the Secure Boot support, which many distributions enable in
+their kernel configuration for x86 machines.
+
+Sometimes you will need tools for compression formats like bzip2, gzip, lz4,
+lzma, lzo, xz, or zstd as well.
+
+You might need additional libraries and their development headers in case you
+perform tasks not covered in this guide. For example, zlib will be needed when
+building kernel tools from the tools/ directory; adjusting the build
+configuration with make targets like 'menuconfig' or 'xconfig' will require
+development headers for ncurses or Qt5.
+
+[:ref:`back to step-by-step guide <buildrequires_sbs>`]
+
+.. _diskspace:
+
+Space requirements
+------------------
+
+ *Ensure to have enough free space for building and installing Linux.*
+ [:ref:`... <diskspace_sbs>`]
+
+The numbers mentioned are rough estimates with a big extra charge to be on the
+safe side, so often you will need less.
+
+If you have space constraints, remember to read the reference section when you
+reach the :ref:`section about configuration adjustments' <configmods>`, as
+ensuring debug symbols are disabled will reduce the consumed disk space by quite
+a few gigabytes.
+
+[:ref:`back to step-by-step guide <diskspace_sbs>`]
+
+
+.. _sources:
+
+Download the sources
+--------------------
+
+ *Retrieve the sources of the Linux version you intend to build.*
+ [:ref:`...<sources_sbs>`]
+
+The step-by-step guide outlines how to retrieve Linux' sources using a shallow
+git clone. There is :ref:`more to tell about this method<sources_shallow>` and
+two alternate ways worth describing: :ref:`packaged archives<sources_archive>`
+and :ref:`a full git clone<sources_full>`. And the aspects ':ref:`wouldn't it
+be wiser to use a proper pre-release than the latest mainline code
+<sources_snapshot>`' and ':ref:`how to get an even fresher mainline codebase
+<sources_fresher>`' need elaboration, too.
+
+Note, to keep things simple the commands used in this guide store the build
+artifacts in the source tree. If you prefer to separate them, simply add
+something like ``O=~/linux-builddir/`` to all make calls; also adjust the path
+in all commands that add files or modify any generated (like your '.config').
+
+[:ref:`back to step-by-step guide <sources_sbs>`]
+
+.. _sources_shallow:
+
+Noteworthy characteristics of shallow clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The step-by-step guide uses a shallow clone, as it is the best solution for most
+of this document's target audience. There are a few aspects of this approach
+worth mentioning:
+
+ * This document in most places uses ``git fetch`` with ``--shallow-exclude=``
+ to specify the earliest version you care about (or to be precise: its git
+ tag). You alternatively can use the parameter ``--shallow-since=`` to specify
+ an absolute (say ``'2023-07-15'``) or relative (``'12 months'``) date to
+ define the depth of the history you want to download. As a second
+ alternative, you can also specify a certain depth explicitly with a parameter
+ like ``--depth=1``, unless you add branches for stable/longterm kernels.
+
+ * When running ``git fetch``, remember to always specify the oldest version,
+ the time you care about, or an explicit depth as shown in the step-by-step
+ guide. Otherwise you will risk downloading nearly the entire git history,
+ which will consume quite a bit of time and bandwidth while also stressing the
+ servers.
+
+ Note, you do not have to use the same version or date all the time. But when
+ you change it over time, git will deepen or flatten the history to the
+ specified point. That allows you to retrieve versions you initially thought
+ you did not need -- or it will discard the sources of older versions, for
+ example in case you want to free up some disk space. The latter will happen
+ automatically when using ``--shallow-since=`` or
+ ``--depth=``.
+
+ * Be warned, when deepening your clone you might encounter an error like
+ 'fatal: error in object: unshallow cafecaca0c0dacafecaca0c0dacafecaca0c0da'.
+ In that case run ``git repack -d`` and try again``
+
+ * In case you want to revert changes from a certain version (say Linux 6.3) or
+ perform a bisection (v6.2..v6.3), better tell ``git fetch`` to retrieve
+ objects up to three versions earlier (e.g. 6.0): ``git describe`` will then
+ be able to describe most commits just like it would in a full git clone.
+
+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]
+
+.. _sources_archive:
+
+Downloading the sources using a packages archive
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+People new to compiling Linux often assume downloading an archive via the
+front-page of https://kernel.org is the best approach to retrieve Linux'
+sources. It actually can be, if you are certain to build just one particular
+kernel version without changing any code. Thing is: you might be sure this will
+be the case, but in practice it often will turn out to be a wrong assumption.
+
+That's because when reporting or debugging an issue developers will often ask to
+give another version a try. They also might suggest temporarily undoing a commit
+with ``git revert`` or might provide various patches to try. Sometimes reporters
+will also be asked to use ``git bisect`` to find the change causing a problem.
+These things rely on git or are a lot easier and quicker to handle with it.
+
+A shallow clone also does not add any significant overhead. For example, when
+you use ``git clone --depth=1`` to create a shallow clone of the latest mainline
+codebase git will only retrieve a little more data than downloading the latest
+mainline pre-release (aka 'rc') via the front-page of kernel.org would.
+
+A shallow clone therefore is often the better choice. If you nevertheless want
+to use a packaged source archive, download one via kernel.org; afterwards
+extract its content to some directory and change to the subdirectory created
+during extraction. The rest of the step-by-step guide will work just fine, apart
+from things that rely on git -- but this mainly concerns the section on
+successive builds of other versions.
+
+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]
+
+.. _sources_full:
+
+Downloading the sources using a full git clone
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If downloading and storing a lot of data (~4,4 Gigabyte as of early 2023) is
+nothing that bothers you, instead of a shallow clone perform a full git clone
+instead. You then will avoid the specialties mentioned above and will have all
+versions and individual commits at hand at any time::
+
+ curl -L \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/clone.bundle \
+ -o linux-stable.git.bundle
+ git clone linux-stable.git.bundle ~/linux/
+ rm linux-stable.git.bundle
+ cd ~/linux/
+ git remote set-url origin \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+ git fetch origin
+ git checkout --detach origin/master
+
+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]
+
+.. _sources_snapshot:
+
+Proper pre-releases (RCs) vs. latest mainline
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When cloning the sources using git and checking out origin/master, you often
+will retrieve a codebase that is somewhere between the latest and the next
+release or pre-release. This almost always is the code you want when giving
+mainline a shot: pre-releases like v6.1-rc5 are in no way special, as they do
+not get any significant extra testing before being published.
+
+There is one exception: you might want to stick to the latest mainline release
+(say v6.1) before its successor's first pre-release (v6.2-rc1) is out. That is
+because compiler errors and other problems are more likely to occur during this
+time, as mainline then is in its 'merge window': a usually two week long phase,
+in which the bulk of the changes for the next release is merged.
+
+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]
+
+.. _sources_fresher:
+
+Avoiding the mainline lag
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The explanations for both the shallow clone and the full clone both retrieve the
+code from the Linux stable git repository. That makes things simpler for this
+document's audience, as it allows easy access to both mainline and
+stable/longterm releases. This approach has just one downside:
+
+Changes merged into the mainline repository are only synced to the master branch
+of the Linux stable repository every few hours. This lag most of the time is
+not something to worry about; but in case you really need the latest code, just
+add the mainline repo as additional remote and checkout the code from there::
+
+ git remote add mainline \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
+ git fetch mainline
+ git checkout --detach mainline/master
+
+When doing this with a shallow clone, remember to call ``git fetch`` with one
+of the parameters described earlier to limit the depth.
+
+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]
+
+.. _patching:
+
+Patch the sources (optional)
+----------------------------
+
+ *In case you want to apply a kernel patch, do so now.*
+ [:ref:`...<patching_sbs>`]
+
+This is the point where you might want to patch your kernel -- for example when
+a developer proposed a fix and asked you to check if it helps. The step-by-step
+guide already explains everything crucial here.
+
+[:ref:`back to step-by-step guide <patching_sbs>`]
+
+.. _tagging:
+
+Tagging this kernel build (optional, often wise)
+------------------------------------------------
+
+ *If you patched your kernel or already have that kernel version installed,
+ better tag your kernel by extending its release name:*
+ [:ref:`...<tagging_sbs>`]
+
+Tagging your kernel will help avoid confusion later, especially when you patched
+your kernel. Adding an individual tag will also ensure the kernel's image and
+its modules are installed in parallel to any existing kernels.
+
+There are various ways to add such a tag. The step-by-step guide realizes one by
+creating a 'localversion' file in your build directory from which the kernel
+build scripts will automatically pick up the tag. You can later change that file
+to use a different tag in subsequent builds or simply remove that file to dump
+the tag.
+
+[:ref:`back to step-by-step guide <tagging_sbs>`]
+
+.. _configuration:
+
+Define the build configuration for your kernel
+----------------------------------------------
+
+ *Create the build configuration for your kernel based on an existing
+ configuration.* [:ref:`... <configuration_sbs>`]
+
+There are various aspects for this steps that require a more careful
+explanation:
+
+Pitfalls when using another configuration file as base
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Make targets like localmodconfig and olddefconfig share a few common snares you
+want to be aware of:
+
+ * These targets will reuse a kernel build configuration in your build directory
+ (e.g. '~/linux/.config'), if one exists. In case you want to start from
+ scratch you thus need to delete it.
+
+ * The make targets try to find the configuration for your running kernel
+ automatically, but might choose poorly. A line like '# using defaults found
+ in /boot/config-6.0.7-250.fc36.x86_64' or 'using config:
+ '/boot/config-6.0.7-250.fc36.x86_64' tells you which file they picked. If
+ that is not the intended one, simply store it as '~/linux/.config'
+ before using these make targets.
+
+ * Unexpected things might happen if you try to use a config file prepared for
+ one kernel (say v6.0) on an older generation (say v5.15). In that case you
+ might want to use a configuration as base which your distribution utilized
+ when they used that or an slightly older kernel version.
+
+Influencing the configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The make target olddefconfig and the ``yes "" |`` used when utilizing
+localmodconfig will set any undefined build options to their default value. This
+among others will disable many kernel features that were introduced after your
+base kernel was released.
+
+If you want to set these configurations options manually, use ``oldconfig``
+instead of ``olddefconfig`` or omit the ``yes "" |`` when utilizing
+localmodconfig. Then for each undefined configuration option you will be asked
+how to proceed. In case you are unsure what to answer, simply hit 'enter' to
+apply the default value.
+
+Big pitfall when using localmodconfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As explained briefly in the step-by-step guide already: with localmodconfig it
+can easily happen that your self-built kernel will lack modules for tasks you
+did not perform before utilizing this make target. That's because those tasks
+require kernel modules that are normally autoloaded when you perform that task
+for the first time; if you didn't perform that task at least once before using
+localmodonfig, the latter will thus assume these modules are superfluous and
+disable them.
+
+You can try to avoid this by performing typical tasks that often will autoload
+additional kernel modules: start a VM, establish VPN connections, loop-mount a
+CD/DVD ISO, mount network shares (CIFS, NFS, ...), and connect all external
+devices (2FA keys, headsets, webcams, ...) as well as storage devices with file
+systems you otherwise do not utilize (btrfs, ext4, FAT, NTFS, XFS, ...). But it
+is hard to think of everything that might be needed -- even kernel developers
+often forget one thing or another at this point.
+
+Do not let that risk bother you, especially when compiling a kernel only for
+testing purposes: everything typically crucial will be there. And if you forget
+something important you can turn on a missing feature later and quickly run the
+commands to compile and install a better kernel.
+
+But if you plan to build and use self-built kernels regularly, you might want to
+reduce the risk by recording which modules your system loads over the course of
+a few weeks. You can automate this with `modprobed-db
+<https://github.com/graysky2/modprobed-db>`_. Afterwards use ``LSMOD=<path>`` to
+point localmodconfig to the list of modules modprobed-db noticed being used::
+
+ yes "" | make LSMOD="${HOME}"/.config/modprobed.db localmodconfig
+
+Remote building with localmodconfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you want to use localmodconfig to build a kernel for another machine, run
+``lsmod > lsmod_foo-machine`` on it and transfer that file to your build host.
+Now point the build scripts to the file like this: ``yes "" | make
+LSMOD=~/lsmod_foo-machine localmodconfig``. Note, in this case
+you likely want to copy a base kernel configuration from the other machine over
+as well and place it as .config in your build directory.
+
+[:ref:`back to step-by-step guide <configuration_sbs>`]
+
+.. _configmods:
+
+Adjust build configuration
+--------------------------
+
+ *Check if you might want to or have to adjust some kernel configuration
+ options:*
+
+Depending on your needs you at this point might want or have to adjust some
+kernel configuration options.
+
+.. _configmods_debugsymbols:
+
+Debug symbols
+~~~~~~~~~~~~~
+
+ *Evaluate how you want to handle debug symbols.*
+ [:ref:`...<configmods_sbs>`]
+
+Most users do not need to care about this, it's often fine to leave everything
+as it is; but you should take a closer look at this, if you might need to decode
+a stack trace or want to reduce space consumption.
+
+Having debug symbols available can be important when your kernel throws a
+'panic', 'Oops', 'warning', or 'BUG' later when running, as then you will be
+able to find the exact place where the problem occurred in the code. But
+collecting and embedding the needed debug information takes time and consumes
+quite a bit of space: in late 2022 the build artifacts for a typical x86 kernel
+configured with localmodconfig consumed around 5 Gigabyte of space with debug
+symbols, but less than 1 when they were disabled. The resulting kernel image and
+the modules are bigger as well, which increases load times.
+
+Hence, if you want a small kernel and are unlikely to decode a stack trace
+later, you might want to disable debug symbols to avoid above downsides::
+
+ ./scripts/config --file .config -d DEBUG_INFO \
+ -d DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -d DEBUG_INFO_DWARF4 \
+ -d DEBUG_INFO_DWARF5 -e CONFIG_DEBUG_INFO_NONE
+ make olddefconfig
+
+You on the other hand definitely want to enable them, if there is a decent
+chance that you need to decode a stack trace later (as explained by 'Decode
+failure messages' in Documentation/admin-guide/tainted-kernels.rst in more
+detail)::
+
+ ./scripts/config --file .config -d DEBUG_INFO_NONE -e DEBUG_KERNEL
+ -e DEBUG_INFO -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -e KALLSYMS -e KALLSYMS_ALL
+ make olddefconfig
+
+Note, many mainstream distributions enable debug symbols in their kernel
+configurations -- make targets like localmodconfig and olddefconfig thus will
+often pick that setting up.
+
+[:ref:`back to step-by-step guide <configmods_sbs>`]
+
+.. _configmods_distros:
+
+Distro specific adjustments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Are you running* [:ref:`... <configmods_sbs>`]
+
+The following sections help you to avoid build problems that are known to occur
+when following this guide on a few commodity distributions.
+
+**Debian:**
+
+ * Remove a stale reference to a certificate file that would cause your build to
+ fail::
+
+ ./scripts/config --file .config --set-str SYSTEM_TRUSTED_KEYS ''
+
+ Alternatively, download the needed certificate and make that configuration
+ option point to it, as `the Debian handbook explains in more detail
+ <https://debian-handbook.info/browse/stable/sect.kernel-compilation.html>`_
+ -- or generate your own, as explained in
+ Documentation/admin-guide/module-signing.rst.
+
+[:ref:`back to step-by-step guide <configmods_sbs>`]
+
+.. _configmods_individual:
+
+Individual adjustments
+~~~~~~~~~~~~~~~~~~~~~~
+
+ *If you want to influence the other aspects of the configuration, do so
+ now* [:ref:`... <configmods_sbs>`]
+
+You at this point can use a command like ``make menuconfig`` to enable or
+disable certain features using a text-based user interface; to use a graphical
+configuration utilize, use the make target ``xconfig`` or ``gconfig`` instead.
+All of them require development libraries from toolkits they are based on
+(ncurses, Qt5, Gtk2); an error message will tell you if something required is
+missing.
+
+[:ref:`back to step-by-step guide <configmods_sbs>`]
+
+.. _build:
+
+Build your kernel
+-----------------
+
+ *Build the image and the modules of your kernel* [:ref:`... <build_sbs>`]
+
+A lot can go wrong at this stage, but the instructions below will help you help
+yourself. Another subsection explains how to directly package your kernel up as
+deb, rpm or tar file.
+
+Dealing with build errors
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a build error occurs, it might be caused by some aspect of your machine's
+setup that often can be fixed quickly; other times though the problem lies in
+the code and can only be fixed by a developer. A close examination of the
+failure messages coupled with some research on the internet will often tell you
+which of the two it is. To perform such a investigation, restart the build
+process like this::
+
+ make V=1
+
+The ``V=1`` activates verbose output, which might be needed to see the actual
+error. To make it easier to spot, this command also omits the ``-j $(nproc
+--all)`` used earlier to utilize every CPU core in the system for the job -- but
+this parallelism also results in some clutter when failures occur.
+
+After a few seconds the build process should run into the error again. Now try
+to find the most crucial line describing the problem. Then search the internet
+for the most important and non-generic section of that line (say 4 to 8 words);
+avoid or remove anything that looks remotely system-specific, like your username
+or local path names like ``/home/username/linux/``. First try your regular
+internet search engine with that string, afterwards search Linux kernel mailing
+lists via `lore.kernel.org/all/ <https://lore.kernel.org/all/>`_.
+
+This most of the time will find something that will explain what is wrong; quite
+often one of the hits will provide a solution for your problem, too. If you
+do not find anything that matches your problem, try again from a different angle
+by modifying your search terms or using another line from the error messages.
+
+In the end, most trouble you are to run into has likely been encountered and
+reported by others already. That includes issues where the cause is not your
+system, but lies the code. If you run into one of those, you might thus find a
+solution (e.g. a patch) or workaround for your problem, too.
+
+Package your kernel up
+~~~~~~~~~~~~~~~~~~~~~~
+
+The step-by-step guide uses the default make targets (e.g. 'bzImage' and
+'modules' on x86) to build the image and the modules of your kernel, which later
+steps of the guide then install. You instead can also directly build everything
+and directly package it up by using one of the following targets:
+
+ * ``make -j $(nproc --all) bindeb-pkg`` to generate a deb package
+
+ * ``make -j $(nproc --all) binrpm-pkg`` to generate a rpm package
+
+ * ``make -j $(nproc --all) tarbz2-pkg`` to generate a bz2 compressed tarball
+
+This is just a selection of available make targets for this purpose, see
+``make help`` for others. You can also use these targets after running
+``make -j $(nproc --all)``, as they will pick up everything already built.
+
+If you employ the targets to generate deb or rpm packages, ignore the
+step-by-step guide's instructions on installing and removing your kernel;
+instead install and remove the packages using the package utility for the format
+(e.g. dpkg and rpm) or a package management utility build on top of them (apt,
+aptitude, dnf/yum, zypper, ...). Be aware that the packages generated using
+these two make targets are designed to work on various distributions utilizing
+those formats, they thus will sometimes behave differently than your
+distribution's kernel packages.
+
+[:ref:`back to step-by-step guide <build_sbs>`]
+
+.. _install:
+
+Install your kernel
+-------------------
+
+ *Now install your kernel* [:ref:`... <install_sbs>`]
+
+What you need to do after executing the command in the step-by-step guide
+depends on the existence and the implementation of an ``installkernel``
+executable. Many commodity Linux distributions ship such a kernel installer in
+``/sbin/`` that does everything needed, hence there is nothing left for you
+except rebooting. But some distributions contain an installkernel that does
+only part of the job -- and a few lack it completely and leave all the work to
+you.
+
+If ``installkernel`` is found, the kernel's build system will delegate the
+actual installation of your kernel's image and related files to this executable.
+On almost all Linux distributions it will store the image as '/boot/vmlinuz-
+<your kernel's release name>' and put a 'System.map-<your kernel's release
+name>' alongside it. Your kernel will thus be installed in parallel to any
+existing ones, unless you already have one with exactly the same release name.
+
+Installkernel on many distributions will afterwards generate an 'initramfs'
+(often also called 'initrd'), which commodity distributions rely on for booting;
+hence be sure to keep the order of the two make targets used in the step-by-step
+guide, as things will go sideways if you install your kernel's image before its
+modules. Often installkernel will then add your kernel to the bootloader
+configuration, too. You have to take care of one or both of these tasks
+yourself, if your distributions installkernel doesn't handle them.
+
+A few distributions like Arch Linux and its derivatives totally lack an
+installkernel executable. On those just install the modules using the kernel's
+build system and then install the image and the System.map file manually::
+
+ sudo make modules_install
+ sudo install -m 0600 $(make -s image_name) /boot/vmlinuz-$(make -s kernelrelease)
+ sudo install -m 0600 System.map /boot/System.map-$(make -s kernelrelease)
+
+If your distribution boots with the help of an initramfs, now generate one for
+your kernel using the tools your distribution provides for this process.
+Afterwards add your kernel to your bootloader configuration and reboot.
+
+[:ref:`back to step-by-step guide <install_sbs>`]
+
+.. _another:
+
+Another round later
+-------------------
+
+ *To later build another kernel you need similar, but sometimes slightly
+ different commands* [:ref:`... <another_sbs>`]
+
+The process to build later kernels is similar, but at some points slightly
+different. You for example do not want to use 'localmodconfig' for succeeding
+kernel builds, as you already created a trimmed down configuration you want to
+use from now on. Hence instead just use ``oldconfig`` or ``olddefconfig`` to
+adjust your build configurations to the needs of the kernel version you are
+about to build.
+
+If you created a shallow-clone with git, remember what the :ref:`section that
+explained the setup described in more detail <sources>`: you need to use a
+slightly different ``git fetch`` command and when switching to another series
+need to add an additional remote branch.
+
+[:ref:`back to step-by-step guide <another_sbs>`]
+
+.. _uninstall:
+
+Uninstall the kernel later
+--------------------------
+
+ *All parts of your installed kernel are identifiable by its release name and
+ thus easy to remove later.* [:ref:`... <uninstall_sbs>`]
+
+Do not worry installing your kernel manually and thus bypassing your
+distribution's packaging system will totally mess up your machine: all parts of
+your kernel are easy to remove later, as files are stored in two places only and
+normally identifiable by the kernel's release name.
+
+One of the two places is a directory in /lib/modules/, which holds the modules
+for each installed kernel. This directory is named after the kernel's release
+name; hence, to remove all modules for one of your kernels, simply remove its
+modules directory in /lib/modules/.
+
+The other place is /boot/, where typically one to five files will be placed
+during installation of a kernel. All of them usually contain the release name in
+their file name, but how many files and their name depends somewhat on your
+distribution's installkernel executable (:ref:`see above <install>`) and its
+initramfs generator. On some distributions the ``kernel-install`` command
+mentioned in the step-by-step guide will remove all of these files for you --
+and the entry for your kernel in the bootloader configuration at the same time,
+too. On others you have to take care of these steps yourself. The following
+command should interactively remove the two main files of a kernel with the
+release name '6.0.1-foobar'::
+
+ rm -i /boot/{System.map,vmlinuz}-6.0.1-foobar
+
+Now remove the belonging initramfs, which often will be called something like
+``/boot/initramfs-6.0.1-foobar.img`` or ``/boot/initrd.img-6.0.1-foobar``.
+Afterwards check for other files in /boot/ that have '6.0.1-foobar' in their
+name and delete them as well. Now remove the kernel from your bootloader's
+configuration.
+
+Note, be very careful with wildcards like '*' when deleting files or directories
+for kernels manually: you might accidentally remove files of a 6.0.11 kernel
+when all you want is to remove 6.0 or 6.0.1.
+
+[:ref:`back to step-by-step guide <uninstall_sbs>`]
+
+.. _faq:
+
+FAQ
+===
+
+Why does this 'how-to' not work on my system?
+---------------------------------------------
+
+As initially stated, this guide is 'designed to cover everything typically
+needed [to build a kernel] on mainstream Linux distributions running on
+commodity PC or server hardware'. The outlined approach despite this should work
+on many other setups as well. But trying to cover every possible use-case in one
+guide would defeat its purpose, as without such a focus you would need dozens or
+hundreds of constructs along the lines of 'in case you are having <insert
+machine or distro>, you at this point have to do <this and that>
+<instead|additionally>'. Each of which would make the text longer, more
+complicated, and harder to follow.
+
+That being said: this of course is a balancing act. Hence, if you think an
+additional use-case is worth describing, suggest it to the maintainers of this
+document, as :ref:`described above <submit_improvements>`.
+
+
+..
+ end-of-content
+..
+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
+ you spot a typo or small mistake, feel free to let him know directly and
+ he'll fix it. You are free to do the same in a mostly informal way if you
+ want to contribute changes to the text -- but for copyright reasons please CC
+ linux-doc@vger.kernel.org and 'sign-off' your contribution as
+ Documentation/process/submitting-patches.rst explains in the section 'Sign
+ your work - the Developer's Certificate of Origin'.
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use 'The Linux kernel development community' for author attribution
+ and link this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/quickly-build-trimmed-linux.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
+
diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst
index a60a96218ba9..e9f85142182d 100644
--- a/Documentation/admin-guide/ramoops.rst
+++ b/Documentation/admin-guide/ramoops.rst
@@ -3,7 +3,7 @@ Ramoops oops/panic logger
Sergiu Iordache <sergiu@chromium.org>
-Updated: 17 November 2011
+Updated: 10 Feb 2021
Introduction
------------
@@ -22,7 +22,7 @@ and type of the memory area are set using three variables:
* ``mem_address`` for the start
* ``mem_size`` for the size. The memory size will be rounded down to a
power of two.
- * ``mem_type`` to specifiy if the memory type (default is pgprot_writecombine).
+ * ``mem_type`` to specify if the memory type (default is pgprot_writecombine).
Typically the default value of ``mem_type=0`` should be used as that sets the pstore
mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
@@ -30,6 +30,8 @@ mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
depends on atomic operations. At least on ARM, pgprot_noncached causes the
memory to be mapped strongly ordered, and atomic operations on strongly ordered
memory are implementation defined, and won't work on many ARMs such as omaps.
+Setting ``mem_type=2`` attempts to treat the memory region as normal memory,
+which enables full cache on it. This can improve the performance.
The memory area is divided into ``record_size`` chunks (also rounded down to
power of two) and each kmesg dump writes a ``record_size`` chunk of
@@ -67,7 +69,7 @@ Setting the ramoops parameters can be done in several different manners:
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
B. Use Device Tree bindings, as described in
- ``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
+ ``Documentation/devicetree/bindings/reserved-memory/ramoops.yaml``.
For example::
reserved-memory {
diff --git a/Documentation/admin-guide/reporting-bugs.rst b/Documentation/admin-guide/reporting-bugs.rst
deleted file mode 100644
index 42481ea7b41d..000000000000
--- a/Documentation/admin-guide/reporting-bugs.rst
+++ /dev/null
@@ -1,182 +0,0 @@
-.. _reportingbugs:
-
-Reporting bugs
-++++++++++++++
-
-Background
-==========
-
-The upstream Linux kernel maintainers only fix bugs for specific kernel
-versions. Those versions include the current "release candidate" (or -rc)
-kernel, any "stable" kernel versions, and any "long term" kernels.
-
-Please see https://www.kernel.org/ for a list of supported kernels. Any
-kernel marked with [EOL] is "end of life" and will not have any fixes
-backported to it.
-
-If you've found a bug on a kernel version that isn't listed on kernel.org,
-contact your Linux distribution or embedded vendor for support.
-Alternatively, you can attempt to run one of the supported stable or -rc
-kernels, and see if you can reproduce the bug on that. It's preferable
-to reproduce the bug on the latest -rc kernel.
-
-
-How to report Linux kernel bugs
-===============================
-
-
-Identify the problematic subsystem
-----------------------------------
-
-Identifying which part of the Linux kernel might be causing your issue
-increases your chances of getting your bug fixed. Simply posting to the
-generic linux-kernel mailing list (LKML) may cause your bug report to be
-lost in the noise of a mailing list that gets 1000+ emails a day.
-
-Instead, try to figure out which kernel subsystem is causing the issue,
-and email that subsystem's maintainer and mailing list. If the subsystem
-maintainer doesn't answer, then expand your scope to mailing lists like
-LKML.
-
-
-Identify who to notify
-----------------------
-
-Once you know the subsystem that is causing the issue, you should send a
-bug report. Some maintainers prefer bugs to be reported via bugzilla
-(https://bugzilla.kernel.org), while others prefer that bugs be reported
-via the subsystem mailing list.
-
-To find out where to send an emailed bug report, find your subsystem or
-device driver in the MAINTAINERS file. Search in the file for relevant
-entries, and send your bug report to the person(s) listed in the "M:"
-lines, making sure to Cc the mailing list(s) in the "L:" lines. When the
-maintainer replies to you, make sure to 'Reply-all' in order to keep the
-public mailing list(s) in the email thread.
-
-If you know which driver is causing issues, you can pass one of the driver
-files to the get_maintainer.pl script::
-
- perl scripts/get_maintainer.pl -f <filename>
-
-If it is a security bug, please copy the Security Contact listed in the
-MAINTAINERS file. They can help coordinate bugfix and disclosure. See
-:ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>` for more information.
-
-If you can't figure out which subsystem caused the issue, you should file
-a bug in kernel.org bugzilla and send email to
-linux-kernel@vger.kernel.org, referencing the bugzilla URL. (For more
-information on the linux-kernel mailing list see
-http://vger.kernel.org/lkml/).
-
-
-Tips for reporting bugs
------------------------
-
-If you haven't reported a bug before, please read:
-
- https://www.chiark.greenend.org.uk/~sgtatham/bugs.html
-
- http://www.catb.org/esr/faqs/smart-questions.html
-
-It's REALLY important to report bugs that seem unrelated as separate email
-threads or separate bugzilla entries. If you report several unrelated
-bugs at once, it's difficult for maintainers to tease apart the relevant
-data.
-
-
-Gather information
-------------------
-
-The most important information in a bug report is how to reproduce the
-bug. This includes system information, and (most importantly)
-step-by-step instructions for how a user can trigger the bug.
-
-If the failure includes an "OOPS:", take a picture of the screen, capture
-a netconsole trace, or type the message from your screen into the bug
-report. Please read "Documentation/admin-guide/bug-hunting.rst" before posting your
-bug report. This explains what you should do with the "Oops" information
-to make it useful to the recipient.
-
-This is a suggested format for a bug report sent via email or bugzilla.
-Having a standardized bug report form makes it easier for you not to
-overlook things, and easier for the developers to find the pieces of
-information they're really interested in. If some information is not
-relevant to your bug, feel free to exclude it.
-
-First run the ver_linux script included as scripts/ver_linux, which
-reports the version of some important subsystems. Run this script with
-the command ``awk -f scripts/ver_linux``.
-
-Use that information to fill in all fields of the bug report form, and
-post it to the mailing list with a subject of "PROBLEM: <one line
-summary from [1.]>" for easy identification by the developers::
-
- [1.] One line summary of the problem:
- [2.] Full description of the problem/report:
- [3.] Keywords (i.e., modules, networking, kernel):
- [4.] Kernel information
- [4.1.] Kernel version (from /proc/version):
- [4.2.] Kernel .config file:
- [5.] Most recent kernel version which did not have the bug:
- [6.] Output of Oops.. message (if applicable) with symbolic information
- resolved (see Documentation/admin-guide/bug-hunting.rst)
- [7.] A small shell script or example program which triggers the
- problem (if possible)
- [8.] Environment
- [8.1.] Software (add the output of the ver_linux script here)
- [8.2.] Processor information (from /proc/cpuinfo):
- [8.3.] Module information (from /proc/modules):
- [8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
- [8.5.] PCI information ('lspci -vvv' as root)
- [8.6.] SCSI information (from /proc/scsi/scsi)
- [8.7.] Other information that might be relevant to the problem
- (please look in /proc and include all information that you
- think to be relevant):
- [X.] Other notes, patches, fixes, workarounds:
-
-
-Follow up
-=========
-
-Expectations for bug reporters
-------------------------------
-
-Linux kernel maintainers expect bug reporters to be able to follow up on
-bug reports. That may include running new tests, applying patches,
-recompiling your kernel, and/or re-triggering your bug. The most
-frustrating thing for maintainers is for someone to report a bug, and then
-never follow up on a request to try out a fix.
-
-That said, it's still useful for a kernel maintainer to know a bug exists
-on a supported kernel, even if you can't follow up with retests. Follow
-up reports, such as replying to the email thread with "I tried the latest
-kernel and I can't reproduce my bug anymore" are also helpful, because
-maintainers have to assume silence means things are still broken.
-
-Expectations for kernel maintainers
------------------------------------
-
-Linux kernel maintainers are busy, overworked human beings. Some times
-they may not be able to address your bug in a day, a week, or two weeks.
-If they don't answer your email, they may be on vacation, or at a Linux
-conference. Check the conference schedule at https://LWN.net for more info:
-
- https://lwn.net/Calendar/
-
-In general, kernel maintainers take 1 to 5 business days to respond to
-bugs. The majority of kernel maintainers are employed to work on the
-kernel, and they may not work on the weekends. Maintainers are scattered
-around the world, and they may not work in your time zone. Unless you
-have a high priority bug, please wait at least a week after the first bug
-report before sending the maintainer a reminder email.
-
-The exceptions to this rule are regressions, kernel crashes, security holes,
-or userspace breakage caused by new kernel behavior. Those bugs should be
-addressed by the maintainers ASAP. If you suspect a maintainer is not
-responding to these types of bugs in a timely manner (especially during a
-merge window), escalate the bug to LKML and Linus Torvalds.
-
-Thank you!
-
-[Some of this is taken from Frohwalt Egerer's original linux-kernel FAQ]
diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst
new file mode 100644
index 000000000000..2fd5a030235a
--- /dev/null
+++ b/Documentation/admin-guide/reporting-issues.rst
@@ -0,0 +1,1764 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. See the bottom of this file for additional redistribution information.
+
+Reporting issues
+++++++++++++++++
+
+
+The short guide (aka TL;DR)
+===========================
+
+Are you facing a regression with vanilla kernels from the same stable or
+longterm series? One still supported? Then search the `LKML
+<https://lore.kernel.org/lkml/>`_ and the `Linux stable mailing list
+<https://lore.kernel.org/stable/>`_ archives for matching reports to join. If
+you don't find any, install `the latest release from that series
+<https://kernel.org/>`_. If it still shows the issue, report it to the stable
+mailing list (stable@vger.kernel.org) and CC the regressions list
+(regressions@lists.linux.dev); ideally also CC the maintainer and the mailing
+list for the subsystem in question.
+
+In all other cases try your best guess which kernel part might be causing the
+issue. Check the :ref:`MAINTAINERS <maintainers>` file for how its developers
+expect to be told about problems, which most of the time will be by email with a
+mailing list in CC. Check the destination's archives for matching reports;
+search the `LKML <https://lore.kernel.org/lkml/>`_ and the web, too. If you
+don't find any to join, install `the latest mainline kernel
+<https://kernel.org/>`_. If the issue is present there, send a report.
+
+The issue was fixed there, but you would like to see it resolved in a still
+supported stable or longterm series as well? Then install its latest release.
+If it shows the problem, search for the change that fixed it in mainline and
+check if backporting is in the works or was discarded; if it's neither, ask
+those who handled the change for it.
+
+**General remarks**: When installing and testing a kernel as outlined above,
+ensure it's vanilla (IOW: not patched and not using add-on modules). Also make
+sure it's built and running in a healthy environment and not already tainted
+before the issue occurs.
+
+If you are facing multiple issues with the Linux kernel at once, report each
+separately. While writing your report, include all information relevant to the
+issue, like the kernel and the distro used. In case of a regression, CC the
+regressions mailing list (regressions@lists.linux.dev) to your report. Also try
+to pin-point the culprit with a bisection; if you succeed, include its
+commit-id and CC everyone in the sign-off-by chain.
+
+Once the report is out, answer any questions that come up and help where you
+can. That includes keeping the ball rolling by occasionally retesting with newer
+releases and sending a status update afterwards.
+
+Step-by-step guide how to report issues to the kernel maintainers
+=================================================================
+
+The above TL;DR outlines roughly how to report issues to the Linux kernel
+developers. It might be all that's needed for people already familiar with
+reporting issues to Free/Libre & Open Source Software (FLOSS) projects. For
+everyone else there is this section. It is more detailed and uses a
+step-by-step approach. It still tries to be brief for readability and leaves
+out a lot of details; those are described below the step-by-step guide in a
+reference section, which explains each of the steps in more detail.
+
+Note: this section covers a few more aspects than the TL;DR and does things in
+a slightly different order. That's in your interest, to make sure you notice
+early if an issue that looks like a Linux kernel problem is actually caused by
+something else. These steps thus help to ensure the time you invest in this
+process won't feel wasted in the end:
+
+ * Are you facing an issue with a Linux kernel a hardware or software vendor
+ provided? Then in almost all cases you are better off to stop reading this
+ document and reporting the issue to your vendor instead, unless you are
+ willing to install the latest Linux version yourself. Be aware the latter
+ will often be needed anyway to hunt down and fix issues.
+
+ * Perform a rough search for existing reports with your favorite internet
+ search engine; additionally, check the archives of the `Linux Kernel Mailing
+ List (LKML) <https://lore.kernel.org/lkml/>`_. If you find matching reports,
+ join the discussion instead of sending a new one.
+
+ * See if the issue you are dealing with qualifies as regression, security
+ issue, or a really severe problem: those are 'issues of high priority' that
+ need special handling in some steps that are about to follow.
+
+ * Make sure it's not the kernel's surroundings that are causing the issue
+ you face.
+
+ * Create a fresh backup and put system repair and restore tools at hand.
+
+ * Ensure your system does not enhance its kernels by building additional
+ kernel modules on-the-fly, which solutions like DKMS might be doing locally
+ without your knowledge.
+
+ * Check if your kernel was 'tainted' when the issue occurred, as the event
+ that made the kernel set this flag might be causing the issue you face.
+
+ * Write down coarsely how to reproduce the issue. If you deal with multiple
+ issues at once, create separate notes for each of them and make sure they
+ work independently on a freshly booted system. That's needed, as each issue
+ needs to get reported to the kernel developers separately, unless they are
+ strongly entangled.
+
+ * If you are facing a regression within a stable or longterm version line
+ (say something broke when updating from 5.10.4 to 5.10.5), scroll down to
+ 'Dealing with regressions within a stable and longterm kernel line'.
+
+ * Locate the driver or kernel subsystem that seems to be causing the issue.
+ Find out how and where its developers expect reports. Note: most of the
+ time this won't be bugzilla.kernel.org, as issues typically need to be sent
+ by mail to a maintainer and a public mailing list.
+
+ * Search the archives of the bug tracker or mailing list in question
+ thoroughly for reports that might match your issue. If you find anything,
+ join the discussion instead of sending a new report.
+
+After these preparations you'll now enter the main part:
+
+ * Unless you are already running the latest 'mainline' Linux kernel, better
+ go and install it for the reporting process. Testing and reporting with
+ the latest 'stable' Linux can be an acceptable alternative in some
+ situations; during the merge window that actually might be even the best
+ approach, but in that development phase it can be an even better idea to
+ suspend your efforts for a few days anyway. Whatever version you choose,
+ ideally use a 'vanilla' build. Ignoring these advices will dramatically
+ increase the risk your report will be rejected or ignored.
+
+ * Ensure the kernel you just installed does not 'taint' itself when
+ running.
+
+ * Reproduce the issue with the kernel you just installed. If it doesn't show
+ up there, scroll down to the instructions for issues only happening with
+ stable and longterm kernels.
+
+ * Optimize your notes: try to find and write the most straightforward way to
+ reproduce your issue. Make sure the end result has all the important
+ details, and at the same time is easy to read and understand for others
+ that hear about it for the first time. And if you learned something in this
+ process, consider searching again for existing reports about the issue.
+
+ * If your failure involves a 'panic', 'Oops', 'warning', or 'BUG', consider
+ decoding the kernel log to find the line of code that triggered the error.
+
+ * If your problem is a regression, try to narrow down when the issue was
+ introduced as much as possible.
+
+ * Start to compile the report by writing a detailed description about the
+ issue. Always mention a few things: the latest kernel version you installed
+ for reproducing, the Linux Distribution used, and your notes on how to
+ reproduce the issue. Ideally, make the kernel's build configuration
+ (.config) and the output from ``dmesg`` available somewhere on the net and
+ link to it. Include or upload all other information that might be relevant,
+ like the output/screenshot of an Oops or the output from ``lspci``. Once
+ you wrote this main part, insert a normal length paragraph on top of it
+ outlining the issue and the impact quickly. On top of this add one sentence
+ that briefly describes the problem and gets people to read on. Now give the
+ thing a descriptive title or subject that yet again is shorter. Then you're
+ ready to send or file the report like the MAINTAINERS file told you, unless
+ you are dealing with one of those 'issues of high priority': they need
+ special care which is explained in 'Special handling for high priority
+ issues' below.
+
+ * Wait for reactions and keep the thing rolling until you can accept the
+ outcome in one way or the other. Thus react publicly and in a timely manner
+ to any inquiries. Test proposed fixes. Do proactive testing: retest with at
+ least every first release candidate (RC) of a new mainline version and
+ report your results. Send friendly reminders if things stall. And try to
+ help yourself, if you don't get any help or if it's unsatisfying.
+
+
+Reporting regressions within a stable and longterm kernel line
+--------------------------------------------------------------
+
+This subsection is for you, if you followed above process and got sent here at
+the point about regression within a stable or longterm kernel version line. You
+face one of those if something breaks when updating from 5.10.4 to 5.10.5 (a
+switch from 5.9.15 to 5.10.5 does not qualify). The developers want to fix such
+regressions as quickly as possible, hence there is a streamlined process to
+report them:
+
+ * Check if the kernel developers still maintain the Linux kernel version
+ line you care about: go to the `front page of kernel.org
+ <https://kernel.org/>`_ and make sure it mentions
+ the latest release of the particular version line without an '[EOL]' tag.
+
+ * Check the archives of the `Linux stable mailing list
+ <https://lore.kernel.org/stable/>`_ for existing reports.
+
+ * Install the latest release from the particular version line as a vanilla
+ kernel. Ensure this kernel is not tainted and still shows the problem, as
+ the issue might have already been fixed there. If you first noticed the
+ problem with a vendor kernel, check a vanilla build of the last version
+ known to work performs fine as well.
+
+ * Send a short problem report to the Linux stable mailing list
+ (stable@vger.kernel.org) and CC the Linux regressions mailing list
+ (regressions@lists.linux.dev); if you suspect the cause in a particular
+ subsystem, CC its maintainer and its mailing list. Roughly describe the
+ issue and ideally explain how to reproduce it. Mention the first version
+ that shows the problem and the last version that's working fine. Then
+ wait for further instructions.
+
+The reference section below explains each of these steps in more detail.
+
+
+Reporting issues only occurring in older kernel version lines
+-------------------------------------------------------------
+
+This subsection is for you, if you tried the latest mainline kernel as outlined
+above, but failed to reproduce your issue there; at the same time you want to
+see the issue fixed in a still supported stable or longterm series or vendor
+kernels regularly rebased on those. If that the case, follow these steps:
+
+ * Prepare yourself for the possibility that going through the next few steps
+ might not get the issue solved in older releases: the fix might be too big
+ or risky to get backported there.
+
+ * Perform the first three steps in the section "Dealing with regressions
+ within a stable and longterm kernel line" above.
+
+ * Search the Linux kernel version control system for the change that fixed
+ the issue in mainline, as its commit message might tell you if the fix is
+ scheduled for backporting already. If you don't find anything that way,
+ search the appropriate mailing lists for posts that discuss such an issue
+ or peer-review possible fixes; then check the discussions if the fix was
+ deemed unsuitable for backporting. If backporting was not considered at
+ all, join the newest discussion, asking if it's in the cards.
+
+ * One of the former steps should lead to a solution. If that doesn't work
+ out, ask the maintainers for the subsystem that seems to be causing the
+ issue for advice; CC the mailing list for the particular subsystem as well
+ as the stable mailing list.
+
+The reference section below explains each of these steps in more detail.
+
+
+Reference section: Reporting issues to the kernel maintainers
+=============================================================
+
+The detailed guides above outline all the major steps in brief fashion, which
+should be enough for most people. But sometimes there are situations where even
+experienced users might wonder how to actually do one of those steps. That's
+what this section is for, as it will provide a lot more details on each of the
+above steps. Consider this as reference documentation: it's possible to read it
+from top to bottom. But it's mainly meant to skim over and a place to look up
+details how to actually perform those steps.
+
+A few words of general advice before digging into the details:
+
+ * The Linux kernel developers are well aware this process is complicated and
+ demands more than other FLOSS projects. We'd love to make it simpler. But
+ that would require work in various places as well as some infrastructure,
+ which would need constant maintenance; nobody has stepped up to do that
+ work, so that's just how things are for now.
+
+ * A warranty or support contract with some vendor doesn't entitle you to
+ request fixes from developers in the upstream Linux kernel community: such
+ contracts are completely outside the scope of the Linux kernel, its
+ development community, and this document. That's why you can't demand
+ anything such a contract guarantees in this context, not even if the
+ developer handling the issue works for the vendor in question. If you want
+ to claim your rights, use the vendor's support channel instead. When doing
+ so, you might want to mention you'd like to see the issue fixed in the
+ upstream Linux kernel; motivate them by saying it's the only way to ensure
+ the fix in the end will get incorporated in all Linux distributions.
+
+ * If you never reported an issue to a FLOSS project before you should consider
+ reading `How to Report Bugs Effectively
+ <https://www.chiark.greenend.org.uk/~sgtatham/bugs.html>`_, `How To Ask
+ Questions The Smart Way
+ <http://www.catb.org/esr/faqs/smart-questions.html>`_, and `How to ask good
+ questions <https://jvns.ca/blog/good-questions/>`_.
+
+With that off the table, find below the details on how to properly report
+issues to the Linux kernel developers.
+
+
+Make sure you're using the upstream Linux kernel
+------------------------------------------------
+
+ *Are you facing an issue with a Linux kernel a hardware or software vendor
+ provided? Then in almost all cases you are better off to stop reading this
+ document and reporting the issue to your vendor instead, unless you are
+ willing to install the latest Linux version yourself. Be aware the latter
+ will often be needed anyway to hunt down and fix issues.*
+
+Like most programmers, Linux kernel developers don't like to spend time dealing
+with reports for issues that don't even happen with their current code. It's
+just a waste everybody's time, especially yours. Unfortunately such situations
+easily happen when it comes to the kernel and often leads to frustration on both
+sides. That's because almost all Linux-based kernels pre-installed on devices
+(Computers, Laptops, Smartphones, Routers, …) and most shipped by Linux
+distributors are quite distant from the official Linux kernel as distributed by
+kernel.org: these kernels from these vendors are often ancient from the point of
+Linux development or heavily modified, often both.
+
+Most of these vendor kernels are quite unsuitable for reporting issues to the
+Linux kernel developers: an issue you face with one of them might have been
+fixed by the Linux kernel developers months or years ago already; additionally,
+the modifications and enhancements by the vendor might be causing the issue you
+face, even if they look small or totally unrelated. That's why you should report
+issues with these kernels to the vendor. Its developers should look into the
+report and, in case it turns out to be an upstream issue, fix it directly
+upstream or forward the report there. In practice that often does not work out
+or might not what you want. You thus might want to consider circumventing the
+vendor by installing the very latest Linux kernel core yourself. If that's an
+option for you move ahead in this process, as a later step in this guide will
+explain how to do that once it rules out other potential causes for your issue.
+
+Note, the previous paragraph is starting with the word 'most', as sometimes
+developers in fact are willing to handle reports about issues occurring with
+vendor kernels. If they do in the end highly depends on the developers and the
+issue in question. Your chances are quite good if the distributor applied only
+small modifications to a kernel based on a recent Linux version; that for
+example often holds true for the mainline kernels shipped by Debian GNU/Linux
+Sid or Fedora Rawhide. Some developers will also accept reports about issues
+with kernels from distributions shipping the latest stable kernel, as long as
+its only slightly modified; that for example is often the case for Arch Linux,
+regular Fedora releases, and openSUSE Tumbleweed. But keep in mind, you better
+want to use a mainline Linux and avoid using a stable kernel for this
+process, as outlined in the section 'Install a fresh kernel for testing' in more
+detail.
+
+Obviously you are free to ignore all this advice and report problems with an old
+or heavily modified vendor kernel to the upstream Linux developers. But note,
+those often get rejected or ignored, so consider yourself warned. But it's still
+better than not reporting the issue at all: sometimes such reports directly or
+indirectly will help to get the issue fixed over time.
+
+
+Search for existing reports, first run
+--------------------------------------
+
+ *Perform a rough search for existing reports with your favorite internet
+ search engine; additionally, check the archives of the Linux Kernel Mailing
+ List (LKML). If you find matching reports, join the discussion instead of
+ sending a new one.*
+
+Reporting an issue that someone else already brought forward is often a waste of
+time for everyone involved, especially you as the reporter. So it's in your own
+interest to thoroughly check if somebody reported the issue already. At this
+step of the process it's okay to just perform a rough search: a later step will
+tell you to perform a more detailed search once you know where your issue needs
+to be reported to. Nevertheless, do not hurry with this step of the reporting
+process, it can save you time and trouble.
+
+Simply search the internet with your favorite search engine first. Afterwards,
+search the `Linux Kernel Mailing List (LKML) archives
+<https://lore.kernel.org/lkml/>`_.
+
+If you get flooded with results consider telling your search engine to limit
+search timeframe to the past month or year. And wherever you search, make sure
+to use good search terms; vary them a few times, too. While doing so try to
+look at the issue from the perspective of someone else: that will help you to
+come up with other words to use as search terms. Also make sure not to use too
+many search terms at once. Remember to search with and without information like
+the name of the kernel driver or the name of the affected hardware component.
+But its exact brand name (say 'ASUS Red Devil Radeon RX 5700 XT Gaming OC')
+often is not much helpful, as it is too specific. Instead try search terms like
+the model line (Radeon 5700 or Radeon 5000) and the code name of the main chip
+('Navi' or 'Navi10') with and without its manufacturer ('AMD').
+
+In case you find an existing report about your issue, join the discussion, as
+you might be able to provide valuable additional information. That can be
+important even when a fix is prepared or in its final stages already, as
+developers might look for people that can provide additional information or
+test a proposed fix. Jump to the section 'Duties after the report went out' for
+details on how to get properly involved.
+
+Note, searching `bugzilla.kernel.org <https://bugzilla.kernel.org/>`_ might also
+be a good idea, as that might provide valuable insights or turn up matching
+reports. If you find the latter, just keep in mind: most subsystems expect
+reports in different places, as described below in the section "Check where you
+need to report your issue". The developers that should take care of the issue
+thus might not even be aware of the bugzilla ticket. Hence, check the ticket if
+the issue already got reported as outlined in this document and if not consider
+doing so.
+
+
+Issue of high priority?
+-----------------------
+
+ *See if the issue you are dealing with qualifies as regression, security
+ issue, or a really severe problem: those are 'issues of high priority' that
+ need special handling in some steps that are about to follow.*
+
+Linus Torvalds and the leading Linux kernel developers want to see some issues
+fixed as soon as possible, hence there are 'issues of high priority' that get
+handled slightly differently in the reporting process. Three type of cases
+qualify: regressions, security issues, and really severe problems.
+
+You deal with a regression if some application or practical use case running
+fine with one Linux kernel works worse or not at all with a newer version
+compiled using a similar configuration. The document
+Documentation/admin-guide/reporting-regressions.rst explains this in more
+detail. It also provides a good deal of other information about regressions you
+might want to be aware of; it for example explains how to add your issue to the
+list of tracked regressions, to ensure it won't fall through the cracks.
+
+What qualifies as security issue is left to your judgment. Consider reading
+Documentation/process/security-bugs.rst before proceeding, as it
+provides additional details how to best handle security issues.
+
+An issue is a 'really severe problem' when something totally unacceptably bad
+happens. That's for example the case when a Linux kernel corrupts the data it's
+handling or damages hardware it's running on. You're also dealing with a severe
+issue when the kernel suddenly stops working with an error message ('kernel
+panic') or without any farewell note at all. Note: do not confuse a 'panic' (a
+fatal error where the kernel stop itself) with a 'Oops' (a recoverable error),
+as the kernel remains running after the latter.
+
+
+Ensure a healthy environment
+----------------------------
+
+ *Make sure it's not the kernel's surroundings that are causing the issue
+ you face.*
+
+Problems that look a lot like a kernel issue are sometimes caused by build or
+runtime environment. It's hard to rule out that problem completely, but you
+should minimize it:
+
+ * Use proven tools when building your kernel, as bugs in the compiler or the
+ binutils can cause the resulting kernel to misbehave.
+
+ * Ensure your computer components run within their design specifications;
+ that's especially important for the main processor, the main memory, and the
+ motherboard. Therefore, stop undervolting or overclocking when facing a
+ potential kernel issue.
+
+ * Try to make sure it's not faulty hardware that is causing your issue. Bad
+ main memory for example can result in a multitude of issues that will
+ manifest itself in problems looking like kernel issues.
+
+ * If you're dealing with a filesystem issue, you might want to check the file
+ system in question with ``fsck``, as it might be damaged in a way that leads
+ to unexpected kernel behavior.
+
+ * When dealing with a regression, make sure it's not something else that
+ changed in parallel to updating the kernel. The problem for example might be
+ caused by other software that was updated at the same time. It can also
+ happen that a hardware component coincidentally just broke when you rebooted
+ into a new kernel for the first time. Updating the systems BIOS or changing
+ something in the BIOS Setup can also lead to problems that on look a lot
+ like a kernel regression.
+
+
+Prepare for emergencies
+-----------------------
+
+ *Create a fresh backup and put system repair and restore tools at hand.*
+
+Reminder, you are dealing with computers, which sometimes do unexpected things,
+especially if you fiddle with crucial parts like the kernel of its operating
+system. That's what you are about to do in this process. Thus, make sure to
+create a fresh backup; also ensure you have all tools at hand to repair or
+reinstall the operating system as well as everything you need to restore the
+backup.
+
+
+Make sure your kernel doesn't get enhanced
+------------------------------------------
+
+ *Ensure your system does not enhance its kernels by building additional
+ kernel modules on-the-fly, which solutions like DKMS might be doing locally
+ without your knowledge.*
+
+The risk your issue report gets ignored or rejected dramatically increases if
+your kernel gets enhanced in any way. That's why you should remove or disable
+mechanisms like akmods and DKMS: those build add-on kernel modules
+automatically, for example when you install a new Linux kernel or boot it for
+the first time. Also remove any modules they might have installed. Then reboot
+before proceeding.
+
+Note, you might not be aware that your system is using one of these solutions:
+they often get set up silently when you install Nvidia's proprietary graphics
+driver, VirtualBox, or other software that requires a some support from a
+module not part of the Linux kernel. That why your might need to uninstall the
+packages with such software to get rid of any 3rd party kernel module.
+
+
+Check 'taint' flag
+------------------
+
+ *Check if your kernel was 'tainted' when the issue occurred, as the event
+ that made the kernel set this flag might be causing the issue you face.*
+
+The kernel marks itself with a 'taint' flag when something happens that might
+lead to follow-up errors that look totally unrelated. The issue you face might
+be such an error if your kernel is tainted. That's why it's in your interest to
+rule this out early before investing more time into this process. This is the
+only reason why this step is here, as this process later will tell you to
+install the latest mainline kernel; you will need to check the taint flag again
+then, as that's when it matters because it's the kernel the report will focus
+on.
+
+On a running system is easy to check if the kernel tainted itself: if ``cat
+/proc/sys/kernel/tainted`` returns '0' then the kernel is not tainted and
+everything is fine. Checking that file is impossible in some situations; that's
+why the kernel also mentions the taint status when it reports an internal
+problem (a 'kernel bug'), a recoverable error (a 'kernel Oops') or a
+non-recoverable error before halting operation (a 'kernel panic'). Look near
+the top of the error messages printed when one of these occurs and search for a
+line starting with 'CPU:'. It should end with 'Not tainted' if the kernel was
+not tainted when it noticed the problem; it was tainted if you see 'Tainted:'
+followed by a few spaces and some letters.
+
+If your kernel is tainted, study Documentation/admin-guide/tainted-kernels.rst
+to find out why. Try to eliminate the reason. Often it's caused by one these
+three things:
+
+ 1. A recoverable error (a 'kernel Oops') occurred and the kernel tainted
+ itself, as the kernel knows it might misbehave in strange ways after that
+ point. In that case check your kernel or system log and look for a section
+ that starts with this::
+
+ Oops: 0000 [#1] SMP
+
+ That's the first Oops since boot-up, as the '#1' between the brackets shows.
+ Every Oops and any other problem that happens after that point might be a
+ follow-up problem to that first Oops, even if both look totally unrelated.
+ Rule this out by getting rid of the cause for the first Oops and reproducing
+ the issue afterwards. Sometimes simply restarting will be enough, sometimes
+ a change to the configuration followed by a reboot can eliminate the Oops.
+ But don't invest too much time into this at this point of the process, as
+ the cause for the Oops might already be fixed in the newer Linux kernel
+ version you are going to install later in this process.
+
+ 2. Your system uses a software that installs its own kernel modules, for
+ example Nvidia's proprietary graphics driver or VirtualBox. The kernel
+ taints itself when it loads such module from external sources (even if
+ they are Open Source): they sometimes cause errors in unrelated kernel
+ areas and thus might be causing the issue you face. You therefore have to
+ prevent those modules from loading when you want to report an issue to the
+ Linux kernel developers. Most of the time the easiest way to do that is:
+ temporarily uninstall such software including any modules they might have
+ installed. Afterwards reboot.
+
+ 3. The kernel also taints itself when it's loading a module that resides in
+ the staging tree of the Linux kernel source. That's a special area for
+ code (mostly drivers) that does not yet fulfill the normal Linux kernel
+ quality standards. When you report an issue with such a module it's
+ obviously okay if the kernel is tainted; just make sure the module in
+ question is the only reason for the taint. If the issue happens in an
+ unrelated area reboot and temporarily block the module from being loaded
+ by specifying ``foo.blacklist=1`` as kernel parameter (replace 'foo' with
+ the name of the module in question).
+
+
+Document how to reproduce issue
+-------------------------------
+
+ *Write down coarsely how to reproduce the issue. If you deal with multiple
+ issues at once, create separate notes for each of them and make sure they
+ work independently on a freshly booted system. That's needed, as each issue
+ needs to get reported to the kernel developers separately, unless they are
+ strongly entangled.*
+
+If you deal with multiple issues at once, you'll have to report each of them
+separately, as they might be handled by different developers. Describing
+various issues in one report also makes it quite difficult for others to tear
+it apart. Hence, only combine issues in one report if they are very strongly
+entangled.
+
+Additionally, during the reporting process you will have to test if the issue
+happens with other kernel versions. Therefore, it will make your work easier if
+you know exactly how to reproduce an issue quickly on a freshly booted system.
+
+Note: it's often fruitless to report issues that only happened once, as they
+might be caused by a bit flip due to cosmic radiation. That's why you should
+try to rule that out by reproducing the issue before going further. Feel free
+to ignore this advice if you are experienced enough to tell a one-time error
+due to faulty hardware apart from a kernel issue that rarely happens and thus
+is hard to reproduce.
+
+
+Regression in stable or longterm kernel?
+----------------------------------------
+
+ *If you are facing a regression within a stable or longterm version line
+ (say something broke when updating from 5.10.4 to 5.10.5), scroll down to
+ 'Dealing with regressions within a stable and longterm kernel line'.*
+
+Regression within a stable and longterm kernel version line are something the
+Linux developers want to fix badly, as such issues are even more unwanted than
+regression in the main development branch, as they can quickly affect a lot of
+people. The developers thus want to learn about such issues as quickly as
+possible, hence there is a streamlined process to report them. Note,
+regressions with newer kernel version line (say something broke when switching
+from 5.9.15 to 5.10.5) do not qualify.
+
+
+Check where you need to report your issue
+-----------------------------------------
+
+ *Locate the driver or kernel subsystem that seems to be causing the issue.
+ Find out how and where its developers expect reports. Note: most of the
+ time this won't be bugzilla.kernel.org, as issues typically need to be sent
+ by mail to a maintainer and a public mailing list.*
+
+It's crucial to send your report to the right people, as the Linux kernel is a
+big project and most of its developers are only familiar with a small subset of
+it. Quite a few programmers for example only care for just one driver, for
+example one for a WiFi chip; its developer likely will only have small or no
+knowledge about the internals of remote or unrelated "subsystems", like the TCP
+stack, the PCIe/PCI subsystem, memory management or file systems.
+
+Problem is: the Linux kernel lacks a central bug tracker where you can simply
+file your issue and make it reach the developers that need to know about it.
+That's why you have to find the right place and way to report issues yourself.
+You can do that with the help of a script (see below), but it mainly targets
+kernel developers and experts. For everybody else the MAINTAINERS file is the
+better place.
+
+How to read the MAINTAINERS file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To illustrate how to use the :ref:`MAINTAINERS <maintainers>` file, lets assume
+the WiFi in your Laptop suddenly misbehaves after updating the kernel. In that
+case it's likely an issue in the WiFi driver. Obviously it could also be some
+code it builds upon, but unless you suspect something like that stick to the
+driver. If it's really something else, the driver's developers will get the
+right people involved.
+
+Sadly, there is no way to check which code is driving a particular hardware
+component that is both universal and easy.
+
+In case of a problem with the WiFi driver you for example might want to look at
+the output of ``lspci -k``, as it lists devices on the PCI/PCIe bus and the
+kernel module driving it::
+
+ [user@something ~]$ lspci -k
+ [...]
+ 3a:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
+ Subsystem: Bigfoot Networks, Inc. Device 1535
+ Kernel driver in use: ath10k_pci
+ Kernel modules: ath10k_pci
+ [...]
+
+But this approach won't work if your WiFi chip is connected over USB or some
+other internal bus. In those cases you might want to check your WiFi manager or
+the output of ``ip link``. Look for the name of the problematic network
+interface, which might be something like 'wlp58s0'. This name can be used like
+this to find the module driving it::
+
+ [user@something ~]$ realpath --relative-to=/sys/module/ /sys/class/net/wlp58s0/device/driver/module
+ ath10k_pci
+
+In case tricks like these don't bring you any further, try to search the
+internet on how to narrow down the driver or subsystem in question. And if you
+are unsure which it is: just try your best guess, somebody will help you if you
+guessed poorly.
+
+Once you know the driver or subsystem, you want to search for it in the
+MAINTAINERS file. In the case of 'ath10k_pci' you won't find anything, as the
+name is too specific. Sometimes you will need to search on the net for help;
+but before doing so, try a somewhat shorted or modified name when searching the
+MAINTAINERS file, as then you might find something like this::
+
+ QUALCOMM ATHEROS ATH10K WIRELESS DRIVER
+ Mail: A. Some Human <shuman@example.com>
+ Mailing list: ath10k@lists.infradead.org
+ Status: Supported
+ Web-page: https://wireless.wiki.kernel.org/en/users/Drivers/ath10k
+ SCM: git git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git
+ Files: drivers/net/wireless/ath/ath10k/
+
+Note: the line description will be abbreviations, if you read the plain
+MAINTAINERS file found in the root of the Linux source tree. 'Mail:' for
+example will be 'M:', 'Mailing list:' will be 'L', and 'Status:' will be 'S:'.
+A section near the top of the file explains these and other abbreviations.
+
+First look at the line 'Status'. Ideally it should be 'Supported' or
+'Maintained'. If it states 'Obsolete' then you are using some outdated approach
+that was replaced by a newer solution you need to switch to. Sometimes the code
+only has someone who provides 'Odd Fixes' when feeling motivated. And with
+'Orphan' you are totally out of luck, as nobody takes care of the code anymore.
+That only leaves these options: arrange yourself to live with the issue, fix it
+yourself, or find a programmer somewhere willing to fix it.
+
+After checking the status, look for a line starting with 'bugs:': it will tell
+you where to find a subsystem specific bug tracker to file your issue. The
+example above does not have such a line. That is the case for most sections, as
+Linux kernel development is completely driven by mail. Very few subsystems use
+a bug tracker, and only some of those rely on bugzilla.kernel.org.
+
+In this and many other cases you thus have to look for lines starting with
+'Mail:' instead. Those mention the name and the email addresses for the
+maintainers of the particular code. Also look for a line starting with 'Mailing
+list:', which tells you the public mailing list where the code is developed.
+Your report later needs to go by mail to those addresses. Additionally, for all
+issue reports sent by email, make sure to add the Linux Kernel Mailing List
+(LKML) <linux-kernel@vger.kernel.org> to CC. Don't omit either of the mailing
+lists when sending your issue report by mail later! Maintainers are busy people
+and might leave some work for other developers on the subsystem specific list;
+and LKML is important to have one place where all issue reports can be found.
+
+
+Finding the maintainers with the help of a script
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For people that have the Linux sources at hand there is a second option to find
+the proper place to report: the script 'scripts/get_maintainer.pl' which tries
+to find all people to contact. It queries the MAINTAINERS file and needs to be
+called with a path to the source code in question. For drivers compiled as
+module if often can be found with a command like this::
+
+ $ modinfo ath10k_pci | grep filename | sed 's!/lib/modules/.*/kernel/!!; s!filename:!!; s!\.ko\(\|\.xz\)!!'
+ drivers/net/wireless/ath/ath10k/ath10k_pci.ko
+
+Pass parts of this to the script::
+
+ $ ./scripts/get_maintainer.pl -f drivers/net/wireless/ath/ath10k*
+ Some Human <shuman@example.com> (supporter:QUALCOMM ATHEROS ATH10K WIRELESS DRIVER)
+ Another S. Human <asomehuman@example.com> (maintainer:NETWORKING DRIVERS)
+ ath10k@lists.infradead.org (open list:QUALCOMM ATHEROS ATH10K WIRELESS DRIVER)
+ linux-wireless@vger.kernel.org (open list:NETWORKING DRIVERS (WIRELESS))
+ netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
+ linux-kernel@vger.kernel.org (open list)
+
+Don't sent your report to all of them. Send it to the maintainers, which the
+script calls "supporter:"; additionally CC the most specific mailing list for
+the code as well as the Linux Kernel Mailing List (LKML). In this case you thus
+would need to send the report to 'Some Human <shuman@example.com>' with
+'ath10k@lists.infradead.org' and 'linux-kernel@vger.kernel.org' in CC.
+
+Note: in case you cloned the Linux sources with git you might want to call
+``get_maintainer.pl`` a second time with ``--git``. The script then will look
+at the commit history to find which people recently worked on the code in
+question, as they might be able to help. But use these results with care, as it
+can easily send you in a wrong direction. That for example happens quickly in
+areas rarely changed (like old or unmaintained drivers): sometimes such code is
+modified during tree-wide cleanups by developers that do not care about the
+particular driver at all.
+
+
+Search for existing reports, second run
+---------------------------------------
+
+ *Search the archives of the bug tracker or mailing list in question
+ thoroughly for reports that might match your issue. If you find anything,
+ join the discussion instead of sending a new report.*
+
+As mentioned earlier already: reporting an issue that someone else already
+brought forward is often a waste of time for everyone involved, especially you
+as the reporter. That's why you should search for existing report again, now
+that you know where they need to be reported to. If it's mailing list, you will
+often find its archives on `lore.kernel.org <https://lore.kernel.org/>`_.
+
+But some list are hosted in different places. That for example is the case for
+the ath10k WiFi driver used as example in the previous step. But you'll often
+find the archives for these lists easily on the net. Searching for 'archive
+ath10k@lists.infradead.org' for example will lead you to the `Info page for the
+ath10k mailing list <https://lists.infradead.org/mailman/listinfo/ath10k>`_,
+which at the top links to its
+`list archives <https://lists.infradead.org/pipermail/ath10k/>`_. Sadly this and
+quite a few other lists miss a way to search the archives. In those cases use a
+regular internet search engine and add something like
+'site:lists.infradead.org/pipermail/ath10k/' to your search terms, which limits
+the results to the archives at that URL.
+
+It's also wise to check the internet, LKML and maybe bugzilla.kernel.org again
+at this point. If your report needs to be filed in a bug tracker, you may want
+to check the mailing list archives for the subsystem as well, as someone might
+have reported it only there.
+
+For details how to search and what to do if you find matching reports see
+"Search for existing reports, first run" above.
+
+Do not hurry with this step of the reporting process: spending 30 to 60 minutes
+or even more time can save you and others quite a lot of time and trouble.
+
+
+Install a fresh kernel for testing
+----------------------------------
+
+ *Unless you are already running the latest 'mainline' Linux kernel, better
+ go and install it for the reporting process. Testing and reporting with
+ the latest 'stable' Linux can be an acceptable alternative in some
+ situations; during the merge window that actually might be even the best
+ approach, but in that development phase it can be an even better idea to
+ suspend your efforts for a few days anyway. Whatever version you choose,
+ ideally use a 'vanilla' built. Ignoring these advices will dramatically
+ increase the risk your report will be rejected or ignored.*
+
+As mentioned in the detailed explanation for the first step already: Like most
+programmers, Linux kernel developers don't like to spend time dealing with
+reports for issues that don't even happen with the current code. It's just a
+waste everybody's time, especially yours. That's why it's in everybody's
+interest that you confirm the issue still exists with the latest upstream code
+before reporting it. You are free to ignore this advice, but as outlined
+earlier: doing so dramatically increases the risk that your issue report might
+get rejected or simply ignored.
+
+In the scope of the kernel "latest upstream" normally means:
+
+ * Install a mainline kernel; the latest stable kernel can be an option, but
+ most of the time is better avoided. Longterm kernels (sometimes called 'LTS
+ kernels') are unsuitable at this point of the process. The next subsection
+ explains all of this in more detail.
+
+ * The over next subsection describes way to obtain and install such a kernel.
+ It also outlines that using a pre-compiled kernel are fine, but better are
+ vanilla, which means: it was built using Linux sources taken straight `from
+ kernel.org <https://kernel.org/>`_ and not modified or enhanced in any way.
+
+Choosing the right version for testing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Head over to `kernel.org <https://kernel.org/>`_ to find out which version you
+want to use for testing. Ignore the big yellow button that says 'Latest release'
+and look a little lower at the table. At its top you'll see a line starting with
+mainline, which most of the time will point to a pre-release with a version
+number like '5.8-rc2'. If that's the case, you'll want to use this mainline
+kernel for testing, as that where all fixes have to be applied first. Do not let
+that 'rc' scare you, these 'development kernels' are pretty reliable — and you
+made a backup, as you were instructed above, didn't you?
+
+In about two out of every nine to ten weeks, mainline might point you to a
+proper release with a version number like '5.7'. If that happens, consider
+suspending the reporting process until the first pre-release of the next
+version (5.8-rc1) shows up on kernel.org. That's because the Linux development
+cycle then is in its two-week long 'merge window'. The bulk of the changes and
+all intrusive ones get merged for the next release during this time. It's a bit
+more risky to use mainline during this period. Kernel developers are also often
+quite busy then and might have no spare time to deal with issue reports. It's
+also quite possible that one of the many changes applied during the merge
+window fixes the issue you face; that's why you soon would have to retest with
+a newer kernel version anyway, as outlined below in the section 'Duties after
+the report went out'.
+
+That's why it might make sense to wait till the merge window is over. But don't
+to that if you're dealing with something that shouldn't wait. In that case
+consider obtaining the latest mainline kernel via git (see below) or use the
+latest stable version offered on kernel.org. Using that is also acceptable in
+case mainline for some reason does currently not work for you. An in general:
+using it for reproducing the issue is also better than not reporting it issue
+at all.
+
+Better avoid using the latest stable kernel outside merge windows, as all fixes
+must be applied to mainline first. That's why checking the latest mainline
+kernel is so important: any issue you want to see fixed in older version lines
+needs to be fixed in mainline first before it can get backported, which can
+take a few days or weeks. Another reason: the fix you hope for might be too
+hard or risky for backporting; reporting the issue again hence is unlikely to
+change anything.
+
+These aspects are also why longterm kernels (sometimes called "LTS kernels")
+are unsuitable for this part of the reporting process: they are to distant from
+the current code. Hence go and test mainline first and follow the process
+further: if the issue doesn't occur with mainline it will guide you how to get
+it fixed in older version lines, if that's in the cards for the fix in question.
+
+How to obtain a fresh Linux kernel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Using a pre-compiled kernel**: This is often the quickest, easiest, and safest
+way for testing — especially is you are unfamiliar with the Linux kernel. The
+problem: most of those shipped by distributors or add-on repositories are build
+from modified Linux sources. They are thus not vanilla and therefore often
+unsuitable for testing and issue reporting: the changes might cause the issue
+you face or influence it somehow.
+
+But you are in luck if you are using a popular Linux distribution: for quite a
+few of them you'll find repositories on the net that contain packages with the
+latest mainline or stable Linux built as vanilla kernel. It's totally okay to
+use these, just make sure from the repository's description they are vanilla or
+at least close to it. Additionally ensure the packages contain the latest
+versions as offered on kernel.org. The packages are likely unsuitable if they
+are older than a week, as new mainline and stable kernels typically get released
+at least once a week.
+
+Please note that you might need to build your own kernel manually later: that's
+sometimes needed for debugging or testing fixes, as described later in this
+document. Also be aware that pre-compiled kernels might lack debug symbols that
+are needed to decode messages the kernel prints when a panic, Oops, warning, or
+BUG occurs; if you plan to decode those, you might be better off compiling a
+kernel yourself (see the end of this subsection and the section titled 'Decode
+failure messages' for details).
+
+**Using git**: Developers and experienced Linux users familiar with git are
+often best served by obtaining the latest Linux kernel sources straight from the
+`official development repository on kernel.org
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/>`_.
+Those are likely a bit ahead of the latest mainline pre-release. Don't worry
+about it: they are as reliable as a proper pre-release, unless the kernel's
+development cycle is currently in the middle of a merge window. But even then
+they are quite reliable.
+
+**Conventional**: People unfamiliar with git are often best served by
+downloading the sources as tarball from `kernel.org <https://kernel.org/>`_.
+
+How to actually build a kernel is not described here, as many websites explain
+the necessary steps already. If you are new to it, consider following one of
+those how-to's that suggest to use ``make localmodconfig``, as that tries to
+pick up the configuration of your current kernel and then tries to adjust it
+somewhat for your system. That does not make the resulting kernel any better,
+but quicker to compile.
+
+Note: If you are dealing with a panic, Oops, warning, or BUG from the kernel,
+please try to enable CONFIG_KALLSYMS when configuring your kernel.
+Additionally, enable CONFIG_DEBUG_KERNEL and CONFIG_DEBUG_INFO, too; the
+latter is the relevant one of those two, but can only be reached if you enable
+the former. Be aware CONFIG_DEBUG_INFO increases the storage space required to
+build a kernel by quite a bit. But that's worth it, as these options will allow
+you later to pinpoint the exact line of code that triggers your issue. The
+section 'Decode failure messages' below explains this in more detail.
+
+But keep in mind: Always keep a record of the issue encountered in case it is
+hard to reproduce. Sending an undecoded report is better than not reporting
+the issue at all.
+
+
+Check 'taint' flag
+------------------
+
+ *Ensure the kernel you just installed does not 'taint' itself when
+ running.*
+
+As outlined above in more detail already: the kernel sets a 'taint' flag when
+something happens that can lead to follow-up errors that look totally
+unrelated. That's why you need to check if the kernel you just installed does
+not set this flag. And if it does, you in almost all the cases needs to
+eliminate the reason for it before you reporting issues that occur with it. See
+the section above for details how to do that.
+
+
+Reproduce issue with the fresh kernel
+-------------------------------------
+
+ *Reproduce the issue with the kernel you just installed. If it doesn't show
+ up there, scroll down to the instructions for issues only happening with
+ stable and longterm kernels.*
+
+Check if the issue occurs with the fresh Linux kernel version you just
+installed. If it was fixed there already, consider sticking with this version
+line and abandoning your plan to report the issue. But keep in mind that other
+users might still be plagued by it, as long as it's not fixed in either stable
+and longterm version from kernel.org (and thus vendor kernels derived from
+those). If you prefer to use one of those or just want to help their users,
+head over to the section "Details about reporting issues only occurring in
+older kernel version lines" below.
+
+
+Optimize description to reproduce issue
+---------------------------------------
+
+ *Optimize your notes: try to find and write the most straightforward way to
+ reproduce your issue. Make sure the end result has all the important
+ details, and at the same time is easy to read and understand for others
+ that hear about it for the first time. And if you learned something in this
+ process, consider searching again for existing reports about the issue.*
+
+An unnecessarily complex report will make it hard for others to understand your
+report. Thus try to find a reproducer that's straight forward to describe and
+thus easy to understand in written form. Include all important details, but at
+the same time try to keep it as short as possible.
+
+In this in the previous steps you likely have learned a thing or two about the
+issue you face. Use this knowledge and search again for existing reports
+instead you can join.
+
+
+Decode failure messages
+-----------------------
+
+ *If your failure involves a 'panic', 'Oops', 'warning', or 'BUG', consider
+ decoding the kernel log to find the line of code that triggered the error.*
+
+When the kernel detects an internal problem, it will log some information about
+the executed code. This makes it possible to pinpoint the exact line in the
+source code that triggered the issue and shows how it was called. But that only
+works if you enabled CONFIG_DEBUG_INFO and CONFIG_KALLSYMS when configuring
+your kernel. If you did so, consider to decode the information from the
+kernel's log. That will make it a lot easier to understand what lead to the
+'panic', 'Oops', 'warning', or 'BUG', which increases the chances that someone
+can provide a fix.
+
+Decoding can be done with a script you find in the Linux source tree. If you
+are running a kernel you compiled yourself earlier, call it like this::
+
+ [user@something ~]$ sudo dmesg | ./linux-5.10.5/scripts/decode_stacktrace.sh ./linux-5.10.5/vmlinux
+
+If you are running a packaged vanilla kernel, you will likely have to install
+the corresponding packages with debug symbols. Then call the script (which you
+might need to get from the Linux sources if your distro does not package it)
+like this::
+
+ [user@something ~]$ sudo dmesg | ./linux-5.10.5/scripts/decode_stacktrace.sh \
+ /usr/lib/debug/lib/modules/5.10.10-4.1.x86_64/vmlinux /usr/src/kernels/5.10.10-4.1.x86_64/
+
+The script will work on log lines like the following, which show the address of
+the code the kernel was executing when the error occurred::
+
+ [ 68.387301] RIP: 0010:test_module_init+0x5/0xffa [test_module]
+
+Once decoded, these lines will look like this::
+
+ [ 68.387301] RIP: 0010:test_module_init (/home/username/linux-5.10.5/test-module/test-module.c:16) test_module
+
+In this case the executed code was built from the file
+'~/linux-5.10.5/test-module/test-module.c' and the error occurred by the
+instructions found in line '16'.
+
+The script will similarly decode the addresses mentioned in the section
+starting with 'Call trace', which show the path to the function where the
+problem occurred. Additionally, the script will show the assembler output for
+the code section the kernel was executing.
+
+Note, if you can't get this to work, simply skip this step and mention the
+reason for it in the report. If you're lucky, it might not be needed. And if it
+is, someone might help you to get things going. Also be aware this is just one
+of several ways to decode kernel stack traces. Sometimes different steps will
+be required to retrieve the relevant details. Don't worry about that, if that's
+needed in your case, developers will tell you what to do.
+
+
+Special care for regressions
+----------------------------
+
+ *If your problem is a regression, try to narrow down when the issue was
+ introduced as much as possible.*
+
+Linux lead developer Linus Torvalds insists that the Linux kernel never
+worsens, that's why he deems regressions as unacceptable and wants to see them
+fixed quickly. That's why changes that introduced a regression are often
+promptly reverted if the issue they cause can't get solved quickly any other
+way. Reporting a regression is thus a bit like playing a kind of trump card to
+get something quickly fixed. But for that to happen the change that's causing
+the regression needs to be known. Normally it's up to the reporter to track
+down the culprit, as maintainers often won't have the time or setup at hand to
+reproduce it themselves.
+
+To find the change there is a process called 'bisection' which the document
+Documentation/admin-guide/bug-bisect.rst describes in detail. That process
+will often require you to build about ten to twenty kernel images, trying to
+reproduce the issue with each of them before building the next. Yes, that takes
+some time, but don't worry, it works a lot quicker than most people assume.
+Thanks to a 'binary search' this will lead you to the one commit in the source
+code management system that's causing the regression. Once you find it, search
+the net for the subject of the change, its commit id and the shortened commit id
+(the first 12 characters of the commit id). This will lead you to existing
+reports about it, if there are any.
+
+Note, a bisection needs a bit of know-how, which not everyone has, and quite a
+bit of effort, which not everyone is willing to invest. Nevertheless, it's
+highly recommended performing a bisection yourself. If you really can't or
+don't want to go down that route at least find out which mainline kernel
+introduced the regression. If something for example breaks when switching from
+5.5.15 to 5.8.4, then try at least all the mainline releases in that area (5.6,
+5.7 and 5.8) to check when it first showed up. Unless you're trying to find a
+regression in a stable or longterm kernel, avoid testing versions which number
+has three sections (5.6.12, 5.7.8), as that makes the outcome hard to
+interpret, which might render your testing useless. Once you found the major
+version which introduced the regression, feel free to move on in the reporting
+process. But keep in mind: it depends on the issue at hand if the developers
+will be able to help without knowing the culprit. Sometimes they might
+recognize from the report want went wrong and can fix it; other times they will
+be unable to help unless you perform a bisection.
+
+When dealing with regressions make sure the issue you face is really caused by
+the kernel and not by something else, as outlined above already.
+
+In the whole process keep in mind: an issue only qualifies as regression if the
+older and the newer kernel got built with a similar configuration. This can be
+achieved by using ``make olddefconfig``, as explained in more detail by
+Documentation/admin-guide/reporting-regressions.rst; that document also
+provides a good deal of other information about regressions you might want to be
+aware of.
+
+
+Write and send the report
+-------------------------
+
+ *Start to compile the report by writing a detailed description about the
+ issue. Always mention a few things: the latest kernel version you installed
+ for reproducing, the Linux Distribution used, and your notes on how to
+ reproduce the issue. Ideally, make the kernel's build configuration
+ (.config) and the output from ``dmesg`` available somewhere on the net and
+ link to it. Include or upload all other information that might be relevant,
+ like the output/screenshot of an Oops or the output from ``lspci``. Once
+ you wrote this main part, insert a normal length paragraph on top of it
+ outlining the issue and the impact quickly. On top of this add one sentence
+ that briefly describes the problem and gets people to read on. Now give the
+ thing a descriptive title or subject that yet again is shorter. Then you're
+ ready to send or file the report like the MAINTAINERS file told you, unless
+ you are dealing with one of those 'issues of high priority': they need
+ special care which is explained in 'Special handling for high priority
+ issues' below.*
+
+Now that you have prepared everything it's time to write your report. How to do
+that is partly explained by the three documents linked to in the preface above.
+That's why this text will only mention a few of the essentials as well as
+things specific to the Linux kernel.
+
+There is one thing that fits both categories: the most crucial parts of your
+report are the title/subject, the first sentence, and the first paragraph.
+Developers often get quite a lot of mail. They thus often just take a few
+seconds to skim a mail before deciding to move on or look closer. Thus: the
+better the top section of your report, the higher are the chances that someone
+will look into it and help you. And that is why you should ignore them for now
+and write the detailed report first. ;-)
+
+Things each report should mention
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Describe in detail how your issue happens with the fresh vanilla kernel you
+installed. Try to include the step-by-step instructions you wrote and optimized
+earlier that outline how you and ideally others can reproduce the issue; in
+those rare cases where that's impossible try to describe what you did to
+trigger it.
+
+Also include all the relevant information others might need to understand the
+issue and its environment. What's actually needed depends a lot on the issue,
+but there are some things you should include always:
+
+ * the output from ``cat /proc/version``, which contains the Linux kernel
+ version number and the compiler it was built with.
+
+ * the Linux distribution the machine is running (``hostnamectl | grep
+ "Operating System"``)
+
+ * the architecture of the CPU and the operating system (``uname -mi``)
+
+ * if you are dealing with a regression and performed a bisection, mention the
+ subject and the commit-id of the change that is causing it.
+
+In a lot of cases it's also wise to make two more things available to those
+that read your report:
+
+ * the configuration used for building your Linux kernel (the '.config' file)
+
+ * the kernel's messages that you get from ``dmesg`` written to a file. Make
+ sure that it starts with a line like 'Linux version 5.8-1
+ (foobar@example.com) (gcc (GCC) 10.2.1, GNU ld version 2.34) #1 SMP Mon Aug
+ 3 14:54:37 UTC 2020' If it's missing, then important messages from the first
+ boot phase already got discarded. In this case instead consider using
+ ``journalctl -b 0 -k``; alternatively you can also reboot, reproduce the
+ issue and call ``dmesg`` right afterwards.
+
+These two files are big, that's why it's a bad idea to put them directly into
+your report. If you are filing the issue in a bug tracker then attach them to
+the ticket. If you report the issue by mail do not attach them, as that makes
+the mail too large; instead do one of these things:
+
+ * Upload the files somewhere public (your website, a public file paste
+ service, a ticket created just for this purpose on `bugzilla.kernel.org
+ <https://bugzilla.kernel.org/>`_, ...) and include a link to them in your
+ report. Ideally use something where the files stay available for years, as
+ they could be useful to someone many years from now; this for example can
+ happen if five or ten years from now a developer works on some code that was
+ changed just to fix your issue.
+
+ * Put the files aside and mention you will send them later in individual
+ replies to your own mail. Just remember to actually do that once the report
+ went out. ;-)
+
+Things that might be wise to provide
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Depending on the issue you might need to add more background data. Here are a
+few suggestions what often is good to provide:
+
+ * If you are dealing with a 'warning', an 'OOPS' or a 'panic' from the kernel,
+ include it. If you can't copy'n'paste it, try to capture a netconsole trace
+ or at least take a picture of the screen.
+
+ * If the issue might be related to your computer hardware, mention what kind
+ of system you use. If you for example have problems with your graphics card,
+ mention its manufacturer, the card's model, and what chip is uses. If it's a
+ laptop mention its name, but try to make sure it's meaningful. 'Dell XPS 13'
+ for example is not, because it might be the one from 2012; that one looks
+ not that different from the one sold today, but apart from that the two have
+ nothing in common. Hence, in such cases add the exact model number, which
+ for example are '9380' or '7390' for XPS 13 models introduced during 2019.
+ Names like 'Lenovo Thinkpad T590' are also somewhat ambiguous: there are
+ variants of this laptop with and without a dedicated graphics chip, so try
+ to find the exact model name or specify the main components.
+
+ * Mention the relevant software in use. If you have problems with loading
+ modules, you want to mention the versions of kmod, systemd, and udev in use.
+ If one of the DRM drivers misbehaves, you want to state the versions of
+ libdrm and Mesa; also specify your Wayland compositor or the X-Server and
+ its driver. If you have a filesystem issue, mention the version of
+ corresponding filesystem utilities (e2fsprogs, btrfs-progs, xfsprogs, ...).
+
+ * Gather additional information from the kernel that might be of interest. The
+ output from ``lspci -nn`` will for example help others to identify what
+ hardware you use. If you have a problem with hardware you even might want to
+ make the output from ``sudo lspci -vvv`` available, as that provides
+ insights how the components were configured. For some issues it might be
+ good to include the contents of files like ``/proc/cpuinfo``,
+ ``/proc/ioports``, ``/proc/iomem``, ``/proc/modules``, or
+ ``/proc/scsi/scsi``. Some subsystem also offer tools to collect relevant
+ information. One such tool is ``alsa-info.sh`` `which the audio/sound
+ subsystem developers provide <https://www.alsa-project.org/wiki/AlsaInfo>`_.
+
+Those examples should give your some ideas of what data might be wise to
+attach, but you have to think yourself what will be helpful for others to know.
+Don't worry too much about forgetting something, as developers will ask for
+additional details they need. But making everything important available from
+the start increases the chance someone will take a closer look.
+
+
+The important part: the head of your report
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that you have the detailed part of the report prepared let's get to the
+most important section: the first few sentences. Thus go to the top, add
+something like 'The detailed description:' before the part you just wrote and
+insert two newlines at the top. Now write one normal length paragraph that
+describes the issue roughly. Leave out all boring details and focus on the
+crucial parts readers need to know to understand what this is all about; if you
+think this bug affects a lot of users, mention this to get people interested.
+
+Once you did that insert two more lines at the top and write a one sentence
+summary that explains quickly what the report is about. After that you have to
+get even more abstract and write an even shorter subject/title for the report.
+
+Now that you have written this part take some time to optimize it, as it is the
+most important parts of your report: a lot of people will only read this before
+they decide if reading the rest is time well spent.
+
+Now send or file the report like the :ref:`MAINTAINERS <maintainers>` file told
+you, unless it's one of those 'issues of high priority' outlined earlier: in
+that case please read the next subsection first before sending the report on
+its way.
+
+Special handling for high priority issues
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Reports for high priority issues need special handling.
+
+**Severe issues**: make sure the subject or ticket title as well as the first
+paragraph makes the severeness obvious.
+
+**Regressions**: make the report's subject start with '[REGRESSION]'.
+
+In case you performed a successful bisection, use the title of the change that
+introduced the regression as the second part of your subject. Make the report
+also mention the commit id of the culprit. In case of an unsuccessful bisection,
+make your report mention the latest tested version that's working fine (say 5.7)
+and the oldest where the issue occurs (say 5.8-rc1).
+
+When sending the report by mail, CC the Linux regressions mailing list
+(regressions@lists.linux.dev). In case the report needs to be filed to some web
+tracker, proceed to do so. Once filed, forward the report by mail to the
+regressions list; CC the maintainer and the mailing list for the subsystem in
+question. Make sure to inline the forwarded report, hence do not attach it.
+Also add a short note at the top where you mention the URL to the ticket.
+
+When mailing or forwarding the report, in case of a successful bisection add the
+author of the culprit to the recipients; also CC everyone in the signed-off-by
+chain, which you find at the end of its commit message.
+
+**Security issues**: for these issues your will have to evaluate if a
+short-term risk to other users would arise if details were publicly disclosed.
+If that's not the case simply proceed with reporting the issue as described.
+For issues that bear such a risk you will need to adjust the reporting process
+slightly:
+
+ * If the MAINTAINERS file instructed you to report the issue by mail, do not
+ CC any public mailing lists.
+
+ * If you were supposed to file the issue in a bug tracker make sure to mark
+ the ticket as 'private' or 'security issue'. If the bug tracker does not
+ offer a way to keep reports private, forget about it and send your report as
+ a private mail to the maintainers instead.
+
+In both cases make sure to also mail your report to the addresses the
+MAINTAINERS file lists in the section 'security contact'. Ideally directly CC
+them when sending the report by mail. If you filed it in a bug tracker, forward
+the report's text to these addresses; but on top of it put a small note where
+you mention that you filed it with a link to the ticket.
+
+See Documentation/process/security-bugs.rst for more information.
+
+
+Duties after the report went out
+--------------------------------
+
+ *Wait for reactions and keep the thing rolling until you can accept the
+ outcome in one way or the other. Thus react publicly and in a timely manner
+ to any inquiries. Test proposed fixes. Do proactive testing: retest with at
+ least every first release candidate (RC) of a new mainline version and
+ report your results. Send friendly reminders if things stall. And try to
+ help yourself, if you don't get any help or if it's unsatisfying.*
+
+If your report was good and you are really lucky then one of the developers
+might immediately spot what's causing the issue; they then might write a patch
+to fix it, test it, and send it straight for integration in mainline while
+tagging it for later backport to stable and longterm kernels that need it. Then
+all you need to do is reply with a 'Thank you very much' and switch to a version
+with the fix once it gets released.
+
+But this ideal scenario rarely happens. That's why the job is only starting
+once you got the report out. What you'll have to do depends on the situations,
+but often it will be the things listed below. But before digging into the
+details, here are a few important things you need to keep in mind for this part
+of the process.
+
+
+General advice for further interactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Always reply in public**: When you filed the issue in a bug tracker, always
+reply there and do not contact any of the developers privately about it. For
+mailed reports always use the 'Reply-all' function when replying to any mails
+you receive. That includes mails with any additional data you might want to add
+to your report: go to your mail applications 'Sent' folder and use 'reply-all'
+on your mail with the report. This approach will make sure the public mailing
+list(s) and everyone else that gets involved over time stays in the loop; it
+also keeps the mail thread intact, which among others is really important for
+mailing lists to group all related mails together.
+
+There are just two situations where a comment in a bug tracker or a 'Reply-all'
+is unsuitable:
+
+ * Someone tells you to send something privately.
+
+ * You were told to send something, but noticed it contains sensitive
+ information that needs to be kept private. In that case it's okay to send it
+ in private to the developer that asked for it. But note in the ticket or a
+ mail that you did that, so everyone else knows you honored the request.
+
+**Do research before asking for clarifications or help**: In this part of the
+process someone might tell you to do something that requires a skill you might
+not have mastered yet. For example, you might be asked to use some test tools
+you never have heard of yet; or you might be asked to apply a patch to the
+Linux kernel sources to test if it helps. In some cases it will be fine sending
+a reply asking for instructions how to do that. But before going that route try
+to find the answer own your own by searching the internet; alternatively
+consider asking in other places for advice. For example ask a friend or post
+about it to a chatroom or forum you normally hang out.
+
+**Be patient**: If you are really lucky you might get a reply to your report
+within a few hours. But most of the time it will take longer, as maintainers
+are scattered around the globe and thus might be in a different time zone – one
+where they already enjoy their night away from keyboard.
+
+In general, kernel developers will take one to five business days to respond to
+reports. Sometimes it will take longer, as they might be busy with the merge
+windows, other work, visiting developer conferences, or simply enjoying a long
+summer holiday.
+
+The 'issues of high priority' (see above for an explanation) are an exception
+here: maintainers should address them as soon as possible; that's why you
+should wait a week at maximum (or just two days if it's something urgent)
+before sending a friendly reminder.
+
+Sometimes the maintainer might not be responding in a timely manner; other
+times there might be disagreements, for example if an issue qualifies as
+regression or not. In such cases raise your concerns on the mailing list and
+ask others for public or private replies how to move on. If that fails, it
+might be appropriate to get a higher authority involved. In case of a WiFi
+driver that would be the wireless maintainers; if there are no higher level
+maintainers or all else fails, it might be one of those rare situations where
+it's okay to get Linus Torvalds involved.
+
+**Proactive testing**: Every time the first pre-release (the 'rc1') of a new
+mainline kernel version gets released, go and check if the issue is fixed there
+or if anything of importance changed. Mention the outcome in the ticket or in a
+mail you sent as reply to your report (make sure it has all those in the CC
+that up to that point participated in the discussion). This will show your
+commitment and that you are willing to help. It also tells developers if the
+issue persists and makes sure they do not forget about it. A few other
+occasional retests (for example with rc3, rc5 and the final) are also a good
+idea, but only report your results if something relevant changed or if you are
+writing something anyway.
+
+With all these general things off the table let's get into the details of how
+to help to get issues resolved once they were reported.
+
+Inquires and testing request
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here are your duties in case you got replies to your report:
+
+**Check who you deal with**: Most of the time it will be the maintainer or a
+developer of the particular code area that will respond to your report. But as
+issues are normally reported in public it could be anyone that's replying —
+including people that want to help, but in the end might guide you totally off
+track with their questions or requests. That rarely happens, but it's one of
+many reasons why it's wise to quickly run an internet search to see who you're
+interacting with. By doing this you also get aware if your report was heard by
+the right people, as a reminder to the maintainer (see below) might be in order
+later if discussion fades out without leading to a satisfying solution for the
+issue.
+
+**Inquiries for data**: Often you will be asked to test something or provide
+additional details. Try to provide the requested information soon, as you have
+the attention of someone that might help and risk losing it the longer you
+wait; that outcome is even likely if you do not provide the information within
+a few business days.
+
+**Requests for testing**: When you are asked to test a diagnostic patch or a
+possible fix, try to test it in timely manner, too. But do it properly and make
+sure to not rush it: mixing things up can happen easily and can lead to a lot
+of confusion for everyone involved. A common mistake for example is thinking a
+proposed patch with a fix was applied, but in fact wasn't. Things like that
+happen even to experienced testers occasionally, but they most of the time will
+notice when the kernel with the fix behaves just as one without it.
+
+What to do when nothing of substance happens
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some reports will not get any reaction from the responsible Linux kernel
+developers; or a discussion around the issue evolved, but faded out with
+nothing of substance coming out of it.
+
+In these cases wait two (better: three) weeks before sending a friendly
+reminder: maybe the maintainer was just away from keyboard for a while when
+your report arrived or had something more important to take care of. When
+writing the reminder, kindly ask if anything else from your side is needed to
+get the ball running somehow. If the report got out by mail, do that in the
+first lines of a mail that is a reply to your initial mail (see above) which
+includes a full quote of the original report below: that's on of those few
+situations where such a 'TOFU' (Text Over, Fullquote Under) is the right
+approach, as then all the recipients will have the details at hand immediately
+in the proper order.
+
+After the reminder wait three more weeks for replies. If you still don't get a
+proper reaction, you first should reconsider your approach. Did you maybe try
+to reach out to the wrong people? Was the report maybe offensive or so
+confusing that people decided to completely stay away from it? The best way to
+rule out such factors: show the report to one or two people familiar with FLOSS
+issue reporting and ask for their opinion. Also ask them for their advice how
+to move forward. That might mean: prepare a better report and make those people
+review it before you send it out. Such an approach is totally fine; just
+mention that this is the second and improved report on the issue and include a
+link to the first report.
+
+If the report was proper you can send a second reminder; in it ask for advice
+why the report did not get any replies. A good moment for this second reminder
+mail is shortly after the first pre-release (the 'rc1') of a new Linux kernel
+version got published, as you should retest and provide a status update at that
+point anyway (see above).
+
+If the second reminder again results in no reaction within a week, try to
+contact a higher-level maintainer asking for advice: even busy maintainers by
+then should at least have sent some kind of acknowledgment.
+
+Remember to prepare yourself for a disappointment: maintainers ideally should
+react somehow to every issue report, but they are only obliged to fix those
+'issues of high priority' outlined earlier. So don't be too devastating if you
+get a reply along the lines of 'thanks for the report, I have more important
+issues to deal with currently and won't have time to look into this for the
+foreseeable future'.
+
+It's also possible that after some discussion in the bug tracker or on a list
+nothing happens anymore and reminders don't help to motivate anyone to work out
+a fix. Such situations can be devastating, but is within the cards when it
+comes to Linux kernel development. This and several other reasons for not
+getting help are explained in 'Why some issues won't get any reaction or remain
+unfixed after being reported' near the end of this document.
+
+Don't get devastated if you don't find any help or if the issue in the end does
+not get solved: the Linux kernel is FLOSS and thus you can still help yourself.
+You for example could try to find others that are affected and team up with
+them to get the issue resolved. Such a team could prepare a fresh report
+together that mentions how many you are and why this is something that in your
+option should get fixed. Maybe together you can also narrow down the root cause
+or the change that introduced a regression, which often makes developing a fix
+easier. And with a bit of luck there might be someone in the team that knows a
+bit about programming and might be able to write a fix.
+
+
+Reference for "Reporting regressions within a stable and longterm kernel line"
+------------------------------------------------------------------------------
+
+This subsection provides details for the steps you need to perform if you face
+a regression within a stable and longterm kernel line.
+
+Make sure the particular version line still gets support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Check if the kernel developers still maintain the Linux kernel version
+ line you care about: go to the front page of kernel.org and make sure it
+ mentions the latest release of the particular version line without an
+ '[EOL]' tag.*
+
+Most kernel version lines only get supported for about three months, as
+maintaining them longer is quite a lot of work. Hence, only one per year is
+chosen and gets supported for at least two years (often six). That's why you
+need to check if the kernel developers still support the version line you care
+for.
+
+Note, if kernel.org lists two stable version lines on the front page, you
+should consider switching to the newer one and forget about the older one:
+support for it is likely to be abandoned soon. Then it will get a "end-of-life"
+(EOL) stamp. Version lines that reached that point still get mentioned on the
+kernel.org front page for a week or two, but are unsuitable for testing and
+reporting.
+
+Search stable mailing list
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Check the archives of the Linux stable mailing list for existing reports.*
+
+Maybe the issue you face is already known and was fixed or is about to. Hence,
+`search the archives of the Linux stable mailing list
+<https://lore.kernel.org/stable/>`_ for reports about an issue like yours. If
+you find any matches, consider joining the discussion, unless the fix is
+already finished and scheduled to get applied soon.
+
+Reproduce issue with the newest release
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Install the latest release from the particular version line as a vanilla
+ kernel. Ensure this kernel is not tainted and still shows the problem, as
+ the issue might have already been fixed there. If you first noticed the
+ problem with a vendor kernel, check a vanilla build of the last version
+ known to work performs fine as well.*
+
+Before investing any more time in this process you want to check if the issue
+was already fixed in the latest release of version line you're interested in.
+This kernel needs to be vanilla and shouldn't be tainted before the issue
+happens, as detailed outlined already above in the section "Install a fresh
+kernel for testing".
+
+Did you first notice the regression with a vendor kernel? Then changes the
+vendor applied might be interfering. You need to rule that out by performing
+a recheck. Say something broke when you updated from 5.10.4-vendor.42 to
+5.10.5-vendor.43. Then after testing the latest 5.10 release as outlined in
+the previous paragraph check if a vanilla build of Linux 5.10.4 works fine as
+well. If things are broken there, the issue does not qualify as upstream
+regression and you need switch back to the main step-by-step guide to report
+the issue.
+
+Report the regression
+~~~~~~~~~~~~~~~~~~~~~
+
+ *Send a short problem report to the Linux stable mailing list
+ (stable@vger.kernel.org) and CC the Linux regressions mailing list
+ (regressions@lists.linux.dev); if you suspect the cause in a particular
+ subsystem, CC its maintainer and its mailing list. Roughly describe the
+ issue and ideally explain how to reproduce it. Mention the first version
+ that shows the problem and the last version that's working fine. Then
+ wait for further instructions.*
+
+When reporting a regression that happens within a stable or longterm kernel
+line (say when updating from 5.10.4 to 5.10.5) a brief report is enough for
+the start to get the issue reported quickly. Hence a rough description to the
+stable and regressions mailing list is all it takes; but in case you suspect
+the cause in a particular subsystem, CC its maintainers and its mailing list
+as well, because that will speed things up.
+
+And note, it helps developers a great deal if you can specify the exact version
+that introduced the problem. Hence if possible within a reasonable time frame,
+try to find that version using vanilla kernels. Lets assume something broke when
+your distributor released a update from Linux kernel 5.10.5 to 5.10.8. Then as
+instructed above go and check the latest kernel from that version line, say
+5.10.9. If it shows the problem, try a vanilla 5.10.5 to ensure that no patches
+the distributor applied interfere. If the issue doesn't manifest itself there,
+try 5.10.7 and then (depending on the outcome) 5.10.8 or 5.10.6 to find the
+first version where things broke. Mention it in the report and state that 5.10.9
+is still broken.
+
+What the previous paragraph outlines is basically a rough manual 'bisection'.
+Once your report is out your might get asked to do a proper one, as it allows to
+pinpoint the exact change that causes the issue (which then can easily get
+reverted to fix the issue quickly). Hence consider to do a proper bisection
+right away if time permits. See the section 'Special care for regressions' and
+the document Documentation/admin-guide/bug-bisect.rst for details how to
+perform one. In case of a successful bisection add the author of the culprit to
+the recipients; also CC everyone in the signed-off-by chain, which you find at
+the end of its commit message.
+
+
+Reference for "Reporting issues only occurring in older kernel version lines"
+-----------------------------------------------------------------------------
+
+This section provides details for the steps you need to take if you could not
+reproduce your issue with a mainline kernel, but want to see it fixed in older
+version lines (aka stable and longterm kernels).
+
+Some fixes are too complex
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Prepare yourself for the possibility that going through the next few steps
+ might not get the issue solved in older releases: the fix might be too big
+ or risky to get backported there.*
+
+Even small and seemingly obvious code-changes sometimes introduce new and
+totally unexpected problems. The maintainers of the stable and longterm kernels
+are very aware of that and thus only apply changes to these kernels that are
+within rules outlined in Documentation/process/stable-kernel-rules.rst.
+
+Complex or risky changes for example do not qualify and thus only get applied
+to mainline. Other fixes are easy to get backported to the newest stable and
+longterm kernels, but too risky to integrate into older ones. So be aware the
+fix you are hoping for might be one of those that won't be backported to the
+version line your care about. In that case you'll have no other choice then to
+live with the issue or switch to a newer Linux version, unless you want to
+patch the fix into your kernels yourself.
+
+Common preparations
+~~~~~~~~~~~~~~~~~~~
+
+ *Perform the first three steps in the section "Reporting issues only
+ occurring in older kernel version lines" above.*
+
+You need to carry out a few steps already described in another section of this
+guide. Those steps will let you:
+
+ * Check if the kernel developers still maintain the Linux kernel version line
+ you care about.
+
+ * Search the Linux stable mailing list for exiting reports.
+
+ * Check with the latest release.
+
+
+Check code history and search for existing discussions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Search the Linux kernel version control system for the change that fixed
+ the issue in mainline, as its commit message might tell you if the fix is
+ scheduled for backporting already. If you don't find anything that way,
+ search the appropriate mailing lists for posts that discuss such an issue
+ or peer-review possible fixes; then check the discussions if the fix was
+ deemed unsuitable for backporting. If backporting was not considered at
+ all, join the newest discussion, asking if it's in the cards.*
+
+In a lot of cases the issue you deal with will have happened with mainline, but
+got fixed there. The commit that fixed it would need to get backported as well
+to get the issue solved. That's why you want to search for it or any
+discussions abound it.
+
+ * First try to find the fix in the Git repository that holds the Linux kernel
+ sources. You can do this with the web interfaces `on kernel.org
+ <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/>`_
+ or its mirror `on GitHub <https://github.com/torvalds/linux>`_; if you have
+ a local clone you alternatively can search on the command line with ``git
+ log --grep=<pattern>``.
+
+ If you find the fix, look if the commit message near the end contains a
+ 'stable tag' that looks like this:
+
+ Cc: <stable@vger.kernel.org> # 5.4+
+
+ If that's case the developer marked the fix safe for backporting to version
+ line 5.4 and later. Most of the time it's getting applied there within two
+ weeks, but sometimes it takes a bit longer.
+
+ * If the commit doesn't tell you anything or if you can't find the fix, look
+ again for discussions about the issue. Search the net with your favorite
+ internet search engine as well as the archives for the `Linux kernel
+ developers mailing list <https://lore.kernel.org/lkml/>`_. Also read the
+ section `Locate kernel area that causes the issue` above and follow the
+ instructions to find the subsystem in question: its bug tracker or mailing
+ list archive might have the answer you are looking for.
+
+ * If you see a proposed fix, search for it in the version control system as
+ outlined above, as the commit might tell you if a backport can be expected.
+
+ * Check the discussions for any indicators the fix might be too risky to get
+ backported to the version line you care about. If that's the case you have
+ to live with the issue or switch to the kernel version line where the fix
+ got applied.
+
+ * If the fix doesn't contain a stable tag and backporting was not discussed,
+ join the discussion: mention the version where you face the issue and that
+ you would like to see it fixed, if suitable.
+
+
+Ask for advice
+~~~~~~~~~~~~~~
+
+ *One of the former steps should lead to a solution. If that doesn't work
+ out, ask the maintainers for the subsystem that seems to be causing the
+ issue for advice; CC the mailing list for the particular subsystem as well
+ as the stable mailing list.*
+
+If the previous three steps didn't get you closer to a solution there is only
+one option left: ask for advice. Do that in a mail you sent to the maintainers
+for the subsystem where the issue seems to have its roots; CC the mailing list
+for the subsystem as well as the stable mailing list (stable@vger.kernel.org).
+
+
+Why some issues won't get any reaction or remain unfixed after being reported
+=============================================================================
+
+When reporting a problem to the Linux developers, be aware only 'issues of high
+priority' (regressions, security issues, severe problems) are definitely going
+to get resolved. The maintainers or if all else fails Linus Torvalds himself
+will make sure of that. They and the other kernel developers will fix a lot of
+other issues as well. But be aware that sometimes they can't or won't help; and
+sometimes there isn't even anyone to send a report to.
+
+This is best explained with kernel developers that contribute to the Linux
+kernel in their spare time. Quite a few of the drivers in the kernel were
+written by such programmers, often because they simply wanted to make their
+hardware usable on their favorite operating system.
+
+These programmers most of the time will happily fix problems other people
+report. But nobody can force them to do, as they are contributing voluntarily.
+
+Then there are situations where such developers really want to fix an issue,
+but can't: sometimes they lack hardware programming documentation to do so.
+This often happens when the publicly available docs are superficial or the
+driver was written with the help of reverse engineering.
+
+Sooner or later spare time developers will also stop caring for the driver.
+Maybe their test hardware broke, got replaced by something more fancy, or is so
+old that it's something you don't find much outside of computer museums
+anymore. Sometimes developer stops caring for their code and Linux at all, as
+something different in their life became way more important. In some cases
+nobody is willing to take over the job as maintainer – and nobody can be forced
+to, as contributing to the Linux kernel is done on a voluntary basis. Abandoned
+drivers nevertheless remain in the kernel: they are still useful for people and
+removing would be a regression.
+
+The situation is not that different with developers that are paid for their
+work on the Linux kernel. Those contribute most changes these days. But their
+employers sooner or later also stop caring for their code or make its
+programmer focus on other things. Hardware vendors for example earn their money
+mainly by selling new hardware; quite a few of them hence are not investing
+much time and energy in maintaining a Linux kernel driver for something they
+stopped selling years ago. Enterprise Linux distributors often care for a
+longer time period, but in new versions often leave support for old and rare
+hardware aside to limit the scope. Often spare time contributors take over once
+a company orphans some code, but as mentioned above: sooner or later they will
+leave the code behind, too.
+
+Priorities are another reason why some issues are not fixed, as maintainers
+quite often are forced to set those, as time to work on Linux is limited.
+That's true for spare time or the time employers grant their developers to
+spend on maintenance work on the upstream kernel. Sometimes maintainers also
+get overwhelmed with reports, even if a driver is working nearly perfectly. To
+not get completely stuck, the programmer thus might have no other choice than
+to prioritize issue reports and reject some of them.
+
+But don't worry too much about all of this, a lot of drivers have active
+maintainers who are quite interested in fixing as many issues as possible.
+
+
+Closing words
+=============
+
+Compared with other Free/Libre & Open Source Software it's hard to report
+issues to the Linux kernel developers: the length and complexity of this
+document and the implications between the lines illustrate that. But that's how
+it is for now. The main author of this text hopes documenting the state of the
+art will lay some groundwork to improve the situation over time.
+
+
+..
+ end-of-content
+..
+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
+ you spot a typo or small mistake, feel free to let him know directly and
+ he'll fix it. You are free to do the same in a mostly informal way if you
+ want to contribute changes to the text, but for copyright reasons please CC
+ linux-doc@vger.kernel.org and "sign-off" your contribution as
+ Documentation/process/submitting-patches.rst outlines in the section "Sign
+ your work - the Developer's Certificate of Origin".
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use "The Linux kernel developers" for author attribution and link
+ this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/reporting-regressions.rst b/Documentation/admin-guide/reporting-regressions.rst
new file mode 100644
index 000000000000..76b246ecf21b
--- /dev/null
+++ b/Documentation/admin-guide/reporting-regressions.rst
@@ -0,0 +1,451 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. [see the bottom of this file for redistribution information]
+
+Reporting regressions
++++++++++++++++++++++
+
+"*We don't cause regressions*" is the first rule of Linux kernel development;
+Linux founder and lead developer Linus Torvalds established it himself and
+ensures it's obeyed.
+
+This document describes what the rule means for users and how the Linux kernel's
+development model ensures to address all reported regressions; aspects relevant
+for kernel developers are left to Documentation/process/handling-regressions.rst.
+
+
+The important bits (aka "TL;DR")
+================================
+
+#. It's a regression if something running fine with one Linux kernel works worse
+ or not at all with a newer version. Note, the newer kernel has to be compiled
+ using a similar configuration; the detailed explanations below describes this
+ and other fine print in more detail.
+
+#. Report your issue as outlined in Documentation/admin-guide/reporting-issues.rst,
+ it already covers all aspects important for regressions and repeated
+ below for convenience. Two of them are important: start your report's subject
+ with "[REGRESSION]" and CC or forward it to `the regression mailing list
+ <https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).
+
+#. Optional, but recommended: when sending or forwarding your report, make the
+ Linux kernel regression tracking bot "regzbot" track the issue by specifying
+ when the regression started like this::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+
+
+All the details on Linux kernel regressions relevant for users
+==============================================================
+
+
+The important basics
+--------------------
+
+
+What is a "regression" and what is the "no regressions rule"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's a regression if some application or practical use case running fine with
+one Linux kernel works worse or not at all with a newer version compiled using a
+similar configuration. The "no regressions rule" forbids this to take place; if
+it happens by accident, developers that caused it are expected to quickly fix
+the issue.
+
+It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
+5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
+It's also a regression if a perfectly working application suddenly shows erratic
+behavior with a newer kernel version; such issues can be caused by changes in
+procfs, sysfs, or one of the many other interfaces Linux provides to userland
+software. But keep in mind, as mentioned earlier: 5.14 in this example needs to
+be built from a configuration similar to the one from 5.13. This can be achieved
+using ``make olddefconfig``, as explained in more detail below.
+
+Note the "practical use case" in the first sentence of this section: developers
+despite the "no regressions" rule are free to change any aspect of the kernel
+and even APIs or ABIs to userland, as long as no existing application or use
+case breaks.
+
+Also be aware the "no regressions" rule covers only interfaces the kernel
+provides to the userland. It thus does not apply to kernel-internal interfaces
+like the module API, which some externally developed drivers use to hook into
+the kernel.
+
+How do I report a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Just report the issue as outlined in
+Documentation/admin-guide/reporting-issues.rst, it already describes the
+important points. The following aspects outlined there are especially relevant
+for regressions:
+
+ * When checking for existing reports to join, also search the `archives of the
+ Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
+ `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+ * Start your report's subject with "[REGRESSION]".
+
+ * In your report, clearly mention the last kernel version that worked fine and
+ the first broken one. Ideally try to find the exact change causing the
+ regression using a bisection, as explained below in more detail.
+
+ * Remember to let the Linux regressions mailing list
+ (regressions@lists.linux.dev) know about your report:
+
+ * If you report the regression by mail, CC the regressions list.
+
+ * If you report your regression to some bug tracker, forward the submitted
+ report by mail to the regressions list while CCing the maintainer and the
+ mailing list for the subsystem in question.
+
+ If it's a regression within a stable or longterm series (e.g.
+ v5.15.3..v5.15.5), remember to CC the `Linux stable mailing list
+ <https://lore.kernel.org/stable/>`_ (stable@vger.kernel.org).
+
+ In case you performed a successful bisection, add everyone to the CC the
+ culprit's commit message mentions in lines starting with "Signed-off-by:".
+
+When CCing for forwarding your report to the list, consider directly telling the
+aforementioned Linux kernel regression tracking bot about your report. To do
+that, include a paragraph like this in your mail::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+
+Regzbot will then consider your mail a report for a regression introduced in the
+specified version range. In above case Linux v5.13 still worked fine and Linux
+v5.14-rc1 was the first version where you encountered the issue. If you
+performed a bisection to find the commit that caused the regression, specify the
+culprit's commit-id instead::
+
+ #regzbot introduced: 1f2e3d4c5d
+
+Placing such a "regzbot command" is in your interest, as it will ensure the
+report won't fall through the cracks unnoticed. If you omit this, the Linux
+kernel's regressions tracker will take care of telling regzbot about your
+regression, as long as you send a copy to the regressions mailing lists. But the
+regression tracker is just one human which sometimes has to rest or occasionally
+might even enjoy some time away from computers (as crazy as that might sound).
+Relying on this person thus will result in an unnecessary delay before the
+regressions becomes mentioned `on the list of tracked and unresolved Linux
+kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
+weekly regression reports sent by regzbot. Such delays can result in Linus
+Torvalds being unaware of important regressions when deciding between "continue
+development or call this finished and release the final?".
+
+Are really all regressions fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Nearly all of them are, as long as the change causing the regression (the
+"culprit commit") is reliably identified. Some regressions can be fixed without
+this, but often it's required.
+
+Who needs to find the root cause of a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers of the affected code area should try to locate the culprit on their
+own. But for them that's often impossible to do with reasonable effort, as quite
+a lot of issues only occur in a particular environment outside the developer's
+reach -- for example, a specific hardware platform, firmware, Linux distro,
+system's configuration, or application. That's why in the end it's often up to
+the reporter to locate the culprit commit; sometimes users might even need to
+run additional tests afterwards to pinpoint the exact root cause. Developers
+should offer advice and reasonably help where they can, to make this process
+relatively easy and achievable for typical users.
+
+How can I find the culprit?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Perform a bisection, as roughly outlined in
+Documentation/admin-guide/reporting-issues.rst and described in more detail by
+Documentation/admin-guide/bug-bisect.rst. It might sound like a lot of work, but
+in many cases finds the culprit relatively quickly. If it's hard or
+time-consuming to reliably reproduce the issue, consider teaming up with other
+affected users to narrow down the search range together.
+
+Who can I ask for advice when it comes to regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
+CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
+issue might better be dealt with in private, feel free to omit the list.
+
+
+Additional details about regressions
+------------------------------------
+
+
+What is the goal of the "no regressions rule"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Users should feel safe when updating kernel versions and not have to worry
+something might break. This is in the interest of the kernel developers to make
+updating attractive: they don't want users to stay on stable or longterm Linux
+series that are either abandoned or more than one and a half years old. That's
+in everybody's interest, as `those series might have known bugs, security
+issues, or other problematic aspects already fixed in later versions
+<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
+Additionally, the kernel developers want to make it simple and appealing for
+users to test the latest pre-release or regular release. That's also in
+everybody's interest, as it's a lot easier to track down and fix problems, if
+they are reported shortly after being introduced.
+
+Is the "no regressions" rule really adhered in practice?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's taken really seriously, as can be seen by many mailing list posts from
+Linux creator and lead developer Linus Torvalds, some of which are quoted in
+Documentation/process/handling-regressions.rst.
+
+Exceptions to this rule are extremely rare; in the past developers almost always
+turned out to be wrong when they assumed a particular situation was warranting
+an exception.
+
+Who ensures the "no regressions" is actually followed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The subsystem maintainers should take care of that, which are watched and
+supported by the tree maintainers -- e.g. Linus Torvalds for mainline and
+Greg Kroah-Hartman et al. for various stable/longterm series.
+
+All of them are helped by people trying to ensure no regression report falls
+through the cracks. One of them is Thorsten Leemhuis, who's currently acting as
+the Linux kernel's "regressions tracker"; to facilitate this work he relies on
+regzbot, the Linux kernel regression tracking bot. That's why you want to bring
+your report on the radar of these people by CCing or forwarding each report to
+the regressions mailing list, ideally with a "regzbot command" in your mail to
+get it tracked immediately.
+
+How quickly are regressions normally fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers should fix any reported regression as quickly as possible, to provide
+affected users with a solution in a timely manner and prevent more users from
+running into the issue; nevertheless developers need to take enough time and
+care to ensure regression fixes do not cause additional damage.
+
+The answer thus depends on various factors like the impact of a regression, its
+age, or the Linux series in which it occurs. In the end though, most regressions
+should be fixed within two weeks.
+
+Is it a regression, if the issue can be avoided by updating some software?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Almost always: yes. If a developer tells you otherwise, ask the regression
+tracker for advice as outlined above.
+
+Is it a regression, if a newer kernel works slower or consumes more energy?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, but the difference has to be significant. A five percent slow-down in a
+micro-benchmark thus is unlikely to qualify as regression, unless it also
+influences the results of a broad benchmark by more than one percent. If in
+doubt, ask for advice.
+
+Is it a regression, if an external kernel module breaks when updating Linux?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+No, as the "no regression" rule is about interfaces and services the Linux
+kernel provides to the userland. It thus does not cover building or running
+externally developed kernel modules, as they run in kernel-space and hook into
+the kernel using internal interfaces occasionally changed.
+
+How are regressions handled that are caused by security fixes?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In extremely rare situations security issues can't be fixed without causing
+regressions; those fixes are given way, as they are the lesser evil in the end.
+Luckily this middling almost always can be avoided, as key developers for the
+affected area and often Linus Torvalds himself try very hard to fix security
+issues without causing regressions.
+
+If you nevertheless face such a case, check the mailing list archives if people
+tried their best to avoid the regression. If not, report it; if in doubt, ask
+for advice as outlined above.
+
+What happens if fixing a regression is impossible without causing another?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sadly these things happen, but luckily not very often; if they occur, expert
+developers of the affected code area should look into the issue to find a fix
+that avoids regressions or at least their impact. If you run into such a
+situation, do what was outlined already for regressions caused by security
+fixes: check earlier discussions if people already tried their best and ask for
+advice if in doubt.
+
+A quick note while at it: these situations could be avoided, if people would
+regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each
+development cycle a test run. This is best explained by imagining a change
+integrated between Linux v5.14 and v5.15-rc1 which causes a regression, but at
+the same time is a hard requirement for some other improvement applied for
+5.15-rc1. All these changes often can simply be reverted and the regression thus
+solved, if someone finds and reports it before 5.15 is released. A few days or
+weeks later this solution can become impossible, as some software might have
+started to rely on aspects introduced by one of the follow-up changes: reverting
+all changes would then cause a regression for users of said software and thus is
+out of the question.
+
+Is it a regression, if some feature I relied on was removed months ago?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is, but often it's hard to fix such regressions due to the aspects outlined
+in the previous section. It hence needs to be dealt with on a case-by-case
+basis. This is another reason why it's in everybody's interest to regularly test
+mainline pre-releases.
+
+Does the "no regression" rule apply if I seem to be the only affected person?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but only for practical usage: the Linux developers want to be free to
+remove support for hardware only to be found in attics and museums anymore.
+
+Note, sometimes regressions can't be avoided to make progress -- and the latter
+is needed to prevent Linux from stagnation. Hence, if only very few users seem
+to be affected by a regression, it for the greater good might be in their and
+everyone else's interest to lettings things pass. Especially if there is an
+easy way to circumvent the regression somehow, for example by updating some
+software or using a kernel parameter created just for this purpose.
+
+Does the regression rule apply for code in the staging tree as well?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not according to the `help text for the configuration option covering all
+staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
+which since its early days states::
+
+ Please note that these drivers are under heavy development, may or
+ may not work, and may contain userspace interfaces that most likely
+ will be changed in the near future.
+
+The staging developers nevertheless often adhere to the "no regressions" rule,
+but sometimes bend it to make progress. That's for example why some users had to
+deal with (often negligible) regressions when a WiFi driver from the staging
+tree was replaced by a totally different one written from scratch.
+
+Why do later versions have to be "compiled with a similar configuration"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because the Linux kernel developers sometimes integrate changes known to cause
+regressions, but make them optional and disable them in the kernel's default
+configuration. This trick allows progress, as the "no regressions" rule
+otherwise would lead to stagnation.
+
+Consider for example a new security feature blocking access to some kernel
+interfaces often abused by malware, which at the same time are required to run a
+few rarely used applications. The outlined approach makes both camps happy:
+people using these applications can leave the new security feature off, while
+everyone else can enable it without running into trouble.
+
+How to create a configuration similar to the one of an older kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Start your machine with a known-good kernel and configure the newer Linux
+version with ``make olddefconfig``. This makes the kernel's build scripts pick
+up the configuration file (the ".config" file) from the running kernel as base
+for the new one you are about to compile; afterwards they set all new
+configuration options to their default value, which should disable new features
+that might cause regressions.
+
+Can I report a regression I found with pre-compiled vanilla kernels?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You need to ensure the newer kernel was compiled with a similar configuration
+file as the older one (see above), as those that built them might have enabled
+some known-to-be incompatible feature for the newer kernel. If in doubt, report
+the matter to the kernel's provider and ask for advice.
+
+
+More about regression tracking with "regzbot"
+---------------------------------------------
+
+What is regression tracking and why should I care about it?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Rules like "no regressions" need someone to ensure they are followed, otherwise
+they are broken either accidentally or on purpose. History has shown this to be
+true for Linux kernel development as well. That's why Thorsten Leemhuis, the
+Linux Kernel's regression tracker, and some people try to ensure all regression
+are fixed by keeping an eye on them until they are resolved. Neither of them are
+paid for this, that's why the work is done on a best effort basis.
+
+Why and how are Linux kernel regressions tracked using a bot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Tracking regressions completely manually has proven to be quite hard due to the
+distributed and loosely structured nature of Linux kernel development process.
+That's why the Linux kernel's regression tracker developed regzbot to facilitate
+the work, with the long term goal to automate regression tracking as much as
+possible for everyone involved.
+
+Regzbot works by watching for replies to reports of tracked regressions.
+Additionally, it's looking out for posted or committed patches referencing such
+reports with "Link:" tags; replies to such patch postings are tracked as well.
+Combined this data provides good insights into the current state of the fixing
+process.
+
+How to see which regressions regzbot tracks currently?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check out `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+What kind of issues are supposed to be tracked by regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bot is meant to track regressions, hence please don't involve regzbot for
+regular issues. But it's okay for the Linux kernel's regression tracker if you
+involve regzbot to track severe issues, like reports about hangs, corrupted
+data, or internal errors (Panic, Oops, BUG(), warning, ...).
+
+How to change aspects of a tracked regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By using a 'regzbot command' in a direct or indirect reply to the mail with the
+report. The easiest way to do that: find the report in your "Sent" folder or the
+mailing list archive and reply to it using your mailer's "Reply-all" function.
+In that mail, use one of the following commands in a stand-alone paragraph (IOW:
+use blank lines to separate one or multiple of these commands from the rest of
+the mail's text).
+
+ * Update when the regression started to happen, for example after performing a
+ bisection::
+
+ #regzbot introduced: 1f2e3d4c5d
+
+ * Set or update the title::
+
+ #regzbot title: foo
+
+ * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
+ the issue or a fix are discussed:::
+
+ #regzbot monitor: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
+ #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Point to a place with further details of interest, like a mailing list post
+ or a ticket in a bug tracker that are slightly related, but about a different
+ topic::
+
+ #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Mark a regression as invalid::
+
+ #regzbot invalid: wasn't a regression, problem has always existed
+
+Regzbot supports a few other commands primarily used by developers or people
+tracking regressions. They and more details about the aforementioned regzbot
+commands can be found in the `getting started guide
+<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ and
+the `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
+for regzbot.
+
+..
+ end-of-content
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use "The Linux kernel developers" for author attribution and link
+ this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-regressions.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/security-bugs.rst b/Documentation/admin-guide/security-bugs.rst
deleted file mode 100644
index dcd6c93c7aac..000000000000
--- a/Documentation/admin-guide/security-bugs.rst
+++ /dev/null
@@ -1,89 +0,0 @@
-.. _securitybugs:
-
-Security bugs
-=============
-
-Linux kernel developers take security very seriously. As such, we'd
-like to know when a security bug is found so that it can be fixed and
-disclosed as quickly as possible. Please report security bugs to the
-Linux kernel security team.
-
-Contact
--------
-
-The Linux kernel security team can be contacted by email at
-<security@kernel.org>. This is a private list of security officers
-who will help verify the bug report and develop and release a fix.
-If you already have a fix, please include it with your report, as
-that can speed up the process considerably. It is possible that the
-security team will bring in extra help from area maintainers to
-understand and fix the security vulnerability.
-
-As it is with any bug, the more information provided the easier it
-will be to diagnose and fix. Please review the procedure outlined in
-admin-guide/reporting-bugs.rst if you are unclear about what
-information is helpful. Any exploit code is very helpful and will not
-be released without consent from the reporter unless it has already been
-made public.
-
-Disclosure and embargoed information
-------------------------------------
-
-The security list is not a disclosure channel. For that, see Coordination
-below.
-
-Once a robust fix has been developed, the release process starts. Fixes
-for publicly known bugs are released immediately.
-
-Although our preference is to release fixes for publicly undisclosed bugs
-as soon as they become available, this may be postponed at the request of
-the reporter or an affected party for up to 7 calendar days from the start
-of the release process, with an exceptional extension to 14 calendar days
-if it is agreed that the criticality of the bug requires more time. The
-only valid reason for deferring the publication of a fix is to accommodate
-the logistics of QA and large scale rollouts which require release
-coordination.
-
-While embargoed information may be shared with trusted individuals in
-order to develop a fix, such information will not be published alongside
-the fix or on any other disclosure channel without the permission of the
-reporter. This includes but is not limited to the original bug report
-and followup discussions (if any), exploits, CVE information or the
-identity of the reporter.
-
-In other words our only interest is in getting bugs fixed. All other
-information submitted to the security list and any followup discussions
-of the report are treated confidentially even after the embargo has been
-lifted, in perpetuity.
-
-Coordination
-------------
-
-Fixes for sensitive bugs, such as those that might lead to privilege
-escalations, may need to be coordinated with the private
-<linux-distros@vs.openwall.org> mailing list so that distribution vendors
-are well prepared to issue a fixed kernel upon public disclosure of the
-upstream fix. Distros will need some time to test the proposed patch and
-will generally request at least a few days of embargo, and vendor update
-publication prefers to happen Tuesday through Thursday. When appropriate,
-the security team can assist with this coordination, or the reporter can
-include linux-distros from the start. In this case, remember to prefix
-the email Subject line with "[vs]" as described in the linux-distros wiki:
-<http://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists>
-
-CVE assignment
---------------
-
-The security team does not normally assign CVEs, nor do we require them
-for reports or fixes, as this can needlessly complicate the process and
-may delay the bug handling. If a reporter wishes to have a CVE identifier
-assigned ahead of public disclosure, they will need to contact the private
-linux-distros list, described above. When such a CVE identifier is known
-before a patch is provided, it is desirable to mention it in the commit
-message if the reporter agrees.
-
-Non-disclosure agreements
--------------------------
-
-The Linux kernel security team is not a formal body and therefore unable
-to enter any non-disclosure agreements.
diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst
index 58b32832e50a..a3dfc2c66e01 100644
--- a/Documentation/admin-guide/serial-console.rst
+++ b/Documentation/admin-guide/serial-console.rst
@@ -33,8 +33,11 @@ The format of this option is::
9600n8. The maximum baudrate is 115200.
You can specify multiple console= options on the kernel command line.
-Output will appear on all of them. The last device will be used when
-you open ``/dev/console``. So, for example::
+
+The behavior is well defined when each device type is mentioned only once.
+In this case, the output will appear on all requested consoles. And
+the last device will be used when you open ``/dev/console``.
+So, for example::
console=ttyS1,9600 console=tty0
@@ -42,7 +45,34 @@ defines that opening ``/dev/console`` will get you the current foreground
virtual console, and kernel messages will appear on both the VGA
console and the 2nd serial port (ttyS1 or COM2) at 9600 baud.
-Note that you can only define one console per device type (serial, video).
+The behavior is more complicated when the same device type is defined more
+times. In this case, there are the following two rules:
+
+1. The output will appear only on the first device of each defined type.
+
+2. ``/dev/console`` will be associated with the first registered device.
+ Where the registration order depends on how kernel initializes various
+ subsystems.
+
+ This rule is used also when the last console= parameter is not used
+ for other reasons. For example, because of a typo or because
+ the hardware is not available.
+
+The result might be surprising. For example, the following two command
+lines have the same result::
+
+ console=ttyS1,9600 console=tty0 console=tty1
+ console=tty0 console=ttyS1,9600 console=tty1
+
+The kernel messages are printed only on ``tty0`` and ``ttyS1``. And
+``/dev/console`` gets associated with ``tty0``. It is because kernel
+tries to register graphical consoles before serial ones. It does it
+because of the default behavior when no console device is specified,
+see below.
+
+Note that the last ``console=tty1`` parameter still makes a difference.
+The kernel command line is used also by systemd. It would use the last
+defined ``tty1`` as the login console.
If no console device is specified, the first device found capable of
acting as a system console will be used. At this time, the system
diff --git a/Documentation/admin-guide/spkguide.txt b/Documentation/admin-guide/spkguide.txt
new file mode 100644
index 000000000000..0d5965138f8f
--- /dev/null
+++ b/Documentation/admin-guide/spkguide.txt
@@ -0,0 +1,1625 @@
+
+The Speakup User's Guide
+For Speakup 3.1.2 and Later
+By Gene Collins
+Updated by others
+Last modified on Mon Sep 27 14:26:31 2010
+Document version 1.3
+
+Copyright (c) 2005 Gene Collins
+Copyright (c) 2008, 2023 Samuel Thibault
+Copyright (c) 2009, 2010 the Speakup Team
+
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no
+Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
+copy of the license is included in the section entitled "GNU Free
+Documentation License".
+
+Preface
+
+The purpose of this document is to familiarize users with the user
+interface to Speakup, a Linux Screen Reader. If you need instructions
+for installing or obtaining Speakup, visit the web site at
+http://linux-speakup.org/. Speakup is a set of patches to the standard
+Linux kernel source tree. It can be built as a series of modules, or as
+a part of a monolithic kernel. These details are beyond the scope of
+this manual, but the user may need to be aware of the module
+capabilities, depending on how your system administrator has installed
+Speakup. If Speakup is built as a part of a monolithic kernel, and the
+user is using a hardware synthesizer, then Speakup will be able to
+provide speech access from the time the kernel is loaded, until the time
+the system is shutdown. This means that if you have obtained Linux
+installation media for a distribution which includes Speakup as a part
+of its kernel, you will be able, as a blind person, to install Linux
+with speech access unaided by a sighted person. Again, these details
+are beyond the scope of this manual, but the user should be aware of
+them. See the web site mentioned above for further details.
+
+1. Starting Speakup
+
+If your system administrator has installed Speakup to work with your
+specific synthesizer by default, then all you need to do to use Speakup
+is to boot your system, and Speakup should come up talking. This
+assumes of course that your synthesizer is a supported hardware
+synthesizer, and that it is either installed in or connected to your
+system, and is if necessary powered on.
+
+It is possible, however, that Speakup may have been compiled into the
+kernel with no default synthesizer. It is even possible that your
+kernel has been compiled with support for some of the supported
+synthesizers and not others. If you find that this is the case, and
+your synthesizer is supported but not available, complain to the person
+who compiled and installed your kernel. Or better yet, go to the web
+site, and learn how to patch Speakup into your own kernel source, and
+build and install your own kernel.
+
+If your kernel has been compiled with Speakup, and has no default
+synthesizer set, or you would like to use a different synthesizer than
+the default one, then you may issue the following command at the boot
+prompt of your boot loader.
+
+linux speakup.synth=ltlk
+
+This command would tell Speakup to look for and use a LiteTalk or
+DoubleTalk LT at boot up. You may replace the ltlk synthesizer keyword
+with the keyword for whatever synthesizer you wish to use. The
+speakup.synth parameter will accept the following keywords, provided
+that support for the related synthesizers has been built into the
+kernel.
+
+acntsa -- Accent SA
+acntpc -- Accent PC
+apollo -- Apollo
+audptr -- Audapter
+bns -- Braille 'n Speak
+dectlk -- DecTalk Express (old and new, db9 serial only)
+decext -- DecTalk (old) External
+dtlk -- DoubleTalk PC
+keypc -- Keynote Gold PC
+ltlk -- DoubleTalk LT, LiteTalk, or external Tripletalk (db9 serial only)
+spkout -- Speak Out
+txprt -- Transport
+dummy -- Plain text terminal
+
+Note: Speakup does * NOT * support the internal Tripletalk!
+
+Speakup does support two other synthesizers, but because they work in
+conjunction with other software, they must be loaded as modules after
+their related software is loaded, and so are not available at boot up.
+These are as follows:
+
+decpc -- DecTalk PC (not available at boot up)
+soft -- One of several software synthesizers (not available at boot up)
+
+By default speakup looks for the synthesizer on the ttyS0 serial port. This can
+be changed with the device parameter of the modules, for instance for
+DoubleTalk LT:
+
+speakup_ltlk.dev=ttyUSB0
+
+See the sections on loading modules and software synthesizers later in
+this manual for further details. It should be noted here that the
+speakup.synth boot parameter will have no effect if Speakup has been
+compiled as modules. In order for Speakup modules to be loaded during
+the boot process, such action must be configured by your system
+administrator. This will mean that you will hear some, but not all, of
+the bootup messages.
+
+2. Basic operation
+
+Once you have booted the system, and if necessary, have supplied the
+proper bootup parameter for your synthesizer, Speakup will begin
+talking as soon as the kernel is loaded. In fact, it will talk a lot!
+It will speak all the boot up messages that the kernel prints on the
+screen during the boot process. This is because Speakup is not a
+separate screen reader, but is actually built into the operating
+system. Since almost all console applications must print text on the
+screen using the kernel, and must get their keyboard input through the
+kernel, they are automatically handled properly by Speakup. There are a
+few exceptions, but we'll come to those later.
+
+Note: In this guide I will refer to the numeric keypad as the keypad.
+This is done because the speakupmap.map file referred to later in this
+manual uses the term keypad instead of numeric keypad. Also I'm lazy
+and would rather only type one word. So keypad it is. Got it? Good.
+
+Most of the Speakup review keys are located on the keypad at the far
+right of the keyboard. The numlock key should be off, in order for these
+to work. If you toggle the numlock on, the keypad will produce numbers,
+which is exactly what you want for spreadsheets and such. For the
+purposes of this guide, you should have the numlock turned off, which is
+its default state at bootup.
+
+You probably won't want to listen to all the bootup messages every time
+you start your system, though it's a good idea to listen to them at
+least once, just so you'll know what kind of information is available to
+you during the boot process. You can always review these messages after
+bootup with the command:
+
+dmesg | more
+
+In order to speed the boot process, and to silence the speaking of the
+bootup messages, just press the keypad enter key. This key is located
+in the bottom right corner of the keypad. Speakup will shut up and stay
+that way, until you press another key.
+
+You can check to see if the boot process has completed by pressing the 8
+key on the keypad, which reads the current line. This also has the
+effect of starting Speakup talking again, so you can press keypad enter
+to silence it again if the boot process has not completed.
+
+When the boot process is complete, you will arrive at a "login" prompt.
+At this point, you'll need to type in your user id and password, as
+provided by your system administrator. You will hear Speakup speak the
+letters of your user id as you type it, but not the password. This is
+because the password is not displayed on the screen for security
+reasons. This has nothing to do with Speakup, it's a Linux security
+feature.
+
+Once you've logged in, you can run any Linux command or program which is
+allowed by your user id. Normal users will not be able to run programs
+which require root privileges.
+
+When you are running a program or command, Speakup will automatically
+speak new text as it arrives on the screen. You can at any time silence
+the speech with keypad enter, or use any of the Speakup review keys.
+
+Here are some basic Speakup review keys, and a short description of what
+they do.
+
+keypad 1 -- read previous character
+keypad 2 -- read current character (pressing keypad 2 twice rapidly will speak
+ the current character phonetically)
+keypad 3 -- read next character
+keypad 4 -- read previous word
+keypad 5 -- read current word (press twice rapidly to spell the current word)
+keypad 6 -- read next word
+keypad 7 -- read previous line
+keypad 8 -- read current line (press twice rapidly to hear how much the
+ text on the current line is indented)
+keypad 9 -- read next line
+keypad period -- speak current cursor position and announce current
+ virtual console
+
+It's also worth noting that the insert key on the keypad is mapped
+as the speakup key. Instead of pressing and releasing this key, as you
+do under DOS or Windows, you hold it like a shift key, and press other
+keys in combination with it. For example, repeatedly holding keypad
+insert, from now on called speakup, and keypad enter will toggle the
+speaking of new text on the screen on and off. This is not the same as
+just pressing keypad enter by itself, which just silences the speech
+until you hit another key. When you hit speakup plus keypad enter,
+Speakup will say, "You turned me off.", or "Hey, that's better." When
+Speakup is turned off, no new text on the screen will be spoken. You
+can still use the reading controls to review the screen however.
+
+3. Using the Speakup Help System
+
+In order to enter the Speakup help system, press and hold the speakup
+key (remember that this is the keypad insert key), and press the f1 key.
+You will hear the message:
+
+"Press space to leave help, cursor up or down to scroll, or a letter to
+go to commands in list."
+
+When you press the spacebar to leave the help system, you will hear:
+
+"Leaving help."
+
+While you are in the Speakup help system, you can scroll up or down
+through the list of available commands using the cursor keys. The list
+of commands is arranged in alphabetical order. If you wish to jump to
+commands in a specific part of the alphabet, you may press the letter of
+the alphabet you wish to jump to.
+
+You can also just explore by typing keyboard keys. Pressing keys will
+cause Speakup to speak the command associated with that key. For
+example, if you press the keypad 8 key, you will hear:
+
+"Keypad 8 is line, say current."
+
+You'll notice that some commands do not have keys assigned to them.
+This is because they are very infrequently used commands, and are also
+accessible through the sys system. We'll discuss the sys system later
+in this manual.
+
+You'll also notice that some commands have two keys assigned to them.
+This is because Speakup has a built in set of alternative key bindings
+for laptop users. The alternate speakup key is the caps lock key. You
+can press and hold the caps lock key, while pressing an alternate
+speakup command key to activate the command. On most laptops, the
+numeric keypad is defined as the keys in the j k l area of the keyboard.
+
+There is usually a function key which turns this keypad function on and
+off, and some other key which controls the numlock state. Toggling the
+keypad functionality on and off can become a royal pain. So, Speakup
+gives you a simple way to get at an alternative set of key mappings for
+your laptop. These are also available by default on desktop systems,
+because Speakup does not know whether it is running on a desktop or
+laptop. So you may choose which set of Speakup keys to use. Some
+system administrators may have chosen to compile Speakup for a desktop
+system without this set of alternate key bindings, but these details are
+beyond the scope of this manual. To use the caps lock for its normal
+purpose, hold the shift key while toggling the caps lock on and off. We
+should note here, that holding the caps lock key and pressing the z key
+will toggle the alternate j k l keypad on and off.
+
+4. Keys and Their Assigned Commands
+
+In this section, we'll go through a list of all the speakup keys and
+commands. You can also get a list of commands and assigned keys from
+the help system.
+
+The following list was taken from the speakupmap.map file. Key
+assignments are on the left of the equal sign, and the associated
+Speakup commands are on the right. The designation "spk" means to press
+and hold the speakup key, a.k.a. keypad insert, a.k.a. caps lock, while
+pressing the other specified key.
+
+spk key_f9 = punc_level_dec
+spk key_f10 = punc_level_inc
+spk key_f11 = reading_punc_dec
+spk key_f12 = reading_punc_inc
+spk key_1 = vol_dec
+spk key_2 = vol_inc
+spk key_3 = pitch_dec
+spk key_4 = pitch_inc
+spk key_5 = rate_dec
+spk key_6 = rate_inc
+key_kpasterisk = toggle_cursoring
+spk key_kpasterisk = speakup_goto
+spk key_f1 = speakup_help
+spk key_f2 = set_win
+spk key_f3 = clear_win
+spk key_f4 = enable_win
+spk key_f5 = edit_some
+spk key_f6 = edit_most
+spk key_f7 = edit_delim
+spk key_f8 = edit_repeat
+shift spk key_f9 = edit_exnum
+ key_kp7 = say_prev_line
+spk key_kp7 = left_edge
+ key_kp8 = say_line
+double key_kp8 = say_line_indent
+spk key_kp8 = say_from_top
+ key_kp9 = say_next_line
+spk key_kp9 = top_edge
+ key_kpminus = speakup_parked
+spk key_kpminus = say_char_num
+ key_kp4 = say_prev_word
+spk key_kp4 = say_from_left
+ key_kp5 = say_word
+double key_kp5 = spell_word
+spk key_kp5 = spell_phonetic
+ key_kp6 = say_next_word
+spk key_kp6 = say_to_right
+ key_kpplus = say_screen
+spk key_kpplus = say_win
+ key_kp1 = say_prev_char
+spk key_kp1 = right_edge
+ key_kp2 = say_char
+spk key_kp2 = say_to_bottom
+double key_kp2 = say_phonetic_char
+ key_kp3 = say_next_char
+spk key_kp3 = bottom_edge
+ key_kp0 = spk_key
+ key_kpdot = say_position
+spk key_kpdot = say_attributes
+key_kpenter = speakup_quiet
+spk key_kpenter = speakup_off
+key_sysrq = speech_kill
+ key_kpslash = speakup_cut
+spk key_kpslash = speakup_paste
+spk key_pageup = say_first_char
+spk key_pagedown = say_last_char
+key_capslock = spk_key
+ spk key_z = spk_lock
+key_leftmeta = spk_key
+ctrl spk key_0 = speakup_goto
+spk key_u = say_prev_line
+spk key_i = say_line
+double spk key_i = say_line_indent
+spk key_o = say_next_line
+spk key_minus = speakup_parked
+shift spk key_minus = say_char_num
+spk key_j = say_prev_word
+spk key_k = say_word
+double spk key_k = spell_word
+spk key_l = say_next_word
+spk key_m = say_prev_char
+spk key_comma = say_char
+double spk key_comma = say_phonetic_char
+spk key_dot = say_next_char
+spk key_n = say_position
+ ctrl spk key_m = left_edge
+ ctrl spk key_y = top_edge
+ ctrl spk key_dot = right_edge
+ctrl spk key_p = bottom_edge
+spk key_apostrophe = say_screen
+spk key_h = say_from_left
+spk key_y = say_from_top
+spk key_semicolon = say_to_right
+spk key_p = say_to_bottom
+spk key_slash = say_attributes
+ spk key_enter = speakup_quiet
+ ctrl spk key_enter = speakup_off
+ spk key_9 = speakup_cut
+spk key_8 = speakup_paste
+shift spk key_m = say_first_char
+ ctrl spk key_semicolon = say_last_char
+spk key_r = read_all_doc
+
+5. The Speakup Sys System
+
+The Speakup screen reader also creates a speakup subdirectory as a part
+of the sys system.
+
+As a convenience, run as root
+
+ln -s /sys/accessibility/speakup /speakup
+
+to directly access speakup parameters from /speakup.
+You can see these entries by typing the command:
+
+ls -1 /speakup/*
+
+If you issue the above ls command, you will get back something like
+this:
+
+/speakup/attrib_bleep
+/speakup/bell_pos
+/speakup/bleep_time
+/speakup/bleeps
+/speakup/cursor_time
+/speakup/delimiters
+/speakup/ex_num
+/speakup/key_echo
+/speakup/keymap
+/speakup/no_interrupt
+/speakup/punc_all
+/speakup/punc_level
+/speakup/punc_most
+/speakup/punc_some
+/speakup/reading_punc
+/speakup/repeats
+/speakup/say_control
+/speakup/say_word_ctl
+/speakup/silent
+/speakup/spell_delay
+/speakup/synth
+/speakup/synth_direct
+/speakup/version
+
+/speakup/i18n:
+announcements
+characters
+chartab
+colors
+ctl_keys
+formatted
+function_names
+key_names
+states
+
+/speakup/soft:
+caps_start
+caps_stop
+delay_time
+direct
+freq
+full_time
+jiffy_delta
+pitch
+inflection
+punct
+rate
+tone
+trigger_time
+voice
+vol
+
+Notice the two subdirectories of /speakup: /speakup/i18n and
+/speakup/soft.
+The i18n subdirectory is described in a later section.
+The files under /speakup/soft represent settings that are specific to the
+driver for the software synthesizer. If you use the LiteTalk, your
+synthesizer-specific settings would be found in /speakup/ltlk. In other words,
+a subdirectory named /speakup/KWD is created to hold parameters specific
+to the device whose keyword is KWD.
+These parameters include volume, rate, pitch, and others.
+
+In addition to using the Speakup hot keys to change such things as
+volume, pitch, and rate, you can also echo values to the appropriate
+entry in the /speakup directory. This is very useful, since it
+lets you control Speakup parameters from within a script. How you
+would write such scripts is somewhat beyond the scope of this manual,
+but I will include a couple of simple examples here to give you a
+general idea of what such scripts can do.
+
+Suppose for example, that you wanted to control both the punctuation
+level and the reading punctuation level at the same time. For
+simplicity, we'll call them punc0, punc1, punc2, and punc3. The scripts
+might look something like this:
+
+#!/bin/bash
+# punc0
+# set punc and reading punc levels to 0
+echo 0 >/speakup/punc_level
+echo 0 >/speakup/reading_punc
+echo Punctuation level set to 0.
+
+#!/bin/bash
+# punc1
+# set punc and reading punc levels to 1
+echo 1 >/speakup/punc_level
+echo 1 >/speakup/reading_punc
+echo Punctuation level set to 1.
+
+#!/bin/bash
+# punc2
+# set punc and reading punc levels to 2
+echo 2 >/speakup/punc_level
+echo 2 >/speakup/reading_punc
+echo Punctuation level set to 2.
+
+#!/bin/bash
+# punc3
+# set punc and reading punc levels to 3
+echo 3 >/speakup/punc_level
+echo 3 >/speakup/reading_punc
+echo Punctuation level set to 3.
+
+If you were to store these four small scripts in a directory in your
+path, perhaps /usr/local/bin, and set the permissions to 755 with the
+chmod command, then you could change the default reading punc and
+punctuation levels at the same time by issuing just one command. For
+example, if you were to execute the punc3 command at your shell prompt,
+then the reading punc and punc level would both get set to 3.
+
+I should note that the above scripts were written to work with bash, but
+regardless of which shell you use, you should be able to do something
+similar.
+
+The Speakup sys system also has another interesting use. You can echo
+Speakup parameters into the sys system in a script during system
+startup, and speakup will return to your preferred parameters every time
+the system is rebooted.
+
+Most of the Speakup sys parameters can be manipulated by a regular user
+on the system. However, there are a few parameters that are dangerous
+enough that they should only be manipulated by the root user on your
+system. There are even some parameters that are read only, and cannot
+be written to at all. For example, the version entry in the Speakup
+sys system is read only. This is because there is no reason for a user
+to tamper with the version number which is reported by Speakup. Doing
+an ls -l on /speakup/version will return this:
+
+-r--r--r-- 1 root root 0 Mar 21 13:46 /speakup/version
+
+As you can see, the version entry in the Speakup sys system is read
+only, is owned by root, and belongs to the root group. Doing a cat of
+/speakup/version will display the Speakup version number, like
+this:
+
+cat /speakup/version
+Speakup v-2.00 CVS: Thu Oct 21 10:38:21 EDT 2004
+synth dtlk version 1.1
+
+The display shows the Speakup version number, along with the version
+number of the driver for the current synthesizer.
+
+Looking at entries in the Speakup sys system can be useful in many
+ways. For example, you might wish to know what level your volume is set
+at. You could type:
+
+cat /speakup/KWD/vol
+# Replace KWD with the keyword for your synthesizer, E.G., ltlk for LiteTalk.
+5
+
+The number five which comes back is the level at which the synthesizer
+volume is set at.
+
+All the entries in the Speakup sys system are readable, some are
+writable by root only, and some are writable by everyone. Unless you
+know what you are doing, you should probably leave the ones that are
+writable by root only alone. Most of the names are self explanatory.
+Vol for controlling volume, pitch for pitch, inflection for pitch range, rate
+for controlling speaking rate, etc. If you find one you aren't sure about, you
+can post a query on the Speakup list.
+
+6. Changing Synthesizers
+
+It is possible to change to a different synthesizer while speakup is
+running. In other words, it is not necessary to reboot the system
+in order to use a different synthesizer. You can simply echo the
+synthesizer keyword to the /speakup/synth sys entry.
+Depending on your situation, you may wish to echo none to the synth
+sys entry, to disable speech while one synthesizer is disconnected and
+a second one is connected in its place. Then echo the keyword for the
+new synthesizer into the synth sys entry in order to start speech
+with the newly connected synthesizer. See the list of synthesizer
+keywords in section 1 to find the keyword which matches your synth.
+
+7. Loading modules
+
+As mentioned earlier, Speakup can either be completely compiled into the
+kernel, with the exception of the help module, or it can be compiled as
+a series of modules. When compiled as modules, Speakup will only be
+able to speak some of the bootup messages if your system administrator
+has configured the system to load the modules at boot time. The modules
+can be loaded after the file systems have been checked and mounted, or
+from an initrd. There is a third possibility. Speakup can be compiled
+with some components built into the kernel, and others as modules. As
+we'll see in the next section, this is particularly useful when you are
+working with software synthesizers.
+
+If Speakup is completely compiled as modules, then you must use the
+modprobe command to load Speakup. You do this by loading the module for
+the synthesizer driver you wish to use. The driver modules are all
+named speakup_<keyword>, where <keyword> is the keyword for the
+synthesizer you want. So, in order to load the driver for the DecTalk
+Express, you would type the following command:
+
+modprobe speakup_dectlk
+
+Issuing this command would load the DecTalk Express driver and all other
+related Speakup modules necessary to get Speakup up and running.
+
+To completely unload Speakup, again presuming that it is entirely built
+as modules, you would give the command:
+
+modprobe -r speakup_dectlk
+
+The above command assumes you were running a DecTalk Express. If you
+were using a different synth, then you would substitute its keyword in
+place of dectlk.
+
+If you have multiple drivers loaded, you need to unload all of them, in
+order to completely unload Speakup.
+For example, if you have loaded both the dectlk and ltlk drivers, use the
+command:
+modprobe -r speakup_dectlk speakup_ltlk
+
+You cannot unload the driver for software synthesizers when a user-space
+daemon is using /dev/softsynth. First, kill the daemon. Next, remove
+the driver with the command:
+modprobe -r speakup_soft
+
+Now, suppose we have a situation where the main Speakup component
+is built into the kernel, and some or all of the drivers are built as
+modules. Since the main part of Speakup is compiled into the kernel, a
+partial Speakup sys system has been created which we can take advantage
+of by simply echoing the synthesizer keyword into the
+/speakup/synth sys entry. This will cause the kernel to
+automatically load the appropriate driver module, and start Speakup
+talking. To switch to another synth, just echo a new keyword to the
+synth sys entry. For example, to load the DoubleTalk LT driver,
+you would type:
+
+echo ltlk >/speakup/synth
+
+You can use the modprobe -r command to unload driver modules, regardless
+of whether the main part of Speakup has been built into the kernel or
+not.
+
+8. Using Software Synthesizers
+
+Using a software synthesizer requires that some other software be
+installed and running on your system. For this reason, software
+synthesizers are not available for use at bootup, or during a system
+installation process.
+There are two freely-available solutions for software speech: Espeakup and
+Speech Dispatcher.
+These are described in subsections 8.1 and 8.2, respectively.
+
+During the rest of this section, we assume that speakup_soft is either
+built in to your kernel, or loaded as a module.
+
+If your system does not have udev installed , before you can use a
+software synthesizer, you must have created the /dev/softsynth device.
+If you have not already done so, issue the following commands as root:
+
+cd /dev
+mknod softsynth c 10 26
+
+While we are at it, we might just as well create the /dev/synth device,
+which can be used to let user space programs send information to your
+synthesizer. To create /dev/synth, change to the /dev directory, and
+issue the following command as root:
+
+mknod synth c 10 25
+
+of both.
+
+8.1. Espeakup
+
+Espeakup is a connector between Speakup and the eSpeak software synthesizer.
+Espeakup may already be available as a package for your distribution
+of Linux. If it is not packaged, you need to install it manually.
+You can find it in the contrib/ subdirectory of the Speakup sources.
+The filename is espeakup-$VERSION.tar.bz2, where $VERSION
+depends on the current release of Espeakup. The Speakup 3.1.2 source
+ships with version 0.71 of Espeakup.
+The README file included with the Espeakup sources describes the process
+of manual installation.
+
+Assuming that Espeakup is installed, either by the user or by the distributor,
+follow these steps to use it.
+
+Tell Speakup to use the "soft driver:
+echo soft > /speakup/synth
+
+Finally, start the espeakup program. There are two ways to do it.
+Both require root privileges.
+
+If Espeakup was installed as a package for your Linux distribution,
+you probably have a distribution-specific script that controls the operation
+of the daemon. Look for a file named espeakup under /etc/init.d or
+/etc/rc.d. Execute the following command with root privileges:
+/etc/init.d/espeakup start
+Replace init.d with rc.d, if your distribution uses scripts located under
+/etc/rc.d.
+Your distribution will also have a procedure for starting daemons at
+boot-time, so it is possible to have software speech as soon as user-space
+daemons are started by the bootup scripts.
+These procedures are not described in this document.
+
+If you built Espeakup manually, the "make install" step placed the binary
+under /usr/bin.
+Run the following command as root:
+/usr/bin/espeakup
+Espeakup should start speaking.
+
+8.2. Speech Dispatcher
+
+For this option, you must have a package called
+Speech Dispatcher running on your system, and it must be configured to
+work with one of its supported software synthesizers.
+
+Two open source synthesizers you might use are Flite and Festival. You
+might also choose to purchase the Software DecTalk from Fonix Sales Inc.
+If you run a google search for Fonix, you'll find their web site.
+
+You can obtain a copy of Speech Dispatcher from free(b)soft at
+http://www.freebsoft.org/. Follow the installation instructions that
+come with Speech Dispatcher in order to install and configure Speech
+Dispatcher. You can check out the web site for your Linux distribution
+in order to get a copy of either Flite or Festival. Your Linux
+distribution may also have a precompiled Speech Dispatcher package.
+
+Once you've installed, configured, and tested Speech Dispatcher with your
+chosen software synthesizer, you still need one more piece of software
+in order to make things work. You need a package called speechd-up.
+You get it from the free(b)soft web site mentioned above. After you've
+compiled and installed speechd-up, you are almost ready to begin using
+your software synthesizer.
+
+Now you can begin using your software synthesizer. In order to do so,
+echo the soft keyword to the synth sys entry like this:
+
+echo soft >/speakup/synth
+
+Next run the speechd_up command like this:
+
+speechd_up &
+
+Your synth should now start talking, and you should be able to adjust
+the pitch, rate, etc.
+
+9. Using The DecTalk PC Card
+
+The DecTalk PC card is an ISA card that is inserted into one of the ISA
+slots in your computer. It requires that the DecTalk PC software be
+installed on your computer, and that the software be loaded onto the
+Dectalk PC card before it can be used.
+
+You can get the dec_pc.tgz file from the linux-speakup.org site. The
+dec_pc.tgz file is in the ~ftp/pub/linux/speakup directory.
+
+After you have downloaded the dec_pc.tgz file, untar it in your home
+directory, and read the Readme file in the newly created dec_pc
+directory.
+
+The easiest way to get the software working is to copy the entire dec_pc
+directory into /user/local/lib. To do this, su to root in your home
+directory, and issue the command:
+
+cp dec_pc /usr/local/lib
+
+You will need to copy the dtload command from the dec_pc directory to a
+directory in your path. Either /usr/bin or /usr/local/bin is a good
+choice.
+
+You can now run the dtload command in order to load the DecTalk PC
+software onto the card. After you have done this, echo the decpc
+keyword to the synth entry in the sys system like this:
+
+echo decpc >/speakup/synth
+
+Your DecTalk PC should start talking, and then you can adjust the pitch,
+rate, volume, voice, etc. The voice entry in the Speakup sys system
+will accept a number from 0 through 7 for the DecTalk PC synthesizer,
+which will give you access to some of the DecTalk voices.
+
+10. Using Cursor Tracking
+
+In Speakup version 2.0 and later, cursor tracking is turned on by
+default. This means that when you are using an editor, Speakup will
+automatically speak characters as you move left and right with the
+cursor keys, and lines as you move up and down with the cursor keys.
+This is the traditional sort of cursor tracking.
+Recent versions of Speakup provide two additional ways to control the
+text that is spoken when the cursor is moved:
+"highlight tracking" and "read window."
+They are described later in this section.
+Sometimes, these modes get in your way, so you can disable cursor tracking
+altogether.
+
+You may select among the various forms of cursor tracking using the keypad
+asterisk key.
+Each time you press this key, a new mode is selected, and Speakup speaks
+the name of the new mode. The names for the four possible states of cursor
+tracking are: "cursoring on", "highlight tracking", "read window",
+and "cursoring off." The keypad asterisk key moves through the list of
+modes in a circular fashion.
+
+If highlight tracking is enabled, Speakup tracks highlighted text,
+rather than the cursor itself. When you move the cursor with the arrow keys,
+Speakup speaks the currently highlighted information.
+This is useful when moving through various menus and dialog boxes.
+If cursor tracking isn't helping you while navigating a menu,
+try highlight tracking.
+
+With the "read window" variety of cursor tracking, you can limit the text
+that Speakup speaks by specifying a window of interest on the screen.
+See section 15 for a description of the process of defining windows.
+When you move the cursor via the arrow keys, Speakup only speaks
+the contents of the window. This is especially helpful when you are hearing
+superfluous speech. Consider the following example.
+
+Suppose that you are at a shell prompt. You use bash, and you want to
+explore your command history using the up and down arrow keys. If you
+have enabled cursor tracking, you will hear two pieces of information.
+Speakup speaks both your shell prompt and the current entry from the
+command history. You may not want to hear the prompt repeated
+each time you move, so you can silence it by specifying a window. Find
+the last line of text on the screen. Clear the current window by pressing
+the key combination speakup f3. Use the review cursor to find the first
+character that follows your shell prompt. Press speakup + f2 twice, to
+define a one-line window. The boundaries of the window are the
+character following the shell prompt and the end of the line. Now, cycle
+through the cursor tracking modes using keypad asterisk, until Speakup
+says "read window." Move through your history using your arrow keys.
+You will notice that Speakup no longer speaks the redundant prompt.
+
+Some folks like to turn cursor tracking off while they are using the
+lynx web browser. You definitely want to turn cursor tracking off when
+you are using the alsamixer application. Otherwise, you won't be able
+to hear your mixer settings while you are using the arrow keys.
+
+11. Cut and Paste
+
+One of Speakup's more useful features is the ability to cut and paste
+text on the screen. This means that you can capture information from a
+program, and paste that captured text into a different place in the
+program, or into an entirely different program, which may even be
+running on a different console.
+
+For example, in this manual, we have made references to several web
+sites. It would be nice if you could cut and paste these urls into your
+web browser. Speakup does this quite nicely. Suppose you wanted to
+past the following url into your browser:
+
+http://linux-speakup.org/
+
+Use the speakup review keys to position the reading cursor on the first
+character of the above url. When the reading cursor is in position,
+press the keypad slash key once. Speakup will say, "mark". Next,
+position the reading cursor on the rightmost character of the above
+url. Press the keypad slash key once again to actually cut the text
+from the screen. Speakup will say, "cut". Although we call this
+cutting, Speakup does not actually delete the cut text from the screen.
+It makes a copy of the text in a special buffer for later pasting.
+
+Now that you have the url cut from the screen, you can paste it into
+your browser, or even paste the url on a command line as an argument to
+your browser.
+
+Suppose you want to start lynx and go to the Speakup site.
+
+You can switch to a different console with the alt left and right
+arrows, or you can switch to a specific console by typing alt and a
+function key. These are not Speakup commands, just standard Linux
+console capabilities.
+
+Once you've changed to an appropriate console, and are at a shell prompt,
+type the word lynx, followed by a space. Now press and hold the speakup
+key, while you type the keypad slash character. The url will be pasted
+onto the command line, just as though you had typed it in. Press the
+enter key to execute the command.
+
+The paste buffer will continue to hold the cut information, until a new
+mark and cut operation is carried out. This means you can paste the cut
+information as many times as you like before doing another cut
+operation.
+
+You are not limited to cutting and pasting only one line on the screen.
+You can also cut and paste rectangular regions of the screen. Just
+position the reading cursor at the top left corner of the text to be
+cut, mark it with the keypad slash key, then position the reading cursor
+at the bottom right corner of the region to be cut, and cut it with the
+keypad slash key.
+
+12. Changing the Pronunciation of Characters
+
+Through the /speakup/i18n/characters sys entry, Speakup gives you the
+ability to change how Speakup pronounces a given character. You could,
+for example, change how some punctuation characters are spoken. You can
+even change how Speakup will pronounce certain letters.
+
+You may, for example, wish to change how Speakup pronounces the z
+character. The author of Speakup, Kirk Reiser, is Canadian, and thus
+believes that the z should be pronounced zed. If you are an American,
+you might wish to use the zee pronunciation instead of zed. You can
+change the pronunciation of both the upper and lower case z with the
+following two commands:
+
+echo 90 zee >/speakup/characters
+echo 122 zee >/speakup/characters
+
+Let's examine the parts of the two previous commands. They are issued
+at the shell prompt, and could be placed in a startup script.
+
+The word echo tells the shell that you want to have it display the
+string of characters that follow the word echo. If you were to just
+type:
+
+echo hello.
+
+You would get the word hello printed on your screen as soon as you
+pressed the enter key. In this case, we are echoing strings that we
+want to be redirected into the sys system.
+
+The numbers 90 and 122 in the above echo commands are the ascii numeric
+values for the upper and lower case z, the characters we wish to change.
+
+The string zee is the pronunciation that we want Speakup to use for the
+upper and lower case z.
+
+The > symbol redirects the output of the echo command to a file, just
+like in DOS, or at the Windows command prompt.
+
+And finally, /speakup/i18n/characters is the file entry in the sys system
+where we want the output to be directed. Speakup looks at the numeric
+value of the character we want to change, and inserts the pronunciation
+string into an internal table.
+
+You can look at the whole table with the following command:
+
+cat /speakup/i18n/characters
+
+Speakup will then print out the entire character pronunciation table. I
+won't display it here, but leave you to look at it at your convenience.
+
+13. Mapping Keys
+
+Speakup has the capability of allowing you to assign or "map" keys to
+internal Speakup commands. This section necessarily assumes you have a
+Linux kernel source tree installed, and that it has been patched and
+configured with Speakup. How you do this is beyond the scope of this
+manual. For this information, visit the Speakup web site at
+http://linux-speakup.org/. The reason you'll need the kernel source
+tree patched with Speakup is that the genmap utility you'll need for
+processing keymaps is in the
+/usr/src/linux-<version_number>/drivers/char/speakup directory. The
+<version_number> in the above directory path is the version number of
+the Linux source tree you are working with.
+
+So ok, you've gone off and gotten your kernel source tree, and patched
+and configured it. Now you can start manipulating keymaps.
+
+You can either use the
+/usr/src/linux-<version_number>/drivers/char/speakup/speakupmap.map file
+included with the Speakup source, or you can cut and paste the copy in
+section 4 into a separate file. If you use the one in the Speakup
+source tree, make sure you make a backup of it before you start making
+changes. You have been warned!
+
+Suppose that you want to swap the key assignments for the Speakup
+say_last_char and the Speakup say_first_char commands. The
+speakupmap.map lists the key mappings for these two commands as follows:
+
+spk key_pageup = say_first_char
+spk key_pagedown = say_last_char
+
+You can edit your copy of the speakupmap.map file and swap the command
+names on the right side of the = (equals) sign. You did make a backup,
+right? The new keymap lines would look like this:
+
+spk key_pageup = say_last_char
+spk key_pagedown = say_first_char
+
+After you edit your copy of the speakupmap.map file, save it under a new
+file name, perhaps newmap.map. Then exit your editor and return to the
+shell prompt.
+
+You are now ready to load your keymap with your swapped key assignments.
+ Assuming that you saved your new keymap as the file newmap.map, you
+would load your keymap into the sys system like this:
+
+/usr/src/linux-<version_number>/drivers/char/speakup/genmap newmap.map
+>/speakup/keymap
+
+Remember to substitute your kernel version number for the
+<version_number> in the above command. Also note that although the
+above command wrapped onto two lines in this document, you should type
+it all on one line.
+
+Your say first and say last characters should now be swapped. Pressing
+speakup pagedown should read you the first non-whitespace character on
+the line your reading cursor is in, and pressing speakup pageup should
+read you the last character on the line your reading cursor is in.
+
+You should note that these new mappings will only stay in effect until
+you reboot, or until you load another keymap.
+
+One final warning. If you try to load a partial map, you will quickly
+find that all the mappings you didn't include in your file got deleted
+from the working map. Be extremely careful, and always make a backup!
+You have been warned!
+
+14. Internationalizing Speakup
+
+Speakup indicates various conditions to the user by speaking messages.
+For instance, when you move to the left edge of the screen with the
+review keys, Speakup says, "left."
+Prior to version 3.1.0 of Speakup, all of these messages were in English,
+and they could not be changed. If you used a non-English synthesizer,
+you still heard English messages, such as "left" and "cursoring on."
+In version 3.1.0 or higher, one may load translations for the various
+messages via the /sys filesystem.
+
+The directory /speakup/i18n contains several collections of messages.
+Each group of messages is stored in its own file.
+The following section lists all of these files, along with a brief description
+of each.
+
+14.1. Files Under the i18n Subdirectory
+
+* announcements:
+This file contains various general announcements, most of which cannot
+be categorized. You will find messages such as "You killed Speakup",
+"I'm alive", "leaving help", "parked", "unparked", and others.
+You will also find the names of the screen edges and cursor tracking modes
+here.
+
+* characters:
+See section 12 for a description of this file.
+
+* chartab:
+See section 12. Unlike the rest of the files in the i18n subdirectory,
+this one does not contain messages to be spoken.
+
+* colors:
+When you use the "say attributes" function, Speakup says the name of the
+foreground and background colors. These names come from the i18n/colors
+file.
+
+* ctl_keys:
+Here, you will find names of control keys. These are used with Speakup's
+say_control feature.
+
+* formatted:
+This group of messages contains embedded formatting codes, to specify
+the type and width of displayed data. If you change these, you must
+preserve all of the formatting codes, and they must appear in the order
+used by the default messages.
+
+* function_names:
+Here, you will find a list of names for Speakup functions. These are used
+by the help system. For example, suppose that you have activated help mode,
+and you pressed keypad 3. Speakup says:
+"keypad 3 is character, say next."
+The message "character, say next" names a Speakup function, and it
+comes from this function_names file.
+
+* key_names:
+Again, key_names is used by Speakup's help system. In the previous
+example, Speakup said that you pressed "keypad 3."
+This name came from the key_names file.
+
+* states:
+This file contains names for key states.
+Again, these are part of the help system. For instance, if you had pressed
+speakup + keypad 3, you would hear:
+"speakup keypad 3 is go to bottom edge."
+The speakup key is depressed, so the name of the key state is speakup.
+This part of the message comes from the states collection.
+
+14.2. Changing language
+
+14.2.1. Loading Your Own Messages
+
+The files under the i18n subdirectory all follow the same format.
+They consist of lines, with one message per line.
+Each message is represented by a number, followed by the text of the message.
+The number is the position of the message in the given collection.
+For example, if you view the file /speakup/i18n/colors, you will see the
+following list:
+
+0 black
+1 blue
+2 green
+3 cyan
+4 red
+5 magenta
+6 yellow
+7 white
+8 grey
+
+You can change one message, or you can change a whole group.
+To load a whole collection of messages from a new source, simply use
+the cp command:
+cp ~/my_colors /speakup/i18n/colors
+You can change an individual message with the echo command,
+as shown in the following example.
+
+The Spanish name for the color blue is azul.
+Looking at the colors file, we see that the name "blue" is at position 1
+within the colors group. Let's change blue to azul:
+echo '1 azul' > /speakup/i18n/colors
+The next time that Speakup says message 1 from the colors group, it will
+say "azul", rather than "blue."
+
+14.2.2. Choose a language
+
+In the future, translations into various languages will be made available,
+and most users will just load the files necessary for their language. So far,
+only French language is available beyond native Canadian English language.
+
+French is only available after you are logged in.
+
+Canadian English is the default language. To toggle another language,
+download the source of Speakup and untar it in your home directory. The
+following command should let you do this:
+
+tar xvjf speakup-<version>.tar.bz2
+
+where <version> is the version number of the application.
+
+Next, change to the newly created directory, then into the tools/ directory, and
+run the script speakup_setlocale. You are asked the language that you want to
+use. Type the number associated to your language (e.g. fr for French) then press
+Enter. Needed files are copied in the i18n directory.
+
+Note: the speakupconf must be installed on your system so that settings are saved.
+Otherwise, you will have an error: your language will be loaded but you will
+have to run the script again every time Speakup restarts.
+See section 16.1. for information about speakupconf.
+
+You will have to repeat these steps for any change of locale, i.e. if you wish
+change the speakup's language or charset (iso-8859-15 ou UTF-8).
+
+If you wish store the settings, note that at your next login, you will need to
+do:
+
+speakup load
+
+Alternatively, you can add the above line to your file
+~/.bashrc or ~/.bash_profile.
+
+If your system administrator himself ran the script, all the users will be able
+to change from English to the language chosen by root and do directly
+speakupconf load (or add this to the ~/.bashrc or
+~/.bash_profile file). If there are several languages to handle, the
+administrator (or every user) will have to run the first steps until speakupconf
+save, choosing the appropriate language, in every user's home directory. Every
+user will then be able to do speakupconf load, Speakup will load his own settings.
+
+14.3. No Support for Non-Western-European Languages
+
+As of the current release, Speakup only supports Western European languages.
+Support for the extended characters used by languages outside of the Western
+European family of languages is a work in progress.
+
+15. Using Speakup's Windowing Capability
+
+Speakup has the capability of defining and manipulating windows on the
+screen. Speakup uses the term "Window", to mean a user defined area of
+the screen. The key strokes for defining and manipulating Speakup
+windows are as follows:
+
+speakup + f2 -- Set the bounds of the window.
+Speakup + f3 -- clear the current window definition.
+speakup + f4 -- Toggle window silence on and off.
+speakup + keypad plus -- Say the currently defined window.
+
+These capabilities are useful for tracking a certain part of the screen
+without rereading the whole screen, or for silencing a part of the
+screen that is constantly changing, such as a clock or status line.
+
+There is no way to save these window settings, and you can only have one
+window defined for each virtual console. There is also no way to have
+windows automatically defined for specific applications.
+
+In order to define a window, use the review keys to move your reading
+cursor to the beginning of the area you want to define. Then press
+speakup + f2. Speakup will tell you that the window starts at the
+indicated row and column position. Then move the reading cursor to the
+end of the area to be defined as a window, and press speakup + f2 again.
+ If there is more than one line in the window, Speakup will tell you
+that the window ends at the indicated row and column position. If there
+is only one line in the window, then Speakup will tell you that the
+window is the specified line on the screen. If you are only defining a
+one line window, you can just press speakup + f2 twice after placing the
+reading cursor on the line you want to define as a window. It is not
+necessary to position the reading cursor at the end of the line in order
+to define the whole line as a window.
+
+16. Tools for Controlling Speakup
+
+The speakup distribution includes extra tools (in the tools directory)
+which were written to make speakup easier to use. This section will
+briefly describe the use of these tools.
+
+16.1. Speakupconf
+
+speakupconf began life as a contribution from Steve Holmes, a member of
+the speakup community. We would like to thank him for his work on the
+early versions of this project.
+
+This script may be installed as part of your linux distribution, but if
+it isn't, the recommended places to put it are /usr/local/bin or
+/usr/bin. This script can be run by any user, so it does not require
+root privileges.
+
+Speakupconf allows you to save and load your Speakup settings. It works
+by reading and writing the /sys files described above.
+
+The directory that speakupconf uses to store your settings depends on
+whether it is run from the root account. If you execute speakupconf as
+root, it uses the directory /etc/speakup. Otherwise, it uses the directory
+~/.speakup, where ~ is your home directory.
+Anyone who needs to use Speakup from your console can load his own custom
+settings with this script.
+
+speakupconf takes one required argument: load or save.
+Use the command
+speakupconf save
+to save your Speakup settings, and
+speakupconf load
+to load them into Speakup.
+A second argument may be specified to use an alternate directory to
+load or save the speakup parameters.
+
+16.2. Talkwith
+
+Charles Hallenbeck, another member of the speakup community, wrote the
+initial versions of this script, and we would also like to thank him for
+his work on it.
+
+This script needs root privileges to run, so if it is not installed as
+part of your linux distribution, the recommended places to install it
+are /usr/local/sbin or /usr/sbin.
+
+Talkwith allows you to switch synthesizers on the fly. It takes a synthesizer
+name as an argument. For instance,
+talkwith dectlk
+causes Speakup to use the DecTalk Express. If you wish to switch to a
+software synthesizer, you must also indicate which daemon you wish to
+use. There are two possible choices:
+spd and espeakup. spd is an abbreviation for speechd-up.
+If you wish to use espeakup for software synthesis, give the command
+talkwith soft espeakup
+To use speechd-up, type:
+talkwith soft spd
+Any arguments that follow the name of the daemon are passed to the daemon
+when it is invoked. For instance:
+talkwith espeakup --default-voice=fr
+causes espeakup to use the French voice.
+Note that talkwith must always be executed with root privileges.
+
+Talkwith does not attempt to load your settings after the new
+synthesizer is activated. You can use speakupconf to load your settings
+if desired.
+
+ GNU Free Documentation License
+ Version 1.2, November 2002
+
+
+ Copyright (C) 2000,2001,2002 Free Software Foundation, Inc.
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+
+0. PREAMBLE
+
+The purpose of this License is to make a manual, textbook, or other
+functional and useful document "free" in the sense of freedom: to
+assure everyone the effective freedom to copy and redistribute it,
+with or without modifying it, either commercially or noncommercially.
+Secondarily, this License preserves for the author and publisher a way
+to get credit for their work, while not being considered responsible
+for modifications made by others.
+
+This License is a kind of "copyleft", which means that derivative
+works of the document must themselves be free in the same sense. It
+complements the GNU General Public License, which is a copyleft
+license designed for free software.
+
+We have designed this License in order to use it for manuals for free
+software, because free software needs free documentation: a free
+program should come with manuals providing the same freedoms that the
+software does. But this License is not limited to software manuals;
+it can be used for any textual work, regardless of subject matter or
+whether it is published as a printed book. We recommend this License
+principally for works whose purpose is instruction or reference.
+
+
+1. APPLICABILITY AND DEFINITIONS
+
+This License applies to any manual or other work, in any medium, that
+contains a notice placed by the copyright holder saying it can be
+distributed under the terms of this License. Such a notice grants a
+world-wide, royalty-free license, unlimited in duration, to use that
+work under the conditions stated herein. The "Document", below,
+refers to any such manual or work. Any member of the public is a
+licensee, and is addressed as "you". You accept the license if you
+copy, modify or distribute the work in a way requiring permission
+under copyright law.
+
+A "Modified Version" of the Document means any work containing the
+Document or a portion of it, either copied verbatim, or with
+modifications and/or translated into another language.
+
+A "Secondary Section" is a named appendix or a front-matter section of
+the Document that deals exclusively with the relationship of the
+publishers or authors of the Document to the Document's overall subject
+(or to related matters) and contains nothing that could fall directly
+within that overall subject. (Thus, if the Document is in part a
+textbook of mathematics, a Secondary Section may not explain any
+mathematics.) The relationship could be a matter of historical
+connection with the subject or with related matters, or of legal,
+commercial, philosophical, ethical or political position regarding
+them.
+
+The "Invariant Sections" are certain Secondary Sections whose titles
+are designated, as being those of Invariant Sections, in the notice
+that says that the Document is released under this License. If a
+section does not fit the above definition of Secondary then it is not
+allowed to be designated as Invariant. The Document may contain zero
+Invariant Sections. If the Document does not identify any Invariant
+Sections then there are none.
+
+The "Cover Texts" are certain short passages of text that are listed,
+as Front-Cover Texts or Back-Cover Texts, in the notice that says that
+the Document is released under this License. A Front-Cover Text may
+be at most 5 words, and a Back-Cover Text may be at most 25 words.
+
+A "Transparent" copy of the Document means a machine-readable copy,
+represented in a format whose specification is available to the
+general public, that is suitable for revising the document
+straightforwardly with generic text editors or (for images composed of
+pixels) generic paint programs or (for drawings) some widely available
+drawing editor, and that is suitable for input to text formatters or
+for automatic translation to a variety of formats suitable for input
+to text formatters. A copy made in an otherwise Transparent file
+format whose markup, or absence of markup, has been arranged to thwart
+or discourage subsequent modification by readers is not Transparent.
+An image format is not Transparent if used for any substantial amount
+of text. A copy that is not "Transparent" is called "Opaque".
+
+Examples of suitable formats for Transparent copies include plain
+ASCII without markup, Texinfo input format, LaTeX input format, SGML
+or XML using a publicly available DTD, and standard-conforming simple
+HTML, PostScript or PDF designed for human modification. Examples of
+transparent image formats include PNG, XCF and JPG. Opaque formats
+include proprietary formats that can be read and edited only by
+proprietary word processors, SGML or XML for which the DTD and/or
+processing tools are not generally available, and the
+machine-generated HTML, PostScript or PDF produced by some word
+processors for output purposes only.
+
+The "Title Page" means, for a printed book, the title page itself,
+plus such following pages as are needed to hold, legibly, the material
+this License requires to appear in the title page. For works in
+formats which do not have any title page as such, "Title Page" means
+the text near the most prominent appearance of the work's title,
+preceding the beginning of the body of the text.
+
+A section "Entitled XYZ" means a named subunit of the Document whose
+title either is precisely XYZ or contains XYZ in parentheses following
+text that translates XYZ in another language. (Here XYZ stands for a
+specific section name mentioned below, such as "Acknowledgements",
+"Dedications", "Endorsements", or "History".) To "Preserve the Title"
+of such a section when you modify the Document means that it remains a
+section "Entitled XYZ" according to this definition.
+
+The Document may include Warranty Disclaimers next to the notice which
+states that this License applies to the Document. These Warranty
+Disclaimers are considered to be included by reference in this
+License, but only as regards disclaiming warranties: any other
+implication that these Warranty Disclaimers may have is void and has
+no effect on the meaning of this License.
+
+
+2. VERBATIM COPYING
+
+You may copy and distribute the Document in any medium, either
+commercially or noncommercially, provided that this License, the
+copyright notices, and the license notice saying this License applies
+to the Document are reproduced in all copies, and that you add no other
+conditions whatsoever to those of this License. You may not use
+technical measures to obstruct or control the reading or further
+copying of the copies you make or distribute. However, you may accept
+compensation in exchange for copies. If you distribute a large enough
+number of copies you must also follow the conditions in section 3.
+
+You may also lend copies, under the same conditions stated above, and
+you may publicly display copies.
+
+
+3. COPYING IN QUANTITY
+
+If you publish printed copies (or copies in media that commonly have
+printed covers) of the Document, numbering more than 100, and the
+Document's license notice requires Cover Texts, you must enclose the
+copies in covers that carry, clearly and legibly, all these Cover
+Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
+the back cover. Both covers must also clearly and legibly identify
+you as the publisher of these copies. The front cover must present
+the full title with all words of the title equally prominent and
+visible. You may add other material on the covers in addition.
+Copying with changes limited to the covers, as long as they preserve
+the title of the Document and satisfy these conditions, can be treated
+as verbatim copying in other respects.
+
+If the required texts for either cover are too voluminous to fit
+legibly, you should put the first ones listed (as many as fit
+reasonably) on the actual cover, and continue the rest onto adjacent
+pages.
+
+If you publish or distribute Opaque copies of the Document numbering
+more than 100, you must either include a machine-readable Transparent
+copy along with each Opaque copy, or state in or with each Opaque copy
+a computer-network location from which the general network-using
+public has access to download using public-standard network protocols
+a complete Transparent copy of the Document, free of added material.
+If you use the latter option, you must take reasonably prudent steps,
+when you begin distribution of Opaque copies in quantity, to ensure
+that this Transparent copy will remain thus accessible at the stated
+location until at least one year after the last time you distribute an
+Opaque copy (directly or through your agents or retailers) of that
+edition to the public.
+
+It is requested, but not required, that you contact the authors of the
+Document well before redistributing any large number of copies, to give
+them a chance to provide you with an updated version of the Document.
+
+
+4. MODIFICATIONS
+
+You may copy and distribute a Modified Version of the Document under
+the conditions of sections 2 and 3 above, provided that you release
+the Modified Version under precisely this License, with the Modified
+Version filling the role of the Document, thus licensing distribution
+and modification of the Modified Version to whoever possesses a copy
+of it. In addition, you must do these things in the Modified Version:
+
+A. Use in the Title Page (and on the covers, if any) a title distinct
+ from that of the Document, and from those of previous versions
+ (which should, if there were any, be listed in the History section
+ of the Document). You may use the same title as a previous version
+ if the original publisher of that version gives permission.
+B. List on the Title Page, as authors, one or more persons or entities
+ responsible for authorship of the modifications in the Modified
+ Version, together with at least five of the principal authors of the
+ Document (all of its principal authors, if it has fewer than five),
+ unless they release you from this requirement.
+C. State on the Title page the name of the publisher of the
+ Modified Version, as the publisher.
+D. Preserve all the copyright notices of the Document.
+E. Add an appropriate copyright notice for your modifications
+ adjacent to the other copyright notices.
+F. Include, immediately after the copyright notices, a license notice
+ giving the public permission to use the Modified Version under the
+ terms of this License, in the form shown in the Addendum below.
+G. Preserve in that license notice the full lists of Invariant Sections
+ and required Cover Texts given in the Document's license notice.
+H. Include an unaltered copy of this License.
+I. Preserve the section Entitled "History", Preserve its Title, and add
+ to it an item stating at least the title, year, new authors, and
+ publisher of the Modified Version as given on the Title Page. If
+ there is no section Entitled "History" in the Document, create one
+ stating the title, year, authors, and publisher of the Document as
+ given on its Title Page, then add an item describing the Modified
+ Version as stated in the previous sentence.
+J. Preserve the network location, if any, given in the Document for
+ public access to a Transparent copy of the Document, and likewise
+ the network locations given in the Document for previous versions
+ it was based on. These may be placed in the "History" section.
+ You may omit a network location for a work that was published at
+ least four years before the Document itself, or if the original
+ publisher of the version it refers to gives permission.
+K. For any section Entitled "Acknowledgements" or "Dedications",
+ Preserve the Title of the section, and preserve in the section all
+ the substance and tone of each of the contributor acknowledgements
+ and/or dedications given therein.
+L. Preserve all the Invariant Sections of the Document,
+ unaltered in their text and in their titles. Section numbers
+ or the equivalent are not considered part of the section titles.
+M. Delete any section Entitled "Endorsements". Such a section
+ may not be included in the Modified Version.
+N. Do not retitle any existing section to be Entitled "Endorsements"
+ or to conflict in title with any Invariant Section.
+O. Preserve any Warranty Disclaimers.
+
+If the Modified Version includes new front-matter sections or
+appendices that qualify as Secondary Sections and contain no material
+copied from the Document, you may at your option designate some or all
+of these sections as invariant. To do this, add their titles to the
+list of Invariant Sections in the Modified Version's license notice.
+These titles must be distinct from any other section titles.
+
+You may add a section Entitled "Endorsements", provided it contains
+nothing but endorsements of your Modified Version by various
+parties--for example, statements of peer review or that the text has
+been approved by an organization as the authoritative definition of a
+standard.
+
+You may add a passage of up to five words as a Front-Cover Text, and a
+passage of up to 25 words as a Back-Cover Text, to the end of the list
+of Cover Texts in the Modified Version. Only one passage of
+Front-Cover Text and one of Back-Cover Text may be added by (or
+through arrangements made by) any one entity. If the Document already
+includes a cover text for the same cover, previously added by you or
+by arrangement made by the same entity you are acting on behalf of,
+you may not add another; but you may replace the old one, on explicit
+permission from the previous publisher that added the old one.
+
+The author(s) and publisher(s) of the Document do not by this License
+give permission to use their names for publicity for or to assert or
+imply endorsement of any Modified Version.
+
+
+5. COMBINING DOCUMENTS
+
+You may combine the Document with other documents released under this
+License, under the terms defined in section 4 above for modified
+versions, provided that you include in the combination all of the
+Invariant Sections of all of the original documents, unmodified, and
+list them all as Invariant Sections of your combined work in its
+license notice, and that you preserve all their Warranty Disclaimers.
+
+The combined work need only contain one copy of this License, and
+multiple identical Invariant Sections may be replaced with a single
+copy. If there are multiple Invariant Sections with the same name but
+different contents, make the title of each such section unique by
+adding at the end of it, in parentheses, the name of the original
+author or publisher of that section if known, or else a unique number.
+Make the same adjustment to the section titles in the list of
+Invariant Sections in the license notice of the combined work.
+
+In the combination, you must combine any sections Entitled "History"
+in the various original documents, forming one section Entitled
+"History"; likewise combine any sections Entitled "Acknowledgements",
+and any sections Entitled "Dedications". You must delete all sections
+Entitled "Endorsements".
+
+
+6. COLLECTIONS OF DOCUMENTS
+
+You may make a collection consisting of the Document and other documents
+released under this License, and replace the individual copies of this
+License in the various documents with a single copy that is included in
+the collection, provided that you follow the rules of this License for
+verbatim copying of each of the documents in all other respects.
+
+You may extract a single document from such a collection, and distribute
+it individually under this License, provided you insert a copy of this
+License into the extracted document, and follow this License in all
+other respects regarding verbatim copying of that document.
+
+
+7. AGGREGATION WITH INDEPENDENT WORKS
+
+A compilation of the Document or its derivatives with other separate
+and independent documents or works, in or on a volume of a storage or
+distribution medium, is called an "aggregate" if the copyright
+resulting from the compilation is not used to limit the legal rights
+of the compilation's users beyond what the individual works permit.
+When the Document is included in an aggregate, this License does not
+apply to the other works in the aggregate which are not themselves
+derivative works of the Document.
+
+If the Cover Text requirement of section 3 is applicable to these
+copies of the Document, then if the Document is less than one half of
+the entire aggregate, the Document's Cover Texts may be placed on
+covers that bracket the Document within the aggregate, or the
+electronic equivalent of covers if the Document is in electronic form.
+Otherwise they must appear on printed covers that bracket the whole
+aggregate.
+
+
+8. TRANSLATION
+
+Translation is considered a kind of modification, so you may
+distribute translations of the Document under the terms of section 4.
+Replacing Invariant Sections with translations requires special
+permission from their copyright holders, but you may include
+translations of some or all Invariant Sections in addition to the
+original versions of these Invariant Sections. You may include a
+translation of this License, and all the license notices in the
+Document, and any Warranty Disclaimers, provided that you also include
+the original English version of this License and the original versions
+of those notices and disclaimers. In case of a disagreement between
+the translation and the original version of this License or a notice
+or disclaimer, the original version will prevail.
+
+If a section in the Document is Entitled "Acknowledgements",
+"Dedications", or "History", the requirement (section 4) to Preserve
+its Title (section 1) will typically require changing the actual
+title.
+
+
+9. TERMINATION
+
+You may not copy, modify, sublicense, or distribute the Document except
+as expressly provided for under this License. Any other attempt to
+copy, modify, sublicense or distribute the Document is void, and will
+automatically terminate your rights under this License. However,
+parties who have received copies, or rights, from you under this
+License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+
+10. FUTURE REVISIONS OF THIS LICENSE
+
+The Free Software Foundation may publish new, revised versions
+of the GNU Free Documentation License from time to time. Such new
+versions will be similar in spirit to the present version, but may
+differ in detail to address new problems or concerns. See
+https://www.gnu.org/copyleft/.
+
+Each version of the License is given a distinguishing version number.
+If the Document specifies that a particular numbered version of this
+License "or any later version" applies to it, you have the option of
+following the terms and conditions either of that specified version or
+of any later version that has been published (not as a draft) by the
+Free Software Foundation. If the Document does not specify a version
+number of this License, you may choose any version ever published (not
+as a draft) by the Free Software Foundation.
+
+
+ADDENDUM: How to use this License for your documents
+
+To use this License in a document you have written, include a copy of
+the License in the document and put the following copyright and
+license notices just after the title page:
+
+ Copyright (c) YEAR YOUR NAME.
+ Permission is granted to copy, distribute and/or modify this document
+ under the terms of the GNU Free Documentation License, Version 1.2
+ or any later version published by the Free Software Foundation;
+ with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+ A copy of the license is included in the section entitled "GNU
+ Free Documentation License".
+
+If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
+replace the "with...Texts." line with this:
+
+ with the Invariant Sections being LIST THEIR TITLES, with the
+ Front-Cover Texts being LIST, and with the Back-Cover Texts being LIST.
+
+If you have Invariant Sections without Cover Texts, or some other
+combination of the three, merge those two alternatives to suit the
+situation.
+
+If your document contains nontrivial examples of program code, we
+recommend releasing these examples in parallel under your choice of
+free software license, such as the GNU General Public License,
+to permit their use in free software.
+
+The End.
diff --git a/Documentation/admin-guide/svga.rst b/Documentation/admin-guide/svga.rst
index b6c2f9acca92..9eb1e0738e84 100644
--- a/Documentation/admin-guide/svga.rst
+++ b/Documentation/admin-guide/svga.rst
@@ -12,7 +12,8 @@ Intro
This small document describes the "Video Mode Selection" feature which
allows the use of various special video modes supported by the video BIOS. Due
to usage of the BIOS, the selection is limited to boot time (before the
-kernel decompression starts) and works only on 80X86 machines.
+kernel decompression starts) and works only on 80X86 machines that are
+booted through BIOS firmware (as opposed to through UEFI, kexec, etc.).
.. note::
@@ -23,7 +24,7 @@ kernel decompression starts) and works only on 80X86 machines.
The video mode to be used is selected by a kernel parameter which can be
specified in the kernel Makefile (the SVGA_MODE=... line) or by the "vga=..."
-option of LILO (or some other boot loader you use) or by the "vidmode" utility
+option of LILO (or some other boot loader you use) or by the "xrandr" utility
(present in standard Linux utility packages). You can use the following values
of this parameter::
@@ -41,7 +42,7 @@ of this parameter::
better to use absolute mode numbers instead.
0x.... - Hexadecimal video mode ID (also displayed on the menu, see below
- for exact meaning of the ID). Warning: rdev and LILO don't support
+ for exact meaning of the ID). Warning: LILO doesn't support
hexadecimal numbers -- you have to convert it to decimal manually.
Menu
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
new file mode 100644
index 000000000000..e3cfffef5a63
--- /dev/null
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -0,0 +1,94 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Syscall User Dispatch
+=====================
+
+Background
+----------
+
+Compatibility layers like Wine need a way to efficiently emulate system
+calls of only a part of their process - the part that has the
+incompatible code - while being able to execute native syscalls without
+a high performance penalty on the native part of the process. Seccomp
+falls short on this task, since it has limited support to efficiently
+filter syscalls based on memory regions, and it doesn't support removing
+filters. Therefore a new mechanism is necessary.
+
+Syscall User Dispatch brings the filtering of the syscall dispatcher
+address back to userspace. The application is in control of a flip
+switch, indicating the current personality of the process. A
+multiple-personality application can then flip the switch without
+invoking the kernel, when crossing the compatibility layer API
+boundaries, to enable/disable the syscall redirection and execute
+syscalls directly (disabled) or send them to be emulated in userspace
+through a SIGSYS.
+
+The goal of this design is to provide very quick compatibility layer
+boundary crosses, which is achieved by not executing a syscall to change
+personality every time the compatibility layer executes. Instead, a
+userspace memory region exposed to the kernel indicates the current
+personality, and the application simply modifies that variable to
+configure the mechanism.
+
+There is a relatively high cost associated with handling signals on most
+architectures, like x86, but at least for Wine, syscalls issued by
+native Windows code are currently not known to be a performance problem,
+since they are quite rare, at least for modern gaming applications.
+
+Since this mechanism is designed to capture syscalls issued by
+non-native applications, it must function on syscalls whose invocation
+ABI is completely unexpected to Linux. Syscall User Dispatch, therefore
+doesn't rely on any of the syscall ABI to make the filtering. It uses
+only the syscall dispatcher address and the userspace key.
+
+As the ABI of these intercepted syscalls is unknown to Linux, these
+syscalls are not instrumentable via ptrace or the syscall tracepoints.
+
+Interface
+---------
+
+A thread can setup this mechanism on supported kernels by executing the
+following prctl:
+
+ prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
+
+<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
+disable the mechanism globally for that thread. When
+PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
+
+[<offset>, <offset>+<length>) delimit a memory region interval
+from which syscalls are always executed directly, regardless of the
+userspace selector. This provides a fast path for the C library, which
+includes the most common syscall dispatchers in the native code
+applications, and also provides a way for the signal handler to return
+without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
+interface should make sure that at least the signal trampoline code is
+included in this region. In addition, for syscalls that implement the
+trampoline code on the vDSO, that trampoline is never intercepted.
+
+[selector] is a pointer to a char-sized region in the process memory
+region, that provides a quick way to enable disable syscall redirection
+thread-wide, without the need to invoke the kernel directly. selector
+can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
+Any other value should terminate the program with a SIGSYS.
+
+Additionally, a tasks syscall user dispatch configuration can be peeked
+and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
+requests. This is useful for checkpoint/restart software.
+
+Security Notes
+--------------
+
+Syscall User Dispatch provides functionality for compatibility layers to
+quickly capture system calls issued by a non-native part of the
+application, while not impacting the Linux native regions of the
+process. It is not a mechanism for sandboxing system calls, and it
+should not be seen as a security mechanism, since it is trivial for a
+malicious application to subvert the mechanism by jumping to an allowed
+dispatcher region prior to executing the syscall, or to discover the
+address and modify the selector value. If the use case requires any
+kind of security sandboxing, Seccomp should be used instead.
+
+Any fork or exec of the existing process resets the mechanism to
+PR_SYS_DISPATCH_OFF.
diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst
index 599bcde7f0b7..4e6db0a2a4c0 100644
--- a/Documentation/admin-guide/sysctl/abi.rst
+++ b/Documentation/admin-guide/sysctl/abi.rst
@@ -1,67 +1,34 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
================================
Documentation for /proc/sys/abi/
================================
-kernel version 2.6.0.test2
+.. See scripts/check-sysctl-docs to keep this up to date:
+.. scripts/check-sysctl-docs -vtable="abi" \
+.. Documentation/admin-guide/sysctl/abi.rst \
+.. $(git grep -l register_sysctl_)
-Copyright (c) 2003, Fabian Frederick <ffrederick@users.sourceforge.net>
+Copyright (c) 2020, Stephen Kitt
-For general info: index.rst.
+For general info, see Documentation/admin-guide/sysctl/index.rst.
------------------------------------------------------------------------------
-This path is binary emulation relevant aka personality types aka abi.
-When a process is executed, it's linked to an exec_domain whose
-personality is defined using values available from /proc/sys/abi.
-You can find further details about abi in include/linux/personality.h.
-
-Here are the files featuring in 2.6 kernel:
-
-- defhandler_coff
-- defhandler_elf
-- defhandler_lcall7
-- defhandler_libcso
-- fake_utsname
-- trace
-
-defhandler_coff
----------------
-
-defined value:
- PER_SCOSVR3::
-
- 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE
-
-defhandler_elf
---------------
-
-defined value:
- PER_LINUX::
-
- 0
-
-defhandler_lcall7
------------------
-
-defined value :
- PER_SVR4::
-
- 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
-
-defhandler_libsco
------------------
-
-defined value:
- PER_SVR4::
+The files in ``/proc/sys/abi`` can be used to see and modify
+ABI-related settings.
- 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
+Currently, these files might (depending on your configuration)
+show up in ``/proc/sys/kernel``:
-fake_utsname
-------------
+.. contents:: :local:
-Unused
+vsyscall32 (x86)
+================
-trace
------
+Determines whether the kernels maps a vDSO page into 32-bit processes;
+can be set to 1 to enable, or 0 to disable. Defaults to enabled if
+``CONFIG_COMPAT_VDSO`` is set, disabled otherwise.
-Unused
+This controls the same setting as the ``vdso32`` kernel boot
+parameter.
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 2a45119e3331..47499a1742bd 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -2,8 +2,6 @@
Documentation for /proc/sys/fs/
===============================
-kernel version 2.2.10
-
Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
@@ -12,159 +10,130 @@ For general info and legal blurb, please look in intro.rst.
------------------------------------------------------------------------------
-This file contains documentation for the sysctl files in
-/proc/sys/fs/ and is valid for Linux kernel version 2.2.
+This file contains documentation for the sysctl files and directories
+in ``/proc/sys/fs/``.
The files in this directory can be used to tune and monitor
miscellaneous and general things in the operation of the Linux
-kernel. Since some of the files _can_ be used to screw up your
+kernel. Since some of the files *can* be used to screw up your
system, it is advisable to read both documentation and source
before actually making adjustments.
1. /proc/sys/fs
===============
-Currently, these files are in /proc/sys/fs:
-
-- aio-max-nr
-- aio-nr
-- dentry-state
-- dquot-max
-- dquot-nr
-- file-max
-- file-nr
-- inode-max
-- inode-nr
-- inode-state
-- nr_open
-- overflowuid
-- overflowgid
-- pipe-user-pages-hard
-- pipe-user-pages-soft
-- protected_fifos
-- protected_hardlinks
-- protected_regular
-- protected_symlinks
-- suid_dumpable
-- super-max
-- super-nr
+Currently, these files might (depending on your configuration)
+show up in ``/proc/sys/fs``:
+
+.. contents:: :local:
aio-nr & aio-max-nr
-------------------
-aio-nr is the running total of the number of events specified on the
-io_setup system call for all currently active aio contexts. If aio-nr
-reaches aio-max-nr then io_setup will fail with EAGAIN. Note that
-raising aio-max-nr does not result in the pre-allocation or re-sizing
-of any kernel data structures.
+``aio-nr`` shows the current system-wide number of asynchronous io
+requests. ``aio-max-nr`` allows you to change the maximum value
+``aio-nr`` can grow to. If ``aio-nr`` reaches ``aio-nr-max`` then
+``io_setup`` will fail with ``EAGAIN``. Note that raising
+``aio-max-nr`` does not result in the
+pre-allocation or re-sizing of any kernel data structures.
dentry-state
------------
-From linux/include/linux/dcache.h::
+This file shows the values in ``struct dentry_stat_t``, as defined in
+``fs/dcache.c``::
struct dentry_stat_t dentry_stat {
- int nr_dentry;
- int nr_unused;
- int age_limit; /* age in seconds */
- int want_pages; /* pages requested by system */
- int nr_negative; /* # of unused negative dentries */
- int dummy; /* Reserved for future use */
+ long nr_dentry;
+ long nr_unused;
+ long age_limit; /* age in seconds */
+ long want_pages; /* pages requested by system */
+ long nr_negative; /* # of unused negative dentries */
+ long dummy; /* Reserved for future use */
};
Dentries are dynamically allocated and deallocated.
-nr_dentry shows the total number of dentries allocated (active
-+ unused). nr_unused shows the number of dentries that are not
+``nr_dentry`` shows the total number of dentries allocated (active
++ unused). ``nr_unused shows`` the number of dentries that are not
actively used, but are saved in the LRU list for future reuse.
-Age_limit is the age in seconds after which dcache entries
-can be reclaimed when memory is short and want_pages is
-nonzero when shrink_dcache_pages() has been called and the
+``age_limit`` is the age in seconds after which dcache entries
+can be reclaimed when memory is short and ``want_pages`` is
+nonzero when ``shrink_dcache_pages()`` has been called and the
dcache isn't pruned yet.
-nr_negative shows the number of unused dentries that are also
+``nr_negative`` shows the number of unused dentries that are also
negative dentries which do not map to any files. Instead,
they help speeding up rejection of non-existing files provided
by the users.
-dquot-max & dquot-nr
---------------------
-
-The file dquot-max shows the maximum number of cached disk
-quota entries.
-
-The file dquot-nr shows the number of allocated disk quota
-entries and the number of free disk quota entries.
-
-If the number of free cached disk quotas is very low and
-you have some awesome number of simultaneous system users,
-you might want to raise the limit.
-
-
file-max & file-nr
------------------
-The value in file-max denotes the maximum number of file-
+The value in ``file-max`` denotes the maximum number of file-
handles that the Linux kernel will allocate. When you get lots
of error messages about running out of file handles, you might
want to increase this limit.
Historically,the kernel was able to allocate file handles
dynamically, but not to free them again. The three values in
-file-nr denote the number of allocated file handles, the number
+``file-nr`` denote the number of allocated file handles, the number
of allocated but unused file handles, and the maximum number of
-file handles. Linux 2.6 always reports 0 as the number of free
+file handles. Linux 2.6 and later always reports 0 as the number of free
file handles -- this is not an error, it just means that the
number of allocated file handles exactly matches the number of
used file handles.
-Attempts to allocate more file descriptors than file-max are
-reported with printk, look for "VFS: file-max limit <number>
-reached".
+Attempts to allocate more file descriptors than ``file-max`` are
+reported with ``printk``, look for::
+ VFS: file-max limit <number> reached
-nr_open
--------
-
-This denotes the maximum number of file-handles a process can
-allocate. Default value is 1024*1024 (1048576) which should be
-enough for most machines. Actual limit depends on RLIMIT_NOFILE
-resource limit.
+in the kernel logs.
-inode-max, inode-nr & inode-state
----------------------------------
+inode-nr & inode-state
+----------------------
As with file handles, the kernel allocates the inode structures
dynamically, but can't free them yet.
-The value in inode-max denotes the maximum number of inode
-handlers. This value should be 3-4 times larger than the value
-in file-max, since stdin, stdout and network sockets also
-need an inode struct to handle them. When you regularly run
-out of inodes, you need to increase this value.
-
-The file inode-nr contains the first two items from
-inode-state, so we'll skip to that file...
+The file ``inode-nr`` contains the first two items from
+``inode-state``, so we'll skip to that file...
-Inode-state contains three actual numbers and four dummies.
-The actual numbers are, in order of appearance, nr_inodes,
-nr_free_inodes and preshrink.
+``inode-state`` contains three actual numbers and four dummies.
+The actual numbers are, in order of appearance, ``nr_inodes``,
+``nr_free_inodes`` and ``preshrink``.
-Nr_inodes stands for the number of inodes the system has
-allocated, this can be slightly more than inode-max because
-Linux allocates them one pageful at a time.
+``nr_inodes`` stands for the number of inodes the system has
+allocated.
-Nr_free_inodes represents the number of free inodes (?) and
-preshrink is nonzero when the nr_inodes > inode-max and the
+``nr_free_inodes`` represents the number of free inodes (?) and
+preshrink is nonzero when the
system needs to prune the inode list instead of allocating
more.
+mount-max
+---------
+
+This denotes the maximum number of mounts that may exist
+in a mount namespace.
+
+
+nr_open
+-------
+
+This denotes the maximum number of file-handles a process can
+allocate. Default value is 1024*1024 (1048576) which should be
+enough for most machines. Actual limit depends on ``RLIMIT_NOFILE``
+resource limit.
+
+
overflowgid & overflowuid
-------------------------
@@ -192,7 +161,7 @@ pipe-user-pages-soft
Maximum total number of pages a non-privileged user may allocate for pipes
before the pipe size gets limited to a single page. Once this limit is reached,
new pipes will be limited to a single page in size for this user in order to
-limit total memory usage, and trying to increase them using fcntl() will be
+limit total memory usage, and trying to increase them using ``fcntl()`` will be
denied until usage goes below the limit again. The default value allows to
allocate up to 1024 pipes at their default size. When set to 0, no limit is
applied.
@@ -207,7 +176,7 @@ file.
When set to "0", writing to FIFOs is unrestricted.
-When set to "1" don't allow O_CREAT open on FIFOs that we don't own
+When set to "1" don't allow ``O_CREAT`` open on FIFOs that we don't own
in world writable sticky directories, unless they are owned by the
owner of the directory.
@@ -221,7 +190,7 @@ protected_hardlinks
A long-standing class of security issues is the hardlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
-directories like /tmp. The common method of exploitation of this flaw
+directories like ``/tmp``. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given hardlink (i.e. a
root process follows a hardlink created by another user). Additionally,
on systems without separated partitions, this stops unauthorized users
@@ -239,13 +208,13 @@ This protection is based on the restrictions in Openwall and grsecurity.
protected_regular
-----------------
-This protection is similar to protected_fifos, but it
+This protection is similar to `protected_fifos`_, but it
avoids writes to an attacker-controlled regular file, where a program
expected to create one.
When set to "0", writing to regular files is unrestricted.
-When set to "1" don't allow O_CREAT open on regular files that we
+When set to "1" don't allow ``O_CREAT`` open on regular files that we
don't own in world writable sticky directories, unless they are
owned by the owner of the directory.
@@ -257,11 +226,11 @@ protected_symlinks
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
-directories like /tmp. The common method of exploitation of this flaw
+directories like ``/tmp``. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
-http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
+https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
When set to "0", symlink following behavior is unrestricted.
@@ -272,23 +241,25 @@ follower match, or when the directory owner matches the symlink's owner.
This protection is based on the restrictions in Openwall and grsecurity.
-suid_dumpable:
---------------
+suid_dumpable
+-------------
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are
= ========== ===============================================================
-0 (default) traditional behaviour. Any process which has changed
+0 (default) Traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped.
-1 (debug) all processes dump core when possible. The core dump is
+1 (debug) All processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is
intended for system debugging situations only.
Ptrace is unchecked.
This is insecure as it allows regular users to examine the
memory contents of privileged processes.
-2 (suidsafe) any binary which normally would not be dumped is dumped
- anyway, but only if the "core_pattern" kernel sysctl is set to
+2 (suidsafe) Any binary which normally would not be dumped is dumped
+ anyway, but only if the ``core_pattern`` kernel sysctl (see
+ :ref:`Documentation/admin-guide/sysctl/kernel.rst <core_pattern>`)
+ is set to
either a pipe handler or a fully qualified path. (For more
details on this limitation, see CVE-2006-2451.) This mode is
appropriate when administrators are attempting to debug
@@ -301,36 +272,11 @@ or otherwise protected/tainted binaries. The modes are
= ========== ===============================================================
-super-max & super-nr
---------------------
-
-These numbers control the maximum number of superblocks, and
-thus the maximum number of mounted filesystems the kernel
-can have. You only need to increase super-max if you need to
-mount more filesystems than the current value in super-max
-allows you to.
-
-
-aio-nr & aio-max-nr
--------------------
-
-aio-nr shows the current system-wide number of asynchronous io
-requests. aio-max-nr allows you to change the maximum value
-aio-nr can grow to.
-
-
-mount-max
----------
-
-This denotes the maximum number of mounts that may exist
-in a mount namespace.
-
-
2. /proc/sys/fs/binfmt_misc
===========================
-Documentation for the files in /proc/sys/fs/binfmt_misc is
+Documentation for the files in ``/proc/sys/fs/binfmt_misc`` is
in Documentation/admin-guide/binfmt-misc.rst.
@@ -343,28 +289,32 @@ creation of a user space library that implements the POSIX message queues
API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System
Interfaces specification.)
-The "mqueue" filesystem contains values for determining/setting the amount of
-resources used by the file system.
+The "mqueue" filesystem contains values for determining/setting the
+amount of resources used by the file system.
-/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the
-maximum number of message queues allowed on the system.
+``/proc/sys/fs/mqueue/queues_max`` is a read/write file for
+setting/getting the maximum number of message queues allowed on the
+system.
-/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the
-maximum number of messages in a queue value. In fact it is the limiting value
-for another (user) limit which is set in mq_open invocation. This attribute of
-a queue must be less or equal then msg_max.
+``/proc/sys/fs/mqueue/msg_max`` is a read/write file for
+setting/getting the maximum number of messages in a queue value. In
+fact it is the limiting value for another (user) limit which is set in
+``mq_open`` invocation. This attribute of a queue must be less than
+or equal to ``msg_max``.
-/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the
-maximum message size value (it is every message queue's attribute set during
-its creation).
+``/proc/sys/fs/mqueue/msgsize_max`` is a read/write file for
+setting/getting the maximum message size value (it is an attribute of
+every message queue, set during its creation).
-/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the
-default number of messages in a queue value if attr parameter of mq_open(2) is
-NULL. If it exceed msg_max, the default value is initialized msg_max.
+``/proc/sys/fs/mqueue/msg_default`` is a read/write file for
+setting/getting the default number of messages in a queue value if the
+``attr`` parameter of ``mq_open(2)`` is ``NULL``. If it exceeds
+``msg_max``, the default value is initialized to ``msg_max``.
-/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting
-the default message size value if attr parameter of mq_open(2) is NULL. If it
-exceed msgsize_max, the default value is initialized msgsize_max.
+``/proc/sys/fs/mqueue/msgsize_default`` is a read/write file for
+setting/getting the default message size value if the ``attr``
+parameter of ``mq_open(2)`` is ``NULL``. If it exceeds
+``msgsize_max``, the default value is initialized to ``msgsize_max``.
4. /proc/sys/fs/epoll - Configuration options for the epoll interface
=====================================================================
@@ -378,7 +328,7 @@ Every epoll file descriptor can store a number of files to be monitored
for event readiness. Each one of these monitored files constitutes a "watch".
This configuration option sets the maximum number of "watches" that are
allowed for each user.
-Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes
-on a 64bit one.
-The current default value for max_user_watches is the 1/32 of the available
-low memory, divided for the "watch" cost in bytes.
+Each "watch" costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes
+on a 64-bit one.
+The current default value for ``max_user_watches`` is 4% of the
+available low memory, divided by the "watch" cost in bytes.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 83acf5025488..7fd43947832f 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -9,12 +9,13 @@ Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
-For general info and legal blurb, please look in :doc:`index`.
+For general info and legal blurb, please look in
+Documentation/admin-guide/sysctl/index.rst.
------------------------------------------------------------------------------
This file contains documentation for the sysctl files in
-``/proc/sys/kernel/`` and is valid for Linux kernel version 2.2.
+``/proc/sys/kernel/``.
The files in this directory can be used to tune and monitor
miscellaneous and general things in the operation of the Linux
@@ -37,8 +38,8 @@ acct
If BSD-style process accounting is enabled these values control
its behaviour. If free space on filesystem where the log lives
-goes below ``lowwater``% accounting suspends. If free space gets
-above ``highwater``% accounting resumes. ``frequency`` determines
+goes below ``lowwater``\ % accounting suspends. If free space gets
+above ``highwater``\ % accounting resumes. ``frequency`` determines
how often do we check the amount of free space (value is in
seconds). Default:
@@ -54,7 +55,7 @@ free space valid for 30 seconds.
acpi_video_flags
================
-See :doc:`/power/video`. This allows the video resume mode to be set,
+See Documentation/power/video.rst. This allows the video resume mode to be set,
in a similar fashion to the ``acpi_sleep`` kernel parameter, by
combining the following values:
@@ -64,6 +65,11 @@ combining the following values:
4 s3_beep
= =======
+arch
+====
+
+The machine hardware name, the same output as ``uname -m``
+(e.g. ``x86_64`` or ``aarch64``).
auto_msgmni
===========
@@ -89,7 +95,7 @@ is 0x15 and the full version number is 0x234, this file will contain
the value 340 = 0x154.
See the ``type_of_loader`` and ``ext_loader_type`` fields in
-:doc:`/x86/boot` for additional information.
+Documentation/arch/x86/boot.rst for additional information.
bootloader_version (x86 only)
@@ -99,7 +105,7 @@ The complete bootloader version number. In the example above, this
file will contain the value 564 = 0x234.
See the ``type_of_loader`` and ``ext_loader_ver`` fields in
-:doc:`/x86/boot` for additional information.
+Documentation/arch/x86/boot.rst for additional information.
bpf_stats_enabled
@@ -133,6 +139,8 @@ Highest valid capability of the running kernel. Exports
``CAP_LAST_CAP`` from the kernel.
+.. _core_pattern:
+
core_pattern
============
@@ -164,9 +172,11 @@ core_pattern
%s signal number
%t UNIX time of dump
%h hostname
- %e executable filename (may be shortened)
+ %e executable filename (may be shortened, could be changed by prctl etc)
+ %f executable filename
%E executable path
%c maximum size of core file by resource limit RLIMIT_CORE
+ %C CPU the task ran on
%<OTHER> both are dropped
======== ==========================================
@@ -235,7 +245,7 @@ This toggle indicates whether unprivileged users are prevented
from using ``dmesg(8)`` to view messages from the kernel's log
buffer.
When ``dmesg_restrict`` is set to 0 there are no restrictions.
-When ``dmesg_restrict`` is set set to 1, users must have
+When ``dmesg_restrict`` is set to 1, users must have
``CAP_SYSLOG`` to use ``dmesg(8)``.
The kernel config option ``CONFIG_SECURITY_DMESG_RESTRICT`` sets the
@@ -268,7 +278,7 @@ see the ``hostname(1)`` man page.
firmware_config
===============
-See :doc:`/driver-api/firmware/fallback-mechanisms`.
+See Documentation/driver-api/firmware/fallback-mechanisms.rst.
The entries in this directory allow the firmware loader helper
fallback to be controlled:
@@ -286,17 +296,35 @@ kernel panic). This will output the contents of the ftrace buffers to
the console. This is very useful for capturing traces that lead to
crashes and outputting them to a serial console.
-= ===================================================
-0 Disabled (default).
-1 Dump buffers of all CPUs.
-2 Dump the buffer of the CPU that triggered the oops.
-= ===================================================
+======================= ===========================================
+0 Disabled (default).
+1 Dump buffers of all CPUs.
+2(orig_cpu) Dump the buffer of the CPU that triggered the
+ oops.
+<instance> Dump the specific instance buffer on all CPUs.
+<instance>=2(orig_cpu) Dump the specific instance buffer on the CPU
+ that triggered the oops.
+======================= ===========================================
+
+Multiple instance dump is also supported, and instances are separated
+by commas. If global buffer also needs to be dumped, please specify
+the dump mode (1/2/orig_cpu) first for global buffer.
+So for example to dump "foo" and "bar" instance buffer on all CPUs,
+user can::
+
+ echo "foo,bar" > /proc/sys/kernel/ftrace_dump_on_oops
+
+To dump global buffer and "foo" instance buffer on all
+CPUs along with the "bar" instance buffer on CPU that triggered the
+oops, user can::
+
+ echo "1,foo,bar=2" > /proc/sys/kernel/ftrace_dump_on_oops
ftrace_enabled, stack_tracer_enabled
====================================
-See :doc:`/trace/ftrace`.
+See Documentation/trace/ftrace.rst.
hardlockup_all_cpu_backtrace
@@ -324,7 +352,7 @@ when a hard lockup is detected.
1 Panic on hard lockup.
= ===========================
-See :doc:`/admin-guide/lockup-watchdogs` for more information.
+See Documentation/admin-guide/lockup-watchdogs.rst for more information.
This can also be set using the nmi_watchdog kernel parameter.
@@ -332,11 +360,16 @@ hotplug
=======
Path for the hotplug policy agent.
-Default value is "``/sbin/hotplug``".
+Default value is ``CONFIG_UEVENT_HELPER_PATH``, which in turn defaults
+to the empty string.
+This file only exists when ``CONFIG_UEVENT_HELPER`` is enabled. Most
+modern systems rely exclusively on the netlink-based uevent source and
+don't need this.
-hung_task_all_cpu_backtrace:
-================
+
+hung_task_all_cpu_backtrace
+===========================
If this option is set, the kernel will send an NMI to all CPUs to dump
their backtraces when a hung task is detected. This file shows up if
@@ -421,8 +454,8 @@ ignore-unaligned-usertrap
On architectures where unaligned accesses cause traps, and where this
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``;
-currently, ``arc`` and ``ia64``), controls whether all unaligned traps
-are logged.
+currently, ``arc`` and ``loongarch``), controls whether all
+unaligned traps are logged.
= =============================================================
0 Log all unaligned accesses.
@@ -430,17 +463,44 @@ are logged.
setting.
= =============================================================
-See also `unaligned-trap`_ and `unaligned-dump-stack`_. On ``ia64``,
-this allows system administrators to override the
-``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
+See also `unaligned-trap`_.
+
+io_uring_disabled
+=================
+
+Prevents all processes from creating new io_uring instances. Enabling this
+shrinks the kernel's attack surface.
+
+= ======================================================================
+0 All processes can create io_uring instances as normal. This is the
+ default setting.
+1 io_uring creation is disabled (io_uring_setup() will fail with
+ -EPERM) for unprivileged processes not in the io_uring_group group.
+ Existing io_uring instances can still be used. See the
+ documentation for io_uring_group for more information.
+2 io_uring creation is disabled for all processes. io_uring_setup()
+ always fails with -EPERM. Existing io_uring instances can still be
+ used.
+= ======================================================================
+
+
+io_uring_group
+==============
+
+When io_uring_disabled is set to 1, a process must either be
+privileged (CAP_SYS_ADMIN) or be in the io_uring_group group in order
+to create an io_uring instance. If io_uring_group is set to -1 (the
+default), only processes with the CAP_SYS_ADMIN capability may create
+io_uring instances.
kexec_load_disabled
===================
-A toggle indicating if the ``kexec_load`` syscall has been disabled.
-This value defaults to 0 (false: ``kexec_load`` enabled), but can be
-set to 1 (true: ``kexec_load`` disabled).
+A toggle indicating if the syscalls ``kexec_load`` and
+``kexec_file_load`` have been disabled.
+This value defaults to 0 (false: ``kexec_*load`` enabled), but can be
+set to 1 (true: ``kexec_*load`` disabled).
Once true, kexec can no longer be used, and the toggle cannot be set
back to false.
This allows a kexec image to be loaded before disabling the syscall,
@@ -448,6 +508,24 @@ allowing a system to set up (and later use) an image without it being
altered.
Generally used together with the `modules_disabled`_ sysctl.
+kexec_load_limit_panic
+======================
+
+This parameter specifies a limit to the number of times the syscalls
+``kexec_load`` and ``kexec_file_load`` can be called with a crash
+image. It can only be set with a more restrictive value than the
+current one.
+
+== ======================================================
+-1 Unlimited calls to kexec. This is the default setting.
+N Number of calls left.
+== ======================================================
+
+kexec_load_limit_reboot
+=======================
+
+Similar functionality as ``kexec_load_limit_panic``, but for a normal
+image.
kptr_restrict
=============
@@ -482,10 +560,11 @@ modprobe
========
The full path to the usermode helper for autoloading kernel modules,
-by default "/sbin/modprobe". This binary is executed when the kernel
-requests a module. For example, if userspace passes an unknown
-filesystem type to mount(), then the kernel will automatically request
-the corresponding filesystem module by executing this usermode helper.
+by default ``CONFIG_MODPROBE_PATH``, which in turn defaults to
+"/sbin/modprobe". This binary is executed when the kernel requests a
+module. For example, if userspace passes an unknown filesystem type
+to mount(), then the kernel will automatically request the
+corresponding filesystem module by executing this usermode helper.
This usermode helper should insert the needed module into the kernel.
This sysctl only affects module autoloading. It has no effect on the
@@ -533,6 +612,9 @@ default (``MSGMNB``).
``msgmni`` is the maximum number of IPC queues. 32000 by default
(``MSGMNI``).
+All of these parameters are set per ipc namespace. The maximum number of bytes
+in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is
+respected hierarchically in the each user namespace.
msg_next_id, sem_next_id, and shm_next_id (System V IPC)
========================================================
@@ -580,74 +662,66 @@ in a KVM virtual machine. This default can be overridden by adding::
nmi_watchdog=1
-to the guest kernel command line (see :doc:`/admin-guide/kernel-parameters`).
+to the guest kernel command line (see
+Documentation/admin-guide/kernel-parameters.rst).
-numa_balancing
-==============
+nmi_wd_lpm_factor (PPC only)
+============================
-Enables/disables automatic page fault based NUMA memory
-balancing. Memory is moved automatically to nodes
-that access it often.
+Factor to apply to the NMI watchdog timeout (only when ``nmi_watchdog`` is
+set to 1). This factor represents the percentage added to
+``watchdog_thresh`` when calculating the NMI watchdog timeout during an
+LPM. The soft lockup timeout is not impacted.
-Enables/disables automatic NUMA memory balancing. On NUMA machines, there
-is a performance penalty if remote memory is accessed by a CPU. When this
-feature is enabled the kernel samples what task thread is accessing memory
-by periodically unmapping pages and later trapping a page fault. At the
-time of the page fault, it is determined if the data being accessed should
-be migrated to a local memory node.
-
-The unmapping of pages and trapping faults incur additional overhead that
-ideally is offset by improved memory locality but there is no universal
-guarantee. If the target workload is already bound to NUMA nodes then this
-feature should be disabled. Otherwise, if the system overhead from the
-feature is too high then the rate the kernel samples for NUMA hinting
-faults may be controlled by the `numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls.
+A value of 0 means no change. The default value is 200 meaning the NMI
+watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).
-numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
-===============================================================================================================================
-
+numa_balancing
+==============
-Automatic NUMA balancing scans tasks address space and unmaps pages to
-detect if pages are properly placed or if the data should be migrated to a
-memory node local to where the task is running. Every "scan delay" the task
-scans the next "scan size" number of pages in its address space. When the
-end of the address space is reached the scanner restarts from the beginning.
+Enables/disables and configures automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes that access it often.
+The value to set can be the result of ORing the following:
-In combination, the "scan delay" and "scan size" determine the scan rate.
-When "scan delay" decreases, the scan rate increases. The scan delay and
-hence the scan rate of every task is adaptive and depends on historical
-behaviour. If pages are properly placed then the scan delay increases,
-otherwise the scan delay decreases. The "scan size" is not adaptive but
-the higher the "scan size", the higher the scan rate.
+= =================================
+0 NUMA_BALANCING_DISABLED
+1 NUMA_BALANCING_NORMAL
+2 NUMA_BALANCING_MEMORY_TIERING
+= =================================
-Higher scan rates incur higher system overhead as page faults must be
-trapped and potentially data must be migrated. However, the higher the scan
-rate, the more quickly a tasks memory is migrated to a local node if the
-workload pattern changes and minimises performance impact due to remote
-memory accesses. These sysctls control the thresholds for scan delays and
-the number of pages scanned.
+Or NUMA_BALANCING_NORMAL to optimize page placement among different
+NUMA nodes to reduce remote accessing. On NUMA machines, there is a
+performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing
+memory by periodically unmapping pages and later trapping a page
+fault. At the time of the page fault, it is determined if the data
+being accessed should be migrated to a local memory node.
-``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to
-scan a tasks virtual memory. It effectively controls the maximum scanning
-rate for each task.
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled.
-``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task
-when it initially forks.
+Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among
+different types of memory (represented as different NUMA nodes) to
+place the hot pages in the fast memory. This is implemented based on
+unmapping and page fault too.
-``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to
-scan a tasks virtual memory. It effectively controls the minimum scanning
-rate for each task.
+numa_balancing_promote_rate_limit_MBps
+======================================
-``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are
-scanned for a given scan.
+Too high promotion/demotion throughput between different memory types
+may hurt application latency. This can be used to rate limit the
+promotion throughput. The per-node max promotion throughput in MB/s
+will be limited to be no more than the set value.
+A rule of thumb is to set this to less than 1/10 of the PMEM node
+write bandwidth.
-oops_all_cpu_backtrace:
-================
+oops_all_cpu_backtrace
+======================
If this option is set, the kernel will send an NMI to all CPUs to dump
their backtraces when an oops event occurs. It should be used as a last
@@ -662,6 +736,15 @@ This is the default behavior.
an oops event is detected.
+oops_limit
+==========
+
+Number of kernel oopses after which the kernel should panic when
+``panic_on_oops`` is not set. Setting this to 0 disables checking
+the count. Setting this to 1 has the same effect as setting
+``panic_on_oops=1``. The default value is 10000.
+
+
osrelease, ostype & version
===========================
@@ -786,6 +869,9 @@ bit 1 print system memory info
bit 2 print timer info
bit 3 print locks info if ``CONFIG_LOCKDEP`` is on
bit 4 print ftrace buffer
+bit 5 print all printk messages in buffer
+bit 6 print all CPUs backtrace (if available in the arch)
+bit 7 print only tasks in uninterruptible (blocked) state
===== ============================================
So for example to print tasks and memory info on panic, user can::
@@ -804,6 +890,13 @@ is useful to define the root cause of RCU stalls using a vmcore.
1 panic() after printing RCU stall messages.
= ============================================================
+max_rcu_stall_to_panic
+======================
+
+When ``panic_on_rcu_stall`` is set to 1, this value determines the
+number of times that RCU can stall before panic() is called.
+
+When ``panic_on_rcu_stall`` is set to 0, this value is has no effect.
perf_cpu_time_max_percent
=========================
@@ -878,7 +971,7 @@ The default value is 127.
perf_event_mlock_kb
===================
-Control size of per-cpu ring buffer not counted agains mlock limit.
+Control size of per-cpu ring buffer not counted against mlock limit.
The default value is 512 + 1 page
@@ -896,6 +989,36 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
The default value is 8.
+perf_user_access (arm64 and riscv only)
+=======================================
+
+Controls user space access for reading perf event counters.
+
+arm64
+=====
+
+The default value is 0 (access disabled).
+
+When set to 1, user space can read performance monitor counter registers
+directly.
+
+See Documentation/arch/arm64/perf.rst for more information.
+
+riscv
+=====
+
+When set to 0, user space access is disabled.
+
+The default value is 1, user space can read performance monitor counter
+registers through perf, any direct access without perf intervention will trigger
+an illegal instruction.
+
+When set to 2, which enables legacy mode (user space has direct access to cycle
+and insret CSRs only). Note that this legacy value is deprecated and will be
+removed once all user space applications are fixed.
+
+Note that the time CSR is always directly accessible to all modes.
+
pid_max
=======
@@ -996,6 +1119,32 @@ pty
See Documentation/filesystems/devpts.rst.
+random
+======
+
+This is a directory, with the following entries:
+
+* ``boot_id``: a UUID generated the first time this is retrieved, and
+ unvarying after that;
+
+* ``uuid``: a UUID generated every time this is retrieved (this can
+ thus be used to generate UUIDs at will);
+
+* ``entropy_avail``: the pool's entropy count, in bits;
+
+* ``poolsize``: the entropy pool size, in bits;
+
+* ``urandom_min_reseed_secs``: obsolete (used to determine the minimum
+ number of seconds between urandom pool reseeding). This file is
+ writable for compatibility purposes, but writing to it has no effect
+ on any RNG behavior;
+
+* ``write_wakeup_threshold``: when the entropy count drops below this
+ (as a number of bits), processes waiting to write to ``/dev/random``
+ are woken up. This file is writable for compatibility purposes, but
+ writing to it has no effect on any RNG behavior.
+
+
randomize_va_space
==================
@@ -1033,7 +1182,7 @@ that support this feature.
real-root-dev
=============
-See :doc:`/admin-guide/initrd`.
+See Documentation/admin-guide/initrd.rst.
reboot-cmd (SPARC only)
@@ -1052,8 +1201,16 @@ automatically on platforms where it can run (that is,
platforms with asymmetric CPU topologies and having an Energy
Model available). If your platform happens to meet the
requirements for EAS but you do not want to use it, change
-this value to 0.
+this value to 0. On Non-EAS platforms, write operation fails and
+read doesn't return anything.
+
+task_delayacct
+===============
+Enables/disables task delay accounting (see
+Documentation/accounting/delay-accounting.rst. Enabling this feature incurs
+a small amount of overhead in the scheduler but is useful for debugging
+and performance tuning. It is required by some tools such as iotop.
sched_schedstats
================
@@ -1062,11 +1219,65 @@ Enables/disables scheduler statistics. Enabling this feature
incurs a small amount of overhead in the scheduler but is
useful for debugging and performance tuning.
+sched_util_clamp_min
+====================
+
+Max allowed *minimum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.min value cannot be greater than
+sched_util_clamp_min, i.e., it is restricted to the range
+[0:sched_util_clamp_min].
+
+sched_util_clamp_max
+====================
+
+Max allowed *maximum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.max value cannot be greater than
+sched_util_clamp_max, i.e., it is restricted to the range
+[0:sched_util_clamp_max].
+
+sched_util_clamp_min_rt_default
+===============================
+
+By default Linux is tuned for performance. Which means that RT tasks always run
+at the highest frequency and most capable (highest capacity) CPU (in
+heterogeneous systems).
+
+Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
+1024 by default, which effectively boosts the tasks to run at the highest
+frequency and biases them to run on the biggest CPU.
+
+This knob allows admins to change the default behavior when uclamp is being
+used. In battery powered devices particularly, running at the maximum
+capacity and frequency will increase energy consumption and shorten the battery
+life.
+
+This knob is only effective for RT tasks which the user hasn't modified their
+requested uclamp.min value via sched_setattr() syscall.
+
+This knob will not escape the range constraint imposed by sched_util_clamp_min
+defined above.
+
+For example if
+
+ sched_util_clamp_min_rt_default = 800
+ sched_util_clamp_min = 600
+
+Then the boost will be clamped to 600 because 800 is outside of the permissible
+range of [0:600]. This could happen for instance if a powersave mode will
+restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as
+this restriction is lifted, the requested sched_util_clamp_min_rt_default
+will take effect.
seccomp
=======
-See :doc:`/userspace-api/seccomp_filter`.
+See Documentation/userspace-api/seccomp_filter.rst.
sg-big-buff
@@ -1085,15 +1296,20 @@ are doing anyway :)
shmall
======
-This parameter sets the total amount of shared memory pages that
-can be used system wide. Hence, ``shmall`` should always be at least
-``ceil(shmmax/PAGE_SIZE)``.
+This parameter sets the total amount of shared memory pages that can be used
+inside ipc namespace. The shared memory pages counting occurs for each ipc
+namespace separately and is not inherited. Hence, ``shmall`` should always be at
+least ``ceil(shmmax/PAGE_SIZE)``.
If you are not sure what the default ``PAGE_SIZE`` is on your Linux
system, you can run the following command::
# getconf PAGE_SIZE
+To reduce or disable the ability to allocate shared memory, you must create a
+new ipc namespace, set this parameter to the required value and prohibit the
+creation of a new ipc namespace in the current user namespace or cgroups can
+be used.
shmmax
======
@@ -1195,11 +1411,34 @@ This parameter can be used to control the soft lockup detector.
= =================================
The soft lockup detector monitors CPUs for threads that are hogging the CPUs
-without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads
-from running. The mechanism depends on the CPUs ability to respond to timer
-interrupts which are needed for the 'watchdog/N' threads to be woken up by
-the watchdog timer function, otherwise the NMI watchdog — if enabled — can
-detect a hard lockup condition.
+without rescheduling voluntarily, and thus prevent the 'migration/N' threads
+from running, causing the watchdog work fail to execute. The mechanism depends
+on the CPUs ability to respond to timer interrupts which are needed for the
+watchdog work to be queued by the watchdog timer function, otherwise the NMI
+watchdog — if enabled — can detect a hard lockup condition.
+
+
+split_lock_mitigate (x86 only)
+==============================
+
+On x86, each "split lock" imposes a system-wide performance penalty. On larger
+systems, large numbers of split locks from unprivileged users can result in
+denials of service to well-behaved and potentially more important users.
+
+The kernel mitigates these bad users by detecting split locks and imposing
+penalties: forcing them to wait and only allowing one core to execute split
+locks at a time.
+
+These mitigations can make those bad applications unbearably slow. Setting
+split_lock_mitigate=0 may restore some application performance, but will also
+increase system exposure to denial of service attacks from split lock users.
+
+= ===================================================================
+0 Disable the mitigation mode - just warns the split lock on kernel log
+ and exposes the system to denials of service from the split lockers.
+1 Enable the mitigation mode (this is the default) - penalizes the split
+ lockers with intentional performance degradation.
+= ===================================================================
stack_erasing
@@ -1237,7 +1476,7 @@ the boot PROM.
sysrq
=====
-See :doc:`/admin-guide/sysrq`.
+See Documentation/admin-guide/sysrq.rst.
tainted
@@ -1249,7 +1488,7 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
====== ===== ==============================================================
1 `(P)` proprietary module was loaded
2 `(F)` module was force loaded
- 4 `(S)` SMP kernel oops on an officially SMP incapable processor
+ 4 `(S)` kernel running on an out of specification system
8 `(R)` module was force unloaded
16 `(M)` processor reported a Machine Check Exception (MCE)
32 `(B)` bad page referenced or some unexpected page flags
@@ -1267,15 +1506,16 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
131072 `(T)` The kernel was built with the struct randomization plugin
====== ===== ==============================================================
-See :doc:`/admin-guide/tainted-kernels` for more information.
+See Documentation/admin-guide/tainted-kernels.rst for more information.
Note:
writes to this sysctl interface will fail with ``EINVAL`` if the kernel is
booted with the command line option ``panic_on_taint=<bitmask>,nousertaint``
and any of the ORed together values being written to ``tainted`` match with
the bitmask declared on panic_on_taint.
- See :doc:`/admin-guide/kernel-parameters` for more details on that particular
- kernel command line option and its optional ``nousertaint`` switch.
+ See Documentation/admin-guide/kernel-parameters.rst for more details on
+ that particular kernel command line option and its optional
+ ``nousertaint`` switch.
threads-max
===========
@@ -1299,7 +1539,7 @@ If a value outside of this range is written to ``threads-max`` an
traceoff_on_warning
===================
-When set, disables tracing (see :doc:`/trace/ftrace`) when a
+When set, disables tracing (see Documentation/trace/ftrace.rst) when a
``WARN()`` is hit.
@@ -1319,24 +1559,8 @@ will send them to printk() again.
This only works if the kernel was booted with ``tp_printk`` enabled.
-See :doc:`/admin-guide/kernel-parameters` and
-:doc:`/trace/boottime-trace`.
-
-
-.. _unaligned-dump-stack:
-
-unaligned-dump-stack (ia64)
-===========================
-
-When logging unaligned accesses, controls whether the stack is
-dumped.
-
-= ===================================================
-0 Do not dump the stack. This is the default setting.
-1 Dump the stack.
-= ===================================================
-
-See also `ignore-unaligned-usertrap`_.
+See Documentation/admin-guide/kernel-parameters.rst and
+Documentation/trace/boottime-trace.rst.
unaligned-trap
@@ -1344,8 +1568,8 @@ unaligned-trap
On architectures where unaligned accesses cause traps, and where this
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently,
-``arc`` and ``parisc``), controls whether unaligned traps are caught
-and emulated (instead of failing).
+``arc``, ``parisc`` and ``loongarch``), controls whether unaligned traps
+are caught and emulated (instead of failing).
= ========================================================
0 Do not emulate unaligned accesses.
@@ -1370,10 +1594,31 @@ unprivileged_bpf_disabled
=========================
Writing 1 to this entry will disable unprivileged calls to ``bpf()``;
-once disabled, calling ``bpf()`` without ``CAP_SYS_ADMIN`` will return
-``-EPERM``.
+once disabled, calling ``bpf()`` without ``CAP_SYS_ADMIN`` or ``CAP_BPF``
+will return ``-EPERM``. Once set to 1, this can't be cleared from the
+running kernel anymore.
+
+Writing 2 to this entry will also disable unprivileged calls to ``bpf()``,
+however, an admin can still change this setting later on, if needed, by
+writing 0 or 1 to this entry.
+
+If ``BPF_UNPRIV_DEFAULT_OFF`` is enabled in the kernel config, then this
+entry will default to 2 instead of 0.
+
+= =============================================================
+0 Unprivileged calls to ``bpf()`` are enabled
+1 Unprivileged calls to ``bpf()`` are disabled without recovery
+2 Unprivileged calls to ``bpf()`` are disabled
+= =============================================================
+
+
+warn_limit
+==========
-Once set, this can't be cleared.
+Number of kernel warnings after which the kernel should panic when
+``panic_on_warn`` is not set. Setting this to 0 disables checking
+the warning count. Setting this to 1 has the same effect as setting
+``panic_on_warn=1``. The default value is 0.
watchdog
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 42cd04bca548..7250c0542828 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -31,17 +31,18 @@ see only some of them, depending on your kernel's configuration.
Table : Subdirectories in /proc/sys/net
- ========= =================== = ========== ==================
+ ========= =================== = ========== ===================
Directory Content Directory Content
- ========= =================== = ========== ==================
- core General parameter appletalk Appletalk protocol
- unix Unix domain sockets netrom NET/ROM
- 802 E802 protocol ax25 AX25
- ethernet Ethernet protocol rose X.25 PLP layer
+ ========= =================== = ========== ===================
+ 802 E802 protocol mptcp Multipath TCP
+ appletalk Appletalk protocol netfilter Network Filter
+ ax25 AX25 netrom NET/ROM
+ bridge Bridging rose X.25 PLP layer
+ core General parameter tipc TIPC
+ ethernet Ethernet protocol unix Unix domain sockets
ipv4 IP version 4 x25 X.25 protocol
- bridge Bridging decnet DEC net
- ipv6 IP version 6 tipc TIPC
- ========= =================== = ========== ==================
+ ipv6 IP version 6
+ ========= =================== = ========== ===================
1. /proc/sys/net/core - Network core options
============================================
@@ -64,16 +65,17 @@ two flavors of JITs, the newer eBPF JIT currently supported on:
- arm64
- arm32
- ppc64
+ - ppc32
- sparc64
- mips64
- s390x
- riscv64
- riscv32
+ - loongarch64
And the older cBPF JIT supported on the following archs:
- mips
- - ppc
- sparc
eBPF JITs are a superset of cBPF JITs, meaning the kernel will
@@ -101,6 +103,9 @@ Values:
- 1 - enable JIT hardening for unprivileged users only
- 2 - enable JIT hardening for all users
+where "privileged user" in this context means a process having
+CAP_BPF or CAP_SYS_ADMIN in the root user name space.
+
bpf_jit_kallsyms
----------------
@@ -201,6 +206,11 @@ Will increase power usage.
Default: 0 (off)
+mem_pcpu_rsv
+------------
+
+Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
+
rmem_default
------------
@@ -211,6 +221,12 @@ rmem_max
The maximum receive socket buffer size in bytes.
+rps_default_mask
+----------------
+
+The default RPS CPU mask used on newly created network devices. An empty
+mask means RPS disabled by default.
+
tstamp_allow_data
-----------------
Allow processes to receive tx timestamps looped together with the original
@@ -271,7 +287,7 @@ poll cycle or the number of packets processed reaches netdev_budget.
netdev_max_backlog
------------------
-Maximum number of packets, queued on the INPUT side, when the interface
+Maximum number of packets, queued on the INPUT side, when the interface
receives packets faster than kernel can process them.
netdev_rss_key
@@ -311,21 +327,52 @@ permit to distribute the load on several cpus.
If set to 1 (default), timestamps are sampled as soon as possible, before
queueing.
+netdev_unregister_timeout_secs
+------------------------------
+
+Unregister network device timeout in seconds.
+This option controls the timeout (in seconds) used to issue a warning while
+waiting for a network device refcount to drop to 0 during device
+unregistration. A lower value may be useful during bisection to detect
+a leaked reference faster. A larger value may be useful to prevent false
+warnings on slow/loaded systems.
+Default value is 10, minimum 1, maximum 3600.
+
+skb_defer_max
+-------------
+
+Max size (in skbs) of the per-cpu list of skbs being freed
+by the cpu which allocated them. Used by TCP stack so far.
+
+Default: 64
+
optmem_max
----------
Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
-of struct cmsghdr structures with appended data.
+of struct cmsghdr structures with appended data. TCP tx zerocopy also uses
+optmem_max as a limit for its internal structures.
+
+Default : 128 KB
fb_tunnels_only_for_init_net
----------------------------
Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0,
-sit0, ip6tnl0, ip6gre0) are automatically created when a new
-network namespace is created, if corresponding tunnel is present
-in initial network namespace.
-If set to 1, these devices are not automatically created, and
-user space is responsible for creating them if needed.
+sit0, ip6tnl0, ip6gre0) are automatically created. There are 3 possibilities
+(a) value = 0; respective fallback tunnels are created when module is
+loaded in every net namespaces (backward compatible behavior).
+(b) value = 1; [kcmd value: initns] respective fallback tunnels are
+created only in init net namespace and every other net namespace will
+not have them.
+(c) value = 2; [kcmd value: none] fallback tunnels are not created
+when a module is loaded in any of the net namespace. Setting value to
+"2" is pointless after boot if these modules are built-in, so there is
+a kernel command-line option that can change this default. Please refer to
+Documentation/admin-guide/kernel-parameters.txt for additional details.
+
+Not creating fallback tunnels gives control to userspace to create
+whatever is needed only and avoid creating devices which are redundant.
Default : 0 (for compatibility reasons)
@@ -345,6 +392,36 @@ new netns has been created.
Default : 0 (for compatibility reasons)
+txrehash
+--------
+
+Controls default hash rethink behaviour on socket when SO_TXREHASH option is set
+to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
+
+If set to 1 (default), hash rethink is performed on listening socket.
+If set to 0, hash rethink is not performed.
+
+gro_normal_batch
+----------------
+
+Maximum number of the segments to batch up on output of GRO. When a packet
+exits GRO, either as a coalesced superframe or as an original packet which
+GRO has decided not to coalesce, it is placed on a per-NAPI list. This
+list is then passed to the stack when the number of segments reaches the
+gro_normal_batch limit.
+
+high_order_alloc_disable
+------------------------
+
+By default the allocator for page frags tries to use high order pages (order-3
+on x86). While the default behavior gives good results in most cases, some users
+might have hit a contention in page allocations/freeing. This was especially
+true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
+lists. This allows to opt-in for order-0 allocation instead but is now mostly of
+historical importance.
+
+Default: 0
+
2. /proc/sys/net/unix - Parameters for Unix domain sockets
----------------------------------------------------------
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index d46d5b7013c6..c59889de122b 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -25,8 +25,8 @@ files can be found in mm/swap.c.
Currently, these files are in /proc/sys/vm:
- admin_reserve_kbytes
-- block_dump
- compact_memory
+- compaction_proactiveness
- compact_unevictable_allowed
- dirty_background_bytes
- dirty_background_ratio
@@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/vm:
- dirty_writeback_centisecs
- drop_caches
- extfrag_threshold
+- highmem_is_dirtyable
- hugetlb_shm_group
- laptop_mode
- legacy_va_layout
@@ -61,8 +62,9 @@ Currently, these files are in /proc/sys/vm:
- overcommit_memory
- overcommit_ratio
- page-cluster
+- page_lock_unfairness
- panic_on_oom
-- percpu_pagelist_fraction
+- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
- numa_stat
@@ -104,13 +106,6 @@ On x86_64 this is about 128MB.
Changing this takes effect whenever an application requests memory.
-block_dump
-==========
-
-block_dump enables block I/O debugging when set to a nonzero value. More
-information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.
-
-
compact_memory
==============
@@ -119,6 +114,22 @@ all zones are compacted such that free memory is available in contiguous
blocks where possible. This can be important for example in the allocation of
huge pages although processes will also directly compact memory as required.
+compaction_proactiveness
+========================
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Write of a non zero value to this tunable will immediately
+trigger the proactive compaction. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
+Be careful when setting it to extreme values like 100, as that may
+cause excessive background compaction activity.
compact_unevictable_allowed
===========================
@@ -129,7 +140,7 @@ This should be used on systems where stalls for minor page faults are an
acceptable trade for large contiguous free memory. Set to 0 to prevent
compaction from moving pages that are unevictable. Default value is 1.
On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
-to compaction, which would block the task from becomming active until the fault
+to compaction, which would block the task from becoming active until the fault
is resolved.
@@ -345,7 +356,7 @@ The lowmem_reserve_ratio is an array. You can see them by reading this file::
But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
-in /proc/zoneinfo like followings. (This is an example of x86-64 box).
+in /proc/zoneinfo like the following. (This is an example of x86-64 box).
Each zone has an array of protection pages like this::
Node 0, zone DMA
@@ -411,7 +422,7 @@ While most applications need less than a thousand maps, certain
programs, particularly malloc debuggers, may consume lots of them,
e.g., up to one or two maps per allocation.
-The default value is 65536.
+The default value is 65530.
memory_failure_early_kill:
@@ -422,7 +433,7 @@ a 2bit error in a memory module) is detected in the background by hardware
that cannot be handled by the kernel. In some cases (like the page
still having a valid copy on disk) the kernel will handle the failure
transparently without affecting any applications. But if there is
-no other uptodate copy of the data it will kill to prevent any data
+no other up-to-date copy of the data it will kill to prevent any data
corruptions from propagating.
1: Kill all processes that have the corrupted and not reloadable page mapped
@@ -551,6 +562,43 @@ Change the minimum size of the hugepage pool.
See Documentation/admin-guide/mm/hugetlbpage.rst
+hugetlb_optimize_vmemmap
+========================
+
+This knob is not available when the size of 'struct page' (a structure defined
+in include/linux/mm_types.h) is not power of two (an unusual system config could
+result in this).
+
+Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).
+
+Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
+buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
+per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be
+optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool
+to the buddy allocator, the vmemmap pages representing that range needs to be
+remapped again and the vmemmap pages discarded earlier need to be rellocated
+again. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.
+never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set
+'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on
+the fly') instead of being pulled from the HugeTLB pool, you should weigh the
+benefits of memory savings against the more overhead (~2x slower than before)
+of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy
+allocator. Another behavior to note is that if the system is under heavy memory
+pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB
+pool to the buddy allocator since the allocation of vmemmap pages could be
+failed, you have to retry later if your system encounter this situation.
+
+Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
+buddy allocator will not be optimized meaning the extra overhead at allocation
+time from buddy allocator disappears, whereas already optimized HugeTLB pages
+will not be affected. If you want to make sure there are no optimized HugeTLB
+pages, you can set "nr_hugepages" to 0 first and then disable this. Note that
+writing 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus
+pages. So, those surplus pages are still optimized until they are no longer
+in use. You would need to wait for those surplus pages to be released before
+there are no optimized pages in the system.
+
+
nr_hugepages_mempolicy
======================
@@ -583,7 +631,7 @@ trimming of allocations is initiated.
The default value is 1.
-See Documentation/nommu-mmap.txt for more information.
+See Documentation/admin-guide/mm/nommu-mmap.rst for more information.
numa_zonelist_order
@@ -694,8 +742,8 @@ overcommit_memory
This value contains a flag that enables memory overcommitment.
-When this flag is 0, the kernel attempts to estimate the amount
-of free memory left when userspace requests more memory.
+When this flag is 0, the kernel compares the userspace memory request
+size against total memory plus swap and rejects obvious overcommits.
When this flag is 1, the kernel pretends there is always enough
memory until it actually runs out.
@@ -710,7 +758,7 @@ and don't use much of it.
The default value is 0.
-See Documentation/vm/overcommit-accounting.rst and
+See Documentation/mm/overcommit-accounting.rst and
mm/util.c::__vm_enough_memory() for more information.
@@ -744,6 +792,14 @@ extra faults and I/O delays for following faults if they would have been part of
that consecutive pages readahead would have brought in.
+page_lock_unfairness
+====================
+
+This value determines the number of times that the page lock can be
+stolen from under a waiter. After the lock is stolen the number of times
+specified in this file (default is 5), the "fair lock handoff" semantics
+will apply, and the waiter will only be awakened if the lock can be taken.
+
panic_on_oom
============
@@ -773,22 +829,24 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.
-percpu_pagelist_fraction
-========================
+percpu_pagelist_high_fraction
+=============================
-This is the fraction of pages at most (high mark pcp->high) in each zone that
-are allocated for each per cpu page list. The min value for this is 8. It
-means that we don't allow more than 1/8th of pages in each zone to be
-allocated in any single per_cpu_pagelist. This entry only changes the value
-of hot per cpu pagelists. User can specify a number like 100 to allocate
-1/100th of each zone to each per cpu page list.
+This is the fraction of pages in each zone that are can be stored to
+per-cpu page lists. It is an upper boundary that is divided depending
+on the number of online CPUs. The min value for this is 8 which means
+that we do not allow more than 1/8th of pages in each zone to be stored
+on per-cpu page lists. This entry only changes the value of hot per-cpu
+page lists. A user can specify a number like 100 to allocate 1/100th of
+each zone between per-cpu lists.
-The batch value of each per cpu pagelist is also updated as a result. It is
-set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
+The batch value of each per-cpu page list remains the same regardless of
+the value of the high fraction so allocation latencies are unaffected.
-The initial value is zero. Kernel does not use this value at boot time to set
-the high water marks for each per cpu page list. If the user writes '0' to this
-sysctl, it will revert to this default behavior.
+The initial value is zero. Kernel uses this value to set the high pcp->high
+mark based on the low watermark for the zone and the number of local
+online CPUs. If the user writes '0' to this sysctl, it will revert to
+this default behavior.
stat_interval
@@ -856,13 +914,21 @@ file-backed pages is less than the high watermark in a zone.
unprivileged_userfaultfd
========================
-This flag controls whether unprivileged users can use the userfaultfd
-system calls. Set this to 1 to allow unprivileged users to use the
-userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
-privileged users (with SYS_CAP_PTRACE capability).
+This flag controls the mode in which unprivileged users can use the
+userfaultfd system calls. Set this to 0 to restrict unprivileged users
+to handle page faults in user mode only. In this case, users without
+SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
+succeed. Prohibiting use of userfaultfd for handling faults from kernel
+mode may make certain vulnerabilities more difficult to exploit.
-The default value is 1.
+Set this to 1 to allow unprivileged users to use the userfaultfd system
+calls without any restrictions.
+
+The default value is 0.
+Another way to control permissions for userfaultfd is to use
+/dev/userfaultfd instead of userfaultfd(2). See
+Documentation/admin-guide/mm/userfaultfd.rst.
user_reserve_kbytes
===================
@@ -914,12 +980,12 @@ allocations, THP and hugetlbfs pages.
To make it sensible with respect to the watermark_scale_factor
parameter, the unit is in fractions of 10,000. The default value of
-15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
-watermark will be reclaimed in the event of a pageblock being mixed due
-to fragmentation. The level of reclaim is determined by the number of
-fragmentation events that occurred in the recent past. If this value is
-smaller than a pageblock then a pageblocks worth of pages will be reclaimed
-(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+15,000 means that up to 150% of the high watermark will be reclaimed in the
+event of a pageblock being mixed due to fragmentation. The level of reclaim
+is determined by the number of fragmentation events that occurred in the
+recent past. If this value is smaller than a pageblock then a pageblocks
+worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
+of 0 will disable the feature.
watermark_scale_factor
@@ -931,7 +997,7 @@ how much memory needs to be free before kswapd goes back to sleep.
The unit is in fractions of 10,000. The default value of 10 means the
distances between watermarks are 0.1% of the available memory in the
-node/system. The maximum value is 1000, or 10% of memory.
+node/system. The maximum value is 3000, or 30% of memory.
A high rate of threads entering direct reclaim (allocstall) or kswapd
going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
@@ -961,11 +1027,11 @@ that benefit from having their data cached, zone_reclaim_mode should be
left disabled as the caching effect is likely to be more important than
data locality.
-zone_reclaim may be enabled if it's known that the workload is partitioned
-such that each partition fits within a NUMA node and that accessing remote
-memory would cause a measurable performance reduction. The page allocator
-will then reclaim easily reusable pages (those page cache pages that are
-currently not used) before allocating off node pages.
+Consider enabling one or more zone_reclaim mode bits if it's known that the
+workload is partitioned such that each partition fits within a NUMA node
+and that accessing remote memory would cause a measurable performance
+reduction. The page allocator will take additional actions before
+allocating off node pages.
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst
index e6424d8c5846..2f2e5bd440f9 100644
--- a/Documentation/admin-guide/sysrq.rst
+++ b/Documentation/admin-guide/sysrq.rst
@@ -72,13 +72,24 @@ On PowerPC
On other
If you know of the key combos for other architectures, please
- let me know so I can add them to this section.
+ submit a patch to be included in this section.
On all
- Write a character to /proc/sysrq-trigger. e.g.::
+ Write a single character to /proc/sysrq-trigger.
+ Only the first character is processed, the rest of the string is
+ ignored. However, it is not recommended to write any extra characters
+ as the behavior is undefined and might change in the future versions.
+ E.g.::
echo t > /proc/sysrq-trigger
+ Alternatively, write multiple characters prepended by underscore.
+ This way, all characters will be processed. E.g.::
+
+ echo _reisub > /proc/sysrq-trigger
+
+The :kbd:`<command key>` is case sensitive.
+
What are the 'command' keys?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -88,8 +99,8 @@ Command Function
``b`` Will immediately reboot the system without syncing or unmounting
your disks.
-``c`` Will perform a system crash by a NULL pointer dereference.
- A crashdump will be taken if configured.
+``c`` Will perform a system crash and a crashdump will be taken
+ if configured.
``d`` Shows all locks that are held.
@@ -136,7 +147,7 @@ Command Function
``v`` Forcefully restores framebuffer console
``v`` Causes ETM buffer dump [ARM-specific]
-``w`` Dumps tasks that are in uninterruptable (blocked) state.
+``w`` Dumps tasks that are in uninterruptible (blocked) state.
``x`` Used by xmon interface on ppc/powerpc platforms.
Show global PMU Registers on sparc64.
@@ -203,10 +214,12 @@ frozen (probably root) filesystem via the FIFREEZE ioctl.
Sometimes SysRq seems to get 'stuck' after using it, what can I do?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-That happens to me, also. I've found that tapping shift, alt, and control
-on both sides of the keyboard, and hitting an invalid sysrq sequence again
-will fix the problem. (i.e., something like :kbd:`alt-sysrq-z`). Switching to
-another virtual console (:kbd:`ALT+Fn`) and then back again should also help.
+When this happens, try tapping shift, alt and control on both sides of the
+keyboard, and hitting an invalid sysrq sequence again. (i.e., something like
+:kbd:`alt-sysrq-z`).
+
+Switching to another virtual console (:kbd:`ALT+Fn`) and then back again
+should also help.
I hit SysRq, but nothing seems to happen, what's wrong?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 71e9184a9079..f92551539e8a 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -34,11 +34,11 @@ name of the command ('Comm:') that triggered the event::
You'll find a 'Not tainted: ' there if the kernel was not tainted at the
time of the event; if it was, then it will print 'Tainted: ' and characters
-either letters or blanks. In above example it looks like this::
+either letters or blanks. In the example above it looks like this::
Tainted: P W O
-The meaning of those characters is explained in the table below. In tis case
+The meaning of those characters is explained in the table below. In this case
the kernel got tainted earlier because a proprietary Module (``P``) was loaded,
a warning occurred (``W``), and an externally-built module was loaded (``O``).
To decode other letters use the table below.
@@ -52,7 +52,7 @@ At runtime, you can query the tainted state by reading
tainted; any other number indicates the reasons why it is. The easiest way to
decode that number is the script ``tools/debugging/kernel-chktaint``, which your
distribution might ship as part of a package called ``linux-tools`` or
-``kernel-tools``; if it doesn't you can download the script from
+``kernel-tools``; if it doesn't, you can download the script from
`git.kernel.org <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/tools/debugging/kernel-chktaint>`_
and execute it with ``sh kernel-chktaint``, which would print something like
this on the machine that had the statements in the logs that were quoted earlier::
@@ -61,7 +61,7 @@ this on the machine that had the statements in the logs that were quoted earlier
* Proprietary module was loaded (#0)
* Kernel issued warning (#9)
* Externally-built ('out-of-tree') module was loaded (#12)
- See Documentation/admin-guide/tainted-kernels.rst in the the Linux kernel or
+ See Documentation/admin-guide/tainted-kernels.rst in the Linux kernel or
https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html for
a more details explanation of the various taint flags.
Raw taint value as int/string: 4609/'P W O '
@@ -84,7 +84,7 @@ Bit Log Number Reason that got the kernel tainted
=== === ====== ========================================================
0 G/P 1 proprietary module was loaded
1 _/F 2 module was force loaded
- 2 _/S 4 SMP kernel oops on an officially SMP incapable processor
+ 2 _/S 4 kernel running on an out of specification system
3 _/R 8 module was force unloaded
4 _/M 16 processor reported a Machine Check Exception (MCE)
5 _/B 32 bad page referenced or some unexpected page flags
@@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
15 _/K 32768 kernel has been live patched
16 _/X 65536 auxiliary taint, defined for and used by distros
17 _/T 131072 kernel was built with the struct randomization plugin
+ 18 _/N 262144 an in-kernel test has been run
=== === ====== ========================================================
Note: The character ``_`` is representing a blank in this table to make reading
@@ -116,10 +117,29 @@ More detailed explanation for tainting
1) ``F`` if any module was force loaded by ``insmod -f``, ``' '`` if all
modules were loaded normally.
- 2) ``S`` if the oops occurred on an SMP kernel running on hardware that
- hasn't been certified as safe to run multiprocessor.
- Currently this occurs only on various Athlons that are not
- SMP capable.
+ 2) ``S`` if the kernel is running on a processor or system that is out of
+ specification: hardware has been put into an unsupported configuration,
+ therefore proper execution cannot be guaranteed.
+ Kernel will be tainted if, for example:
+
+ - on x86: PAE is forced through forcepae on intel CPUs (such as Pentium M)
+ which do not report PAE but may have a functional implementation, an SMP
+ kernel is running on non officially capable SMP Athlon CPUs, MSRs are
+ being poked at from userspace.
+ - on arm: kernel running on certain CPUs (such as Keystone 2) without
+ having certain kernel features enabled.
+ - on arm64: there are mismatched hardware features between CPUs, the
+ bootloader has booted CPUs in different modes.
+ - certain drivers are being used on non supported architectures (such as
+ scsi/snic on something else than x86_64, scsi/ips on non
+ x86/x86_64/itanium, have broken firmware settings for the
+ irqchip/irq-gic on arm64 ...).
+ - x86/x86_64: Microcode late loading is dangerous and will result in
+ tainting the kernel. It requires that all CPUs rendezvous to make sure
+ the update happens when the system is as quiescent as possible. However,
+ a higher priority MCE/SMI/NMI can move control flow away from that
+ rendezvous and interrupt the update, which can be detrimental to the
+ machine.
3) ``R`` if a module was force unloaded by ``rmmod -f``, ``' '`` if all
modules were unloaded normally.
@@ -130,7 +150,7 @@ More detailed explanation for tainting
5) ``B`` If a page-release function has found a bad page reference or some
unexpected page flags. This indicates a hardware problem or a kernel bug;
there should be other information in the log indicating why this tainting
- occured.
+ occurred.
6) ``U`` if a user or user application specifically requested that the
Tainted flag be set, ``' '`` otherwise.
diff --git a/Documentation/admin-guide/thermal/index.rst b/Documentation/admin-guide/thermal/index.rst
new file mode 100644
index 000000000000..193b7b01a87d
--- /dev/null
+++ b/Documentation/admin-guide/thermal/index.rst
@@ -0,0 +1,8 @@
+=================
+Thermal Subsystem
+=================
+
+.. toctree::
+ :maxdepth: 1
+
+ intel_powerclamp
diff --git a/Documentation/admin-guide/thermal/intel_powerclamp.rst b/Documentation/admin-guide/thermal/intel_powerclamp.rst
new file mode 100644
index 000000000000..08509b978af4
--- /dev/null
+++ b/Documentation/admin-guide/thermal/intel_powerclamp.rst
@@ -0,0 +1,345 @@
+=======================
+Intel Powerclamp Driver
+=======================
+
+By:
+ - Arjan van de Ven <arjan@linux.intel.com>
+ - Jacob Pan <jacob.jun.pan@linux.intel.com>
+
+.. Contents:
+
+ (*) Introduction
+ - Goals and Objectives
+
+ (*) Theory of Operation
+ - Idle Injection
+ - Calibration
+
+ (*) Performance Analysis
+ - Effectiveness and Limitations
+ - Power vs Performance
+ - Scalability
+ - Calibration
+ - Comparison with Alternative Techniques
+
+ (*) Usage and Interfaces
+ - Generic Thermal Layer (sysfs)
+ - Kernel APIs (TBD)
+
+ (*) Module Parameters
+
+INTRODUCTION
+============
+
+Consider the situation where a system’s power consumption must be
+reduced at runtime, due to power budget, thermal constraint, or noise
+level, and where active cooling is not preferred. Software managed
+passive power reduction must be performed to prevent the hardware
+actions that are designed for catastrophic scenarios.
+
+Currently, P-states, T-states (clock modulation), and CPU offlining
+are used for CPU throttling.
+
+On Intel CPUs, C-states provide effective power reduction, but so far
+they’re only used opportunistically, based on workload. With the
+development of intel_powerclamp driver, the method of synchronizing
+idle injection across all online CPU threads was introduced. The goal
+is to achieve forced and controllable C-state residency.
+
+Test/Analysis has been made in the areas of power, performance,
+scalability, and user experience. In many cases, clear advantage is
+shown over taking the CPU offline or modulating the CPU clock.
+
+
+THEORY OF OPERATION
+===================
+
+Idle Injection
+--------------
+
+On modern Intel processors (Nehalem or later), package level C-state
+residency is available in MSRs, thus also available to the kernel.
+
+These MSRs are::
+
+ #define MSR_PKG_C2_RESIDENCY 0x60D
+ #define MSR_PKG_C3_RESIDENCY 0x3F8
+ #define MSR_PKG_C6_RESIDENCY 0x3F9
+ #define MSR_PKG_C7_RESIDENCY 0x3FA
+
+If the kernel can also inject idle time to the system, then a
+closed-loop control system can be established that manages package
+level C-state. The intel_powerclamp driver is conceived as such a
+control system, where the target set point is a user-selected idle
+ratio (based on power reduction), and the error is the difference
+between the actual package level C-state residency ratio and the target idle
+ratio.
+
+Injection is controlled by high priority kernel threads, spawned for
+each online CPU.
+
+These kernel threads, with SCHED_FIFO class, are created to perform
+clamping actions of controlled duty ratio and duration. Each per-CPU
+thread synchronizes its idle time and duration, based on the rounding
+of jiffies, so accumulated errors can be prevented to avoid a jittery
+effect. Threads are also bound to the CPU such that they cannot be
+migrated, unless the CPU is taken offline. In this case, threads
+belong to the offlined CPUs will be terminated immediately.
+
+Running as SCHED_FIFO and relatively high priority, also allows such
+scheme to work for both preemptible and non-preemptible kernels.
+Alignment of idle time around jiffies ensures scalability for HZ
+values. This effect can be better visualized using a Perf timechart.
+The following diagram shows the behavior of kernel thread
+kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
+for a given "duration", then relinquishes the CPU to other tasks,
+until the next time interval.
+
+The NOHZ schedule tick is disabled during idle time, but interrupts
+are not masked. Tests show that the extra wakeups from scheduler tick
+have a dramatic impact on the effectiveness of the powerclamp driver
+on large scale systems (Westmere system with 80 processors).
+
+::
+
+ CPU0
+ ____________ ____________
+ kidle_inject/0 | sleep | mwait | sleep |
+ _________| |________| |_______
+ duration
+ CPU1
+ ____________ ____________
+ kidle_inject/1 | sleep | mwait | sleep |
+ _________| |________| |_______
+ ^
+ |
+ |
+ roundup(jiffies, interval)
+
+Only one CPU is allowed to collect statistics and update global
+control parameters. This CPU is referred to as the controlling CPU in
+this document. The controlling CPU is elected at runtime, with a
+policy that favors BSP, taking into account the possibility of a CPU
+hot-plug.
+
+In terms of dynamics of the idle control system, package level idle
+time is considered largely as a non-causal system where its behavior
+cannot be based on the past or current input. Therefore, the
+intel_powerclamp driver attempts to enforce the desired idle time
+instantly as given input (target idle ratio). After injection,
+powerclamp monitors the actual idle for a given time window and adjust
+the next injection accordingly to avoid over/under correction.
+
+When used in a causal control system, such as a temperature control,
+it is up to the user of this driver to implement algorithms where
+past samples and outputs are included in the feedback. For example, a
+PID-based thermal controller can use the powerclamp driver to
+maintain a desired target temperature, based on integral and
+derivative gains of the past samples.
+
+
+
+Calibration
+-----------
+During scalability testing, it is observed that synchronized actions
+among CPUs become challenging as the number of cores grows. This is
+also true for the ability of a system to enter package level C-states.
+
+To make sure the intel_powerclamp driver scales well, online
+calibration is implemented. The goals for doing such a calibration
+are:
+
+a) determine the effective range of idle injection ratio
+b) determine the amount of compensation needed at each target ratio
+
+Compensation to each target ratio consists of two parts:
+
+ a) steady state error compensation
+
+ This is to offset the error occurring when the system can
+ enter idle without extra wakeups (such as external interrupts).
+
+ b) dynamic error compensation
+
+ When an excessive amount of wakeups occurs during idle, an
+ additional idle ratio can be added to quiet interrupts, by
+ slowing down CPU activities.
+
+A debugfs file is provided for the user to examine compensation
+progress and results, such as on a Westmere system::
+
+ [jacob@nex01 ~]$ cat
+ /sys/kernel/debug/intel_powerclamp/powerclamp_calib
+ controlling cpu: 0
+ pct confidence steady dynamic (compensation)
+ 0 0 0 0
+ 1 1 0 0
+ 2 1 1 0
+ 3 3 1 0
+ 4 3 1 0
+ 5 3 1 0
+ 6 3 1 0
+ 7 3 1 0
+ 8 3 1 0
+ ...
+ 30 3 2 0
+ 31 3 2 0
+ 32 3 1 0
+ 33 3 2 0
+ 34 3 1 0
+ 35 3 2 0
+ 36 3 1 0
+ 37 3 2 0
+ 38 3 1 0
+ 39 3 2 0
+ 40 3 3 0
+ 41 3 1 0
+ 42 3 2 0
+ 43 3 1 0
+ 44 3 1 0
+ 45 3 2 0
+ 46 3 3 0
+ 47 3 0 0
+ 48 3 2 0
+ 49 3 3 0
+
+Calibration occurs during runtime. No offline method is available.
+Steady state compensation is used only when confidence levels of all
+adjacent ratios have reached satisfactory level. A confidence level
+is accumulated based on clean data collected at runtime. Data
+collected during a period without extra interrupts is considered
+clean.
+
+To compensate for excessive amounts of wakeup during idle, additional
+idle time is injected when such a condition is detected. Currently,
+we have a simple algorithm to double the injection ratio. A possible
+enhancement might be to throttle the offending IRQ, such as delaying
+EOI for level triggered interrupts. But it is a challenge to be
+non-intrusive to the scheduler or the IRQ core code.
+
+
+CPU Online/Offline
+------------------
+Per-CPU kernel threads are started/stopped upon receiving
+notifications of CPU hotplug activities. The intel_powerclamp driver
+keeps track of clamping kernel threads, even after they are migrated
+to other CPUs, after a CPU offline event.
+
+
+Performance Analysis
+====================
+This section describes the general performance data collected on
+multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
+
+Effectiveness and Limitations
+-----------------------------
+The maximum range that idle injection is allowed is capped at 50
+percent. As mentioned earlier, since interrupts are allowed during
+forced idle time, excessive interrupts could result in less
+effectiveness. The extreme case would be doing a ping -f to generated
+flooded network interrupts without much CPU acknowledgement. In this
+case, little can be done from the idle injection threads. In most
+normal cases, such as scp a large file, applications can be throttled
+by the powerclamp driver, since slowing down the CPU also slows down
+network protocol processing, which in turn reduces interrupts.
+
+When control parameters change at runtime by the controlling CPU, it
+may take an additional period for the rest of the CPUs to catch up
+with the changes. During this time, idle injection is out of sync,
+thus not able to enter package C- states at the expected ratio. But
+this effect is minor, in that in most cases change to the target
+ratio is updated much less frequently than the idle injection
+frequency.
+
+Scalability
+-----------
+Tests also show a minor, but measurable, difference between the 4P/8P
+Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
+More compensation is needed on Westmere for the same amount of
+target idle ratio. The compensation also increases as the idle ratio
+gets larger. The above reason constitutes the need for the
+calibration code.
+
+On the IVB 8P system, compared to an offline CPU, powerclamp can
+achieve up to 40% better performance per watt. (measured by a spin
+counter summed over per CPU counting threads spawned for all running
+CPUs).
+
+Usage and Interfaces
+====================
+The powerclamp driver is registered to the generic thermal layer as a
+cooling device. Currently, it’s not bound to any thermal zones::
+
+ jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
+ cur_state:0
+ max_state:50
+ type:intel_powerclamp
+
+cur_state allows user to set the desired idle percentage. Writing 0 to
+cur_state will stop idle injection. Writing a value between 1 and
+max_state will start the idle injection. Reading cur_state returns the
+actual and current idle percentage. This may not be the same value
+set by the user in that current idle percentage depends on workload
+and includes natural idle. When idle injection is disabled, reading
+cur_state returns value -1 instead of 0 which is to avoid confusing
+100% busy state with the disabled state.
+
+Example usage:
+
+- To inject 25% idle time::
+
+ $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
+
+If the system is not busy and has more than 25% idle time already,
+then the powerclamp driver will not start idle injection. Using Top
+will not show idle injection kernel threads.
+
+If the system is busy (spin test below) and has less than 25% natural
+idle time, powerclamp kernel threads will do idle injection. Forced
+idle time is accounted as normal idle in that common code path is
+taken as the idle task.
+
+In this example, 24.1% idle is shown. This helps the system admin or
+user determine the cause of slowdown, when a powerclamp driver is in action::
+
+
+ Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie
+ Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
+ Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers
+ Swap: 4087804k total, 0k used, 4087804k free, 945336k cached
+
+ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
+ 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin
+ 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0
+ 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3
+ 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1
+ 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2
+ 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox
+ 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg
+ 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz
+
+Tests have shown that by using the powerclamp driver as a cooling
+device, a PID based userspace thermal controller can manage to
+control CPU temperature effectively, when no other thermal influence
+is added. For example, a UltraBook user can compile the kernel under
+certain temperature (below most active trip points).
+
+Module Parameters
+=================
+
+``cpumask`` (RW)
+ A bit mask of CPUs to inject idle. The format of the bitmask is same as
+ used in other subsystems like in /proc/irq/\*/smp_affinity. The mask is
+ comma separated 32 bit groups. Each CPU is one bit. For example for a 256
+ CPU system the full mask is:
+ ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
+
+ The rightmost mask is for CPU 0-32.
+
+``max_idle`` (RW)
+ Maximum injected idle time to the total CPU time ratio in percent range
+ from 1 to 100. Even if the cooling device max_state is always 100 (100%),
+ this parameter allows to add a max idle percent limit. The default is 50,
+ to match the current implementation of powerclamp driver. Also doesn't
+ allow value more than 75, if the cpumask includes every CPU present in
+ the system.
diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst
index 10c4f0ce2ad0..2ed79f41a411 100644
--- a/Documentation/admin-guide/thunderbolt.rst
+++ b/Documentation/admin-guide/thunderbolt.rst
@@ -47,6 +47,9 @@ be DMA masters and thus read contents of the host memory without CPU and OS
knowing about it. There are ways to prevent this by setting up an IOMMU but
it is not always available for various reasons.
+Some USB4 systems have a BIOS setting to disable PCIe tunneling. This is
+treated as another security level (nopcie).
+
The security levels are as follows:
none
@@ -77,6 +80,10 @@ The security levels are as follows:
Display Port in a dock. All PCIe links downstream of the dock are
removed.
+ nopcie
+ PCIe tunneling is disabled/forbidden from the BIOS. Available in some
+ USB4 systems.
+
The current security level can be read from
``/sys/bus/thunderbolt/devices/domainX/security`` where ``domainX`` is
the Thunderbolt domain the host controller manages. There is typically
@@ -153,6 +160,22 @@ If the user still wants to connect the device they can either approve
the device without a key or write a new key and write 1 to the
``authorized`` file to get the new key stored on the device NVM.
+De-authorizing devices
+----------------------
+It is possible to de-authorize devices by writing ``0`` to their
+``authorized`` attribute. This requires support from the connection
+manager implementation and can be checked by reading domain
+``deauthorization`` attribute. If it reads ``1`` then the feature is
+supported.
+
+When a device is de-authorized the PCIe tunnel from the parent device
+PCIe downstream (or root) port to the device PCIe upstream port is torn
+down. This is essentially the same thing as PCIe hot-remove and the PCIe
+toplogy in question will not be accessible anymore until the device is
+authorized again. If there is storage such as NVMe or similar involved,
+there is a risk for data loss if the filesystem on that storage is not
+properly shut down. You have been warned!
+
DMA protection utilizing IOMMU
------------------------------
Recent systems from 2018 and forward with Thunderbolt ports may natively
@@ -173,8 +196,8 @@ following ``udev`` rule::
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTRS{iommu_dma_protection}=="1", ATTR{authorized}=="0", ATTR{authorized}="1"
-Upgrading NVM on Thunderbolt device or host
--------------------------------------------
+Upgrading NVM on Thunderbolt device, host or retimer
+----------------------------------------------------
Since most of the functionality is handled in firmware running on a
host controller or a device, it is important that the firmware can be
upgraded to the latest where possible bugs in it have been fixed.
@@ -185,9 +208,10 @@ for some machines:
`Thunderbolt Updates <https://thunderbolttechnology.net/updates>`_
-Before you upgrade firmware on a device or host, please make sure it is a
-suitable upgrade. Failing to do that may render the device (or host) in a
-state where it cannot be used properly anymore without special tools!
+Before you upgrade firmware on a device, host or retimer, please make
+sure it is a suitable upgrade. Failing to do that may render the device
+in a state where it cannot be used properly anymore without special
+tools!
Host NVM upgrade on Apple Macs is not supported.
@@ -232,6 +256,35 @@ Note names of the NVMem devices ``nvm_activeN`` and ``nvm_non_activeN``
depend on the order they are registered in the NVMem subsystem. N in
the name is the identifier added by the NVMem subsystem.
+Upgrading on-board retimer NVM when there is no cable connected
+---------------------------------------------------------------
+If the platform supports, it may be possible to upgrade the retimer NVM
+firmware even when there is nothing connected to the USB4
+ports. When this is the case the ``usb4_portX`` devices have two special
+attributes: ``offline`` and ``rescan``. The way to upgrade the firmware
+is to first put the USB4 port into offline mode::
+
+ # echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline
+
+This step makes sure the port does not respond to any hotplug events,
+and also ensures the retimers are powered on. The next step is to scan
+for the retimers::
+
+ # echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/rescan
+
+This enumerates and adds the on-board retimers. Now retimer NVM can be
+upgraded in the same way than with cable connected (see previous
+section). However, the retimer is not disconnected as we are offline
+mode) so after writing ``1`` to ``nvm_authenticate`` one should wait for
+5 or more seconds before running rescan again::
+
+ # echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/rescan
+
+This point if everything went fine, the port can be put back to
+functional state again::
+
+ # echo 0 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline
+
Upgrading NVM when host controller is in safe mode
--------------------------------------------------
If the existing NVM is not properly authenticated (or is missing) the
diff --git a/Documentation/admin-guide/unicode.rst b/Documentation/admin-guide/unicode.rst
index 290fe83ebe82..cba7e5017d36 100644
--- a/Documentation/admin-guide/unicode.rst
+++ b/Documentation/admin-guide/unicode.rst
@@ -3,11 +3,10 @@ Unicode support
Last update: 2005-01-17, version 1.4
-This file is maintained by H. Peter Anvin <unicode@lanana.org> as part
-of the Linux Assigned Names And Numbers Authority (LANANA) project.
-The current version can be found at:
-
- http://www.lanana.org/docs/unicode/admin-guide/unicode.rst
+Note: The original version of this document, which was maintained at
+lanana.org as part of the Linux Assigned Names And Numbers Authority
+(LANANA) project, is no longer existent. So, this version in the
+mainline Linux kernel is now the maintained main document.
Introduction
------------
diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
new file mode 100644
index 000000000000..d3504826f401
--- /dev/null
+++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
@@ -0,0 +1,1985 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. [see the bottom of this file for redistribution information]
+
+=========================================
+How to verify bugs and bisect regressions
+=========================================
+
+This document describes how to check if some Linux kernel problem occurs in code
+currently supported by developers -- to then explain how to locate the change
+causing the issue, if it is a regression (e.g. did not happen with earlier
+versions).
+
+The text aims at people running kernels from mainstream Linux distributions on
+commodity hardware who want to report a kernel bug to the upstream Linux
+developers. Despite this intent, the instructions work just as well for users
+who are already familiar with building their own kernels: they help avoid
+mistakes occasionally made even by experienced developers.
+
+..
+ Note: if you see this note, you are reading the text's source file. You
+ might want to switch to a rendered version: it makes it a lot easier to
+ read and navigate this document -- especially when you want to look something
+ up in the reference section, then jump back to where you left off.
+..
+ Find the latest rendered version of this text here:
+ https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.rst.html
+
+The essence of the process (aka 'TL;DR')
+========================================
+
+*[If you are new to building or bisecting Linux, ignore this section and head
+over to the* ":ref:`step-by-step guide<introguide_bissbs>`" *below. It utilizes
+the same commands as this section while describing them in brief fashion. The
+steps are nevertheless easy to follow and together with accompanying entries
+in a reference section mention many alternatives, pitfalls, and additional
+aspects, all of which might be essential in your present case.]*
+
+**In case you want to check if a bug is present in code currently supported by
+developers**, execute just the *preparations* and *segment 1*; while doing so,
+consider the newest Linux kernel you regularly use to be the 'working' kernel.
+In the following example that's assumed to be 6.0.13, which is why the sources
+of 6.0 will be used to prepare the .config file.
+
+**In case you face a regression**, follow the steps at least till the end of
+*segment 2*. Then you can submit a preliminary report -- or continue with
+*segment 3*, which describes how to perform a bisection needed for a
+full-fledged regression report. In the following example 6.0.13 is assumed to be
+the 'working' kernel and 6.1.5 to be the first 'broken', which is why 6.0
+will be considered the 'good' release and used to prepare the .config file.
+
+* **Preparations**: set up everything to build your own kernels::
+
+ # * Remove any software that depends on externally maintained kernel modules
+ # or builds any automatically during bootup.
+ # * Ensure Secure Boot permits booting self-compiled Linux kernels.
+ # * If you are not already running the 'working' kernel, reboot into it.
+ # * Install compilers and everything else needed for building Linux.
+ # * Ensure to have 15 Gigabyte free space in your home directory.
+ git clone -o mainline --no-checkout \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/linux/
+ cd ~/linux/
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+ git checkout --detach v6.0
+ # * Hint: if you used an existing clone, ensure no stale .config is around.
+ make olddefconfig
+ # * Ensure the former command picked the .config of the 'working' kernel.
+ # * Connect external hardware (USB keys, tokens, ...), start a VM, bring up
+ # VPNs, mount network shares, and briefly try the feature that is broken.
+ yes '' | make localmodconfig
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local'
+ ./scripts/config -e CONFIG_LOCALVERSION_AUTO
+ # * Note, when short on storage space, check the guide for an alternative:
+ ./scripts/config -d DEBUG_INFO_NONE -e KALLSYMS_ALL -e DEBUG_KERNEL \
+ -e DEBUG_INFO -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -e KALLSYMS
+ # * Hint: at this point you might want to adjust the build configuration;
+ # you'll have to, if you are running Debian.
+ make olddefconfig
+ cp .config ~/kernel-config-working
+
+* **Segment 1**: build a kernel from the latest mainline codebase.
+
+ This among others checks if the problem was fixed already and which developers
+ later need to be told about the problem; in case of a regression, this rules
+ out a .config change as root of the problem.
+
+ a) Checking out latest mainline code::
+
+ cd ~/linux/
+ git checkout --force --detach mainline/master
+
+ b) Build, install, and boot a kernel::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Make sure there is enough disk space to hold another kernel:
+ df -h /boot/ /lib/modules/
+ # * Note: on Arch Linux, its derivatives and a few other distributions
+ # the following commands will do nothing at all or only part of the
+ # job. See the step-by-step guide for further details.
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ # * Check how much space your self-built kernel actually needs, which
+ # enables you to make better estimates later:
+ du -ch /boot/*$(make -s kernelrelease)* | tail -n 1
+ du -sh /lib/modules/$(make -s kernelrelease)/
+ # * Hint: the output of the following command will help you pick the
+ # right kernel from the boot menu:
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+ # * Once booted, ensure you are running the kernel you just built by
+ # checking if the output of the next two commands matches:
+ tail -n 1 ~/kernels-built
+ uname -r
+ cat /proc/sys/kernel/tainted
+
+ c) Check if the problem occurs with this kernel as well.
+
+* **Segment 2**: ensure the 'good' kernel is also a 'working' kernel.
+
+ This among others verifies the trimmed .config file actually works well, as
+ bisecting with it otherwise would be a waste of time:
+
+ a) Start by checking out the sources of the 'good' version::
+
+ cd ~/linux/
+ git checkout --force --detach v6.0
+
+ b) Build, install, and boot a kernel as described earlier in *segment 1,
+ section b* -- just feel free to skip the 'du' commands, as you have a rough
+ estimate already.
+
+ c) Ensure the feature that regressed with the 'broken' kernel actually works
+ with this one.
+
+* **Segment 3**: perform and validate the bisection.
+
+ a) In case your 'broken' version is a stable/longterm release, add the Git
+ branch holding it::
+
+ git remote set-branches --add stable linux-6.1.y
+ git fetch stable
+
+ b) Initialize the bisection::
+
+ cd ~/linux/
+ git bisect start
+ git bisect good v6.0
+ git bisect bad v6.1.5
+
+ c) Build, install, and boot a kernel as described earlier in *segment 1,
+ section b*.
+
+ In case building or booting the kernel fails for unrelated reasons, run
+ ``git bisect skip``. In all other outcomes, check if the regressed feature
+ works with the newly built kernel. If it does, tell Git by executing
+ ``git bisect good``; if it does not, run ``git bisect bad`` instead.
+
+ All three commands will make Git checkout another commit; then re-execute
+ this step (e.g. build, install, boot, and test a kernel to then tell Git
+ the outcome). Do so again and again until Git shows which commit broke
+ things. If you run short of disk space during this process, check the
+ "Supplementary tasks" section below.
+
+ d) Once your finished the bisection, put a few things away::
+
+ cd ~/linux/
+ git bisect log > ~/bisect-log
+ cp .config ~/bisection-config-culprit
+ git bisect reset
+
+ e) Try to verify the bisection result::
+
+ git checkout --force --detach mainline/master
+ git revert --no-edit cafec0cacaca0
+
+ This is optional, as some commits are impossible to revert. But if the
+ second command worked flawlessly, build, install, and boot one more kernel
+ kernel, which should not show the regression.
+
+* **Supplementary tasks**: cleanup during and after the process.
+
+ a) To avoid running out of disk space during a bisection, you might need to
+ remove some kernels you built earlier. You most likely want to keep those
+ you built during segment 1 and 2 around for a while, but you will most
+ likely no longer need kernels tested during the actual bisection
+ (Segment 3 c). You can list them in build order using::
+
+ ls -ltr /lib/modules/*-local*
+
+ To then for example erase a kernel that identifies itself as
+ '6.0-rc1-local-gcafec0cacaca0', use this::
+
+ sudo rm -rf /lib/modules/6.0-rc1-local-gcafec0cacaca0
+ sudo kernel-install -v remove 6.0-rc1-local-gcafec0cacaca0
+ # * Note, on some distributions kernel-install is missing
+ # or does only part of the job.
+
+ b) If you performed a bisection and successfully validated the result, feel
+ free to remove all kernels built during the actual bisection (Segment 3 c);
+ the kernels you built earlier and later you might want to keep around for
+ a week or two.
+
+.. _introguide_bissbs:
+
+Step-by-step guide on how to verify bugs and bisect regressions
+===============================================================
+
+This guide describes how to set up your own Linux kernels for investigating bugs
+or regressions you intent to report. How far you want to follow the instructions
+depends on your issue:
+
+Execute all steps till the end of *segment 1* to **verify if your kernel problem
+is present in code supported by Linux kernel developers**. If it is, you are all
+set to report the bug -- unless it did not happen with earlier kernel versions,
+as then your want to at least continue with *segment 2* to **check if the issue
+qualifies as regression** which receive priority treatment. Depending on the
+outcome you then are ready to report a bug or submit a preliminary regression
+report; instead of the latter your could also head straight on and follow
+*segment 3* to **perform a bisection** for a full-fledged regression report
+developers are obliged to act upon.
+
+ :ref:`Preparations: set up everything to build your own kernels.<introprep_bissbs>`
+
+ :ref:`Segment 1: try to reproduce the problem with the latest codebase.<introlatestcheck_bissbs>`
+
+ :ref:`Segment 2: check if the kernels you build work fine.<introworkingcheck_bissbs>`
+
+ :ref:`Segment 3: perform a bisection and validate the result.<introbisect_bissbs>`
+
+ :ref:`Supplementary tasks: cleanup during and after following this guide.<introclosure_bissbs>`
+
+The steps in each segment illustrate the important aspects of the process, while
+a comprehensive reference section holds additional details for almost all of the
+steps. The reference section sometimes also outlines alternative approaches,
+pitfalls, as well as problems that might occur at the particular step -- and how
+to get things rolling again.
+
+For further details on how to report Linux kernel issues or regressions check
+out Documentation/admin-guide/reporting-issues.rst, which works in conjunction
+with this document. It among others explains why you need to verify bugs with
+the latest 'mainline' kernel, even if you face a problem with a kernel from a
+'stable/longterm' series; for users facing a regression it also explains that
+sending a preliminary report after finishing segment 2 might be wise, as the
+regression and its culprit might be known already. For further details on
+what actually qualifies as a regression check out
+Documentation/admin-guide/reporting-regressions.rst.
+
+.. _introprep_bissbs:
+
+Preparations: set up everything to build your own kernels
+---------------------------------------------------------
+
+.. _backup_bissbs:
+
+* Create a fresh backup and put system repair and restore tools at hand, just
+ to be prepared for the unlikely case of something going sideways.
+
+ [:ref:`details<backup_bisref>`]
+
+.. _vanilla_bissbs:
+
+* Remove all software that depends on externally developed kernel drivers or
+ builds them automatically. That includes but is not limited to DKMS, openZFS,
+ VirtualBox, and Nvidia's graphics drivers (including the GPLed kernel module).
+
+ [:ref:`details<vanilla_bisref>`]
+
+.. _secureboot_bissbs:
+
+* On platforms with 'Secure Boot' or similar solutions, prepare everything to
+ ensure the system will permit your self-compiled kernel to boot. The
+ quickest and easiest way to achieve this on commodity x86 systems is to
+ disable such techniques in the BIOS setup utility; alternatively, remove
+ their restrictions through a process initiated by
+ ``mokutil --disable-validation``.
+
+ [:ref:`details<secureboot_bisref>`]
+
+.. _rangecheck_bissbs:
+
+* Determine the kernel versions considered 'good' and 'bad' throughout this
+ guide.
+
+ Do you follow this guide to verify if a bug is present in the code developers
+ care for? Then consider the mainline release your 'working' kernel (the newest
+ one you regularly use) is based on to be the 'good' version; if your 'working'
+ kernel for example is 6.0.11, then your 'good' kernel is 6.0.
+
+ In case you face a regression, it depends on the version range where the
+ regression was introduced:
+
+ * Something which used to work in Linux 6.0 broke when switching to Linux
+ 6.1-rc1? Then henceforth regard 6.0 as the last known 'good' version
+ and 6.1-rc1 as the first 'bad' one.
+
+ * Some function stopped working when updating from 6.0.11 to 6.1.4? Then for
+ the time being consider 6.0 as the last 'good' version and 6.1.4 as
+ the 'bad' one. Note, at this point it is merely assumed that 6.0 is fine;
+ this assumption will be checked in segment 2.
+
+ * A feature you used in 6.0.11 does not work at all or worse in 6.1.13? In
+ that case you want to bisect within a stable/longterm series: consider
+ 6.0.11 as the last known 'good' version and 6.0.13 as the first 'bad'
+ one. Note, in this case you still want to compile and test a mainline kernel
+ as explained in segment 1: the outcome will determine if you need to report
+ your issue to the regular developers or the stable team.
+
+ *Note, do not confuse 'good' version with 'working' kernel; the latter term
+ throughout this guide will refer to the last kernel that has been working
+ fine.*
+
+ [:ref:`details<rangecheck_bisref>`]
+
+.. _bootworking_bissbs:
+
+* Boot into the 'working' kernel and briefly use the apparently broken feature.
+
+ [:ref:`details<bootworking_bisref>`]
+
+.. _diskspace_bissbs:
+
+* Ensure to have enough free space for building Linux. 15 Gigabyte in your home
+ directory should typically suffice. If you have less available, be sure to pay
+ attention to later steps about retrieving the Linux sources and handling of
+ debug symbols: both explain approaches reducing the amount of space, which
+ should allow you to master these tasks with about 4 Gigabytes free space.
+
+ [:ref:`details<diskspace_bisref>`]
+
+.. _buildrequires_bissbs:
+
+* Install all software required to build a Linux kernel. Often you will need:
+ 'bc', 'binutils' ('ld' et al.), 'bison', 'flex', 'gcc', 'git', 'openssl',
+ 'pahole', 'perl', and the development headers for 'libelf' and 'openssl'. The
+ reference section shows how to quickly install those on various popular Linux
+ distributions.
+
+ [:ref:`details<buildrequires_bisref>`]
+
+.. _sources_bissbs:
+
+* Retrieve the mainline Linux sources; then change into the directory holding
+ them, as all further commands in this guide are meant to be executed from
+ there.
+
+ *Note, the following describe how to retrieve the sources using a full
+ mainline clone, which downloads about 2,75 GByte as of early 2024. The*
+ :ref:`reference section describes two alternatives <sources_bisref>` *:
+ one downloads less than 500 MByte, the other works better with unreliable
+ internet connections.*
+
+ Execute the following command to retrieve a fresh mainline codebase while
+ preparing things to add branches for stable/longterm series later::
+
+ git clone -o mainline --no-checkout \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/linux/
+ cd ~/linux/
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+
+ [:ref:`details<sources_bisref>`]
+
+.. _oldconfig_bissbs:
+
+* Start preparing a kernel build configuration (the '.config' file).
+
+ Before doing so, ensure you are still running the 'working' kernel an earlier
+ step told you to boot; if you are unsure, check the current kernel release
+ identifier using ``uname -r``.
+
+ Afterwards check out the source code for the version earlier established as
+ 'good'. In the following example command this is assumed to be 6.0; note that
+ the version number in this and all later Git commands needs to be prefixed
+ with a 'v'::
+
+ git checkout --detach v6.0
+
+ Now create a build configuration file::
+
+ make olddefconfig
+
+ The kernel build scripts then will try to locate the build configuration file
+ for the running kernel and then adjust it for the needs of the kernel sources
+ you checked out. While doing so, it will print a few lines you need to check.
+
+ Look out for a line starting with '# using defaults found in'. It should be
+ followed by a path to a file in '/boot/' that contains the release identifier
+ of your currently working kernel. If the line instead continues with something
+ like 'arch/x86/configs/x86_64_defconfig', then the build infra failed to find
+ the .config file for your running kernel -- in which case you have to put one
+ there manually, as explained in the reference section.
+
+ In case you can not find such a line, look for one containing '# configuration
+ written to .config'. If that's the case you have a stale build configuration
+ lying around. Unless you intend to use it, delete it; afterwards run
+ 'make olddefconfig' again and check if it now picked up the right config file
+ as base.
+
+ [:ref:`details<oldconfig_bisref>`]
+
+.. _localmodconfig_bissbs:
+
+* Disable any kernel modules apparently superfluous for your setup. This is
+ optional, but especially wise for bisections, as it speeds up the build
+ process enormously -- at least unless the .config file picked up in the
+ previous step was already tailored to your and your hardware needs, in which
+ case you should skip this step.
+
+ To prepare the trimming, connect external hardware you occasionally use (USB
+ keys, tokens, ...), quickly start a VM, and bring up VPNs. And if you rebooted
+ since you started that guide, ensure that you tried using the feature causing
+ trouble since you started the system. Only then trim your .config::
+
+ yes '' | make localmodconfig
+
+ There is a catch to this, as the 'apparently' in initial sentence of this step
+ and the preparation instructions already hinted at:
+
+ The 'localmodconfig' target easily disables kernel modules for features only
+ used occasionally -- like modules for external peripherals not yet connected
+ since booting, virtualization software not yet utilized, VPN tunnels, and a
+ few other things. That's because some tasks rely on kernel modules Linux only
+ loads when you execute tasks like the aforementioned ones for the first time.
+
+ This drawback of localmodconfig is nothing you should lose sleep over, but
+ something to keep in mind: if something is misbehaving with the kernels built
+ during this guide, this is most likely the reason. You can reduce or nearly
+ eliminate the risk with tricks outlined in the reference section; but when
+ building a kernel just for quick testing purposes this is usually not worth
+ spending much effort on, as long as it boots and allows to properly test the
+ feature that causes trouble.
+
+ [:ref:`details<localmodconfig_bisref>`]
+
+.. _tagging_bissbs:
+
+* Ensure all the kernels you will build are clearly identifiable using a special
+ tag and a unique version number::
+
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local'
+ ./scripts/config -e CONFIG_LOCALVERSION_AUTO
+
+ [:ref:`details<tagging_bisref>`]
+
+.. _debugsymbols_bissbs:
+
+* Decide how to handle debug symbols.
+
+ In the context of this document it is often wise to enable them, as there is a
+ decent chance you will need to decode a stack trace from a 'panic', 'Oops',
+ 'warning', or 'BUG'::
+
+ ./scripts/config -d DEBUG_INFO_NONE -e KALLSYMS_ALL -e DEBUG_KERNEL \
+ -e DEBUG_INFO -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -e KALLSYMS
+
+ But if you are extremely short on storage space, you might want to disable
+ debug symbols instead::
+
+ ./scripts/config -d DEBUG_INFO -d DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT \
+ -d DEBUG_INFO_DWARF4 -d DEBUG_INFO_DWARF5 -e CONFIG_DEBUG_INFO_NONE
+
+ [:ref:`details<debugsymbols_bisref>`]
+
+.. _configmods_bissbs:
+
+* Check if you may want or need to adjust some other kernel configuration
+ options:
+
+ * Are you running Debian? Then you want to avoid known problems by performing
+ additional adjustments explained in the reference section.
+
+ [:ref:`details<configmods_distros_bisref>`].
+
+ * If you want to influence other aspects of the configuration, do so now using
+ your preferred tool. Note, to use make targets like 'menuconfig' or
+ 'nconfig', you will need to install the development files of ncurses; for
+ 'xconfig' you likewise need the Qt5 or Qt6 headers.
+
+ [:ref:`details<configmods_individual_bisref>`].
+
+.. _saveconfig_bissbs:
+
+* Reprocess the .config after the latest adjustments and store it in a safe
+ place::
+
+ make olddefconfig
+ cp .config ~/kernel-config-working
+
+ [:ref:`details<saveconfig_bisref>`]
+
+.. _introlatestcheck_bissbs:
+
+Segment 1: try to reproduce the problem with the latest codebase
+----------------------------------------------------------------
+
+The following steps verify if the problem occurs with the code currently
+supported by developers. In case you face a regression, it also checks that the
+problem is not caused by some .config change, as reporting the issue then would
+be a waste of time. [:ref:`details<introlatestcheck_bisref>`]
+
+.. _checkoutmaster_bissbs:
+
+* Check out the latest Linux codebase::
+
+ cd ~/linux/
+ git checkout --force --detach mainline/master
+
+ [:ref:`details<checkoutmaster_bisref>`]
+
+.. _build_bissbs:
+
+* Build the image and the modules of your first kernel using the config file you
+ prepared::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+
+ If you want your kernel packaged up as deb, rpm, or tar file, see the
+ reference section for alternatives, which obviously will require other
+ steps to install as well.
+
+ [:ref:`details<build_bisref>`]
+
+.. _install_bissbs:
+
+* Install your newly built kernel.
+
+ Before doing so, consider checking if there is still enough space for it::
+
+ df -h /boot/ /lib/modules/
+
+ For now assume 150 MByte in /boot/ and 200 in /lib/modules/ will suffice; how
+ much your kernels actually require will be determined later during this guide.
+
+ Now install the kernel's modules and its image, which will be stored in
+ parallel to the your Linux distribution's kernels::
+
+ sudo make modules_install
+ command -v installkernel && sudo make install
+
+ The second command ideally will take care of three steps required at this
+ point: copying the kernel's image to /boot/, generating an initramfs, and
+ adding an entry for both to the boot loader's configuration.
+
+ Sadly some distributions (among them Arch Linux, its derivatives, and many
+ immutable Linux distributions) will perform none or only some of those tasks.
+ You therefore want to check if all of them were taken care of and manually
+ perform those that were not. The reference section provides further details on
+ that; your distribution's documentation might help, too.
+
+ Once you figured out the steps needed at this point, consider writing them
+ down: if you will build more kernels as described in segment 2 and 3, you will
+ have to perform those again after executing ``command -v installkernel [...]``.
+
+ [:ref:`details<install_bisref>`]
+
+.. _storagespace_bissbs:
+
+* In case you plan to follow this guide further, check how much storage space
+ the kernel, its modules, and other related files like the initramfs consume::
+
+ du -ch /boot/*$(make -s kernelrelease)* | tail -n 1
+ du -sh /lib/modules/$(make -s kernelrelease)/
+
+ Write down or remember those two values for later: they enable you to prevent
+ running out of disk space accidentally during a bisection.
+
+ [:ref:`details<storagespace_bisref>`]
+
+.. _kernelrelease_bissbs:
+
+* Show and store the kernelrelease identifier of the kernel you just built::
+
+ make -s kernelrelease | tee -a ~/kernels-built
+
+ Remember the identifier momentarily, as it will help you pick the right kernel
+ from the boot menu upon restarting.
+
+* Reboot into your newly built kernel. To ensure your actually started the one
+ you just built, you might want to verify if the output of these commands
+ matches::
+
+ tail -n 1 ~/kernels-built
+ uname -r
+
+.. _tainted_bissbs:
+
+* Check if the kernel marked itself as 'tainted'::
+
+ cat /proc/sys/kernel/tainted
+
+ If that command does not return '0', check the reference section, as the cause
+ for this might interfere with your testing.
+
+ [:ref:`details<tainted_bisref>`]
+
+.. _recheckbroken_bissbs:
+
+* Verify if your bug occurs with the newly built kernel. If it does not, check
+ out the instructions in the reference section to ensure nothing went sideways
+ during your tests.
+
+ [:ref:`details<recheckbroken_bisref>`]
+
+.. _recheckstablebroken_bissbs:
+
+* Are you facing a problem within a stable/longterm series, but failed to
+ reproduce it with the mainline kernel you just built? One that according to
+ the `front page of kernel.org <https://kernel.org/>`_ is still supported? Then
+ check if the latest codebase for the particular series might already fix the
+ problem. To do so, add the stable series Git branch for your 'good' kernel
+ (again, this here is assumed to be 6.0) and check out the latest version::
+
+ cd ~/linux/
+ git remote set-branches --add stable linux-6.0.y
+ git fetch stable
+ git checkout --force --detach linux-6.0.y
+
+ Now use the checked out code to build and install another kernel using the
+ commands the earlier steps already described in more detail::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ Confirm you booted the kernel you intended to start and check its tainted
+ status::
+
+ tail -n 1 ~/kernels-built
+ uname -r
+ cat /proc/sys/kernel/tainted
+
+ Now verify if this kernel is showing the problem.
+
+ [:ref:`details<recheckstablebroken_bisref>`]
+
+Do you follow this guide to verify if a problem is present in the code
+currently supported by Linux kernel developers? Then you are done at this
+point. If you later want to remove the kernel you just built, check out
+:ref:`Supplementary tasks: cleanup during and after following this guide<introclosure_bissbs>`.
+
+In case you face a regression, move on and execute at least the next segment
+as well.
+
+.. _introworkingcheck_bissbs:
+
+Segment 2: check if the kernels you build work fine
+---------------------------------------------------
+
+In case of a regression, you now want to ensure the trimmed configuration file
+you created earlier works as expected; a bisection with the .config file
+otherwise would be a waste of time. [:ref:`details<introworkingcheck_bisref>`]
+
+.. _recheckworking_bissbs:
+
+* Build your own variant of the 'working' kernel and check if the feature that
+ regressed works as expected with it.
+
+ Start by checking out the sources for the version earlier established as
+ 'good' (once again assumed to be 6.0 here)::
+
+ cd ~/linux/
+ git checkout --detach v6.0
+
+ Now use the checked out code to configure, build, and install another kernel
+ using the commands the previous subsection explained in more detail::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ When the system booted, you may want to verify once again that the
+ kernel you started is the one you just built::
+
+ tail -n 1 ~/kernels-built
+ uname -r
+
+ Now check if this kernel works as expected; if not, consult the reference
+ section for further instructions.
+
+ [:ref:`details<recheckworking_bisref>`]
+
+.. _introbisect_bissbs:
+
+Segment 3: perform the bisection and validate the result
+--------------------------------------------------------
+
+With all the preparations and precaution builds taken care of, you are now ready
+to begin the bisection. This will make you build quite a few kernels -- usually
+about 15 in case you encountered a regression when updating to a newer series
+(say from 6.0.11 to 6.1.3). But do not worry, due to the trimmed build
+configuration created earlier this works a lot faster than many people assume:
+overall on average it will often just take about 10 to 15 minutes to compile
+each kernel on commodity x86 machines.
+
+* In case your 'bad' version is a stable/longterm release (say 6.1.5), add its
+ stable branch, unless you already did so earlier::
+
+ cd ~/linux/
+ git remote set-branches --add stable linux-6.1.y
+ git fetch stable
+
+.. _bisectstart_bissbs:
+
+* Start the bisection and tell Git about the versions earlier established as
+ 'good' (6.0 in the following example command) and 'bad' (6.1.5)::
+
+ cd ~/linux/
+ git bisect start
+ git bisect good v6.0
+ git bisect bad v6.1.5
+
+ [:ref:`details<bisectstart_bisref>`]
+
+.. _bisectbuild_bissbs:
+
+* Now use the code Git checked out to build, install, and boot a kernel using
+ the commands introduced earlier::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ If compilation fails for some reason, run ``git bisect skip`` and restart
+ executing the stack of commands from the beginning.
+
+ In case you skipped the "test latest codebase" step in the guide, check its
+ description as for why the 'df [...]' and 'make -s kernelrelease [...]'
+ commands are here.
+
+ Important note: the latter command from this point on will print release
+ identifiers that might look odd or wrong to you -- which they are not, as it's
+ totally normal to see release identifiers like '6.0-rc1-local-gcafec0cacaca0'
+ if you bisect between versions 6.1 and 6.2 for example.
+
+ [:ref:`details<bisectbuild_bisref>`]
+
+.. _bisecttest_bissbs:
+
+* Now check if the feature that regressed works in the kernel you just built.
+
+ You again might want to start by making sure the kernel you booted is the one
+ you just built::
+
+ cd ~/linux/
+ tail -n 1 ~/kernels-built
+ uname -r
+
+ Now verify if the feature that regressed works at this kernel bisection point.
+ If it does, run this::
+
+ git bisect good
+
+ If it does not, run this::
+
+ git bisect bad
+
+ Be sure about what you tell Git, as getting this wrong just once will send the
+ rest of the bisection totally off course.
+
+ While the bisection is ongoing, Git will use the information you provided to
+ find and check out another bisection point for you to test. While doing so, it
+ will print something like 'Bisecting: 675 revisions left to test after this
+ (roughly 10 steps)' to indicate how many further changes it expects to be
+ tested. Now build and install another kernel using the instructions from the
+ previous step; afterwards follow the instructions in this step again.
+
+ Repeat this again and again until you finish the bisection -- that's the case
+ when Git after tagging a change as 'good' or 'bad' prints something like
+ 'cafecaca0c0dacafecaca0c0dacafecaca0c0da is the first bad commit'; right
+ afterwards it will show some details about the culprit including the patch
+ description of the change. The latter might fill your terminal screen, so you
+ might need to scroll up to see the message mentioning the culprit;
+ alternatively, run ``git bisect log > ~/bisection-log``.
+
+ [:ref:`details<bisecttest_bisref>`]
+
+.. _bisectlog_bissbs:
+
+* Store Git's bisection log and the current .config file in a safe place before
+ telling Git to reset the sources to the state before the bisection::
+
+ cd ~/linux/
+ git bisect log > ~/bisection-log
+ cp .config ~/bisection-config-culprit
+ git bisect reset
+
+ [:ref:`details<bisectlog_bisref>`]
+
+.. _revert_bissbs:
+
+* Try reverting the culprit on top of latest mainline to see if this fixes your
+ regression.
+
+ This is optional, as it might be impossible or hard to realize. The former is
+ the case, if the bisection determined a merge commit as the culprit; the
+ latter happens if other changes depend on the culprit. But if the revert
+ succeeds, it is worth building another kernel, as it validates the result of
+ a bisection, which can easily deroute; it furthermore will let kernel
+ developers know, if they can resolve the regression with a quick revert.
+
+ Begin by checking out the latest codebase depending on the range you bisected:
+
+ * Did you face a regression within a stable/longterm series (say between
+ 6.0.11 and 6.0.13) that does not happen in mainline? Then check out the
+ latest codebase for the affected series like this::
+
+ git fetch stable
+ git checkout --force --detach linux-6.0.y
+
+ * In all other cases check out latest mainline::
+
+ git fetch mainline
+ git checkout --force --detach mainline/master
+
+ If you bisected a regression within a stable/longterm series that also
+ happens in mainline, there is one more thing to do: look up the mainline
+ commit-id. To do so, use a command like ``git show abcdcafecabcd`` to
+ view the patch description of the culprit. There will be a line near
+ the top which looks like 'commit cafec0cacaca0 upstream.' or
+ 'Upstream commit cafec0cacaca0'; use that commit-id in the next command
+ and not the one the bisection blamed.
+
+ Now try reverting the culprit by specifying its commit id::
+
+ git revert --no-edit cafec0cacaca0
+
+ If that fails, give up trying and move on to the next step. But if it works,
+ build a kernel again using the familiar command sequence::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig &&
+ make -j $(nproc --all) &&
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ Make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ Now check one last time if the feature that made you perform a bisection work
+ with that kernel.
+
+ [:ref:`details<revert_bisref>`]
+
+.. _introclosure_bissbs:
+
+Supplementary tasks: cleanup during and after the bisection
+-----------------------------------------------------------
+
+During and after following this guide you might want or need to remove some of
+the kernels you installed: the boot menu otherwise will become confusing or
+space might run out.
+
+.. _makeroom_bissbs:
+
+* To remove one of the kernels you installed, look up its 'kernelrelease'
+ identifier. This guide stores them in '~/kernels-built', but the following
+ command will print them as well::
+
+ ls -ltr /lib/modules/*-local*
+
+ You in most situations want to remove the oldest kernels built during the
+ actual bisection (e.g. segment 3 of this guide). The two ones you created
+ beforehand (e.g. to test the latest codebase and the version considered
+ 'good') might become handy to verify something later -- thus better keep them
+ around, unless you are really short on storage space.
+
+ To remove the modules of a kernel with the kernelrelease identifier
+ '*6.0-rc1-local-gcafec0cacaca0*', start by removing the directory holding its
+ modules::
+
+ sudo rm -rf /lib/modules/6.0-rc1-local-gcafec0cacaca0
+
+ Afterwards try the following command::
+
+ sudo kernel-install -v remove 6.0-rc1-local-gcafec0cacaca0
+
+ On quite a few distributions this will delete all other kernel files installed
+ while also removing the kernel's entry from the boot menu. But on some
+ distributions kernel-install does not exist or leaves boot-loader entries or
+ kernel image and related files behind; in that case remove them as described
+ in the reference section.
+
+ [:ref:`details<makeroom_bisref>`]
+
+.. _finishingtouch_bissbs:
+
+* Once you have finished the bisection, do not immediately remove anything you
+ set up, as you might need a few things again. What is safe to remove depends
+ on the outcome of the bisection:
+
+ * Could you initially reproduce the regression with the latest codebase and
+ after the bisection were able to fix the problem by reverting the culprit on
+ top of the latest codebase? Then you want to keep those two kernels around
+ for a while, but safely remove all others with a '-local' in the release
+ identifier.
+
+ * Did the bisection end on a merge-commit or seems questionable for other
+ reasons? Then you want to keep as many kernels as possible around for a few
+ days: it's pretty likely that you will be asked to recheck something.
+
+ * In other cases it likely is a good idea to keep the following kernels around
+ for some time: the one built from the latest codebase, the one created from
+ the version considered 'good', and the last three or four you compiled
+ during the actual bisection process.
+
+ [:ref:`details<finishingtouch_bisref>`]
+
+.. _submit_improvements:
+
+This concludes the step-by-step guide.
+
+Did you run into trouble following any of the above steps not cleared up by the
+reference section below? Did you spot errors? Or do you have ideas how to
+improve the guide? Then please take a moment and let the maintainer of this
+document know by email (Thorsten Leemhuis <linux@leemhuis.info>), ideally while
+CCing the Linux docs mailing list (linux-doc@vger.kernel.org). Such feedback is
+vital to improve this document further, which is in everybody's interest, as it
+will enable more people to master the task described here -- and hopefully also
+improve similar guides inspired by this one.
+
+
+Reference section for the step-by-step guide
+============================================
+
+This section holds additional information for almost all the items in the above
+step-by-step guide.
+
+.. _backup_bisref:
+
+Prepare for emergencies
+-----------------------
+
+ *Create a fresh backup and put system repair and restore tools at hand.*
+ [:ref:`... <backup_bissbs>`]
+
+Remember, you are dealing with computers, which sometimes do unexpected things
+-- especially if you fiddle with crucial parts like the kernel of an operating
+system. That's what you are about to do in this process. Hence, better prepare
+for something going sideways, even if that should not happen.
+
+[:ref:`back to step-by-step guide <backup_bissbs>`]
+
+.. _vanilla_bisref:
+
+Remove anything related to externally maintained kernel modules
+---------------------------------------------------------------
+
+ *Remove all software that depends on externally developed kernel drivers or
+ builds them automatically.* [:ref:`...<vanilla_bissbs>`]
+
+Externally developed kernel modules can easily cause trouble during a bisection.
+
+But there is a more important reason why this guide contains this step: most
+kernel developers will not care about reports about regressions occurring with
+kernels that utilize such modules. That's because such kernels are not
+considered 'vanilla' anymore, as Documentation/admin-guide/reporting-issues.rst
+explains in more detail.
+
+[:ref:`back to step-by-step guide <vanilla_bissbs>`]
+
+.. _secureboot_bisref:
+
+Deal with techniques like Secure Boot
+-------------------------------------
+
+ *On platforms with 'Secure Boot' or similar techniques, prepare everything to
+ ensure the system will permit your self-compiled kernel to boot later.*
+ [:ref:`... <secureboot_bissbs>`]
+
+Many modern systems allow only certain operating systems to start; that's why
+they reject booting self-compiled kernels by default.
+
+You ideally deal with this by making your platform trust your self-built kernels
+with the help of a certificate. How to do that is not described
+here, as it requires various steps that would take the text too far away from
+its purpose; 'Documentation/admin-guide/module-signing.rst' and various web
+sides already explain everything needed in more detail.
+
+Temporarily disabling solutions like Secure Boot is another way to make your own
+Linux boot. On commodity x86 systems it is possible to do this in the BIOS Setup
+utility; the required steps vary a lot between machines and therefore cannot be
+described here.
+
+On mainstream x86 Linux distributions there is a third and universal option:
+disable all Secure Boot restrictions for your Linux environment. You can
+initiate this process by running ``mokutil --disable-validation``; this will
+tell you to create a one-time password, which is safe to write down. Now
+restart; right after your BIOS performed all self-tests the bootloader Shim will
+show a blue box with a message 'Press any key to perform MOK management'. Hit
+some key before the countdown exposes, which will open a menu. Choose 'Change
+Secure Boot state'. Shim's 'MokManager' will now ask you to enter three
+randomly chosen characters from the one-time password specified earlier. Once
+you provided them, confirm you really want to disable the validation.
+Afterwards, permit MokManager to reboot the machine.
+
+[:ref:`back to step-by-step guide <secureboot_bissbs>`]
+
+.. _bootworking_bisref:
+
+Boot the last kernel that was working
+-------------------------------------
+
+ *Boot into the last working kernel and briefly recheck if the feature that
+ regressed really works.* [:ref:`...<bootworking_bissbs>`]
+
+This will make later steps that cover creating and trimming the configuration do
+the right thing.
+
+[:ref:`back to step-by-step guide <bootworking_bissbs>`]
+
+.. _diskspace_bisref:
+
+Space requirements
+------------------
+
+ *Ensure to have enough free space for building Linux.*
+ [:ref:`... <diskspace_bissbs>`]
+
+The numbers mentioned are rough estimates with a big extra charge to be on the
+safe side, so often you will need less.
+
+If you have space constraints, be sure to hay attention to the :ref:`step about
+debug symbols' <debugsymbols_bissbs>` and its :ref:`accompanying reference
+section' <debugsymbols_bisref>`, as disabling then will reduce the consumed disk
+space by quite a few gigabytes.
+
+[:ref:`back to step-by-step guide <diskspace_bissbs>`]
+
+.. _rangecheck_bisref:
+
+Bisection range
+---------------
+
+ *Determine the kernel versions considered 'good' and 'bad' throughout this
+ guide.* [:ref:`...<rangecheck_bissbs>`]
+
+Establishing the range of commits to be checked is mostly straightforward,
+except when a regression occurred when switching from a release of one stable
+series to a release of a later series (e.g. from 6.0.11 to 6.1.4). In that case
+Git will need some hand holding, as there is no straight line of descent.
+
+That's because with the release of 6.0 mainline carried on to 6.1 while the
+stable series 6.0.y branched to the side. It's therefore theoretically possible
+that the issue you face with 6.1.4 only worked in 6.0.11, as it was fixed by a
+commit that went into one of the 6.0.y releases, but never hit mainline or the
+6.1.y series. Thankfully that normally should not happen due to the way the
+stable/longterm maintainers maintain the code. It's thus pretty safe to assume
+6.0 as a 'good' kernel. That assumption will be tested anyway, as that kernel
+will be built and tested in the segment '2' of this guide; Git would force you
+to do this as well, if you tried bisecting between 6.0.11 and 6.1.13.
+
+[:ref:`back to step-by-step guide <rangecheck_bissbs>`]
+
+.. _buildrequires_bisref:
+
+Install build requirements
+--------------------------
+
+ *Install all software required to build a Linux kernel.*
+ [:ref:`...<buildrequires_bissbs>`]
+
+The kernel is pretty stand-alone, but besides tools like the compiler you will
+sometimes need a few libraries to build one. How to install everything needed
+depends on your Linux distribution and the configuration of the kernel you are
+about to build.
+
+Here are a few examples what you typically need on some mainstream
+distributions:
+
+* Arch Linux and derivatives::
+
+ sudo pacman --needed -S bc binutils bison flex gcc git kmod libelf openssl \
+ pahole perl zlib ncurses qt6-base
+
+* Debian, Ubuntu, and derivatives::
+
+ sudo apt install bc binutils bison dwarves flex gcc git kmod libelf-dev \
+ libssl-dev make openssl pahole perl-base pkg-config zlib1g-dev \
+ libncurses-dev qt6-base-dev g++
+
+* Fedora and derivatives::
+
+ sudo dnf install binutils \
+ /usr/bin/{bc,bison,flex,gcc,git,openssl,make,perl,pahole,rpmbuild} \
+ /usr/include/{libelf.h,openssl/pkcs7.h,zlib.h,ncurses.h,qt6/QtGui/QAction}
+
+* openSUSE and derivatives::
+
+ sudo zypper install bc binutils bison dwarves flex gcc git \
+ kernel-install-tools libelf-devel make modutils openssl openssl-devel \
+ perl-base zlib-devel rpm-build ncurses-devel qt6-base-devel
+
+These commands install a few packages that are often, but not always needed. You
+for example might want to skip installing the development headers for ncurses,
+which you will only need in case you later might want to adjust the kernel build
+configuration using make the targets 'menuconfig' or 'nconfig'; likewise omit
+the headers of Qt6 is you do not plan to adjust the .config using 'xconfig'.
+
+You furthermore might need additional libraries and their development headers
+for tasks not covered in this guide -- for example when building utilities from
+the kernel's tools/ directory.
+
+[:ref:`back to step-by-step guide <buildrequires_bissbs>`]
+
+.. _sources_bisref:
+
+Download the sources using Git
+------------------------------
+
+ *Retrieve the Linux mainline sources.*
+ [:ref:`...<sources_bissbs>`]
+
+The step-by-step guide outlines how to download the Linux sources using a full
+Git clone of Linus' mainline repository. There is nothing more to say about
+that -- but there are two alternatives ways to retrieve the sources that might
+work better for you:
+
+* If you have an unreliable internet connection, consider
+ :ref:`using a 'Git bundle'<sources_bundle_bisref>`.
+
+* If downloading the complete repository would take too long or requires too
+ much storage space, consider :ref:`using a 'shallow
+ clone'<sources_shallow_bisref>`.
+
+.. _sources_bundle_bisref:
+
+Downloading Linux mainline sources using a bundle
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use the following commands to retrieve the Linux mainline sources using a
+bundle::
+
+ wget -c \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/clone.bundle
+ git clone --no-checkout clone.bundle ~/linux/
+ cd ~/linux/
+ git remote remove origin
+ git remote add mainline \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
+ git fetch mainline
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+
+In case the 'wget' command fails, just re-execute it, it will pick up where
+it left off.
+
+[:ref:`back to step-by-step guide <sources_bissbs>`]
+[:ref:`back to section intro <sources_bisref>`]
+
+.. _sources_shallow_bisref:
+
+Downloading Linux mainline sources using a shallow clone
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, execute the following command to retrieve the latest mainline codebase::
+
+ git clone -o mainline --no-checkout --depth 1 -b master \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/linux/
+ cd ~/linux/
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+
+Now deepen your clone's history to the second predecessor of the mainline
+release of your 'good' version. In case the latter are 6.0 or 6.0.11, 5.19 would
+be the first predecessor and 5.18 the second -- hence deepen the history up to
+that version::
+
+ git fetch --shallow-exclude=v5.18 mainline
+
+Afterwards add the stable Git repository as remote and all required stable
+branches as explained in the step-by-step guide.
+
+Note, shallow clones have a few peculiar characteristics:
+
+* For bisections the history needs to be deepened a few mainline versions
+ farther than it seems necessary, as explained above already. That's because
+ Git otherwise will be unable to revert or describe most of the commits within
+ a range (say 6.1..6.2), as they are internally based on earlier kernels
+ releases (like 6.0-rc2 or 5.19-rc3).
+
+* This document in most places uses ``git fetch`` with ``--shallow-exclude=``
+ to specify the earliest version you care about (or to be precise: its git
+ tag). You alternatively can use the parameter ``--shallow-since=`` to specify
+ an absolute (say ``'2023-07-15'``) or relative (``'12 months'``) date to
+ define the depth of the history you want to download. When using them while
+ bisecting mainline, ensure to deepen the history to at least 7 months before
+ the release of the mainline release your 'good' kernel is based on.
+
+* Be warned, when deepening your clone you might encounter an error like
+ 'fatal: error in object: unshallow cafecaca0c0dacafecaca0c0dacafecaca0c0da'.
+ In that case run ``git repack -d`` and try again.
+
+[:ref:`back to step-by-step guide <sources_bissbs>`]
+[:ref:`back to section intro <sources_bisref>`]
+
+.. _oldconfig_bisref:
+
+Start defining the build configuration for your kernel
+------------------------------------------------------
+
+ *Start preparing a kernel build configuration (the '.config' file).*
+ [:ref:`... <oldconfig_bissbs>`]
+
+*Note, this is the first of multiple steps in this guide that create or modify
+build artifacts. The commands used in this guide store them right in the source
+tree to keep things simple. In case you prefer storing the build artifacts
+separately, create a directory like '~/linux-builddir/' and add the parameter
+``O=~/linux-builddir/`` to all make calls used throughout this guide. You will
+have to point other commands there as well -- among them the ``./scripts/config
+[...]`` commands, which will require ``--file ~/linux-builddir/.config`` to
+locate the right build configuration.*
+
+Two things can easily go wrong when creating a .config file as advised:
+
+* The oldconfig target will use a .config file from your build directory, if
+ one is already present there (e.g. '~/linux/.config'). That's totally fine if
+ that's what you intend (see next step), but in all other cases you want to
+ delete it. This for example is important in case you followed this guide
+ further, but due to problems come back here to redo the configuration from
+ scratch.
+
+* Sometimes olddefconfig is unable to locate the .config file for your running
+ kernel and will use defaults, as briefly outlined in the guide. In that case
+ check if your distribution ships the configuration somewhere and manually put
+ it in the right place (e.g. '~/linux/.config') if it does. On distributions
+ where /proc/config.gz exists this can be achieved using this command::
+
+ zcat /proc/config.gz > .config
+
+ Once you put it there, run ``make olddefconfig`` again to adjust it to the
+ needs of the kernel about to be built.
+
+Note, the olddefconfig target will set any undefined build options to their
+default value. If you prefer to set such configuration options manually, use
+``make oldconfig`` instead. Then for each undefined configuration option you
+will be asked how to proceed; in case you are unsure what to answer, simply hit
+'enter' to apply the default value. Note though that for bisections you normally
+want to go with the defaults, as you otherwise might enable a new feature that
+causes a problem looking like regressions (for example due to security
+restrictions).
+
+Occasionally odd things happen when trying to use a config file prepared for one
+kernel (say 6.1) on an older mainline release -- especially if it is much older
+(say 5.15). That's one of the reasons why the previous step in the guide told
+you to boot the kernel where everything works. If you manually add a .config
+file you thus want to ensure it's from the working kernel and not from a one
+that shows the regression.
+
+In case you want to build kernels for another machine, locate its kernel build
+configuration; usually ``ls /boot/config-$(uname -r)`` will print its name. Copy
+that file to the build machine and store it as ~/linux/.config; afterwards run
+``make olddefconfig`` to adjust it.
+
+[:ref:`back to step-by-step guide <oldconfig_bissbs>`]
+
+.. _localmodconfig_bisref:
+
+Trim the build configuration for your kernel
+--------------------------------------------
+
+ *Disable any kernel modules apparently superfluous for your setup.*
+ [:ref:`... <localmodconfig_bissbs>`]
+
+As explained briefly in the step-by-step guide already: with localmodconfig it
+can easily happen that your self-built kernels will lack modules for tasks you
+did not perform at least once before utilizing this make target. That happens
+when a task requires kernel modules which are only autoloaded when you execute
+it for the first time. So when you never performed that task since starting your
+kernel the modules will not have been loaded -- and from localmodonfig's point
+of view look superfluous, which thus disables them to reduce the amount of code
+to be compiled.
+
+You can try to avoid this by performing typical tasks that often will autoload
+additional kernel modules: start a VM, establish VPN connections, loop-mount a
+CD/DVD ISO, mount network shares (CIFS, NFS, ...), and connect all external
+devices (2FA keys, headsets, webcams, ...) as well as storage devices with file
+systems you otherwise do not utilize (btrfs, ext4, FAT, NTFS, XFS, ...). But it
+is hard to think of everything that might be needed -- even kernel developers
+often forget one thing or another at this point.
+
+Do not let that risk bother you, especially when compiling a kernel only for
+testing purposes: everything typically crucial will be there. And if you forget
+something important you can turn on a missing feature manually later and quickly
+run the commands again to compile and install a kernel that has everything you
+need.
+
+But if you plan to build and use self-built kernels regularly, you might want to
+reduce the risk by recording which modules your system loads over the course of
+a few weeks. You can automate this with `modprobed-db
+<https://github.com/graysky2/modprobed-db>`_. Afterwards use ``LSMOD=<path>`` to
+point localmodconfig to the list of modules modprobed-db noticed being used::
+
+ yes '' | make LSMOD='${HOME}'/.config/modprobed.db localmodconfig
+
+That parameter also allows you to build trimmed kernels for another machine in
+case you copied a suitable .config over to use as base (see previous step). Just
+run ``lsmod > lsmod_foo-machine`` on that system and copy the generated file to
+your build's host home directory. Then run these commands instead of the one the
+step-by-step guide mentions::
+
+ yes '' | make LSMOD=~/lsmod_foo-machine localmodconfig
+
+[:ref:`back to step-by-step guide <localmodconfig_bissbs>`]
+
+.. _tagging_bisref:
+
+Tag the kernels about to be build
+---------------------------------
+
+ *Ensure all the kernels you will build are clearly identifiable using a
+ special tag and a unique version identifier.* [:ref:`... <tagging_bissbs>`]
+
+This allows you to differentiate your distribution's kernels from those created
+during this process, as the file or directories for the latter will contain
+'-local' in the name; it also helps picking the right entry in the boot menu and
+not lose track of you kernels, as their version numbers will look slightly
+confusing during the bisection.
+
+[:ref:`back to step-by-step guide <tagging_bissbs>`]
+
+.. _debugsymbols_bisref:
+
+Decide to enable or disable debug symbols
+-----------------------------------------
+
+ *Decide how to handle debug symbols.* [:ref:`... <debugsymbols_bissbs>`]
+
+Having debug symbols available can be important when your kernel throws a
+'panic', 'Oops', 'warning', or 'BUG' later when running, as then you will be
+able to find the exact place where the problem occurred in the code. But
+collecting and embedding the needed debug information takes time and consumes
+quite a bit of space: in late 2022 the build artifacts for a typical x86 kernel
+trimmed with localmodconfig consumed around 5 Gigabyte of space with debug
+symbols, but less than 1 when they were disabled. The resulting kernel image and
+modules are bigger as well, which increases storage requirements for /boot/ and
+load times.
+
+In case you want a small kernel and are unlikely to decode a stack trace later,
+you thus might want to disable debug symbols to avoid those downsides. If it
+later turns out that you need them, just enable them as shown and rebuild the
+kernel.
+
+You on the other hand definitely want to enable them for this process, if there
+is a decent chance that you need to decode a stack trace later. The section
+'Decode failure messages' in Documentation/admin-guide/reporting-issues.rst
+explains this process in more detail.
+
+[:ref:`back to step-by-step guide <debugsymbols_bissbs>`]
+
+.. _configmods_bisref:
+
+Adjust build configuration
+--------------------------
+
+ *Check if you may want or need to adjust some other kernel configuration
+ options:*
+
+Depending on your needs you at this point might want or have to adjust some
+kernel configuration options.
+
+.. _configmods_distros_bisref:
+
+Distro specific adjustments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Are you running* [:ref:`... <configmods_bissbs>`]
+
+The following sections help you to avoid build problems that are known to occur
+when following this guide on a few commodity distributions.
+
+**Debian:**
+
+* Remove a stale reference to a certificate file that would cause your build to
+ fail::
+
+ ./scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
+
+ Alternatively, download the needed certificate and make that configuration
+ option point to it, as `the Debian handbook explains in more detail
+ <https://debian-handbook.info/browse/stable/sect.kernel-compilation.html>`_
+ -- or generate your own, as explained in
+ Documentation/admin-guide/module-signing.rst.
+
+[:ref:`back to step-by-step guide <configmods_bissbs>`]
+
+.. _configmods_individual_bisref:
+
+Individual adjustments
+~~~~~~~~~~~~~~~~~~~~~~
+
+ *If you want to influence the other aspects of the configuration, do so
+ now.* [:ref:`... <configmods_bissbs>`]
+
+At this point you can use a command like ``make menuconfig`` or ``make nconfig``
+to enable or disable certain features using a text-based user interface; to use
+a graphical configuration utility, run ``make xconfig`` instead. Both of them
+require development libraries from toolkits they are rely on (ncurses
+respectively Qt5 or Qt6); an error message will tell you if something required
+is missing.
+
+[:ref:`back to step-by-step guide <configmods_bissbs>`]
+
+.. _saveconfig_bisref:
+
+Put the .config file aside
+--------------------------
+
+ *Reprocess the .config after the latest changes and store it in a safe place.*
+ [:ref:`... <saveconfig_bissbs>`]
+
+Put the .config you prepared aside, as you want to copy it back to the build
+directory every time during this guide before you start building another
+kernel. That's because going back and forth between different versions can alter
+.config files in odd ways; those occasionally cause side effects that could
+confuse testing or in some cases render the result of your bisection
+meaningless.
+
+[:ref:`back to step-by-step guide <saveconfig_bissbs>`]
+
+.. _introlatestcheck_bisref:
+
+Try to reproduce the regression
+-----------------------------------------
+
+ *Verify the regression is not caused by some .config change and check if it
+ still occurs with the latest codebase.* [:ref:`... <introlatestcheck_bissbs>`]
+
+For some readers it might seem unnecessary to check the latest codebase at this
+point, especially if you did that already with a kernel prepared by your
+distributor or face a regression within a stable/longterm series. But it's
+highly recommended for these reasons:
+
+* You will run into any problems caused by your setup before you actually begin
+ a bisection. That will make it a lot easier to differentiate between 'this
+ most likely is some problem in my setup' and 'this change needs to be skipped
+ during the bisection, as the kernel sources at that stage contain an unrelated
+ problem that causes building or booting to fail'.
+
+* These steps will rule out if your problem is caused by some change in the
+ build configuration between the 'working' and the 'broken' kernel. This for
+ example can happen when your distributor enabled an additional security
+ feature in the newer kernel which was disabled or not yet supported by the
+ older kernel. That security feature might get into the way of something you
+ do -- in which case your problem from the perspective of the Linux kernel
+ upstream developers is not a regression, as
+ Documentation/admin-guide/reporting-regressions.rst explains in more detail.
+ You thus would waste your time if you'd try to bisect this.
+
+* If the cause for your regression was already fixed in the latest mainline
+ codebase, you'd perform the bisection for nothing. This holds true for a
+ regression you encountered with a stable/longterm release as well, as they are
+ often caused by problems in mainline changes that were backported -- in which
+ case the problem will have to be fixed in mainline first. Maybe it already was
+ fixed there and the fix is already in the process of being backported.
+
+* For regressions within a stable/longterm series it's furthermore crucial to
+ know if the issue is specific to that series or also happens in the mainline
+ kernel, as the report needs to be sent to different people:
+
+ * Regressions specific to a stable/longterm series are the stable team's
+ responsibility; mainline Linux developers might or might not care.
+
+ * Regressions also happening in mainline are something the regular Linux
+ developers and maintainers have to handle; the stable team does not care
+ and does not need to be involved in the report, they just should be told
+ to backport the fix once it's ready.
+
+ Your report might be ignored if you send it to the wrong party -- and even
+ when you get a reply there is a decent chance that developers tell you to
+ evaluate which of the two cases it is before they take a closer look.
+
+[:ref:`back to step-by-step guide <introlatestcheck_bissbs>`]
+
+.. _checkoutmaster_bisref:
+
+Check out the latest Linux codebase
+-----------------------------------
+
+ *Check out the latest Linux codebase.*
+ [:ref:`... <introlatestcheck_bissbs>`]
+
+In case you later want to recheck if an ever newer codebase might fix the
+problem, remember to run that ``git fetch --shallow-exclude [...]`` command
+again mentioned earlier to update your local Git repository.
+
+[:ref:`back to step-by-step guide <introlatestcheck_bissbs>`]
+
+.. _build_bisref:
+
+Build your kernel
+-----------------
+
+ *Build the image and the modules of your first kernel using the config file
+ you prepared.* [:ref:`... <build_bissbs>`]
+
+A lot can go wrong at this stage, but the instructions below will help you help
+yourself. Another subsection explains how to directly package your kernel up as
+deb, rpm or tar file.
+
+Dealing with build errors
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a build error occurs, it might be caused by some aspect of your machine's
+setup that often can be fixed quickly; other times though the problem lies in
+the code and can only be fixed by a developer. A close examination of the
+failure messages coupled with some research on the internet will often tell you
+which of the two it is. To perform such investigation, restart the build
+process like this::
+
+ make V=1
+
+The ``V=1`` activates verbose output, which might be needed to see the actual
+error. To make it easier to spot, this command also omits the ``-j $(nproc
+--all)`` used earlier to utilize every CPU core in the system for the job -- but
+this parallelism also results in some clutter when failures occur.
+
+After a few seconds the build process should run into the error again. Now try
+to find the most crucial line describing the problem. Then search the internet
+for the most important and non-generic section of that line (say 4 to 8 words);
+avoid or remove anything that looks remotely system-specific, like your username
+or local path names like ``/home/username/linux/``. First try your regular
+internet search engine with that string, afterwards search Linux kernel mailing
+lists via `lore.kernel.org/all/ <https://lore.kernel.org/all/>`_.
+
+This most of the time will find something that will explain what is wrong; quite
+often one of the hits will provide a solution for your problem, too. If you
+do not find anything that matches your problem, try again from a different angle
+by modifying your search terms or using another line from the error messages.
+
+In the end, most issues you run into have likely been encountered and
+reported by others already. That includes issues where the cause is not your
+system, but lies in the code. If you run into one of those, you might thus find a
+solution (e.g. a patch) or workaround for your issue, too.
+
+Package your kernel up
+~~~~~~~~~~~~~~~~~~~~~~
+
+The step-by-step guide uses the default make targets (e.g. 'bzImage' and
+'modules' on x86) to build the image and the modules of your kernel, which later
+steps of the guide then install. You instead can also directly build everything
+and directly package it up by using one of the following targets:
+
+* ``make -j $(nproc --all) bindeb-pkg`` to generate a deb package
+
+* ``make -j $(nproc --all) binrpm-pkg`` to generate a rpm package
+
+* ``make -j $(nproc --all) tarbz2-pkg`` to generate a bz2 compressed tarball
+
+This is just a selection of available make targets for this purpose, see
+``make help`` for others. You can also use these targets after running
+``make -j $(nproc --all)``, as they will pick up everything already built.
+
+If you employ the targets to generate deb or rpm packages, ignore the
+step-by-step guide's instructions on installing and removing your kernel;
+instead install and remove the packages using the package utility for the format
+(e.g. dpkg and rpm) or a package management utility build on top of them (apt,
+aptitude, dnf/yum, zypper, ...). Be aware that the packages generated using
+these two make targets are designed to work on various distributions utilizing
+those formats, they thus will sometimes behave differently than your
+distribution's kernel packages.
+
+[:ref:`back to step-by-step guide <build_bissbs>`]
+
+.. _install_bisref:
+
+Put the kernel in place
+-----------------------
+
+ *Install the kernel you just built.* [:ref:`... <install_bissbs>`]
+
+What you need to do after executing the command in the step-by-step guide
+depends on the existence and the implementation of ``/sbin/installkernel``
+executable on your distribution.
+
+If installkernel is found, the kernel's build system will delegate the actual
+installation of your kernel image to this executable, which then performs some
+or all of these tasks:
+
+* On almost all Linux distributions installkernel will store your kernel's
+ image in /boot/, usually as '/boot/vmlinuz-<kernelrelease_id>'; often it will
+ put a 'System.map-<kernelrelease_id>' alongside it.
+
+* On most distributions installkernel will then generate an 'initramfs'
+ (sometimes also called 'initrd'), which usually are stored as
+ '/boot/initramfs-<kernelrelease_id>.img' or
+ '/boot/initrd-<kernelrelease_id>'. Commodity distributions rely on this file
+ for booting, hence ensure to execute the make target 'modules_install' first,
+ as your distribution's initramfs generator otherwise will be unable to find
+ the modules that go into the image.
+
+* On some distributions installkernel will then add an entry for your kernel
+ to your bootloader's configuration.
+
+You have to take care of some or all of the tasks yourself, if your
+distribution lacks a installkernel script or does only handle part of them.
+Consult the distribution's documentation for details. If in doubt, install the
+kernel manually::
+
+ sudo install -m 0600 $(make -s image_name) /boot/vmlinuz-$(make -s kernelrelease)
+ sudo install -m 0600 System.map /boot/System.map-$(make -s kernelrelease)
+
+Now generate your initramfs using the tools your distribution provides for this
+process. Afterwards add your kernel to your bootloader configuration and reboot.
+
+[:ref:`back to step-by-step guide <install_bissbs>`]
+
+.. _storagespace_bisref:
+
+Storage requirements per kernel
+-------------------------------
+
+ *Check how much storage space the kernel, its modules, and other related files
+ like the initramfs consume.* [:ref:`... <storagespace_bissbs>`]
+
+The kernels built during a bisection consume quite a bit of space in /boot/ and
+/lib/modules/, especially if you enabled debug symbols. That makes it easy to
+fill up volumes during a bisection -- and due to that even kernels which used to
+work earlier might fail to boot. To prevent that you will need to know how much
+space each installed kernel typically requires.
+
+Note, most of the time the pattern '/boot/*$(make -s kernelrelease)*' used in
+the guide will match all files needed to boot your kernel -- but neither the
+path nor the naming scheme are mandatory. On some distributions you thus will
+need to look in different places.
+
+[:ref:`back to step-by-step guide <storagespace_bissbs>`]
+
+.. _tainted_bisref:
+
+Check if your newly built kernel considers itself 'tainted'
+-----------------------------------------------------------
+
+ *Check if the kernel marked itself as 'tainted'.*
+ [:ref:`... <tainted_bissbs>`]
+
+Linux marks itself as tainted when something happens that potentially leads to
+follow-up errors that look totally unrelated. That is why developers might
+ignore or react scantly to reports from tainted kernels -- unless of course the
+kernel set the flag right when the reported bug occurred.
+
+That's why you want check why a kernel is tainted as explained in
+Documentation/admin-guide/tainted-kernels.rst; doing so is also in your own
+interest, as your testing might be flawed otherwise.
+
+[:ref:`back to step-by-step guide <tainted_bissbs>`]
+
+.. _recheckbroken_bisref:
+
+Check the kernel built from a recent mainline codebase
+------------------------------------------------------
+
+ *Verify if your bug occurs with the newly built kernel.*
+ [:ref:`... <recheckbroken_bissbs>`]
+
+There are a couple of reasons why your bug or regression might not show up with
+the kernel you built from the latest codebase. These are the most frequent:
+
+* The bug was fixed meanwhile.
+
+* What you suspected to be a regression was caused by a change in the build
+ configuration the provider of your kernel carried out.
+
+* Your problem might be a race condition that does not show up with your kernel;
+ the trimmed build configuration, a different setting for debug symbols, the
+ compiler used, and various other things can cause this.
+
+* In case you encountered the regression with a stable/longterm kernel it might
+ be a problem that is specific to that series; the next step in this guide will
+ check this.
+
+[:ref:`back to step-by-step guide <recheckbroken_bissbs>`]
+
+.. _recheckstablebroken_bisref:
+
+Check the kernel built from the latest stable/longterm codebase
+---------------------------------------------------------------
+
+ *Are you facing a regression within a stable/longterm release, but failed to
+ reproduce it with the kernel you just built using the latest mainline sources?
+ Then check if the latest codebase for the particular series might already fix
+ the problem.* [:ref:`... <recheckstablebroken_bissbs>`]
+
+If this kernel does not show the regression either, there most likely is no need
+for a bisection.
+
+[:ref:`back to step-by-step guide <recheckstablebroken_bissbs>`]
+
+.. _introworkingcheck_bisref:
+
+Ensure the 'good' version is really working well
+------------------------------------------------
+
+ *Check if the kernels you build work fine.*
+ [:ref:`... <introworkingcheck_bissbs>`]
+
+This section will reestablish a known working base. Skipping it might be
+appealing, but is usually a bad idea, as it does something important:
+
+It will ensure the .config file you prepared earlier actually works as expected.
+That is in your own interest, as trimming the configuration is not foolproof --
+and you might be building and testing ten or more kernels for nothing before
+starting to suspect something might be wrong with the build configuration.
+
+That alone is reason enough to spend the time on this, but not the only reason.
+
+Many readers of this guide normally run kernels that are patched, use add-on
+modules, or both. Those kernels thus are not considered 'vanilla' -- therefore
+it's possible that the thing that regressed might never have worked in vanilla
+builds of the 'good' version in the first place.
+
+There is a third reason for those that noticed a regression between
+stable/longterm kernels of different series (e.g. 6.0.13..6.1.5): it will
+ensure the kernel version you assumed to be 'good' earlier in the process (e.g.
+6.0) actually is working.
+
+[:ref:`back to step-by-step guide <introworkingcheck_bissbs>`]
+
+.. _recheckworking_bisref:
+
+Build your own version of the 'good' kernel
+-------------------------------------------
+
+ *Build your own variant of the working kernel and check if the feature that
+ regressed works as expected with it.* [:ref:`... <recheckworking_bissbs>`]
+
+In case the feature that broke with newer kernels does not work with your first
+self-built kernel, find and resolve the cause before moving on. There are a
+multitude of reasons why this might happen. Some ideas where to look:
+
+* Check the taint status and the output of ``dmesg``, maybe something unrelated
+ went wrong.
+
+* Maybe localmodconfig did something odd and disabled the module required to
+ test the feature? Then you might want to recreate a .config file based on the
+ one from the last working kernel and skip trimming it down; manually disabling
+ some features in the .config might work as well to reduce the build time.
+
+* Maybe it's not a kernel regression and something that is caused by some fluke,
+ a broken initramfs (also known as initrd), new firmware files, or an updated
+ userland software?
+
+* Maybe it was a feature added to your distributor's kernel which vanilla Linux
+ at that point never supported?
+
+Note, if you found and fixed problems with the .config file, you want to use it
+to build another kernel from the latest codebase, as your earlier tests with
+mainline and the latest version from an affected stable/longterm series were most
+likely flawed.
+
+[:ref:`back to step-by-step guide <recheckworking_bissbs>`]
+
+.. _bisectstart_bisref:
+
+Start the bisection
+-------------------
+
+ *Start the bisection and tell Git about the versions earlier established as
+ 'good' and 'bad'.* [:ref:`... <bisectstart_bissbs>`]
+
+This will start the bisection process; the last of the commands will make Git
+check out a commit round about half-way between the 'good' and the 'bad' changes
+for you to test.
+
+[:ref:`back to step-by-step guide <bisectstart_bissbs>`]
+
+.. _bisectbuild_bisref:
+
+Build a kernel from the bisection point
+---------------------------------------
+
+ *Build, install, and boot a kernel from the code Git checked out using the
+ same commands you used earlier.* [:ref:`... <bisectbuild_bissbs>`]
+
+There are two things worth of note here:
+
+* Occasionally building the kernel will fail or it might not boot due some
+ problem in the code at the bisection point. In that case run this command::
+
+ git bisect skip
+
+ Git will then check out another commit nearby which with a bit of luck should
+ work better. Afterwards restart executing this step.
+
+* Those slightly odd looking version identifiers can happen during bisections,
+ because the Linux kernel subsystems prepare their changes for a new mainline
+ release (say 6.2) before its predecessor (e.g. 6.1) is finished. They thus
+ base them on a somewhat earlier point like 6.1-rc1 or even 6.0 -- and then
+ get merged for 6.2 without rebasing nor squashing them once 6.1 is out. This
+ leads to those slightly odd looking version identifiers coming up during
+ bisections.
+
+[:ref:`back to step-by-step guide <bisectbuild_bissbs>`]
+
+.. _bisecttest_bisref:
+
+Bisection checkpoint
+--------------------
+
+ *Check if the feature that regressed works in the kernel you just built.*
+ [:ref:`... <bisecttest_bissbs>`]
+
+Ensure what you tell Git is accurate: getting it wrong just one time will bring
+the rest of the bisection totally off course, hence all testing after that point
+will be for nothing.
+
+[:ref:`back to step-by-step guide <bisecttest_bissbs>`]
+
+.. _bisectlog_bisref:
+
+Put the bisection log away
+--------------------------
+
+ *Store Git's bisection log and the current .config file in a safe place.*
+ [:ref:`... <bisectlog_bissbs>`]
+
+As indicated above: declaring just one kernel wrongly as 'good' or 'bad' will
+render the end result of a bisection useless. In that case you'd normally have
+to restart the bisection from scratch. The log can prevent that, as it might
+allow someone to point out where a bisection likely went sideways -- and then
+instead of testing ten or more kernels you might only have to build a few to
+resolve things.
+
+The .config file is put aside, as there is a decent chance that developers might
+ask for it after you report the regression.
+
+[:ref:`back to step-by-step guide <bisectlog_bissbs>`]
+
+.. _revert_bisref:
+
+Try reverting the culprit
+-------------------------
+
+ *Try reverting the culprit on top of the latest codebase to see if this fixes
+ your regression.* [:ref:`... <revert_bissbs>`]
+
+This is an optional step, but whenever possible one you should try: there is a
+decent chance that developers will ask you to perform this step when you bring
+the bisection result up. So give it a try, you are in the flow already, building
+one more kernel shouldn't be a big deal at this point.
+
+The step-by-step guide covers everything relevant already except one slightly
+rare thing: did you bisected a regression that also happened with mainline using
+a stable/longterm series, but Git failed to revert the commit in mainline? Then
+try to revert the culprit in the affected stable/longterm series -- and if that
+succeeds, test that kernel version instead.
+
+[:ref:`back to step-by-step guide <revert_bissbs>`]
+
+
+Supplementary tasks: cleanup during and after the bisection
+-----------------------------------------------------------
+
+.. _makeroom_bisref:
+
+Cleaning up during the bisection
+--------------------------------
+
+ *To remove one of the kernels you installed, look up its 'kernelrelease'
+ identifier.* [:ref:`... <makeroom_bissbs>`]
+
+The kernels you install during this process are easy to remove later, as its
+parts are only stored in two places and clearly identifiable. You thus do not
+need to worry to mess up your machine when you install a kernel manually (and
+thus bypass your distribution's packaging system): all parts of your kernels are
+relatively easy to remove later.
+
+One of the two places is a directory in /lib/modules/, which holds the modules
+for each installed kernel. This directory is named after the kernel's release
+identifier; hence, to remove all modules for one of the kernels you built,
+simply remove its modules directory in /lib/modules/.
+
+The other place is /boot/, where typically two up to five files will be placed
+during installation of a kernel. All of them usually contain the release name in
+their file name, but how many files and their exact names depend somewhat on
+your distribution's installkernel executable and its initramfs generator. On
+some distributions the ``kernel-install remove...`` command mentioned in the
+step-by-step guide will delete all of these files for you while also removing
+the menu entry for the kernel from your bootloader configuration. On others you
+have to take care of these two tasks yourself. The following command should
+interactively remove the three main files of a kernel with the release name
+'6.0-rc1-local-gcafec0cacaca0'::
+
+ rm -i /boot/{System.map,vmlinuz,initr}-6.0-rc1-local-gcafec0cacaca0
+
+Afterwards check for other files in /boot/ that have
+'6.0-rc1-local-gcafec0cacaca0' in their name and consider deleting them as well.
+Now remove the boot entry for the kernel from your bootloader's configuration;
+the steps to do that vary quite a bit between Linux distributions.
+
+Note, be careful with wildcards like '*' when deleting files or directories
+for kernels manually: you might accidentally remove files of a 6.0.11 kernel
+when all you want is to remove 6.0 or 6.0.1.
+
+[:ref:`back to step-by-step guide <makeroom_bissbs>`]
+
+Cleaning up after the bisection
+-------------------------------
+
+.. _finishingtouch_bisref:
+
+ *Once you have finished the bisection, do not immediately remove anything
+ you set up, as you might need a few things again.*
+ [:ref:`... <finishingtouch_bissbs>`]
+
+When you are really short of storage space removing the kernels as described in
+the step-by-step guide might not free as much space as you would like. In that
+case consider running ``rm -rf ~/linux/*`` as well now. This will remove the
+build artifacts and the Linux sources, but will leave the Git repository
+(~/linux/.git/) behind -- a simple ``git reset --hard`` thus will bring the
+sources back.
+
+Removing the repository as well would likely be unwise at this point: there is a
+decent chance developers will ask you to build another kernel to perform
+additional tests. This is often required to debug an issue or check proposed
+fixes. Before doing so you want to run the ``git fetch mainline`` command again
+followed by ``git checkout mainline/master`` to bring your clone up to date and
+checkout the latest codebase. Then apply the patch using ``git apply
+<filename>`` or ``git am <filename>`` and build yet another kernel using the
+familiar commands.
+
+Additional tests are also the reason why you want to keep the
+~/kernel-config-working file around for a few weeks.
+
+[:ref:`back to step-by-step guide <finishingtouch_bissbs>`]
+
+
+Additional reading material
+===========================
+
+Further sources
+---------------
+
+* The `man page for 'git bisect' <https://git-scm.com/docs/git-bisect>`_ and
+ `fighting regressions with 'git bisect' <https://git-scm.com/docs/git-bisect-lk2009.html>`_
+ in the Git documentation.
+* `Working with git bisect <https://nathanchance.dev/posts/working-with-git-bisect/>`_
+ from kernel developer Nathan Chancellor.
+* `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_.
+* `Fully automated bisecting with 'git bisect run' <https://lwn.net/Articles/317154>`_.
+
+..
+ end-of-content
+..
+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
+ you spot a typo or small mistake, feel free to let him know directly and
+ he'll fix it. You are free to do the same in a mostly informal way if you
+ want to contribute changes to the text -- but for copyright reasons please CC
+ linux-doc@vger.kernel.org and 'sign-off' your contribution as
+ Documentation/process/submitting-patches.rst explains in the section 'Sign
+ your work - the Developer's Certificate of Origin'.
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use 'The Linux kernel development community' for author attribution
+ and link this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
+
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/wimax/i2400m.rst b/Documentation/admin-guide/wimax/i2400m.rst
deleted file mode 100644
index 194388c0c351..000000000000
--- a/Documentation/admin-guide/wimax/i2400m.rst
+++ /dev/null
@@ -1,283 +0,0 @@
-.. include:: <isonum.txt>
-
-====================================================
-Driver for the Intel Wireless Wimax Connection 2400m
-====================================================
-
-:Copyright: |copy| 2008 Intel Corporation < linux-wimax@intel.com >
-
- This provides a driver for the Intel Wireless WiMAX Connection 2400m
- and a basic Linux kernel WiMAX stack.
-
-1. Requirements
-===============
-
- * Linux installation with Linux kernel 2.6.22 or newer (if building
- from a separate tree)
- * Intel i2400m Echo Peak or Baxter Peak; this includes the Intel
- Wireless WiMAX/WiFi Link 5x50 series.
- * build tools:
-
- + Linux kernel development package for the target kernel; to
- build against your currently running kernel, you need to have
- the kernel development package corresponding to the running
- image installed (usually if your kernel is named
- linux-VERSION, the development package is called
- linux-dev-VERSION or linux-headers-VERSION).
- + GNU C Compiler, make
-
-2. Compilation and installation
-===============================
-
-2.1. Compilation of the drivers included in the kernel
-------------------------------------------------------
-
- Configure the kernel; to enable the WiMAX drivers select Drivers >
- Networking Drivers > WiMAX device support. Enable all of them as
- modules (easier).
-
- If USB or SDIO are not enabled in the kernel configuration, the options
- to build the i2400m USB or SDIO drivers will not show. Enable said
- subsystems and go back to the WiMAX menu to enable the drivers.
-
- Compile and install your kernel as usual.
-
-2.2. Compilation of the drivers distributed as an standalone module
--------------------------------------------------------------------
-
- To compile::
-
- $ cd source/directory
- $ make
-
- Once built you can load and unload using the provided load.sh script;
- load.sh will load the modules, load.sh u will unload them.
-
- To install in the default kernel directories (and enable auto loading
- when the device is plugged)::
-
- $ make install
- $ depmod -a
-
- If your kernel development files are located in a non standard
- directory or if you want to build for a kernel that is not the
- currently running one, set KDIR to the right location::
-
- $ make KDIR=/path/to/kernel/dev/tree
-
- For more information, please contact linux-wimax@intel.com.
-
-3. Installing the firmware
---------------------------
-
- The firmware can be obtained from http://linuxwimax.org or might have
- been supplied with your hardware.
-
- It has to be installed in the target system::
-
- $ cp FIRMWAREFILE.sbcf /lib/firmware/i2400m-fw-BUSTYPE-1.3.sbcf
-
- * NOTE: if your firmware came in an .rpm or .deb file, just install
- it as normal, with the rpm (rpm -i FIRMWARE.rpm) or dpkg
- (dpkg -i FIRMWARE.deb) commands. No further action is needed.
- * BUSTYPE will be usb or sdio, depending on the hardware you have.
- Each hardware type comes with its own firmware and will not work
- with other types.
-
-4. Design
-=========
-
- This package contains two major parts: a WiMAX kernel stack and a
- driver for the Intel i2400m.
-
- The WiMAX stack is designed to provide for common WiMAX control
- services to current and future WiMAX devices from any vendor; please
- see README.wimax for details.
-
- The i2400m kernel driver is broken up in two main parts: the bus
- generic driver and the bus-specific drivers. The bus generic driver
- forms the drivercore and contain no knowledge of the actual method we
- use to connect to the device. The bus specific drivers are just the
- glue to connect the bus-generic driver and the device. Currently only
- USB and SDIO are supported. See drivers/net/wimax/i2400m/i2400m.h for
- more information.
-
- The bus generic driver is logically broken up in two parts: OS-glue and
- hardware-glue. The OS-glue interfaces with Linux. The hardware-glue
- interfaces with the device on using an interface provided by the
- bus-specific driver. The reason for this breakup is to be able to
- easily reuse the hardware-glue to write drivers for other OSes; note
- the hardware glue part is written as a native Linux driver; no
- abstraction layers are used, so to port to another OS, the Linux kernel
- API calls should be replaced with the target OS's.
-
-5. Usage
-========
-
- To load the driver, follow the instructions in the install section;
- once the driver is loaded, plug in the device (unless it is permanently
- plugged in). The driver will enumerate the device, upload the firmware
- and output messages in the kernel log (dmesg, /var/log/messages or
- /var/log/kern.log) such as::
-
- ...
- i2400m_usb 5-4:1.0: firmware interface version 8.0.0
- i2400m_usb 5-4:1.0: WiMAX interface wmx0 (00:1d:e1:01:94:2c) ready
-
- At this point the device is ready to work.
-
- Current versions require the Intel WiMAX Network Service in userspace
- to make things work. See the network service's README for instructions
- on how to scan, connect and disconnect.
-
-5.1. Module parameters
-----------------------
-
- Module parameters can be set at kernel or module load time or by
- echoing values::
-
- $ echo VALUE > /sys/module/MODULENAME/parameters/PARAMETERNAME
-
- To make changes permanent, for example, for the i2400m module, you can
- also create a file named /etc/modprobe.d/i2400m containing::
-
- options i2400m idle_mode_disabled=1
-
- To find which parameters are supported by a module, run::
-
- $ modinfo path/to/module.ko
-
- During kernel bootup (if the driver is linked in the kernel), specify
- the following to the kernel command line::
-
- i2400m.PARAMETER=VALUE
-
-5.1.1. i2400m: idle_mode_disabled
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- The i2400m module supports a parameter to disable idle mode. This
- parameter, once set, will take effect only when the device is
- reinitialized by the driver (eg: following a reset or a reconnect).
-
-5.2. Debug operations: debugfs entries
---------------------------------------
-
- The driver will register debugfs entries that allow the user to tweak
- debug settings. There are three main container directories where
- entries are placed, which correspond to the three blocks a i2400m WiMAX
- driver has:
-
- * /sys/kernel/debug/wimax:DEVNAME/ for the generic WiMAX stack
- controls
- * /sys/kernel/debug/wimax:DEVNAME/i2400m for the i2400m generic
- driver controls
- * /sys/kernel/debug/wimax:DEVNAME/i2400m-usb (or -sdio) for the
- bus-specific i2400m-usb or i2400m-sdio controls).
-
- Of course, if debugfs is mounted in a directory other than
- /sys/kernel/debug, those paths will change.
-
-5.2.1. Increasing debug output
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- The files named *dl_* indicate knobs for controlling the debug output
- of different submodules::
-
- # find /sys/kernel/debug/wimax\:wmx0 -name \*dl_\*
- /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_tx
- /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_rx
- /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_notif
- /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_fw
- /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_usb
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_tx
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_rx
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_rfkill
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_netdev
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_fw
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_debugfs
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_driver
- /sys/kernel/debug/wimax:wmx0/i2400m/dl_control
- /sys/kernel/debug/wimax:wmx0/wimax_dl_stack
- /sys/kernel/debug/wimax:wmx0/wimax_dl_op_rfkill
- /sys/kernel/debug/wimax:wmx0/wimax_dl_op_reset
- /sys/kernel/debug/wimax:wmx0/wimax_dl_op_msg
- /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table
- /sys/kernel/debug/wimax:wmx0/wimax_dl_debugfs
-
- By reading the file you can obtain the current value of said debug
- level; by writing to it, you can set it.
-
- To increase the debug level of, for example, the i2400m's generic TX
- engine, just write::
-
- $ echo 3 > /sys/kernel/debug/wimax:wmx0/i2400m/dl_tx
-
- Increasing numbers yield increasing debug information; for details of
- what is printed and the available levels, check the source. The code
- uses 0 for disabled and increasing values until 8.
-
-5.2.2. RX and TX statistics
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- The i2400m/rx_stats and i2400m/tx_stats provide statistics about the
- data reception/delivery from the device::
-
- $ cat /sys/kernel/debug/wimax:wmx0/i2400m/rx_stats
- 45 1 3 34 3104 48 480
-
- The numbers reported are:
-
- * packets/RX-buffer: total, min, max
- * RX-buffers: total RX buffers received, accumulated RX buffer size
- in bytes, min size received, max size received
-
- Thus, to find the average buffer size received, divide accumulated
- RX-buffer / total RX-buffers.
-
- To clear the statistics back to 0, write anything to the rx_stats file::
-
- $ echo 1 > /sys/kernel/debug/wimax:wmx0/i2400m_rx_stats
-
- Likewise for TX.
-
- Note the packets this debug file refers to are not network packet, but
- packets in the sense of the device-specific protocol for communication
- to the host. See drivers/net/wimax/i2400m/tx.c.
-
-5.2.3. Tracing messages received from user space
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- To echo messages received from user space into the trace pipe that the
- i2400m driver creates, set the debug file i2400m/trace_msg_from_user to
- 1::
-
- $ echo 1 > /sys/kernel/debug/wimax:wmx0/i2400m/trace_msg_from_user
-
-5.2.4. Performing a device reset
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- By writing a 0, a 1 or a 2 to the file
- /sys/kernel/debug/wimax:wmx0/reset, the driver performs a warm (without
- disconnecting from the bus), cold (disconnecting from the bus) or bus
- (bus specific) reset on the device.
-
-5.2.5. Asking the device to enter power saving mode
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- By writing any value to the /sys/kernel/debug/wimax:wmx0 file, the
- device will attempt to enter power saving mode.
-
-6. Troubleshooting
-==================
-
-6.1. Driver complains about ``i2400m-fw-usb-1.2.sbcf: request failed``
-----------------------------------------------------------------------
-
- If upon connecting the device, the following is output in the kernel
- log::
-
- i2400m_usb 5-4:1.0: fw i2400m-fw-usb-1.3.sbcf: request failed: -2
-
- This means that the driver cannot locate the firmware file named
- /lib/firmware/i2400m-fw-usb-1.2.sbcf. Check that the file is present in
- the right location.
diff --git a/Documentation/admin-guide/wimax/index.rst b/Documentation/admin-guide/wimax/index.rst
deleted file mode 100644
index fdf7c1f99ff5..000000000000
--- a/Documentation/admin-guide/wimax/index.rst
+++ /dev/null
@@ -1,19 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===============
-WiMAX subsystem
-===============
-
-.. toctree::
- :maxdepth: 2
-
- wimax
-
- i2400m
-
-.. only:: subproject and html
-
- Indices
- =======
-
- * :ref:`genindex`
diff --git a/Documentation/admin-guide/wimax/wimax.rst b/Documentation/admin-guide/wimax/wimax.rst
deleted file mode 100644
index 817ee8ba2732..000000000000
--- a/Documentation/admin-guide/wimax/wimax.rst
+++ /dev/null
@@ -1,89 +0,0 @@
-.. include:: <isonum.txt>
-
-========================
-Linux kernel WiMAX stack
-========================
-
-:Copyright: |copy| 2008 Intel Corporation < linux-wimax@intel.com >
-
- This provides a basic Linux kernel WiMAX stack to provide a common
- control API for WiMAX devices, usable from kernel and user space.
-
-1. Design
-=========
-
- The WiMAX stack is designed to provide for common WiMAX control
- services to current and future WiMAX devices from any vendor.
-
- Because currently there is only one and we don't know what would be the
- common services, the APIs it currently provides are very minimal.
- However, it is done in such a way that it is easily extensible to
- accommodate future requirements.
-
- The stack works by embedding a struct wimax_dev in your device's
- control structures. This provides a set of callbacks that the WiMAX
- stack will call in order to implement control operations requested by
- the user. As well, the stack provides API functions that the driver
- calls to notify about changes of state in the device.
-
- The stack exports the API calls needed to control the device to user
- space using generic netlink as a marshalling mechanism. You can access
- them using your own code or use the wrappers provided for your
- convenience in libwimax (in the wimax-tools package).
-
- For detailed information on the stack, please see
- include/linux/wimax.h.
-
-2. Usage
-========
-
- For usage in a driver (registration, API, etc) please refer to the
- instructions in the header file include/linux/wimax.h.
-
- When a device is registered with the WiMAX stack, a set of debugfs
- files will appear in /sys/kernel/debug/wimax:wmxX can tweak for
- control.
-
-2.1. Obtaining debug information: debugfs entries
--------------------------------------------------
-
- The WiMAX stack is compiled, by default, with debug messages that can
- be used to diagnose issues. By default, said messages are disabled.
-
- The drivers will register debugfs entries that allow the user to tweak
- debug settings.
-
- Each driver, when registering with the stack, will cause a debugfs
- directory named wimax:DEVICENAME to be created; optionally, it might
- create more subentries below it.
-
-2.1.1. Increasing debug output
-------------------------------
-
- The files named *dl_* indicate knobs for controlling the debug output
- of different submodules of the WiMAX stack::
-
- # find /sys/kernel/debug/wimax\:wmx0 -name \*dl_\*
- /sys/kernel/debug/wimax:wmx0/wimax_dl_stack
- /sys/kernel/debug/wimax:wmx0/wimax_dl_op_rfkill
- /sys/kernel/debug/wimax:wmx0/wimax_dl_op_reset
- /sys/kernel/debug/wimax:wmx0/wimax_dl_op_msg
- /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table
- /sys/kernel/debug/wimax:wmx0/wimax_dl_debugfs
- /sys/kernel/debug/wimax:wmx0/.... # other driver specific files
-
- NOTE:
- Of course, if debugfs is mounted in a directory other than
- /sys/kernel/debug, those paths will change.
-
- By reading the file you can obtain the current value of said debug
- level; by writing to it, you can set it.
-
- To increase the debug level of, for example, the id-table submodule,
- just write:
-
- $ echo 3 > /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table
-
- Increasing numbers yield increasing debug information; for details of
- what is printed and the available levels, check the source. The code
- uses 0 for disabled and increasing values until 8.
diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst
new file mode 100644
index 000000000000..b2e254ec8ee8
--- /dev/null
+++ b/Documentation/admin-guide/workload-tracing.rst
@@ -0,0 +1,606 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+
+======================================================
+Discovering Linux kernel subsystems used by a workload
+======================================================
+
+:Authors: - Shuah Khan <skhan@linuxfoundation.org>
+ - Shefali Sharma <sshefali021@gmail.com>
+:maintained-by: Shuah Khan <skhan@linuxfoundation.org>
+
+Key Points
+==========
+
+ * Understanding system resources necessary to build and run a workload
+ is important.
+ * Linux tracing and strace can be used to discover the system resources
+ in use by a workload. The completeness of the system usage information
+ depends on the completeness of coverage of a workload.
+ * Performance and security of the operating system can be analyzed with
+ the help of tools such as:
+ `perf <https://man7.org/linux/man-pages/man1/perf.1.html>`_,
+ `stress-ng <https://www.mankier.com/1/stress-ng>`_,
+ `paxtest <https://github.com/opntr/paxtest-freebsd>`_.
+ * Once we discover and understand the workload needs, we can focus on them
+ to avoid regressions and use it to evaluate safety considerations.
+
+Methodology
+===========
+
+`strace <https://man7.org/linux/man-pages/man1/strace.1.html>`_ is a
+diagnostic, instructional, and debugging tool and can be used to discover
+the system resources in use by a workload. Once we discover and understand
+the workload needs, we can focus on them to avoid regressions and use it
+to evaluate safety considerations. We use strace tool to trace workloads.
+
+This method of tracing using strace tells us the system calls invoked by
+the workload and doesn't include all the system calls that can be invoked
+by it. In addition, this tracing method tells us just the code paths within
+these system calls that are invoked. As an example, if a workload opens a
+file and reads from it successfully, then the success path is the one that
+is traced. Any error paths in that system call will not be traced. If there
+is a workload that provides full coverage of a workload then the method
+outlined here will trace and find all possible code paths. The completeness
+of the system usage information depends on the completeness of coverage of a
+workload.
+
+The goal is tracing a workload on a system running a default kernel without
+requiring custom kernel installs.
+
+How do we gather fine-grained system information?
+=================================================
+
+strace tool can be used to trace system calls made by a process and signals
+it receives. System calls are the fundamental interface between an
+application and the operating system kernel. They enable a program to
+request services from the kernel. For instance, the open() system call in
+Linux is used to provide access to a file in the file system. strace enables
+us to track all the system calls made by an application. It lists all the
+system calls made by a process and their resulting output.
+
+You can generate profiling data combining strace and perf record tools to
+record the events and information associated with a process. This provides
+insight into the process. "perf annotate" tool generates the statistics of
+each instruction of the program. This document goes over the details of how
+to gather fine-grained information on a workload's usage of system resources.
+
+We used strace to trace the perf, stress-ng, paxtest workloads to illustrate
+our methodology to discover resources used by a workload. This process can
+be applied to trace other workloads.
+
+Getting the system ready for tracing
+====================================
+
+Before we can get started we will show you how to get your system ready.
+We assume that you have a Linux distribution running on a physical system
+or a virtual machine. Most distributions will include strace command. Let’s
+install other tools that aren’t usually included to build Linux kernel.
+Please note that the following works on Debian based distributions. You
+might have to find equivalent packages on other Linux distributions.
+
+Install tools to build Linux kernel and tools in kernel repository.
+scripts/ver_linux is a good way to check if your system already has
+the necessary tools::
+
+ sudo apt-get build-essentials flex bison yacc
+ sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev
+
+cscope is a good tool to browse kernel sources. Let's install it now::
+
+ sudo apt-get install cscope
+
+Install stress-ng and paxtest::
+
+ apt-get install stress-ng
+ apt-get install paxtest
+
+Workload overview
+=================
+
+As mentioned earlier, we used strace to trace perf bench, stress-ng and
+paxtest workloads to show how to analyze a workload and identify Linux
+subsystems used by these workloads. Let's start with an overview of these
+three workloads to get a better understanding of what they do and how to
+use them.
+
+perf bench (all) workload
+-------------------------
+
+The perf bench command contains multiple multi-threaded microkernel
+benchmarks for executing different subsystems in the Linux kernel and
+system calls. This allows us to easily measure the impact of changes,
+which can help mitigate performance regressions. It also acts as a common
+benchmarking framework, enabling developers to easily create test cases,
+integrate transparently, and use performance-rich tooling subsystems.
+
+Stress-ng netdev stressor workload
+----------------------------------
+
+stress-ng is used for performing stress testing on the kernel. It allows
+you to exercise various physical subsystems of the computer, as well as
+interfaces of the OS kernel, using "stressor-s". They are available for
+CPU, CPU cache, devices, I/O, interrupts, file system, memory, network,
+operating system, pipelines, schedulers, and virtual machines. Please refer
+to the `stress-ng man-page <https://www.mankier.com/1/stress-ng>`_ to
+find the description of all the available stressor-s. The netdev stressor
+starts specified number (N) of workers that exercise various netdevice
+ioctl commands across all the available network devices.
+
+paxtest kiddie workload
+-----------------------
+
+paxtest is a program that tests buffer overflows in the kernel. It tests
+kernel enforcements over memory usage. Generally, execution in some memory
+segments makes buffer overflows possible. It runs a set of programs that
+attempt to subvert memory usage. It is used as a regression test suite for
+PaX, but might be useful to test other memory protection patches for the
+kernel. We used paxtest kiddie mode which looks for simple vulnerabilities.
+
+What is strace and how do we use it?
+====================================
+
+As mentioned earlier, strace which is a useful diagnostic, instructional,
+and debugging tool and can be used to discover the system resources in use
+by a workload. It can be used:
+
+ * To see how a process interacts with the kernel.
+ * To see why a process is failing or hanging.
+ * For reverse engineering a process.
+ * To find the files on which a program depends.
+ * For analyzing the performance of an application.
+ * For troubleshooting various problems related to the operating system.
+
+In addition, strace can generate run-time statistics on times, calls, and
+errors for each system call and report a summary when program exits,
+suppressing the regular output. This attempts to show system time (CPU time
+spent running in the kernel) independent of wall clock time. We plan to use
+these features to get information on workload system usage.
+
+strace command supports basic, verbose, and stats modes. strace command when
+run in verbose mode gives more detailed information about the system calls
+invoked by a process.
+
+Running strace -c generates a report of the percentage of time spent in each
+system call, the total time in seconds, the microseconds per call, the total
+number of calls, the count of each system call that has failed with an error
+and the type of system call made.
+
+ * Usage: strace <command we want to trace>
+ * Verbose mode usage: strace -v <command>
+ * Gather statistics: strace -c <command>
+
+We used the “-c” option to gather fine-grained run-time statistics in use
+by three workloads we have chose for this analysis.
+
+ * perf
+ * stress-ng
+ * paxtest
+
+What is cscope and how do we use it?
+====================================
+
+Now let’s look at `cscope <https://cscope.sourceforge.net/>`_, a command
+line tool for browsing C, C++ or Java code-bases. We can use it to find
+all the references to a symbol, global definitions, functions called by a
+function, functions calling a function, text strings, regular expression
+patterns, files including a file.
+
+We can use cscope to find which system call belongs to which subsystem.
+This way we can find the kernel subsystems used by a process when it is
+executed.
+
+Let’s checkout the latest Linux repository and build cscope database::
+
+ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux
+ cd linux
+ cscope -R -p10 # builds cscope.out database before starting browse session
+ cscope -d -p10 # starts browse session on cscope.out database
+
+Note: Run "cscope -R -p10" to build the database and c"scope -d -p10" to
+enter into the browsing session. cscope by default cscope.out database.
+To get out of this mode press ctrl+d. -p option is used to specify the
+number of file path components to display. -p10 is optimal for browsing
+kernel sources.
+
+What is perf and how do we use it?
+==================================
+
+Perf is an analysis tool based on Linux 2.6+ systems, which abstracts the
+CPU hardware difference in performance measurement in Linux, and provides
+a simple command line interface. Perf is based on the perf_events interface
+exported by the kernel. It is very useful for profiling the system and
+finding performance bottlenecks in an application.
+
+If you haven't already checked out the Linux mainline repository, you can do
+so and then build kernel and perf tool::
+
+ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux
+ cd linux
+ make -j3 all
+ cd tools/perf
+ make
+
+Note: The perf command can be built without building the kernel in the
+repository and can be run on older kernels. However matching the kernel
+and perf revisions gives more accurate information on the subsystem usage.
+
+We used "perf stat" and "perf bench" options. For a detailed information on
+the perf tool, run "perf -h".
+
+perf stat
+---------
+The perf stat command generates a report of various hardware and software
+events. It does so with the help of hardware counter registers found in
+modern CPUs that keep the count of these activities. "perf stat cal" shows
+stats for cal command.
+
+Perf bench
+----------
+The perf bench command contains multiple multi-threaded microkernel
+benchmarks for executing different subsystems in the Linux kernel and
+system calls. This allows us to easily measure the impact of changes,
+which can help mitigate performance regressions. It also acts as a common
+benchmarking framework, enabling developers to easily create test cases,
+integrate transparently, and use performance-rich tooling.
+
+"perf bench all" command runs the following benchmarks:
+
+ * sched/messaging
+ * sched/pipe
+ * syscall/basic
+ * mem/memcpy
+ * mem/memset
+
+What is stress-ng and how do we use it?
+=======================================
+
+As mentioned earlier, stress-ng is used for performing stress testing on
+the kernel. It allows you to exercise various physical subsystems of the
+computer, as well as interfaces of the OS kernel, using stressor-s. They
+are available for CPU, CPU cache, devices, I/O, interrupts, file system,
+memory, network, operating system, pipelines, schedulers, and virtual
+machines.
+
+The netdev stressor starts N workers that exercise various netdevice ioctl
+commands across all the available network devices. The following ioctls are
+exercised:
+
+ * SIOCGIFCONF, SIOCGIFINDEX, SIOCGIFNAME, SIOCGIFFLAGS
+ * SIOCGIFADDR, SIOCGIFNETMASK, SIOCGIFMETRIC, SIOCGIFMTU
+ * SIOCGIFHWADDR, SIOCGIFMAP, SIOCGIFTXQLEN
+
+The following command runs the stressor::
+
+ stress-ng --netdev 1 -t 60 --metrics command.
+
+We can use the perf record command to record the events and information
+associated with a process. This command records the profiling data in the
+perf.data file in the same directory.
+
+Using the following commands you can record the events associated with the
+netdev stressor, view the generated report perf.data and annotate the to
+view the statistics of each instruction of the program::
+
+ perf record stress-ng --netdev 1 -t 60 --metrics command.
+ perf report
+ perf annotate
+
+What is paxtest and how do we use it?
+=====================================
+
+paxtest is a program that tests buffer overflows in the kernel. It tests
+kernel enforcements over memory usage. Generally, execution in some memory
+segments makes buffer overflows possible. It runs a set of programs that
+attempt to subvert memory usage. It is used as a regression test suite for
+PaX, and will be useful to test other memory protection patches for the
+kernel.
+
+paxtest provides kiddie and blackhat modes. The paxtest kiddie mode runs
+in normal mode, whereas the blackhat mode tries to get around the protection
+of the kernel testing for vulnerabilities. We focus on the kiddie mode here
+and combine "paxtest kiddie" run with "perf record" to collect CPU stack
+traces for the paxtest kiddie run to see which function is calling other
+functions in the performance profile. Then the "dwarf" (DWARF's Call Frame
+Information) mode can be used to unwind the stack.
+
+The following command can be used to view resulting report in call-graph
+format::
+
+ perf record --call-graph dwarf paxtest kiddie
+ perf report --stdio
+
+Tracing workloads
+=================
+
+Now that we understand the workloads, let's start tracing them.
+
+Tracing perf bench all workload
+-------------------------------
+
+Run the following command to trace perf bench all workload::
+
+ strace -c perf bench all
+
+**System Calls made by the workload**
+
+The below table shows the system calls invoked by the workload, number of
+times each system call is invoked, and the corresponding Linux subsystem.
+
++-------------------+-----------+-----------------+-------------------------+
+| System Call | # calls | Linux Subsystem | System Call (API) |
++===================+===========+=================+=========================+
+| getppid | 10000001 | Process Mgmt | sys_getpid() |
++-------------------+-----------+-----------------+-------------------------+
+| clone | 1077 | Process Mgmt. | sys_clone() |
++-------------------+-----------+-----------------+-------------------------+
+| prctl | 23 | Process Mgmt. | sys_prctl() |
++-------------------+-----------+-----------------+-------------------------+
+| prlimit64 | 7 | Process Mgmt. | sys_prlimit64() |
++-------------------+-----------+-----------------+-------------------------+
+| getpid | 10 | Process Mgmt. | sys_getpid() |
++-------------------+-----------+-----------------+-------------------------+
+| uname | 3 | Process Mgmt. | sys_uname() |
++-------------------+-----------+-----------------+-------------------------+
+| sysinfo | 1 | Process Mgmt. | sys_sysinfo() |
++-------------------+-----------+-----------------+-------------------------+
+| getuid | 1 | Process Mgmt. | sys_getuid() |
++-------------------+-----------+-----------------+-------------------------+
+| getgid | 1 | Process Mgmt. | sys_getgid() |
++-------------------+-----------+-----------------+-------------------------+
+| geteuid | 1 | Process Mgmt. | sys_geteuid() |
++-------------------+-----------+-----------------+-------------------------+
+| getegid | 1 | Process Mgmt. | sys_getegid |
++-------------------+-----------+-----------------+-------------------------+
+| close | 49951 | Filesystem | sys_close() |
++-------------------+-----------+-----------------+-------------------------+
+| pipe | 604 | Filesystem | sys_pipe() |
++-------------------+-----------+-----------------+-------------------------+
+| openat | 48560 | Filesystem | sys_opennat() |
++-------------------+-----------+-----------------+-------------------------+
+| fstat | 8338 | Filesystem | sys_fstat() |
++-------------------+-----------+-----------------+-------------------------+
+| stat | 1573 | Filesystem | sys_stat() |
++-------------------+-----------+-----------------+-------------------------+
+| pread64 | 9646 | Filesystem | sys_pread64() |
++-------------------+-----------+-----------------+-------------------------+
+| getdents64 | 1873 | Filesystem | sys_getdents64() |
++-------------------+-----------+-----------------+-------------------------+
+| access | 3 | Filesystem | sys_access() |
++-------------------+-----------+-----------------+-------------------------+
+| lstat | 1880 | Filesystem | sys_lstat() |
++-------------------+-----------+-----------------+-------------------------+
+| lseek | 6 | Filesystem | sys_lseek() |
++-------------------+-----------+-----------------+-------------------------+
+| ioctl | 3 | Filesystem | sys_ioctl() |
++-------------------+-----------+-----------------+-------------------------+
+| dup2 | 1 | Filesystem | sys_dup2() |
++-------------------+-----------+-----------------+-------------------------+
+| execve | 2 | Filesystem | sys_execve() |
++-------------------+-----------+-----------------+-------------------------+
+| fcntl | 8779 | Filesystem | sys_fcntl() |
++-------------------+-----------+-----------------+-------------------------+
+| statfs | 1 | Filesystem | sys_statfs() |
++-------------------+-----------+-----------------+-------------------------+
+| epoll_create | 2 | Filesystem | sys_epoll_create() |
++-------------------+-----------+-----------------+-------------------------+
+| epoll_ctl | 64 | Filesystem | sys_epoll_ctl() |
++-------------------+-----------+-----------------+-------------------------+
+| newfstatat | 8318 | Filesystem | sys_newfstatat() |
++-------------------+-----------+-----------------+-------------------------+
+| eventfd2 | 192 | Filesystem | sys_eventfd2() |
++-------------------+-----------+-----------------+-------------------------+
+| mmap | 243 | Memory Mgmt. | sys_mmap() |
++-------------------+-----------+-----------------+-------------------------+
+| mprotect | 32 | Memory Mgmt. | sys_mprotect() |
++-------------------+-----------+-----------------+-------------------------+
+| brk | 21 | Memory Mgmt. | sys_brk() |
++-------------------+-----------+-----------------+-------------------------+
+| munmap | 128 | Memory Mgmt. | sys_munmap() |
++-------------------+-----------+-----------------+-------------------------+
+| set_mempolicy | 156 | Memory Mgmt. | sys_set_mempolicy() |
++-------------------+-----------+-----------------+-------------------------+
+| set_tid_address | 1 | Process Mgmt. | sys_set_tid_address() |
++-------------------+-----------+-----------------+-------------------------+
+| set_robust_list | 1 | Futex | sys_set_robust_list() |
++-------------------+-----------+-----------------+-------------------------+
+| futex | 341 | Futex | sys_futex() |
++-------------------+-----------+-----------------+-------------------------+
+| sched_getaffinity | 79 | Scheduler | sys_sched_getaffinity() |
++-------------------+-----------+-----------------+-------------------------+
+| sched_setaffinity | 223 | Scheduler | sys_sched_setaffinity() |
++-------------------+-----------+-----------------+-------------------------+
+| socketpair | 202 | Network | sys_socketpair() |
++-------------------+-----------+-----------------+-------------------------+
+| rt_sigprocmask | 21 | Signal | sys_rt_sigprocmask() |
++-------------------+-----------+-----------------+-------------------------+
+| rt_sigaction | 36 | Signal | sys_rt_sigaction() |
++-------------------+-----------+-----------------+-------------------------+
+| rt_sigreturn | 2 | Signal | sys_rt_sigreturn() |
++-------------------+-----------+-----------------+-------------------------+
+| wait4 | 889 | Time | sys_wait4() |
++-------------------+-----------+-----------------+-------------------------+
+| clock_nanosleep | 37 | Time | sys_clock_nanosleep() |
++-------------------+-----------+-----------------+-------------------------+
+| capget | 4 | Capability | sys_capget() |
++-------------------+-----------+-----------------+-------------------------+
+
+Tracing stress-ng netdev stressor workload
+------------------------------------------
+
+Run the following command to trace stress-ng netdev stressor workload::
+
+ strace -c stress-ng --netdev 1 -t 60 --metrics
+
+**System Calls made by the workload**
+
+The below table shows the system calls invoked by the workload, number of
+times each system call is invoked, and the corresponding Linux subsystem.
+
++-------------------+-----------+-----------------+-------------------------+
+| System Call | # calls | Linux Subsystem | System Call (API) |
++===================+===========+=================+=========================+
+| openat | 74 | Filesystem | sys_openat() |
++-------------------+-----------+-----------------+-------------------------+
+| close | 75 | Filesystem | sys_close() |
++-------------------+-----------+-----------------+-------------------------+
+| read | 58 | Filesystem | sys_read() |
++-------------------+-----------+-----------------+-------------------------+
+| fstat | 20 | Filesystem | sys_fstat() |
++-------------------+-----------+-----------------+-------------------------+
+| flock | 10 | Filesystem | sys_flock() |
++-------------------+-----------+-----------------+-------------------------+
+| write | 7 | Filesystem | sys_write() |
++-------------------+-----------+-----------------+-------------------------+
+| getdents64 | 8 | Filesystem | sys_getdents64() |
++-------------------+-----------+-----------------+-------------------------+
+| pread64 | 8 | Filesystem | sys_pread64() |
++-------------------+-----------+-----------------+-------------------------+
+| lseek | 1 | Filesystem | sys_lseek() |
++-------------------+-----------+-----------------+-------------------------+
+| access | 2 | Filesystem | sys_access() |
++-------------------+-----------+-----------------+-------------------------+
+| getcwd | 1 | Filesystem | sys_getcwd() |
++-------------------+-----------+-----------------+-------------------------+
+| execve | 1 | Filesystem | sys_execve() |
++-------------------+-----------+-----------------+-------------------------+
+| mmap | 61 | Memory Mgmt. | sys_mmap() |
++-------------------+-----------+-----------------+-------------------------+
+| munmap | 3 | Memory Mgmt. | sys_munmap() |
++-------------------+-----------+-----------------+-------------------------+
+| mprotect | 20 | Memory Mgmt. | sys_mprotect() |
++-------------------+-----------+-----------------+-------------------------+
+| mlock | 2 | Memory Mgmt. | sys_mlock() |
++-------------------+-----------+-----------------+-------------------------+
+| brk | 3 | Memory Mgmt. | sys_brk() |
++-------------------+-----------+-----------------+-------------------------+
+| rt_sigaction | 21 | Signal | sys_rt_sigaction() |
++-------------------+-----------+-----------------+-------------------------+
+| rt_sigprocmask | 1 | Signal | sys_rt_sigprocmask() |
++-------------------+-----------+-----------------+-------------------------+
+| sigaltstack | 1 | Signal | sys_sigaltstack() |
++-------------------+-----------+-----------------+-------------------------+
+| rt_sigreturn | 1 | Signal | sys_rt_sigreturn() |
++-------------------+-----------+-----------------+-------------------------+
+| getpid | 8 | Process Mgmt. | sys_getpid() |
++-------------------+-----------+-----------------+-------------------------+
+| prlimit64 | 5 | Process Mgmt. | sys_prlimit64() |
++-------------------+-----------+-----------------+-------------------------+
+| arch_prctl | 2 | Process Mgmt. | sys_arch_prctl() |
++-------------------+-----------+-----------------+-------------------------+
+| sysinfo | 2 | Process Mgmt. | sys_sysinfo() |
++-------------------+-----------+-----------------+-------------------------+
+| getuid | 2 | Process Mgmt. | sys_getuid() |
++-------------------+-----------+-----------------+-------------------------+
+| uname | 1 | Process Mgmt. | sys_uname() |
++-------------------+-----------+-----------------+-------------------------+
+| setpgid | 1 | Process Mgmt. | sys_setpgid() |
++-------------------+-----------+-----------------+-------------------------+
+| getrusage | 1 | Process Mgmt. | sys_getrusage() |
++-------------------+-----------+-----------------+-------------------------+
+| geteuid | 1 | Process Mgmt. | sys_geteuid() |
++-------------------+-----------+-----------------+-------------------------+
+| getppid | 1 | Process Mgmt. | sys_getppid() |
++-------------------+-----------+-----------------+-------------------------+
+| sendto | 3 | Network | sys_sendto() |
++-------------------+-----------+-----------------+-------------------------+
+| connect | 1 | Network | sys_connect() |
++-------------------+-----------+-----------------+-------------------------+
+| socket | 1 | Network | sys_socket() |
++-------------------+-----------+-----------------+-------------------------+
+| clone | 1 | Process Mgmt. | sys_clone() |
++-------------------+-----------+-----------------+-------------------------+
+| set_tid_address | 1 | Process Mgmt. | sys_set_tid_address() |
++-------------------+-----------+-----------------+-------------------------+
+| wait4 | 2 | Time | sys_wait4() |
++-------------------+-----------+-----------------+-------------------------+
+| alarm | 1 | Time | sys_alarm() |
++-------------------+-----------+-----------------+-------------------------+
+| set_robust_list | 1 | Futex | sys_set_robust_list() |
++-------------------+-----------+-----------------+-------------------------+
+
+Tracing paxtest kiddie workload
+-------------------------------
+
+Run the following command to trace paxtest kiddie workload::
+
+ strace -c paxtest kiddie
+
+**System Calls made by the workload**
+
+The below table shows the system calls invoked by the workload, number of
+times each system call is invoked, and the corresponding Linux subsystem.
+
++-------------------+-----------+-----------------+----------------------+
+| System Call | # calls | Linux Subsystem | System Call (API) |
++===================+===========+=================+======================+
+| read | 3 | Filesystem | sys_read() |
++-------------------+-----------+-----------------+----------------------+
+| write | 11 | Filesystem | sys_write() |
++-------------------+-----------+-----------------+----------------------+
+| close | 41 | Filesystem | sys_close() |
++-------------------+-----------+-----------------+----------------------+
+| stat | 24 | Filesystem | sys_stat() |
++-------------------+-----------+-----------------+----------------------+
+| fstat | 2 | Filesystem | sys_fstat() |
++-------------------+-----------+-----------------+----------------------+
+| pread64 | 6 | Filesystem | sys_pread64() |
++-------------------+-----------+-----------------+----------------------+
+| access | 1 | Filesystem | sys_access() |
++-------------------+-----------+-----------------+----------------------+
+| pipe | 1 | Filesystem | sys_pipe() |
++-------------------+-----------+-----------------+----------------------+
+| dup2 | 24 | Filesystem | sys_dup2() |
++-------------------+-----------+-----------------+----------------------+
+| execve | 1 | Filesystem | sys_execve() |
++-------------------+-----------+-----------------+----------------------+
+| fcntl | 26 | Filesystem | sys_fcntl() |
++-------------------+-----------+-----------------+----------------------+
+| openat | 14 | Filesystem | sys_openat() |
++-------------------+-----------+-----------------+----------------------+
+| rt_sigaction | 7 | Signal | sys_rt_sigaction() |
++-------------------+-----------+-----------------+----------------------+
+| rt_sigreturn | 38 | Signal | sys_rt_sigreturn() |
++-------------------+-----------+-----------------+----------------------+
+| clone | 38 | Process Mgmt. | sys_clone() |
++-------------------+-----------+-----------------+----------------------+
+| wait4 | 44 | Time | sys_wait4() |
++-------------------+-----------+-----------------+----------------------+
+| mmap | 7 | Memory Mgmt. | sys_mmap() |
++-------------------+-----------+-----------------+----------------------+
+| mprotect | 3 | Memory Mgmt. | sys_mprotect() |
++-------------------+-----------+-----------------+----------------------+
+| munmap | 1 | Memory Mgmt. | sys_munmap() |
++-------------------+-----------+-----------------+----------------------+
+| brk | 3 | Memory Mgmt. | sys_brk() |
++-------------------+-----------+-----------------+----------------------+
+| getpid | 1 | Process Mgmt. | sys_getpid() |
++-------------------+-----------+-----------------+----------------------+
+| getuid | 1 | Process Mgmt. | sys_getuid() |
++-------------------+-----------+-----------------+----------------------+
+| getgid | 1 | Process Mgmt. | sys_getgid() |
++-------------------+-----------+-----------------+----------------------+
+| geteuid | 2 | Process Mgmt. | sys_geteuid() |
++-------------------+-----------+-----------------+----------------------+
+| getegid | 1 | Process Mgmt. | sys_getegid() |
++-------------------+-----------+-----------------+----------------------+
+| getppid | 1 | Process Mgmt. | sys_getppid() |
++-------------------+-----------+-----------------+----------------------+
+| arch_prctl | 2 | Process Mgmt. | sys_arch_prctl() |
++-------------------+-----------+-----------------+----------------------+
+
+Conclusion
+==========
+
+This document is intended to be used as a guide on how to gather fine-grained
+information on the resources in use by workloads using strace.
+
+References
+==========
+
+ * `Discovery Linux Kernel Subsystems used by OpenAPS <https://elisa.tech/blog/2022/02/02/discovery-linux-kernel-subsystems-used-by-openaps>`_
+ * `ELISA-White-Papers-Discovering Linux kernel subsystems used by a workload <https://github.com/elisa-tech/ELISA-White-Papers/blob/master/Processes/Discovering_Linux_kernel_subsystems_used_by_a_workload.md>`_
+ * `strace <https://man7.org/linux/man-pages/man1/strace.1.html>`_
+ * `perf <https://man7.org/linux/man-pages/man1/perf.1.html>`_
+ * `paxtest README <https://github.com/opntr/paxtest-freebsd/blob/hardenedbsd/0.9.14-hbsd/README>`_
+ * `stress-ng <https://www.mankier.com/1/stress-ng>`_
+ * `Monitoring and managing system status and performance <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/index>`_
diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index ad911be5b5e9..b67772cf36d6 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -133,7 +133,7 @@ When mounting an XFS filesystem, the following options are accepted.
logbsize must be an integer multiple of the log
stripe unit configured at **mkfs(8)** time.
- The default value for for version 1 logs is 32768, while the
+ The default value for version 1 logs is 32768, while the
default value for version 2 logs is MAX(32768, log_sunit).
logdev=device and rtdev=device
@@ -192,7 +192,7 @@ When mounting an XFS filesystem, the following options are accepted.
are any integer multiple of a valid ``sunit`` value.
Typically the only time these mount options are necessary if
- after an underlying RAID device has had it's geometry
+ after an underlying RAID device has had its geometry
modified, such as adding a new disk to a RAID5 lun and
reshaping it.
@@ -210,14 +210,40 @@ When mounting an XFS filesystem, the following options are accepted.
inconsistent namespace presentation during or after a
failover event.
+Deprecation of V4 Format
+========================
+
+The V4 filesystem format lacks certain features that are supported by
+the V5 format, such as metadata checksumming, strengthened metadata
+verification, and the ability to store timestamps past the year 2038.
+Because of this, the V4 format is deprecated. All users should upgrade
+by backing up their files, reformatting, and restoring from the backup.
+
+Administrators and users can detect a V4 filesystem by running xfs_info
+against a filesystem mountpoint and checking for a string containing
+"crc=". If no such string is found, please upgrade xfsprogs to the
+latest version and try again.
+
+The deprecation will take place in two parts. Support for mounting V4
+filesystems can now be disabled at kernel build time via Kconfig option.
+The option will default to yes until September 2025, at which time it
+will be changed to default to no. In September 2030, support will be
+removed from the codebase entirely.
+
+Note: Distributors may choose to withdraw V4 format support earlier than
+the dates listed above.
Deprecated Mount Options
========================
-=========================== ================
+============================ ================
Name Removal Schedule
-=========================== ================
-=========================== ================
+============================ ================
+Mounting with V4 filesystem September 2030
+Mounting ascii-ci filesystem September 2030
+ikeep/noikeep September 2025
+attr2/noattr2 September 2025
+============================ ================
Removed Mount Options
@@ -259,6 +285,9 @@ The following sysctls are available for the XFS filesystem:
removes unused preallocation from clean inodes and releases
the unused space back to the free pool.
+ fs.xfs.speculative_cow_prealloc_lifetime
+ This is an alias for speculative_prealloc_lifetime.
+
fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
A volume knob for error reporting when internal errors occur.
This will generate detailed messages & backtraces for filesystem
@@ -268,7 +297,7 @@ The following sysctls are available for the XFS filesystem:
XFS_ERRLEVEL_LOW: 1
XFS_ERRLEVEL_HIGH: 5
- fs.xfs.panic_mask (Min: 0 Default: 0 Max: 256)
+ fs.xfs.panic_mask (Min: 0 Default: 0 Max: 511)
Causes certain error conditions to call BUG(). Value is a bitmask;
OR together the tags which represent errors which should cause panics:
@@ -331,7 +360,13 @@ The following sysctls are available for the XFS filesystem:
Deprecated Sysctls
==================
-None at present.
+=========================================== ================
+ Name Removal Schedule
+=========================================== ================
+fs.xfs.irix_sgid_inherit September 2025
+fs.xfs.irix_symlink_mode September 2025
+fs.xfs.speculative_cow_prealloc_lifetime September 2025
+=========================================== ================
Removed Sysctls
@@ -465,3 +500,45 @@ the class and error context. For example, the default values for
"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
to "fail immediately" behaviour. This is done because ENODEV is a fatal,
unrecoverable error no matter how many times the metadata IO is retried.
+
+Workqueue Concurrency
+=====================
+
+XFS uses kernel workqueues to parallelize metadata update processes. This
+enables it to take advantage of storage hardware that can service many IO
+operations simultaneously. This interface exposes internal implementation
+details of XFS, and as such is explicitly not part of any userspace API/ABI
+guarantee the kernel may give userspace. These are undocumented features of
+the generic workqueue implementation XFS uses for concurrency, and they are
+provided here purely for diagnostic and tuning purposes and may change at any
+time in the future.
+
+The control knobs for a filesystem's workqueues are organized by task at hand
+and the short name of the data device. They all can be found in:
+
+ /sys/bus/workqueue/devices/${task}!${device}
+
+================ ===========
+ Task Description
+================ ===========
+ xfs_iwalk-$pid Inode scans of the entire filesystem. Currently limited to
+ mount time quotacheck.
+ xfs-gc Background garbage collection of disk space that have been
+ speculatively allocated beyond EOF or for staging copy on
+ write operations.
+================ ===========
+
+For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
+found in /sys/bus/workqueue/devices/xfs_iwalk-1111!nvme0n1/.
+
+The interesting knobs for XFS workqueues are as follows:
+
+============ ===========
+ Knob Description
+============ ===========
+ max_active Maximum number of background threads that can be started to
+ run the work.
+ cpumask CPUs upon which the threads are allowed to run.
+ nice Relative priority of scheduling the threads. These are the
+ same nice levels that can be applied to userspace processes.
+============ ===========