aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/acpi/cppc_sysfs.rst76
-rw-r--r--Documentation/admin-guide/acpi/dsdt-override.rst13
-rw-r--r--Documentation/admin-guide/acpi/index.rst14
-rw-r--r--Documentation/admin-guide/acpi/initrd_table_override.rst115
-rw-r--r--Documentation/admin-guide/acpi/ssdt-overlays.rst180
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst36
-rw-r--r--Documentation/admin-guide/ext4.rst38
-rw-r--r--Documentation/admin-guide/hw-vuln/index.rst13
-rw-r--r--Documentation/admin-guide/hw-vuln/l1tf.rst (renamed from Documentation/admin-guide/l1tf.rst)1
-rw-r--r--Documentation/admin-guide/hw-vuln/mds.rst308
-rw-r--r--Documentation/admin-guide/index.rst7
-rw-r--r--Documentation/admin-guide/kernel-parameters.rst1
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt123
-rw-r--r--Documentation/admin-guide/mm/index.rst1
-rw-r--r--Documentation/admin-guide/mm/numaperf.rst169
-rw-r--r--Documentation/admin-guide/pm/cpufreq.rst18
-rw-r--r--Documentation/admin-guide/pm/cpuidle.rst8
-rw-r--r--Documentation/admin-guide/pm/index.rst2
-rw-r--r--Documentation/admin-guide/pm/intel_epb.rst41
-rw-r--r--Documentation/admin-guide/pm/intel_pstate.rst32
-rw-r--r--Documentation/admin-guide/pm/sleep-states.rst8
-rw-r--r--Documentation/admin-guide/pm/strategies.rst8
-rw-r--r--Documentation/admin-guide/pm/system-wide.rst2
-rw-r--r--Documentation/admin-guide/pm/working-state.rst3
24 files changed, 1177 insertions, 40 deletions
diff --git a/Documentation/admin-guide/acpi/cppc_sysfs.rst b/Documentation/admin-guide/acpi/cppc_sysfs.rst
new file mode 100644
index 000000000000..a4b99afbe331
--- /dev/null
+++ b/Documentation/admin-guide/acpi/cppc_sysfs.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Collaborative Processor Performance Control (CPPC)
+==================================================
+
+CPPC
+====
+
+CPPC defined in the ACPI spec describes a mechanism for the OS to manage the
+performance of a logical processor on a contigious and abstract performance
+scale. CPPC exposes a set of registers to describe abstract performance scale,
+to request performance levels and to measure per-cpu delivered performance.
+
+For more details on CPPC please refer to the ACPI specification at:
+
+http://uefi.org/specifications
+
+Some of the CPPC registers are exposed via sysfs under::
+
+ /sys/devices/system/cpu/cpuX/acpi_cppc/
+
+for each cpu X::
+
+ $ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/
+ /sys/devices/system/cpu/cpu0/acpi_cppc/:
+ total 0
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time
+
+* highest_perf : Highest performance of this processor (abstract scale).
+* nominal_perf : Highest sustained performance of this processor
+ (abstract scale).
+* lowest_nonlinear_perf : Lowest performance of this processor with nonlinear
+ power savings (abstract scale).
+* lowest_perf : Lowest performance of this processor (abstract scale).
+
+* lowest_freq : CPU frequency corresponding to lowest_perf (in MHz).
+* nominal_freq : CPU frequency corresponding to nominal_perf (in MHz).
+ The above frequencies should only be used to report processor performance in
+ freqency instead of abstract scale. These values should not be used for any
+ functional decisions.
+
+* feedback_ctrs : Includes both Reference and delivered performance counter.
+ Reference counter ticks up proportional to processor's reference performance.
+ Delivered counter ticks up proportional to processor's delivered performance.
+* wraparound_time: Minimum time for the feedback counters to wraparound
+ (seconds).
+* reference_perf : Performance level at which reference performance counter
+ accumulates (abstract scale).
+
+
+Computing Average Delivered Performance
+=======================================
+
+Below describes the steps to compute the average performance delivered by
+taking two different snapshots of feedback counters at time T1 and T2.
+
+ T1: Read feedback_ctrs as fbc_t1
+ Wait or run some workload
+
+ T2: Read feedback_ctrs as fbc_t2
+
+::
+
+ delivered_counter_delta = fbc_t2[del] - fbc_t1[del]
+ reference_counter_delta = fbc_t2[ref] - fbc_t1[ref]
+
+ delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta
diff --git a/Documentation/admin-guide/acpi/dsdt-override.rst b/Documentation/admin-guide/acpi/dsdt-override.rst
new file mode 100644
index 000000000000..50bd7f194bf4
--- /dev/null
+++ b/Documentation/admin-guide/acpi/dsdt-override.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Overriding DSDT
+===============
+
+Linux supports a method of overriding the BIOS DSDT:
+
+CONFIG_ACPI_CUSTOM_DSDT - builds the image into the kernel.
+
+When to use this method is described in detail on the
+Linux/ACPI home page:
+https://01.org/linux-acpi/documentation/overriding-dsdt
diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst
new file mode 100644
index 000000000000..4d13eeea1eca
--- /dev/null
+++ b/Documentation/admin-guide/acpi/index.rst
@@ -0,0 +1,14 @@
+============
+ACPI Support
+============
+
+Here we document in detail how to interact with various mechanisms in
+the Linux ACPI support.
+
+.. toctree::
+ :maxdepth: 1
+
+ initrd_table_override
+ dsdt-override
+ ssdt-overlays
+ cppc_sysfs
diff --git a/Documentation/admin-guide/acpi/initrd_table_override.rst b/Documentation/admin-guide/acpi/initrd_table_override.rst
new file mode 100644
index 000000000000..cbd768207631
--- /dev/null
+++ b/Documentation/admin-guide/acpi/initrd_table_override.rst
@@ -0,0 +1,115 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Upgrading ACPI tables via initrd
+================================
+
+What is this about
+==================
+
+If the ACPI_TABLE_UPGRADE compile option is true, it is possible to
+upgrade the ACPI execution environment that is defined by the ACPI tables
+via upgrading the ACPI tables provided by the BIOS with an instrumented,
+modified, more recent version one, or installing brand new ACPI tables.
+
+When building initrd with kernel in a single image, option
+ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD should also be true for this
+feature to work.
+
+For a full list of ACPI tables that can be upgraded/installed, take a look
+at the char `*table_sigs[MAX_ACPI_SIGNATURE];` definition in
+drivers/acpi/tables.c.
+
+All ACPI tables iasl (Intel's ACPI compiler and disassembler) knows should
+be overridable, except:
+
+ - ACPI_SIG_RSDP (has a signature of 6 bytes)
+ - ACPI_SIG_FACS (does not have an ordinary ACPI table header)
+
+Both could get implemented as well.
+
+
+What is this for
+================
+
+Complain to your platform/BIOS vendor if you find a bug which is so severe
+that a workaround is not accepted in the Linux kernel. And this facility
+allows you to upgrade the buggy tables before your platform/BIOS vendor
+releases an upgraded BIOS binary.
+
+This facility can be used by platform/BIOS vendors to provide a Linux
+compatible environment without modifying the underlying platform firmware.
+
+This facility also provides a powerful feature to easily debug and test
+ACPI BIOS table compatibility with the Linux kernel by modifying old
+platform provided ACPI tables or inserting new ACPI tables.
+
+It can and should be enabled in any kernel because there is no functional
+change with not instrumented initrds.
+
+
+How does it work
+================
+::
+
+ # Extract the machine's ACPI tables:
+ cd /tmp
+ acpidump >acpidump
+ acpixtract -a acpidump
+ # Disassemble, modify and recompile them:
+ iasl -d *.dat
+ # For example add this statement into a _PRT (PCI Routing Table) function
+ # of the DSDT:
+ Store("HELLO WORLD", debug)
+ # And increase the OEM Revision. For example, before modification:
+ DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000000)
+ # After modification:
+ DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000001)
+ iasl -sa dsdt.dsl
+ # Add the raw ACPI tables to an uncompressed cpio archive.
+ # They must be put into a /kernel/firmware/acpi directory inside the cpio
+ # archive. Note that if the table put here matches a platform table
+ # (similar Table Signature, and similar OEMID, and similar OEM Table ID)
+ # with a more recent OEM Revision, the platform table will be upgraded by
+ # this table. If the table put here doesn't match a platform table
+ # (dissimilar Table Signature, or dissimilar OEMID, or dissimilar OEM Table
+ # ID), this table will be appended.
+ mkdir -p kernel/firmware/acpi
+ cp dsdt.aml kernel/firmware/acpi
+ # A maximum of "NR_ACPI_INITRD_TABLES (64)" tables are currently allowed
+ # (see osl.c):
+ iasl -sa facp.dsl
+ iasl -sa ssdt1.dsl
+ cp facp.aml kernel/firmware/acpi
+ cp ssdt1.aml kernel/firmware/acpi
+ # The uncompressed cpio archive must be the first. Other, typically
+ # compressed cpio archives, must be concatenated on top of the uncompressed
+ # one. Following command creates the uncompressed cpio archive and
+ # concatenates the original initrd on top:
+ find kernel | cpio -H newc --create > /boot/instrumented_initrd
+ cat /boot/initrd >>/boot/instrumented_initrd
+ # reboot with increased acpi debug level, e.g. boot params:
+ acpi.debug_level=0x2 acpi.debug_layer=0xFFFFFFFF
+ # and check your syslog:
+ [ 1.268089] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
+ [ 1.272091] [ACPI Debug] String [0x0B] "HELLO WORLD"
+
+iasl is able to disassemble and recompile quite a lot different,
+also static ACPI tables.
+
+
+Where to retrieve userspace tools
+=================================
+
+iasl and acpixtract are part of Intel's ACPICA project:
+http://acpica.org/
+
+and should be packaged by distributions (for example in the acpica package
+on SUSE).
+
+acpidump can be found in Len Browns pmtools:
+ftp://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools/acpidump
+
+This tool is also part of the acpica package on SUSE.
+Alternatively, used ACPI tables can be retrieved via sysfs in latest kernels:
+/sys/firmware/acpi/tables
diff --git a/Documentation/admin-guide/acpi/ssdt-overlays.rst b/Documentation/admin-guide/acpi/ssdt-overlays.rst
new file mode 100644
index 000000000000..da37455f96c9
--- /dev/null
+++ b/Documentation/admin-guide/acpi/ssdt-overlays.rst
@@ -0,0 +1,180 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+SSDT Overlays
+=============
+
+In order to support ACPI open-ended hardware configurations (e.g. development
+boards) we need a way to augment the ACPI configuration provided by the firmware
+image. A common example is connecting sensors on I2C / SPI buses on development
+boards.
+
+Although this can be accomplished by creating a kernel platform driver or
+recompiling the firmware image with updated ACPI tables, neither is practical:
+the former proliferates board specific kernel code while the latter requires
+access to firmware tools which are often not publicly available.
+
+Because ACPI supports external references in AML code a more practical
+way to augment firmware ACPI configuration is by dynamically loading
+user defined SSDT tables that contain the board specific information.
+
+For example, to enumerate a Bosch BMA222E accelerometer on the I2C bus of the
+Minnowboard MAX development board exposed via the LSE connector [1], the
+following ASL code can be used::
+
+ DefinitionBlock ("minnowmax.aml", "SSDT", 1, "Vendor", "Accel", 0x00000003)
+ {
+ External (\_SB.I2C6, DeviceObj)
+
+ Scope (\_SB.I2C6)
+ {
+ Device (STAC)
+ {
+ Name (_ADR, Zero)
+ Name (_HID, "BMA222E")
+
+ Method (_CRS, 0, Serialized)
+ {
+ Name (RBUF, ResourceTemplate ()
+ {
+ I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
+ AddressingMode7Bit, "\\_SB.I2C6", 0x00,
+ ResourceConsumer, ,)
+ GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
+ "\\_SB.GPO2", 0x00, ResourceConsumer, , )
+ { // Pin list
+ 0
+ }
+ })
+ Return (RBUF)
+ }
+ }
+ }
+ }
+
+which can then be compiled to AML binary format::
+
+ $ iasl minnowmax.asl
+
+ Intel ACPI Component Architecture
+ ASL Optimizing Compiler version 20140214-64 [Mar 29 2014]
+ Copyright (c) 2000 - 2014 Intel Corporation
+
+ ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords
+ AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes
+
+[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29
+
+The resulting AML code can then be loaded by the kernel using one of the methods
+below.
+
+Loading ACPI SSDTs from initrd
+==============================
+
+This option allows loading of user defined SSDTs from initrd and it is useful
+when the system does not support EFI or when there is not enough EFI storage.
+
+It works in a similar way with initrd based ACPI tables override/upgrade: SSDT
+aml code must be placed in the first, uncompressed, initrd under the
+"kernel/firmware/acpi" path. Multiple files can be used and this will translate
+in loading multiple tables. Only SSDT and OEM tables are allowed. See
+initrd_table_override.txt for more details.
+
+Here is an example::
+
+ # Add the raw ACPI tables to an uncompressed cpio archive.
+ # They must be put into a /kernel/firmware/acpi directory inside the
+ # cpio archive.
+ # The uncompressed cpio archive must be the first.
+ # Other, typically compressed cpio archives, must be
+ # concatenated on top of the uncompressed one.
+ mkdir -p kernel/firmware/acpi
+ cp ssdt.aml kernel/firmware/acpi
+
+ # Create the uncompressed cpio archive and concatenate the original initrd
+ # on top:
+ find kernel | cpio -H newc --create > /boot/instrumented_initrd
+ cat /boot/initrd >>/boot/instrumented_initrd
+
+Loading ACPI SSDTs from EFI variables
+=====================================
+
+This is the preferred method, when EFI is supported on the platform, because it
+allows a persistent, OS independent way of storing the user defined SSDTs. There
+is also work underway to implement EFI support for loading user defined SSDTs
+and using this method will make it easier to convert to the EFI loading
+mechanism when that will arrive.
+
+In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line
+parameter can be used. The argument for the option is the variable name to
+use. If there are multiple variables with the same name but with different
+vendor GUIDs, all of them will be loaded.
+
+In order to store the AML code in an EFI variable the efivarfs filesystem can be
+used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all
+recent distribution.
+
+Creating a new file in /sys/firmware/efi/efivars will automatically create a new
+EFI variable. Updating a file in /sys/firmware/efi/efivars will update the EFI
+variable. Please note that the file name needs to be specially formatted as
+"Name-GUID" and that the first 4 bytes in the file (little-endian format)
+represent the attributes of the EFI variable (see EFI_VARIABLE_MASK in
+include/linux/efi.h). Writing to the file must also be done with one write
+operation.
+
+For example, you can use the following bash script to create/update an EFI
+variable with the content from a given file::
+
+ #!/bin/sh -e
+
+ while ! [ -z "$1" ]; do
+ case "$1" in
+ "-f") filename="$2"; shift;;
+ "-g") guid="$2"; shift;;
+ *) name="$1";;
+ esac
+ shift
+ done
+
+ usage()
+ {
+ echo "Syntax: ${0##*/} -f filename [ -g guid ] name"
+ exit 1
+ }
+
+ [ -n "$name" -a -f "$filename" ] || usage
+
+ EFIVARFS="/sys/firmware/efi/efivars"
+
+ [ -d "$EFIVARFS" ] || exit 2
+
+ if stat -tf $EFIVARFS | grep -q -v de5e81e4; then
+ mount -t efivarfs none $EFIVARFS
+ fi
+
+ # try to pick up an existing GUID
+ [ -n "$guid" ] || guid=$(find "$EFIVARFS" -name "$name-*" | head -n1 | cut -f2- -d-)
+
+ # use a randomly generated GUID
+ [ -n "$guid" ] || guid="$(cat /proc/sys/kernel/random/uuid)"
+
+ # efivarfs expects all of the data in one write
+ tmp=$(mktemp)
+ /bin/echo -ne "\007\000\000\000" | cat - $filename > $tmp
+ dd if=$tmp of="$EFIVARFS/$name-$guid" bs=$(stat -c %s $tmp)
+ rm $tmp
+
+Loading ACPI SSDTs from configfs
+================================
+
+This option allows loading of user defined SSDTs from userspace via the configfs
+interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be
+mounted. In the following examples, we assume that configfs has been mounted in
+/config.
+
+New tables can be loading by creating new directories in /config/acpi/table/ and
+writing the SSDT aml code in the aml attribute::
+
+ cd /config/acpi/table
+ mkdir my_ssdt
+ cat ~/ssdt.aml > my_ssdt/aml
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 20f92c16ffbf..cf88c1f98270 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -177,6 +177,15 @@ cgroup v2 currently supports the following mount options.
ignored on non-init namespace mounts. Please refer to the
Delegation section for details.
+ memory_localevents
+
+ Only populate memory.events with data for the current cgroup,
+ and not any subtrees. This is legacy behaviour, the default
+ behaviour without this option is to include subtree counts.
+ This option is system wide and can only be set on mount or
+ modified through remount from the init namespace. The mount
+ option is ignored on non-init namespace mounts.
+
Organizing Processes and Threads
--------------------------------
@@ -864,6 +873,8 @@ All cgroup core files are prefixed with "cgroup."
populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+ 1 if the cgroup is frozen; otherwise, 0.
cgroup.max.descendants
A read-write single value files. The default is "max".
@@ -897,6 +908,31 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
+ cgroup.freeze
+ A read-write single value file which exists on non-root cgroups.
+ Allowed values are "0" and "1". The default is "0".
+
+ Writing "1" to the file causes freezing of the cgroup and all
+ descendant cgroups. This means that all belonging processes will
+ be stopped and will not run until the cgroup will be explicitly
+ unfrozen. Freezing of the cgroup may take some time; when this action
+ is completed, the "frozen" value in the cgroup.events control file
+ will be updated to "1" and the corresponding notification will be
+ issued.
+
+ A cgroup can be frozen either by its own settings, or by settings
+ of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+ cgroup will remain frozen.
+
+ Processes in the frozen cgroup can be killed by a fatal signal.
+ They also can enter and leave a frozen cgroup: either by an explicit
+ move by a user, or if freezing of the cgroup races with fork().
+ If a process is moved to a frozen cgroup, it stops. If a process is
+ moved out of a frozen cgroup, it becomes running.
+
+ Frozen status of a cgroup doesn't affect any cgroup tree operations:
+ it's possible to delete a frozen (and empty) cgroup, as well as
+ create new sub-cgroups.
Controllers
===========
diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index e506d3dae510..059ddcbe769d 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -91,10 +91,48 @@ Currently Available
* large block (up to pagesize) support
* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
the ordering)
+* Case-insensitive file name lookups
[1] Filesystems with a block size of 1k may see a limit imposed by the
directory hash tree having a maximum depth of two.
+case-insensitive file name lookups
+======================================================
+
+The case-insensitive file name lookup feature is supported on a
+per-directory basis, allowing the user to mix case-insensitive and
+case-sensitive directories in the same filesystem. It is enabled by
+flipping the +F inode attribute of an empty directory. The
+case-insensitive string match operation is only defined when we know how
+text in encoded in a byte sequence. For that reason, in order to enable
+case-insensitive directories, the filesystem must have the
+casefold feature, which stores the filesystem-wide encoding
+model used. By default, the charset adopted is the latest version of
+Unicode (12.1.0, by the time of this writing), encoded in the UTF-8
+form. The comparison algorithm is implemented by normalizing the
+strings to the Canonical decomposition form, as defined by Unicode,
+followed by a byte per byte comparison.
+
+The case-awareness is name-preserving on the disk, meaning that the file
+name provided by userspace is a byte-per-byte match to what is actually
+written in the disk. The Unicode normalization format used by the
+kernel is thus an internal representation, and not exposed to the
+userspace nor to the disk, with the important exception of disk hashes,
+used on large case-insensitive directories with DX feature. On DX
+directories, the hash must be calculated using the casefolded version of
+the filename, meaning that the normalization format used actually has an
+impact on where the directory entry is stored.
+
+When we change from viewing filenames as opaque byte sequences to seeing
+them as encoded strings we need to address what happens when a program
+tries to create a file with an invalid name. The Unicode subsystem
+within the kernel leaves the decision of what to do in this case to the
+filesystem, which select its preferred behavior by enabling/disabling
+the strict mode. When Ext4 encounters one of those strings and the
+filesystem did not require strict mode, it falls back to considering the
+entire string as an opaque byte sequence, which still allows the user to
+operate on that file, but the case-insensitive lookups won't work.
+
Options
=======
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
new file mode 100644
index 000000000000..ffc064c1ec68
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -0,0 +1,13 @@
+========================
+Hardware vulnerabilities
+========================
+
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+ :maxdepth: 1
+
+ l1tf
+ mds
diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/hw-vuln/l1tf.rst
index 9af977384168..31653a9f0e1b 100644
--- a/Documentation/admin-guide/l1tf.rst
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -445,6 +445,7 @@ The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
module parameter is ignored and writes to the sysfs file are rejected.
+.. _mitigation_selection:
Mitigation selection guide
--------------------------
diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
new file mode 100644
index 000000000000..e3a796c0d3a2
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -0,0 +1,308 @@
+MDS - Microarchitectural Data Sampling
+======================================
+
+Microarchitectural Data Sampling is a hardware vulnerability which allows
+unprivileged speculative access to data which is available in various CPU
+internal buffers.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+ - Processors from AMD, Centaur and other non Intel vendors
+
+ - Older processor models, where the CPU family is < 6
+
+ - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+ - Intel processors which have the ARCH_CAP_MDS_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR.
+
+Whether a processor is affected or not can be read out from the MDS
+vulnerability file in sysfs. See :ref:`mds_sys_info`.
+
+Not all processors are affected by all variants of MDS, but the mitigation
+is identical for all of them so the kernel treats them as a single
+vulnerability.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the MDS vulnerability:
+
+ ============== ===== ===================================================
+ CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling
+ CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling
+ CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling
+ CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory
+ ============== ===== ===================================================
+
+Problem
+-------
+
+When performing store, load, L1 refill operations, processors write data
+into temporary microarchitectural structures (buffers). The data in the
+buffer can be forwarded to load operations as an optimization.
+
+Under certain conditions, usually a fault/assist caused by a load
+operation, data unrelated to the load memory address can be speculatively
+forwarded from the buffers. Because the load operation causes a fault or
+assist and its result will be discarded, the forwarded data will not cause
+incorrect program execution or state changes. But a malicious operation
+may be able to forward this speculative data to a disclosure gadget which
+allows in turn to infer the value via a cache side channel attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks are possible.
+
+Deeper technical information is available in the MDS specific x86
+architecture section: :ref:`Documentation/x86/mds.rst <mds>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the MDS vulnerabilities can be mounted from malicious non
+priviledged user space applications running on hosts or guest. Malicious
+guest OSes can obviously mount attacks as well.
+
+Contrary to other speculation based vulnerabilities the MDS vulnerability
+does not allow the attacker to control the memory target address. As a
+consequence the attacks are purely sampling based, but as demonstrated with
+the TLBleed attack samples can be postprocessed successfully.
+
+Web-Browsers
+^^^^^^^^^^^^
+
+ It's unclear whether attacks through Web-Browsers are possible at
+ all. The exploitation through Java-Script is considered very unlikely,
+ but other widely used web technologies like Webassembly could possibly be
+ abused.
+
+
+.. _mds_sys_info:
+
+MDS system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current MDS
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable
+ * - 'Vulnerable'
+ - The processor is vulnerable, but no mitigation enabled
+ * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+ - The processor is vulnerable but microcode is not updated.
+
+ The mitigation is enabled on a best effort basis. See :ref:`vmwerv`
+ * - 'Mitigation: Clear CPU buffers'
+ - The processor is vulnerable and the CPU buffer clearing mitigation is
+ enabled.
+
+If the processor is vulnerable then the following information is appended
+to the above information:
+
+ ======================== ============================================
+ 'SMT vulnerable' SMT is enabled
+ 'SMT mitigated' SMT is enabled and mitigated
+ 'SMT disabled' SMT is disabled
+ 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
+ ======================== ============================================
+
+.. _vmwerv:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the processor is vulnerable, but the availability of the microcode based
+ mitigation mechanism is not advertised via CPUID the kernel selects a best
+ effort mitigation mode. This mode invokes the mitigation instructions
+ without a guarantee that they clear the CPU buffers.
+
+ This is done to address virtualization scenarios where the host has the
+ microcode update applied, but the hypervisor is not yet updated to expose
+ the CPUID to the guest. If the host has updated microcode the protection
+ takes effect otherwise a few cpu cycles are wasted pointlessly.
+
+ The state in the mds sysfs file reflects this situation accordingly.
+
+
+Mitigation mechanism
+-------------------------
+
+The kernel detects the affected CPUs and the presence of the microcode
+which is required.
+
+If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default. The mitigation can be controlled at boot
+time via a kernel command line option. See
+:ref:`mds_mitigation_control_command_line`.
+
+.. _cpu_buffer_clear:
+
+CPU buffer clearing
+^^^^^^^^^^^^^^^^^^^
+
+ The mitigation for MDS clears the affected CPU buffers on return to user
+ space and when entering a guest.
+
+ If SMT is enabled it also clears the buffers on idle entry when the CPU
+ is only affected by MSBDS and not any other MDS variant, because the
+ other variants cannot be protected against cross Hyper-Thread attacks.
+
+ For CPUs which are only affected by MSBDS the user space, guest and idle
+ transition mitigations are sufficient and SMT is not affected.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection for host to guest transition depends on the L1TF
+ vulnerability of the CPU:
+
+ - CPU is affected by L1TF:
+
+ If the L1D flush mitigation is enabled and up to date microcode is
+ available, the L1D flush mitigation is automatically protecting the
+ guest transition.
+
+ If the L1D flush mitigation is disabled then the MDS mitigation is
+ invoked explicit when the host MDS mitigation is enabled.
+
+ For details on L1TF and virtualization see:
+ :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`.
+
+ - CPU is not affected by L1TF:
+
+ CPU buffers are flushed before entering the guest when the host MDS
+ mitigation is enabled.
+
+ The resulting MDS protection matrix for the host to guest transition:
+
+ ============ ===== ============= ============ =================
+ L1TF MDS VMX-L1FLUSH Host MDS MDS-State
+
+ Don't care No Don't care N/A Not affected
+
+ Yes Yes Disabled Off Vulnerable
+
+ Yes Yes Disabled Full Mitigated
+
+ Yes Yes Enabled Don't care Mitigated
+
+ No Yes N/A Off Vulnerable
+
+ No Yes N/A Full Mitigated
+ ============ ===== ============= ============ =================
+
+ This only covers the host to guest transition, i.e. prevents leakage from
+ host to guest, but does not protect the guest internally. Guests need to
+ have their own protections.
+
+.. _xeon_phi:
+
+XEON PHI specific considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The XEON PHI processor family is affected by MSBDS which can be exploited
+ cross Hyper-Threads when entering idle states. Some XEON PHI variants allow
+ to use MWAIT in user space (Ring 3) which opens an potential attack vector
+ for malicious user space. The exposure can be disabled on the kernel
+ command line with the 'ring3mwait=disable' command line option.
+
+ XEON PHI is not affected by the other MDS variants and MSBDS is mitigated
+ before the CPU enters a idle state. As XEON PHI is not affected by L1TF
+ either disabling SMT is not required for full protection.
+
+.. _mds_smt_control:
+
+SMT control
+^^^^^^^^^^^
+
+ All MDS variants except MSBDS can be attacked cross Hyper-Threads. That
+ means on CPUs which are affected by MFBDS or MLPDS it is necessary to
+ disable SMT for full protection. These are most of the affected CPUs; the
+ exception is XEON PHI, see :ref:`xeon_phi`.
+
+ Disabling SMT can have a significant performance impact, but the impact
+ depends on the type of workloads.
+
+ See the relevant chapter in the L1TF mitigation documentation for details:
+ :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+
+.. _mds_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the MDS mitigations at boot
+time with the option "mds=". The valid arguments for this option are:
+
+ ============ =============================================================
+ full If the CPU is vulnerable, enable all available mitigations
+ for the MDS vulnerability, CPU buffer clearing on exit to
+ userspace and when entering a VM. Idle transitions are
+ protected as well if SMT is enabled.
+
+ It does not automatically disable SMT.
+
+ full,nosmt The same as mds=full, with SMT disabled on vulnerable
+ CPUs. This is the complete mitigation.
+
+ off Disables MDS mitigations completely.
+
+ ============ =============================================================
+
+Not specifying this option is equivalent to "mds=full".
+
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+ If all userspace applications are from a trusted source and do not
+ execute untrusted code which is supplied externally, then the mitigation
+ can be disabled.
+
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The same considerations as above versus trusted user space apply.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection depends on the state of the L1TF mitigations.
+ See :ref:`virt_mechanism`.
+
+ If the MDS mitigation is enabled and SMT is disabled, guest to host and
+ guest to guest attacks are prevented.
+
+.. _mds_default_mitigations:
+
+Default mitigations
+-------------------
+
+ The kernel default mitigations for vulnerable processors are:
+
+ - Enable CPU buffer clearing
+
+ The kernel does not by default enforce the disabling of SMT, which leaves
+ SMT systems vulnerable when running untrusted code. The same rationale as
+ for L1TF applies.
+ See :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <default_mitigations>`.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 0a491676685e..8001917ee012 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -17,14 +17,12 @@ etc.
kernel-parameters
devices
-This section describes CPU vulnerabilities and provides an overview of the
-possible mitigations along with guidance for selecting mitigations if they
-are configurable at compile, boot or run time.
+This section describes CPU vulnerabilities and their mitigations.
.. toctree::
:maxdepth: 1
- l1tf
+ hw-vuln/index
Here is a set of documents aimed at users who are trying to track down
problems and bugs in particular.
@@ -77,6 +75,7 @@ configure specific aspects of kernel behavior to your liking.
LSM/index
mm/index
perf-security
+ acpi/index
.. only:: subproject and html
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index b8d0bc07ed0a..0124980dca2d 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -88,6 +88,7 @@ parameter is applicable::
APIC APIC support is enabled.
APM Advanced Power Management support is enabled.
ARM ARM architecture is enabled.
+ ARM64 ARM64 architecture is enabled.
AX25 Appropriate AX.25 support is enabled.
CLK Common clock infrastructure is enabled.
CMA Contiguous Memory Area support is enabled.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2b8ee90bb644..138f6664b2e2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -704,8 +704,11 @@
upon panic. This parameter reserves the physical
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
- is selected automatically. Check
- Documentation/kdump/kdump.txt for further details.
+ is selected automatically.
+ [KNL, x86_64] select a region under 4G first, and
+ fall back to reserve region above 4G when '@offset'
+ hasn't been specified.
+ See Documentation/kdump/kdump.txt for further details.
crashkernel=range1:size1[,range2:size2,...][@offset]
[KNL] Same as above, but depends on the memory
@@ -1585,7 +1588,7 @@
Format: { "off" | "enforce" | "fix" | "log" }
default: "enforce"
- ima_appraise_tcb [IMA]
+ ima_appraise_tcb [IMA] Deprecated. Use ima_policy= instead.
The builtin appraise policy appraises all files
owned by uid=0.
@@ -1612,8 +1615,7 @@
uid=0.
The "appraise_tcb" policy appraises the integrity of
- all files owned by root. (This is the equivalent
- of ima_appraise_tcb.)
+ all files owned by root.
The "secure_boot" policy appraises the integrity
of files (eg. kexec kernel image, kernel modules,
@@ -1828,6 +1830,9 @@
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
+ ipcmni_extend [KNL] Extend the maximum number of unique System V
+ IPC identifiers from 32,768 to 16,777,216.
+
irqaffinity= [SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
@@ -2141,7 +2146,7 @@
Default is 'flush'.
- For details see: Documentation/admin-guide/l1tf.rst
+ For details see: Documentation/admin-guide/hw-vuln/l1tf.rst
l2cr= [PPC]
@@ -2387,6 +2392,32 @@
Format: <first>,<last>
Specifies range of consoles to be captured by the MDA.
+ mds= [X86,INTEL]
+ Control mitigation for the Micro-architectural Data
+ Sampling (MDS) vulnerability.
+
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the MDS mitigation. The
+ options are:
+
+ full - Enable MDS mitigation on vulnerable CPUs
+ full,nosmt - Enable MDS mitigation and disable
+ SMT on vulnerable CPUs
+ off - Unconditionally disable MDS mitigation
+
+ Not specifying this option is equivalent to
+ mds=full.
+
+ For details see: Documentation/admin-guide/hw-vuln/mds.rst
+
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
Amount of memory to be used when the kernel is not able
to see the whole system memory or for test.
@@ -2544,6 +2575,42 @@
in the "bleeding edge" mini2440 support kernel at
http://repo.or.cz/w/linux-2.6/mini2440.git
+ mitigations=
+ [X86,PPC,S390,ARM64] Control optional mitigations for
+ CPU vulnerabilities. This is a set of curated,
+ arch-independent options, each of which is an
+ aggregation of existing arch-specific options.
+
+ off
+ Disable all optional CPU mitigations. This
+ improves system performance, but it may also
+ expose users to several CPU vulnerabilities.
+ Equivalent to: nopti [X86,PPC]
+ kpti=0 [ARM64]
+ nospectre_v1 [PPC]
+ nobp=0 [S390]
+ nospectre_v2 [X86,PPC,S390,ARM64]
+ spectre_v2_user=off [X86]
+ spec_store_bypass_disable=off [X86,PPC]
+ ssbd=force-off [ARM64]
+ l1tf=off [X86]
+ mds=off [X86]
+
+ auto (default)
+ Mitigate all CPU vulnerabilities, but leave SMT
+ enabled, even if it's vulnerable. This is for
+ users who don't want to be surprised by SMT
+ getting disabled across kernel upgrades, or who
+ have other ways of avoiding SMT-based attacks.
+ Equivalent to: (default behavior)
+
+ auto,nosmt
+ Mitigate all CPU vulnerabilities, disabling SMT
+ if needed. This is for users who always want to
+ be fully mitigated, even if it means losing SMT.
+ Equivalent to: l1tf=flush,nosmt [X86]
+ mds=full,nosmt [X86]
+
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
@@ -2839,11 +2906,11 @@
noexec=on: enable non-executable mappings (default)
noexec=off: disable non-executable mappings
- nosmap [X86]
+ nosmap [X86,PPC]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
- nosmep [X86]
+ nosmep [X86,PPC]
Disable SMEP (Supervisor Mode Execution Prevention)
even if it is supported by processor.
@@ -2873,10 +2940,10 @@
check bypass). With this option data leaks are possible
in the system.
- nospectre_v2 [X86,PPC_FSL_BOOK3E] Disable all mitigations for the Spectre variant 2
- (indirect branch prediction) vulnerability. System may
- allow data leaks with this option, which is equivalent
- to spectre_v2=off.
+ nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for
+ the Spectre variant 2 (indirect branch prediction)
+ vulnerability. System may allow data leaks with this
+ option.
nospec_store_bypass_disable
[HW] Disable all mitigations for the Speculative Store Bypass vulnerability
@@ -3110,6 +3177,16 @@
This will also cause panics on machine check exceptions.
Useful together with panic=30 to trigger a reboot.
+ page_alloc.shuffle=
+ [KNL] Boolean flag to control whether the page allocator
+ should randomize its free lists. The randomization may
+ be automatically enabled if the kernel detects it is
+ running on a platform with a direct-mapped memory-side
+ cache, and this parameter can be used to
+ override/disable that behavior. The state of the flag
+ can be read from sysfs at:
+ /sys/module/page_alloc/parameters/shuffle.
+
page_owner= [KNL] Boot-time page_owner enabling option.
Storage of the information about who allocated
each page is disabled in default. With this switch,
@@ -3135,6 +3212,7 @@
bit 2: print timer info
bit 3: print locks info if CONFIG_LOCKDEP is on
bit 4: print ftrace buffer
+ bit 5: print all printk messages in buffer
panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN().
@@ -3394,6 +3472,8 @@
bridges without forcing it upstream. Note:
this removes isolation between devices and
may put more devices in an IOMMU group.
+ force_floating [S390] Force usage of floating interrupts.
+ nomio [S390] Do not use MIO instructions.
pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
Management.
@@ -3623,7 +3703,9 @@
see CONFIG_RAS_CEC help text.
rcu_nocbs= [KNL]
- The argument is a cpu list, as described above.
+ The argument is a cpu list, as described above,
+ except that the string "all" can be used to
+ specify every CPU on the system.
In kernels built with CONFIG_RCU_NOCB_CPU=y, set
the specified list of CPUs to be no-callback CPUs.
@@ -3986,7 +4068,9 @@
[[,]s[mp]#### \
[[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \
[[,]f[orce]
- Where reboot_mode is one of warm (soft) or cold (hard) or gpio,
+ Where reboot_mode is one of warm (soft) or cold (hard) or gpio
+ (prefix with 'panic_' to set mode for panic
+ reboot only),
reboot_type is one of bios, acpi, kbd, triple, efi, or pci,
reboot_force is either force or not specified,
reboot_cpu is s[mp]#### with #### being the processor
@@ -4703,6 +4787,10 @@
[x86] unstable: mark the TSC clocksource as unstable, this
marks the TSC unconditionally unstable at bootup and
avoids any further wobbles once the TSC watchdog notices.
+ [x86] nowatchdog: disable clocksource watchdog. Used
+ in situations with strict latency requirements (where
+ interruptions from clocksource watchdog are not
+ acceptable).
turbografx.map[2|3]= [HW,JOY]
TurboGraFX parallel port interface
@@ -5173,6 +5261,13 @@
with /sys/devices/system/xen_memory/xen_memory0/scrub_pages.
Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
+ xen_timer_slop= [X86-64,XEN]
+ Set the timer slop (in nanoseconds) for the virtual Xen
+ timers (default is 100000). This adjusts the minimum
+ delta of virtualized Xen timers, where lower values
+ improve timer resolution at the expense of processing
+ more timer interrupts.
+
xirc2ps_cs= [NET,PCMCIA]
Format:
<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 8edb35f11317..ddf8d8d33377 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -31,6 +31,7 @@ the Linux memory management.
ksm
memory-hotplug
numa_memory_policy
+ numaperf
pagemap
soft-dirty
transhuge
diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
new file mode 100644
index 000000000000..c067ed145158
--- /dev/null
+++ b/Documentation/admin-guide/mm/numaperf.rst
@@ -0,0 +1,169 @@
+.. _numaperf:
+
+=============
+NUMA Locality
+=============
+
+Some platforms may have multiple types of memory attached to a compute
+node. These disparate memory ranges may share some characteristics, such
+as CPU cache coherence, but may have different performance. For example,
+different media types and buses affect bandwidth and latency.
+
+A system supports such heterogeneous memory by grouping each memory type
+under different domains, or "nodes", based on locality and performance
+characteristics. Some memory may share the same node as a CPU, and others
+are provided as memory only nodes. While memory only nodes do not provide
+CPUs, they may still be local to one or more compute nodes relative to
+other nodes. The following diagram shows one such example of two compute
+nodes with local memory and a memory only node for each of compute node::
+
+ +------------------+ +------------------+
+ | Compute Node 0 +-----+ Compute Node 1 |
+ | Local Node0 Mem | | Local Node1 Mem |
+ +--------+---------+ +--------+---------+
+ | |
+ +--------+---------+ +--------+---------+
+ | Slower Node2 Mem | | Slower Node3 Mem |
+ +------------------+ +--------+---------+
+
+A "memory initiator" is a node containing one or more devices such as
+CPUs or separate memory I/O devices that can initiate memory requests.
+A "memory target" is a node containing one or more physical address
+ranges accessible from one or more memory initiators.
+
+When multiple memory initiators exist, they may not all have the same
+performance when accessing a given memory target. Each initiator-target
+pair may be organized into different ranked access classes to represent
+this relationship. The highest performing initiator to a given target
+is considered to be one of that target's local initiators, and given
+the highest access class, 0. Any given target may have one or more
+local initiators, and any given initiator may have multiple local
+memory targets.
+
+To aid applications matching memory targets with their initiators, the
+kernel provides symlinks to each other. The following example lists the
+relationship for the access class "0" memory initiators and targets::
+
+ # symlinks -v /sys/devices/system/node/nodeX/access0/targets/
+ relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
+
+ # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
+ relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
+
+A memory initiator may have multiple memory targets in the same access
+class. The target memory's initiators in a given class indicate the
+nodes' access characteristics share the same performance relative to other
+linked initiator nodes. Each target within an initiator's access class,
+though, do not necessarily perform the same as each other.
+
+================
+NUMA Performance
+================
+
+Applications may wish to consider which node they want their memory to
+be allocated from based on the node's performance characteristics. If
+the system provides these attributes, the kernel exports them under the
+node sysfs hierarchy by appending the attributes directory under the
+memory node's access class 0 initiators as follows::
+
+ /sys/devices/system/node/nodeY/access0/initiators/
+
+These attributes apply only when accessed from nodes that have the
+are linked under the this access's inititiators.
+
+The performance characteristics the kernel provides for the local initiators
+are exported are as follows::
+
+ # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
+ /sys/devices/system/node/nodeY/access0/initiators/
+ |-- read_bandwidth
+ |-- read_latency
+ |-- write_bandwidth
+ `-- write_latency
+
+The bandwidth attributes are provided in MiB/second.
+
+The latency attributes are provided in nanoseconds.
+
+The values reported here correspond to the rated latency and bandwidth
+for the platform.
+
+==========
+NUMA Cache
+==========
+
+System memory may be constructed in a hierarchy of elements with various
+performance characteristics in order to provide large address space of
+slower performing memory cached by a smaller higher performing memory. The
+system physical addresses memory initiators are aware of are provided
+by the last memory level in the hierarchy. The system meanwhile uses
+higher performing memory to transparently cache access to progressively
+slower levels.
+
+The term "far memory" is used to denote the last level memory in the
+hierarchy. Each increasing cache level provides higher performing
+initiator access, and the term "near memory" represents the fastest
+cache provided by the system.
+
+This numbering is different than CPU caches where the cache level (ex:
+L1, L2, L3) uses the CPU-side view where each increased level is lower
+performing. In contrast, the memory cache level is centric to the last
+level memory, so the higher numbered cache level corresponds to memory
+nearer to the CPU, and further from far memory.
+
+The memory-side caches are not directly addressable by software. When
+software accesses a system address, the system will return it from the
+near memory cache if it is present. If it is not present, the system
+accesses the next level of memory until there is either a hit in that
+cache level, or it reaches far memory.
+
+An application does not need to know about caching attributes in order
+to use the system. Software may optionally query the memory cache
+attributes in order to maximize the performance out of such a setup.
+If the system provides a way for the kernel to discover this information,
+for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
+the kernel will append these attributes to the NUMA node memory target.
+
+When the kernel first registers a memory cache with a node, the kernel
+will create the following directory::
+
+ /sys/devices/system/node/nodeX/memory_side_cache/
+
+If that directory is not present, the system either does not not provide
+a memory-side cache, or that information is not accessible to the kernel.
+
+The attributes for each level of cache is provided under its cache
+level index::
+
+ /sys/devices/system/node/nodeX/memory_side_cache/indexA/
+ /sys/devices/system/node/nodeX/memory_side_cache/indexB/
+ /sys/devices/system/node/nodeX/memory_side_cache/indexC/
+
+Each cache level's directory provides its attributes. For example, the
+following shows a single cache level and the attributes available for
+software to query::
+
+ # tree sys/devices/system/node/node0/memory_side_cache/
+ /sys/devices/system/node/node0/memory_side_cache/
+ |-- index1
+ | |-- indexing
+ | |-- line_size
+ | |-- size
+ | `-- write_policy
+
+The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
+for any other indexed based, multi-way associativity.
+
+The "line_size" is the number of bytes accessed from the next cache
+level on a miss.
+
+The "size" is the number of bytes provided by this cache level.
+
+The "write_policy" will be 0 for write-back, and non-zero for
+write-through caching.
+
+========
+See Also
+========
+.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
+ Section 5.2.27
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 7eca9026a9ed..0c74a7784964 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
@@ -5,9 +8,10 @@
CPU Performance Scaling
=======================
-::
+:Copyright: |copy| 2017 Intel Corporation
+
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Concept of CPU Performance Scaling
======================================
@@ -396,8 +400,8 @@ RT or deadline scheduling classes, the governor will increase the frequency to
the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn,
if it is invoked by the CFS scheduling class, the governor will use the
Per-Entity Load Tracking (PELT) metric for the root control group of the
-given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_
-LWN.net article for a description of the PELT mechanism). Then, the new
+given CPU as the CPU utilization estimate (see the *Per-entity load tracking*
+LWN.net article [1]_ for a description of the PELT mechanism). Then, the new
CPU frequency to apply is computed in accordance with the formula
f = 1.25 * ``f_0`` * ``util`` / ``max``
@@ -698,4 +702,8 @@ hardware feature (e.g. all Intel ones), even if the
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
-.. _Per-entity load tracking: https://lwn.net/Articles/531853/
+References
+==========
+
+.. [1] Jonathan Corbet, *Per-entity load tracking*,
+ https://lwn.net/Articles/531853/
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
index 9c58b35a81cb..e70b365dbc60 100644
--- a/Documentation/admin-guide/pm/cpuidle.rst
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
.. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>`
@@ -5,9 +8,10 @@
CPU Idle Time Management
========================
-::
+:Copyright: |copy| 2018 Intel Corporation
+
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Copyright (c) 2018 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Concepts
========
diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst
index 49237ac73442..39f8f9f81e7a 100644
--- a/Documentation/admin-guide/pm/index.rst
+++ b/Documentation/admin-guide/pm/index.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
================
Power Management
================
diff --git a/Documentation/admin-guide/pm/intel_epb.rst b/Documentation/admin-guide/pm/intel_epb.rst
new file mode 100644
index 000000000000..005121167af7
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_epb.rst
@@ -0,0 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+======================================
+Intel Performance and Energy Bias Hint
+======================================
+
+:Copyright: |copy| 2019 Intel Corporation
+
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+
+.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c
+ :doc: overview
+
+Intel Performance and Energy Bias Attribute in ``sysfs``
+========================================================
+
+The Intel Performance and Energy Bias Hint (EPB) value for a given (logical) CPU
+can be checked or updated through a ``sysfs`` attribute (file) under
+:file:`/sys/devices/system/cpu/cpu<N>/power/`, where the CPU number ``<N>``
+is allocated at the system initialization time:
+
+``energy_perf_bias``
+ Shows the current EPB value for the CPU in a sliding scale 0 - 15, where
+ a value of 0 corresponds to a hint preference for highest performance
+ and a value of 15 corresponds to the maximum energy savings.
+
+ In order to update the EPB value for the CPU, this attribute can be
+ written to, either with a number in the 0 - 15 sliding scale above, or
+ with one of the strings: "performance", "balance-performance", "normal",
+ "balance-power", "power" that represent values reflected by their
+ meaning.
+
+ This attribute is present for all online CPUs supporting the EPB
+ feature.
+
+Note that while the EPB interface to the processor is defined at the logical CPU
+level, the physical register backing it may be shared by multiple CPUs (for
+example, SMT siblings or cores in one package). For this reason, updating the
+EPB value for one CPU may cause the EPB values for other CPUs to change.
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index ec0f7c111f65..67e414e34f37 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -1,10 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
===============================================
``intel_pstate`` CPU Performance Scaling Driver
===============================================
-::
+:Copyright: |copy| 2017 Intel Corporation
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
General Information
@@ -20,11 +23,10 @@ you have not done that yet.]
For the processors supported by ``intel_pstate``, the P-state concept is broader
than just an operating frequency or an operating performance point (see the
-`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more
+LinuxCon Europe 2015 presentation by Kristen Accardi [1]_ for more
information about that). For this reason, the representation of P-states used
by ``intel_pstate`` internally follows the hardware specification (for details
-refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual
-Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core
+refer to Intel Software Developer’s Manual [2]_). However, the ``CPUFreq`` core
uses frequencies for identifying operating performance points of CPUs and
frequencies are involved in the user space interface exposed by it, so
``intel_pstate`` maps its internal representation of P-states to frequencies too
@@ -561,9 +563,9 @@ or to pin every task potentially sensitive to them to a specific CPU.]
On the majority of systems supported by ``intel_pstate``, the ACPI tables
provided by the platform firmware contain ``_PSS`` objects returning information
-that can be used for CPU performance scaling (refer to the `ACPI specification`_
-for details on the ``_PSS`` objects and the format of the information returned
-by them).
+that can be used for CPU performance scaling (refer to the ACPI specification
+[3]_ for details on the ``_PSS`` objects and the format of the information
+returned by them).
The information returned by the ACPI ``_PSS`` objects is used by the
``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate``
@@ -728,6 +730,14 @@ P-state is called, the ``ftrace`` filter can be set to to
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
-.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
-.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
-.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
+References
+==========
+
+.. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*,
+ http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
+
+.. [2] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide*,
+ http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
+
+.. [3] *Advanced Configuration and Power Interface Specification*,
+ https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst
index dbf5acd49f35..cd3a28cb81f4 100644
--- a/Documentation/admin-guide/pm/sleep-states.rst
+++ b/Documentation/admin-guide/pm/sleep-states.rst
@@ -1,10 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
===================
System Sleep States
===================
-::
+:Copyright: |copy| 2017 Intel Corporation
+
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Sleep states are global low-power states of the entire system in which user
space code cannot be executed and the overall system activity is significantly
diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst
index afe4d3f831fe..dd0362e32fa5 100644
--- a/Documentation/admin-guide/pm/strategies.rst
+++ b/Documentation/admin-guide/pm/strategies.rst
@@ -1,10 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
===========================
Power Management Strategies
===========================
-::
+:Copyright: |copy| 2017 Intel Corporation
+
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Linux kernel supports two major high-level power management strategies.
diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst
index 0c81e4c5de39..2b1f987b34f0 100644
--- a/Documentation/admin-guide/pm/system-wide.rst
+++ b/Documentation/admin-guide/pm/system-wide.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
============================
System-Wide Power Management
============================
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
index b6cef9b5e961..fc298eb1234b 100644
--- a/Documentation/admin-guide/pm/working-state.rst
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
==============================
Working-State Power Management
==============================
@@ -8,3 +10,4 @@ Working-State Power Management
cpuidle
cpufreq
intel_pstate
+ intel_epb