aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/obsolete/procfs-i8k10
-rw-r--r--Documentation/ABI/stable/sysfs-block49
-rw-r--r--Documentation/ABI/stable/sysfs-devices-system-cpu4
-rw-r--r--Documentation/ABI/testing/debugfs-hisi-hpre178
-rw-r--r--Documentation/ABI/testing/debugfs-hisi-sec146
-rw-r--r--Documentation/ABI/testing/debugfs-hisi-zip146
-rw-r--r--Documentation/ABI/testing/sysfs-class-hwmon8
-rw-r--r--Documentation/ABI/testing/sysfs-class-thermal2
-rw-r--r--Documentation/ABI/testing/sysfs-devices-system-cpu7
-rw-r--r--Documentation/ABI/testing/sysfs-fs-f2fs54
-rw-r--r--Documentation/Makefile2
-rw-r--r--Documentation/admin-guide/acpi/fan_performance_states.rst28
-rw-r--r--Documentation/admin-guide/blockdev/zram.rst20
-rw-r--r--Documentation/admin-guide/index.rst1
-rw-r--r--Documentation/admin-guide/iostats.rst6
-rw-r--r--Documentation/admin-guide/kdump/vmcoreinfo.rst8
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt55
-rw-r--r--Documentation/admin-guide/perf/index.rst1
-rw-r--r--Documentation/admin-guide/pm/amd-pstate.rst26
-rw-r--r--Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst60
-rw-r--r--Documentation/admin-guide/pm/working-state.rst1
-rw-r--r--Documentation/admin-guide/reporting-issues.rst73
-rw-r--r--Documentation/admin-guide/reporting-regressions.rst451
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst16
-rw-r--r--Documentation/arm64/booting.rst10
-rw-r--r--Documentation/arm64/elf_hwcaps.rst5
-rw-r--r--Documentation/arm64/memory-tagging-extension.rst54
-rw-r--r--Documentation/arm64/silicon-errata.rst2
-rw-r--r--Documentation/asm-annotations.rst11
-rw-r--r--Documentation/block/biodoc.rst1164
-rw-r--r--Documentation/block/capability.rst2
-rw-r--r--Documentation/block/index.rst1
-rw-r--r--Documentation/cdrom/packet-writing.rst4
-rw-r--r--Documentation/conf.py131
-rw-r--r--Documentation/core-api/entry.rst279
-rw-r--r--Documentation/core-api/index.rst8
-rw-r--r--Documentation/dev-tools/ktap.rst49
-rw-r--r--Documentation/devicetree/bindings/arm/pmu.yaml2
-rw-r--r--Documentation/devicetree/bindings/extcon/maxim,max77843.yaml40
-rw-r--r--Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml68
-rw-r--r--Documentation/devicetree/bindings/hwmon/national,lm90.yaml4
-rw-r--r--Documentation/devicetree/bindings/hwmon/ti,tmp464.yaml114
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt1
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/apple,aic.yaml31
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml98
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml96
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.yaml1
-rw-r--r--Documentation/devicetree/bindings/mfd/google,cros-ec.yaml31
-rw-r--r--Documentation/devicetree/bindings/mfd/max14577.txt147
-rw-r--r--Documentation/devicetree/bindings/mfd/max77802.txt25
-rw-r--r--Documentation/devicetree/bindings/mfd/maxim,max14577.yaml195
-rw-r--r--Documentation/devicetree/bindings/mfd/maxim,max77802.yaml194
-rw-r--r--Documentation/devicetree/bindings/mfd/maxim,max77843.yaml144
-rw-r--r--Documentation/devicetree/bindings/mtd/jedec,spi-nor.yaml3
-rw-r--r--Documentation/devicetree/bindings/perf/marvell-cn10k-ddr.yaml37
-rw-r--r--Documentation/devicetree/bindings/power/supply/maxim,max14577.yaml84
-rw-r--r--Documentation/devicetree/bindings/regulator/max77802.txt111
-rw-r--r--Documentation/devicetree/bindings/regulator/maxim,max14577.yaml78
-rw-r--r--Documentation/devicetree/bindings/regulator/maxim,max77802.yaml85
-rw-r--r--Documentation/devicetree/bindings/regulator/maxim,max77843.yaml65
-rw-r--r--Documentation/devicetree/bindings/regulator/maxim,max8973.yaml5
-rw-r--r--Documentation/devicetree/bindings/regulator/pfuze100.yaml6
-rw-r--r--Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml2
-rw-r--r--Documentation/devicetree/bindings/regulator/richtek,rt5190a-regulator.yaml141
-rw-r--r--Documentation/devicetree/bindings/regulator/ti,tps62360.yaml98
-rw-r--r--Documentation/devicetree/bindings/regulator/ti,tps62864.yaml63
-rw-r--r--Documentation/devicetree/bindings/regulator/tps62360-regulator.txt44
-rw-r--r--Documentation/devicetree/bindings/soc/samsung/exynos-usi.yaml2
-rw-r--r--Documentation/devicetree/bindings/spi/mediatek,spi-mt65xx.yaml107
-rw-r--r--Documentation/devicetree/bindings/spi/mediatek,spi-mtk-nor.yaml4
-rw-r--r--Documentation/devicetree/bindings/spi/mediatek,spi-slave-mt27xx.yaml58
-rw-r--r--Documentation/devicetree/bindings/spi/microchip,mpfs-spi.yaml52
-rw-r--r--Documentation/devicetree/bindings/spi/nvidia,tegra210-quad.yaml3
-rw-r--r--Documentation/devicetree/bindings/spi/renesas,rspi.yaml4
-rw-r--r--Documentation/devicetree/bindings/spi/samsung,spi-peripheral-props.yaml33
-rw-r--r--Documentation/devicetree/bindings/spi/samsung,spi.yaml188
-rw-r--r--Documentation/devicetree/bindings/spi/spi-controller.yaml7
-rw-r--r--Documentation/devicetree/bindings/spi/spi-mt65xx.txt68
-rw-r--r--Documentation/devicetree/bindings/spi/spi-nxp-fspi.yaml3
-rw-r--r--Documentation/devicetree/bindings/spi/spi-peripheral-props.yaml26
-rw-r--r--Documentation/devicetree/bindings/spi/spi-pl022.yaml4
-rw-r--r--Documentation/devicetree/bindings/spi/spi-samsung.txt122
-rw-r--r--Documentation/devicetree/bindings/spi/spi-slave-mt27xx.txt33
-rw-r--r--Documentation/devicetree/bindings/spi/spi-sunplus-sp7021.yaml78
-rw-r--r--Documentation/devicetree/bindings/thermal/exynos-thermal.txt106
-rw-r--r--Documentation/devicetree/bindings/thermal/qcom-lmh.yaml1
-rw-r--r--Documentation/devicetree/bindings/thermal/qcom-tsens.yaml1
-rw-r--r--Documentation/devicetree/bindings/thermal/samsung,exynos-thermal.yaml184
-rw-r--r--Documentation/devicetree/bindings/timer/nvidia,tegra-timer.yaml150
-rw-r--r--Documentation/devicetree/bindings/timer/nvidia,tegra20-timer.txt24
-rw-r--r--Documentation/devicetree/bindings/timer/nvidia,tegra210-timer.txt36
-rw-r--r--Documentation/devicetree/bindings/timer/nvidia,tegra30-timer.txt28
-rw-r--r--Documentation/devicetree/bindings/trivial-devices.yaml5
-rw-r--r--Documentation/devicetree/bindings/vendor-prefixes.yaml2
-rw-r--r--Documentation/driver-api/gpio/board.rst21
-rw-r--r--Documentation/driver-api/mtd/index.rst2
-rw-r--r--Documentation/driver-api/mtd/spi-intel.rst (renamed from Documentation/driver-api/mtd/intel-spi.rst)8
-rw-r--r--Documentation/driver-api/serial/driver.rst2
-rw-r--r--Documentation/driver-api/thermal/index.rst1
-rw-r--r--Documentation/driver-api/thermal/intel_dptf.rst272
-rw-r--r--Documentation/filesystems/dax.rst6
-rw-r--r--Documentation/filesystems/erofs.rst2
-rw-r--r--Documentation/filesystems/ext4/blocks.rst2
-rw-r--r--Documentation/filesystems/fscrypt.rst25
-rw-r--r--Documentation/filesystems/locking.rst6
-rw-r--r--Documentation/firmware-guide/acpi/enumeration.rst111
-rw-r--r--Documentation/firmware-guide/acpi/gpio-properties.rst26
-rw-r--r--Documentation/hwmon/aquacomputer_d5next.rst49
-rw-r--r--Documentation/hwmon/asus_ec_sensors.rst54
-rw-r--r--Documentation/hwmon/dell-smm-hwmon.rst180
-rw-r--r--Documentation/hwmon/index.rst3
-rw-r--r--Documentation/hwmon/lm70.rst7
-rw-r--r--Documentation/hwmon/max6639.rst2
-rw-r--r--Documentation/hwmon/pli1209bc.rst75
-rw-r--r--Documentation/hwmon/sch5627.rst4
-rw-r--r--Documentation/hwmon/sysfs-interface.rst4
-rw-r--r--Documentation/hwmon/tmp464.rst73
-rw-r--r--Documentation/hwmon/xdpe12284.rst12
-rw-r--r--Documentation/locking/locktypes.rst2
-rw-r--r--Documentation/process/applying-patches.rst28
-rw-r--r--Documentation/process/deprecated.rst20
-rw-r--r--Documentation/process/handling-regressions.rst746
-rw-r--r--Documentation/process/index.rst2
-rw-r--r--Documentation/process/researcher-guidelines.rst143
-rw-r--r--Documentation/process/submitting-patches.rst3
-rw-r--r--Documentation/scheduler/index.rst1
-rw-r--r--Documentation/scheduler/sched-domains.rst8
-rw-r--r--Documentation/scheduler/schedutil.rst (renamed from Documentation/scheduler/schedutil.txt)30
-rw-r--r--Documentation/security/SCTP.rst26
-rw-r--r--Documentation/security/keys/trusted-encrypted.rst25
-rw-r--r--Documentation/sphinx/kerneldoc-preamble.sty226
-rw-r--r--Documentation/sphinx/kfigure.py134
-rw-r--r--Documentation/spi/pxa2xx.rst3
-rw-r--r--Documentation/trace/osnoise-tracer.rst4
-rw-r--r--Documentation/translations/conf.py12
-rw-r--r--Documentation/translations/ja_JP/index.rst4
-rw-r--r--Documentation/translations/ko_KR/index.rst5
-rw-r--r--Documentation/translations/zh_CN/accounting/delay-accounting.rst62
-rw-r--r--Documentation/translations/zh_CN/admin-guide/index.rst124
-rw-r--r--Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst28
-rw-r--r--Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst232
-rw-r--r--Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst132
-rw-r--r--Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst286
-rw-r--r--Documentation/translations/zh_CN/admin-guide/mm/index.rst49
-rw-r--r--Documentation/translations/zh_CN/admin-guide/mm/ksm.rst148
-rw-r--r--Documentation/translations/zh_CN/core-api/index.rst2
-rw-r--r--Documentation/translations/zh_CN/core-api/rbtree.rst391
-rw-r--r--Documentation/translations/zh_CN/devicetree/index.rst50
-rw-r--r--Documentation/translations/zh_CN/devicetree/of_unittest.rst189
-rw-r--r--Documentation/translations/zh_CN/devicetree/usage-model.rst330
-rw-r--r--Documentation/translations/zh_CN/index.rst21
-rw-r--r--Documentation/translations/zh_CN/peci/index.rst26
-rw-r--r--Documentation/translations/zh_CN/peci/peci.rst54
-rw-r--r--Documentation/translations/zh_CN/power/energy-model.rst190
-rw-r--r--Documentation/translations/zh_CN/power/index.rst56
-rw-r--r--Documentation/translations/zh_CN/power/opp.rst341
-rw-r--r--Documentation/translations/zh_CN/riscv/index.rst1
-rw-r--r--Documentation/translations/zh_CN/riscv/vm-layout.rst67
-rw-r--r--Documentation/translations/zh_CN/scheduler/index.rst9
-rw-r--r--Documentation/translations/zh_CN/scheduler/sched-energy.rst351
-rw-r--r--Documentation/translations/zh_CN/scheduler/sched-nice-design.rst99
-rw-r--r--Documentation/translations/zh_CN/scheduler/sched-stats.rst156
-rw-r--r--Documentation/translations/zh_CN/vm/active_mm.rst85
-rw-r--r--Documentation/translations/zh_CN/vm/balance.rst81
-rw-r--r--Documentation/translations/zh_CN/vm/damon/api.rst32
-rw-r--r--Documentation/translations/zh_CN/vm/damon/design.rst139
-rw-r--r--Documentation/translations/zh_CN/vm/damon/faq.rst48
-rw-r--r--Documentation/translations/zh_CN/vm/damon/index.rst33
-rw-r--r--Documentation/translations/zh_CN/vm/free_page_reporting.rst38
-rw-r--r--Documentation/translations/zh_CN/vm/highmem.rst128
-rw-r--r--Documentation/translations/zh_CN/vm/index.rst53
-rw-r--r--Documentation/translations/zh_CN/vm/ksm.rst70
-rw-r--r--Documentation/translations/zh_TW/index.rst4
-rw-r--r--Documentation/userspace-api/ioctl/ioctl-number.rst2
-rw-r--r--Documentation/virt/uml/user_mode_linux_howto_v2.rst6
-rw-r--r--Documentation/vm/page_owner.rst10
-rw-r--r--Documentation/x86/index.rst1
-rw-r--r--Documentation/x86/intel-hfi.rst72
-rw-r--r--Documentation/x86/sva.rst53
179 files changed, 10284 insertions, 2748 deletions
diff --git a/Documentation/ABI/obsolete/procfs-i8k b/Documentation/ABI/obsolete/procfs-i8k
new file mode 100644
index 000000000000..32df4d5bdd15
--- /dev/null
+++ b/Documentation/ABI/obsolete/procfs-i8k
@@ -0,0 +1,10 @@
+What: /proc/i8k
+Date: November 2001
+KernelVersion: 2.4.14
+Contact: Pali Rohár <pali@kernel.org>
+Description: Legacy interface for getting/setting sensor information like
+ fan speed, temperature, serial number, hotkey status etc
+ on Dell Laptops.
+ Since the driver is now using the standard hwmon sysfs interface,
+ the procfs interface is deprecated.
+Users: https://github.com/vitorafsr/i8kutils
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 8dd3e84a8aad..e8797cd09aff 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,55 @@ Description:
last zone of the device which may be smaller.
+What: /sys/block/<disk>/queue/crypto/
+Date: February 2022
+Contact: linux-block@vger.kernel.org
+Description:
+ The presence of this subdirectory of /sys/block/<disk>/queue/
+ indicates that the device supports inline encryption. This
+ subdirectory contains files which describe the inline encryption
+ capabilities of the device. For more information about inline
+ encryption, refer to Documentation/block/inline-encryption.rst.
+
+
+What: /sys/block/<disk>/queue/crypto/max_dun_bits
+Date: February 2022
+Contact: linux-block@vger.kernel.org
+Description:
+ [RO] This file shows the maximum length, in bits, of data unit
+ numbers accepted by the device in inline encryption requests.
+
+
+What: /sys/block/<disk>/queue/crypto/modes/<mode>
+Date: February 2022
+Contact: linux-block@vger.kernel.org
+Description:
+ [RO] For each crypto mode (i.e., encryption/decryption
+ algorithm) the device supports with inline encryption, a file
+ will exist at this location. It will contain a hexadecimal
+ number that is a bitmask of the supported data unit sizes, in
+ bytes, for that crypto mode.
+
+ Currently, the crypto modes that may be supported are:
+
+ * AES-256-XTS
+ * AES-128-CBC-ESSIV
+ * Adiantum
+
+ For example, if a device supports AES-256-XTS inline encryption
+ with data unit sizes of 512 and 4096 bytes, the file
+ /sys/block/<disk>/queue/crypto/modes/AES-256-XTS will exist and
+ will contain "0x1200".
+
+
+What: /sys/block/<disk>/queue/crypto/num_keyslots
+Date: February 2022
+Contact: linux-block@vger.kernel.org
+Description:
+ [RO] This file shows the number of keyslots the device has for
+ use with inline encryption.
+
+
What: /sys/block/<disk>/queue/dax
Date: June 2016
Contact: linux-block@vger.kernel.org
diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
index 3965ce504484..902392d7eddf 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-cpu
+++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
@@ -86,6 +86,10 @@ What: /sys/devices/system/cpu/cpuX/topology/die_cpus
Description: internal kernel map of CPUs within the same die.
Values: hexadecimal bitmask.
+What: /sys/devices/system/cpu/cpuX/topology/ppin
+Description: per-socket protected processor inventory number
+Values: hexadecimal.
+
What: /sys/devices/system/cpu/cpuX/topology/die_cpus_list
Description: human-readable list of CPUs within the same die.
The format is like 0-3, 8-11, 14,17.
diff --git a/Documentation/ABI/testing/debugfs-hisi-hpre b/Documentation/ABI/testing/debugfs-hisi-hpre
index b4be5f1db4b7..396de7bc735d 100644
--- a/Documentation/ABI/testing/debugfs-hisi-hpre
+++ b/Documentation/ABI/testing/debugfs-hisi-hpre
@@ -1,140 +1,150 @@
-What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/regs
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: Dump debug registers from the HPRE cluster.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/regs
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: Dump debug registers from the HPRE cluster.
Only available for PF.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/cluster_ctrl
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: Write the HPRE core selection in the cluster into this file,
+What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/cluster_ctrl
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: Write the HPRE core selection in the cluster into this file,
and then we can read the debug information of the core.
Only available for PF.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/rdclr_en
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: HPRE cores debug registers read clear control. 1 means enable
+What: /sys/kernel/debug/hisi_hpre/<bdf>/rdclr_en
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: HPRE cores debug registers read clear control. 1 means enable
register read clear, otherwise 0. Writing to this file has no
functional effect, only enable or disable counters clear after
reading of these registers.
Only available for PF.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/current_qm
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: One HPRE controller has one PF and multiple VFs, each function
+What: /sys/kernel/debug/hisi_hpre/<bdf>/current_qm
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: One HPRE controller has one PF and multiple VFs, each function
has a QM. Select the QM which below qm refers to.
Only available for PF.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/regs
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: Dump debug registers from the HPRE.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/alg_qos
+Date: Jun 2021
+Contact: linux-crypto@vger.kernel.org
+Description: The <bdf> is related the function for PF and VF.
+ HPRE driver supports to configure each function's QoS, the driver
+ supports to write <bdf> value to alg_qos in the host. Such as
+ "echo <bdf> value > alg_qos". The qos value is 1~1000, means
+ 1/1000~1000/1000 of total QoS. The driver reading alg_qos to
+ get related QoS in the host and VM, Such as "cat alg_qos".
+
+What: /sys/kernel/debug/hisi_hpre/<bdf>/regs
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: Dump debug registers from the HPRE.
Only available for PF.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/regs
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: Dump debug registers from the QM.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/regs
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: Dump debug registers from the QM.
Available for PF and VF in host. VF in guest currently only
has one debug register.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/current_q
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: One QM may contain multiple queues. Select specific queue to
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/current_q
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: One QM may contain multiple queues. Select specific queue to
show its debug registers in above regs.
Only available for PF.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/clear_enable
-Date: Sep 2019
-Contact: linux-crypto@vger.kernel.org
-Description: QM debug registers(regs) read clear control. 1 means enable
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/clear_enable
+Date: Sep 2019
+Contact: linux-crypto@vger.kernel.org
+Description: QM debug registers(regs) read clear control. 1 means enable
register read clear, otherwise 0.
Writing to this file has no functional effect, only enable or
disable counters clear after reading of these registers.
Only available for PF.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/err_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of invalid interrupts for
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/err_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of invalid interrupts for
QM task completion.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/aeq_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of QM async event queue interrupts.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/aeq_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of QM async event queue interrupts.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/abnormal_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of interrupts for QM abnormal event.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/abnormal_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of interrupts for QM abnormal event.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/create_qp_err
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of queue allocation errors.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/create_qp_err
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of queue allocation errors.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/mb_err
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of failed QM mailbox commands.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/mb_err
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of failed QM mailbox commands.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/status
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the status of the QM.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/status
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the status of the QM.
Four states: initiated, started, stopped and closed.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of sent requests.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of sent requests.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/recv_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of received requests.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/recv_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of received requests.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_busy_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of requests sent
+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_busy_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of requests sent
with returning busy.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_fail_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of completed but error requests.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_fail_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of completed but error requests.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/invalid_req_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of invalid requests being received.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/invalid_req_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of invalid requests being received.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/overtime_thrhld
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Set the threshold time for counting the request which is
+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/overtime_thrhld
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Set the threshold time for counting the request which is
processed longer than the threshold.
0: disable(default), 1: 1 microsecond.
Available for both PF and VF, and take no other effect on HPRE.
-What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/over_thrhld_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of time out requests.
+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/over_thrhld_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of time out requests.
Available for both PF and VF, and take no other effect on HPRE.
diff --git a/Documentation/ABI/testing/debugfs-hisi-sec b/Documentation/ABI/testing/debugfs-hisi-sec
index 85feb4408e0f..2bf84ced484b 100644
--- a/Documentation/ABI/testing/debugfs-hisi-sec
+++ b/Documentation/ABI/testing/debugfs-hisi-sec
@@ -1,113 +1,123 @@
-What: /sys/kernel/debug/hisi_sec2/<bdf>/clear_enable
-Date: Oct 2019
-Contact: linux-crypto@vger.kernel.org
-Description: Enabling/disabling of clear action after reading
+What: /sys/kernel/debug/hisi_sec2/<bdf>/clear_enable
+Date: Oct 2019
+Contact: linux-crypto@vger.kernel.org
+Description: Enabling/disabling of clear action after reading
the SEC debug registers.
0: disable, 1: enable.
Only available for PF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/current_qm
-Date: Oct 2019
-Contact: linux-crypto@vger.kernel.org
-Description: One SEC controller has one PF and multiple VFs, each function
+What: /sys/kernel/debug/hisi_sec2/<bdf>/current_qm
+Date: Oct 2019
+Contact: linux-crypto@vger.kernel.org
+Description: One SEC controller has one PF and multiple VFs, each function
has a QM. This file can be used to select the QM which below
qm refers to.
Only available for PF.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/qm_regs
-Date: Oct 2019
-Contact: linux-crypto@vger.kernel.org
-Description: Dump of QM related debug registers.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/alg_qos
+Date: Jun 2021
+Contact: linux-crypto@vger.kernel.org
+Description: The <bdf> is related the function for PF and VF.
+ SEC driver supports to configure each function's QoS, the driver
+ supports to write <bdf> value to alg_qos in the host. Such as
+ "echo <bdf> value > alg_qos". The qos value is 1~1000, means
+ 1/1000~1000/1000 of total QoS. The driver reading alg_qos to
+ get related QoS in the host and VM, Such as "cat alg_qos".
+
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/qm_regs
+Date: Oct 2019
+Contact: linux-crypto@vger.kernel.org
+Description: Dump of QM related debug registers.
Available for PF and VF in host. VF in guest currently only
has one debug register.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/current_q
-Date: Oct 2019
-Contact: linux-crypto@vger.kernel.org
-Description: One QM of SEC may contain multiple queues. Select specific
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/current_q
+Date: Oct 2019
+Contact: linux-crypto@vger.kernel.org
+Description: One QM of SEC may contain multiple queues. Select specific
queue to show its debug registers in above 'regs'.
Only available for PF.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/clear_enable
-Date: Oct 2019
-Contact: linux-crypto@vger.kernel.org
-Description: Enabling/disabling of clear action after reading
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/clear_enable
+Date: Oct 2019
+Contact: linux-crypto@vger.kernel.org
+Description: Enabling/disabling of clear action after reading
the SEC's QM debug registers.
0: disable, 1: enable.
Only available for PF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/err_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of invalid interrupts for
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/err_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of invalid interrupts for
QM task completion.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/aeq_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of QM async event queue interrupts.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/aeq_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of QM async event queue interrupts.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/abnormal_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of interrupts for QM abnormal event.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/abnormal_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of interrupts for QM abnormal event.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/create_qp_err
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of queue allocation errors.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/create_qp_err
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of queue allocation errors.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/mb_err
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of failed QM mailbox commands.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/mb_err
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of failed QM mailbox commands.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/status
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the status of the QM.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/status
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the status of the QM.
Four states: initiated, started, stopped and closed.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/send_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of sent requests.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/send_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of sent requests.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/recv_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of received requests.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/recv_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of received requests.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/send_busy_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of requests sent with returning busy.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/send_busy_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of requests sent with returning busy.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/err_bd_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of BD type error requests
+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/err_bd_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of BD type error requests
to be received.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/invalid_req_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of invalid requests being received.
+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/invalid_req_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of invalid requests being received.
Available for both PF and VF, and take no other effect on SEC.
-What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/done_flag_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of completed but marked error requests
+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/done_flag_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of completed but marked error requests
to be received.
Available for both PF and VF, and take no other effect on SEC.
diff --git a/Documentation/ABI/testing/debugfs-hisi-zip b/Documentation/ABI/testing/debugfs-hisi-zip
index 3034a2bf99ca..bf1258bc6495 100644
--- a/Documentation/ABI/testing/debugfs-hisi-zip
+++ b/Documentation/ABI/testing/debugfs-hisi-zip
@@ -1,114 +1,124 @@
-What: /sys/kernel/debug/hisi_zip/<bdf>/comp_core[01]/regs
-Date: Nov 2018
-Contact: linux-crypto@vger.kernel.org
-Description: Dump of compression cores related debug registers.
+What: /sys/kernel/debug/hisi_zip/<bdf>/comp_core[01]/regs
+Date: Nov 2018
+Contact: linux-crypto@vger.kernel.org
+Description: Dump of compression cores related debug registers.
Only available for PF.
-What: /sys/kernel/debug/hisi_zip/<bdf>/decomp_core[0-5]/regs
-Date: Nov 2018
-Contact: linux-crypto@vger.kernel.org
-Description: Dump of decompression cores related debug registers.
+What: /sys/kernel/debug/hisi_zip/<bdf>/decomp_core[0-5]/regs
+Date: Nov 2018
+Contact: linux-crypto@vger.kernel.org
+Description: Dump of decompression cores related debug registers.
Only available for PF.
-What: /sys/kernel/debug/hisi_zip/<bdf>/clear_enable
-Date: Nov 2018
-Contact: linux-crypto@vger.kernel.org
-Description: Compression/decompression core debug registers read clear
+What: /sys/kernel/debug/hisi_zip/<bdf>/clear_enable
+Date: Nov 2018
+Contact: linux-crypto@vger.kernel.org
+Description: Compression/decompression core debug registers read clear
control. 1 means enable register read clear, otherwise 0.
Writing to this file has no functional effect, only enable or
disable counters clear after reading of these registers.
Only available for PF.
-What: /sys/kernel/debug/hisi_zip/<bdf>/current_qm
-Date: Nov 2018
-Contact: linux-crypto@vger.kernel.org
-Description: One ZIP controller has one PF and multiple VFs, each function
+What: /sys/kernel/debug/hisi_zip/<bdf>/current_qm
+Date: Nov 2018
+Contact: linux-crypto@vger.kernel.org
+Description: One ZIP controller has one PF and multiple VFs, each function
has a QM. Select the QM which below qm refers to.
Only available for PF.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/regs
-Date: Nov 2018
-Contact: linux-crypto@vger.kernel.org
-Description: Dump of QM related debug registers.
+What: /sys/kernel/debug/hisi_zip/<bdf>/alg_qos
+Date: Jun 2021
+Contact: linux-crypto@vger.kernel.org
+Description: The <bdf> is related the function for PF and VF.
+ ZIP driver supports to configure each function's QoS, the driver
+ supports to write <bdf> value to alg_qos in the host. Such as
+ "echo <bdf> value > alg_qos". The qos value is 1~1000, means
+ 1/1000~1000/1000 of total QoS. The driver reading alg_qos to
+ get related QoS in the host and VM, Such as "cat alg_qos".
+
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/regs
+Date: Nov 2018
+Contact: linux-crypto@vger.kernel.org
+Description: Dump of QM related debug registers.
Available for PF and VF in host. VF in guest currently only
has one debug register.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/current_q
-Date: Nov 2018
-Contact: linux-crypto@vger.kernel.org
-Description: One QM may contain multiple queues. Select specific queue to
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/current_q
+Date: Nov 2018
+Contact: linux-crypto@vger.kernel.org
+Description: One QM may contain multiple queues. Select specific queue to
show its debug registers in above regs.
Only available for PF.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/clear_enable
-Date: Nov 2018
-Contact: linux-crypto@vger.kernel.org
-Description: QM debug registers(regs) read clear control. 1 means enable
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/clear_enable
+Date: Nov 2018
+Contact: linux-crypto@vger.kernel.org
+Description: QM debug registers(regs) read clear control. 1 means enable
register read clear, otherwise 0.
Writing to this file has no functional effect, only enable or
disable counters clear after reading of these registers.
Only available for PF.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/err_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of invalid interrupts for
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/err_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of invalid interrupts for
QM task completion.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/aeq_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of QM async event queue interrupts.
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/aeq_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of QM async event queue interrupts.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/abnormal_irq
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of interrupts for QM abnormal event.
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/abnormal_irq
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of interrupts for QM abnormal event.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/create_qp_err
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of queue allocation errors.
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/create_qp_err
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of queue allocation errors.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/mb_err
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the number of failed QM mailbox commands.
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/mb_err
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the number of failed QM mailbox commands.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/qm/status
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the status of the QM.
+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/status
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the status of the QM.
Four states: initiated, started, stopped and closed.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/send_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of sent requests.
+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/send_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of sent requests.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/recv_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of received requests.
+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/recv_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of received requests.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/send_busy_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of requests received
+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/send_busy_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of requests received
with returning busy.
Available for both PF and VF, and take no other effect on ZIP.
-What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/err_bd_cnt
-Date: Apr 2020
-Contact: linux-crypto@vger.kernel.org
-Description: Dump the total number of BD type error requests
+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/err_bd_cnt
+Date: Apr 2020
+Contact: linux-crypto@vger.kernel.org
+Description: Dump the total number of BD type error requests
to be received.
Available for both PF and VF, and take no other effect on ZIP.
diff --git a/Documentation/ABI/testing/sysfs-class-hwmon b/Documentation/ABI/testing/sysfs-class-hwmon
index 1f20687def44..653d4c75eddb 100644
--- a/Documentation/ABI/testing/sysfs-class-hwmon
+++ b/Documentation/ABI/testing/sysfs-class-hwmon
@@ -9,6 +9,14 @@ Description:
RO
+What: /sys/class/hwmon/hwmonX/label
+Description:
+ A descriptive label that allows to uniquely identify a
+ device within the system.
+ The contents of the label are free-form.
+
+ RO
+
What: /sys/class/hwmon/hwmonX/update_interval
Description:
The interval at which the chip will update readings.
diff --git a/Documentation/ABI/testing/sysfs-class-thermal b/Documentation/ABI/testing/sysfs-class-thermal
index 2c52bb1f864c..8eee37982b2a 100644
--- a/Documentation/ABI/testing/sysfs-class-thermal
+++ b/Documentation/ABI/testing/sysfs-class-thermal
@@ -203,7 +203,7 @@ Description:
- for generic ACPI: should be "Fan", "Processor" or "LCD"
- for memory controller device on intel_menlow platform:
- should be "Memory controller".
+ should be "Memory controller".
RO, Required
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 61f5676a7429..2ad01cad7f1c 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -73,6 +73,7 @@ What: /sys/devices/system/cpu/cpuX/topology/core_id
/sys/devices/system/cpu/cpuX/topology/physical_package_id
/sys/devices/system/cpu/cpuX/topology/thread_siblings
/sys/devices/system/cpu/cpuX/topology/thread_siblings_list
+ /sys/devices/system/cpu/cpuX/topology/ppin
Date: December 2008
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: CPU topology files that describe a logical CPU's relationship
@@ -103,6 +104,11 @@ Description: CPU topology files that describe a logical CPU's relationship
thread_siblings_list: human-readable list of cpuX's hardware
threads within the same core as cpuX
+ ppin: human-readable Protected Processor Identification
+ Number of the socket the cpu# belongs to. There should be
+ one per physical_package_id. File is readable only to
+ admin.
+
See Documentation/admin-guide/cputopology.rst for more information.
@@ -662,6 +668,7 @@ Description: Preferred MTE tag checking mode
================ ==============================================
"sync" Prefer synchronous mode
+ "asymm" Prefer asymmetric mode
"async" Prefer asynchronous mode
================ ==============================================
diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs
index 2416b03ff283..9b583dd0298b 100644
--- a/Documentation/ABI/testing/sysfs-fs-f2fs
+++ b/Documentation/ABI/testing/sysfs-fs-f2fs
@@ -55,8 +55,9 @@ Description: Controls the in-place-update policy.
0x04 F2FS_IPU_UTIL
0x08 F2FS_IPU_SSR_UTIL
0x10 F2FS_IPU_FSYNC
- 0x20 F2FS_IPU_ASYNC,
+ 0x20 F2FS_IPU_ASYNC
0x40 F2FS_IPU_NOCACHE
+ 0x80 F2FS_IPU_HONOR_OPU_WRITE
==== =================
Refer segment.h for details.
@@ -98,6 +99,33 @@ Description: Controls the issue rate of discard commands that consist of small
checkpoint is triggered, and issued during the checkpoint.
By default, it is disabled with 0.
+What: /sys/fs/f2fs/<disk>/max_discard_request
+Date: December 2021
+Contact: "Konstantin Vyshetsky" <vkon@google.com>
+Description: Controls the number of discards a thread will issue at a time.
+ Higher number will allow the discard thread to finish its work
+ faster, at the cost of higher latency for incomming I/O.
+
+What: /sys/fs/f2fs/<disk>/min_discard_issue_time
+Date: December 2021
+Contact: "Konstantin Vyshetsky" <vkon@google.com>
+Description: Controls the interval the discard thread will wait between
+ issuing discard requests when there are discards to be issued and
+ no I/O aware interruptions occur.
+
+What: /sys/fs/f2fs/<disk>/mid_discard_issue_time
+Date: December 2021
+Contact: "Konstantin Vyshetsky" <vkon@google.com>
+Description: Controls the interval the discard thread will wait between
+ issuing discard requests when there are discards to be issued and
+ an I/O aware interruption occurs.
+
+What: /sys/fs/f2fs/<disk>/max_discard_issue_time
+Date: December 2021
+Contact: "Konstantin Vyshetsky" <vkon@google.com>
+Description: Controls the interval the discard thread will wait when there are
+ no discard operations to be issued.
+
What: /sys/fs/f2fs/<disk>/discard_granularity
Date: July 2017
Contact: "Chao Yu" <yuchao0@huawei.com>
@@ -269,11 +297,16 @@ Description: Shows current reserved blocks in system, it may be temporarily
What: /sys/fs/f2fs/<disk>/gc_urgent
Date: August 2017
Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
-Description: Do background GC aggressively when set. When gc_urgent = 1,
- background thread starts to do GC by given gc_urgent_sleep_time
- interval. When gc_urgent = 2, F2FS will lower the bar of
- checking idle in order to process outstanding discard commands
- and GC a little bit aggressively. It is set to 0 by default.
+Description: Do background GC aggressively when set. Set to 0 by default.
+ gc urgent high(1): does GC forcibly in a period of given
+ gc_urgent_sleep_time and ignores I/O idling check. uses greedy
+ GC approach and turns SSR mode on.
+ gc urgent low(2): lowers the bar of checking I/O idling in
+ order to process outstanding discard commands and GC a
+ little bit aggressively. uses cost benefit GC approach.
+ gc urgent mid(3): does GC forcibly in a period of given
+ gc_urgent_sleep_time and executes a mid level of I/O idling check.
+ uses cost benefit GC approach.
What: /sys/fs/f2fs/<disk>/gc_urgent_sleep_time
Date: August 2017
@@ -430,6 +463,7 @@ Description: Show status of f2fs superblock in real time.
0x800 SBI_QUOTA_SKIP_FLUSH skip flushing quota in current CP
0x1000 SBI_QUOTA_NEED_REPAIR quota file may be corrupted
0x2000 SBI_IS_RESIZEFS resizefs is in process
+ 0x4000 SBI_IS_FREEZING freefs is in process
====== ===================== =================================
What: /sys/fs/f2fs/<disk>/ckpt_thread_ioprio
@@ -503,7 +537,7 @@ Date: July 2021
Contact: "Daeho Jeong" <daehojeong@google.com>
Description: Show how many segments have been reclaimed by GC during a specific
GC mode (0: GC normal, 1: GC idle CB, 2: GC idle greedy,
- 3: GC idle AT, 4: GC urgent high, 5: GC urgent low)
+ 3: GC idle AT, 4: GC urgent high, 5: GC urgent low 6: GC urgent mid)
You can re-initialize this value to "0".
What: /sys/fs/f2fs/<disk>/gc_segment_mode
@@ -540,3 +574,9 @@ Contact: "Daeho Jeong" <daehojeong@google.com>
Description: You can set the trial count limit for GC urgent high mode with this value.
If GC thread gets to the limit, the mode will turn back to GC normal mode.
By default, the value is zero, which means there is no limit like before.
+
+What: /sys/fs/f2fs/<disk>/max_roll_forward_node_blocks
+Date: January 2022
+Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
+Description: Controls max # of node block writes to be used for roll forward
+ recovery. This can limit the roll forward recovery time.
diff --git a/Documentation/Makefile b/Documentation/Makefile
index 9f4bd42cef18..64d44c1ecad3 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -26,7 +26,7 @@ SPHINX_CONF = conf.py
PAPER =
BUILDDIR = $(obj)/output
PDFLATEX = xelatex
-LATEXOPTS = -interaction=batchmode
+LATEXOPTS = -interaction=batchmode -no-shell-escape
ifeq ($(KBUILD_VERBOSE),0)
SPHINXOPTS += "-q"
diff --git a/Documentation/admin-guide/acpi/fan_performance_states.rst b/Documentation/admin-guide/acpi/fan_performance_states.rst
index 98fe5c333121..b9e4b4d146c1 100644
--- a/Documentation/admin-guide/acpi/fan_performance_states.rst
+++ b/Documentation/admin-guide/acpi/fan_performance_states.rst
@@ -60,3 +60,31 @@ For example::
When a given field is not populated or its value provided by the platform
firmware is invalid, the "not-defined" string is shown instead of the value.
+
+ACPI Fan Fine Grain Control
+=============================
+
+When _FIF object specifies support for fine grain control, then fan speed
+can be set from 0 to 100% with the recommended minimum "step size" via
+_FSL object. User can adjust fan speed using thermal sysfs cooling device.
+
+Here use can look at fan performance states for a reference speed (speed_rpm)
+and set it by changing cooling device cur_state. If the fine grain control
+is supported then user can also adjust to some other speeds which are
+not defined in the performance states.
+
+The support of fine grain control is presented via sysfs attribute
+"fine_grain_control". If fine grain control is present, this attribute
+will show "1" otherwise "0".
+
+This sysfs attribute is presented in the same directory as performance states.
+
+ACPI Fan Performance Feedback
+=============================
+
+The optional _FST object provides status information for the fan device.
+This includes field to provide current fan speed in revolutions per minute
+at which the fan is rotating.
+
+This speed is presented in the sysfs using the attribute "fan_speed_rpm",
+in the same directory as performance states.
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index 3e11926a4df9..54fe63745ed8 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -315,8 +315,8 @@ To use the feature, admin should set up backing device via::
echo /dev/sda5 > /sys/block/zramX/backing_dev
-before disksize setting. It supports only partition at this moment.
-If admin wants to use incompressible page writeback, they could do via::
+before disksize setting. It supports only partitions at this moment.
+If admin wants to use incompressible page writeback, they could do it via::
echo huge > /sys/block/zramX/writeback
@@ -341,9 +341,9 @@ Admin can request writeback of those idle pages at right timing via::
echo idle > /sys/block/zramX/writeback
-With the command, zram writeback idle pages from memory to the storage.
+With the command, zram will writeback idle pages from memory to the storage.
-If admin want to write a specific page in zram device to backing device,
+If an admin wants to write a specific page in zram device to the backing device,
they could write a page index into the interface.
echo "page_index=1251" > /sys/block/zramX/writeback
@@ -354,7 +354,7 @@ to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
-any writeback. IOW, if admin wants to apply writeback budget, he should
+any writeback. IOW, if admin wants to apply writeback budget, they should
enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable
@@ -365,7 +365,7 @@ until admin sets the budget via /sys/block/zramX/writeback_limit.
(If admin doesn't enable writeback_limit_enable, writeback_limit's value
assigned via /sys/block/zramX/writeback_limit is meaningless.)
-If admin want to limit writeback as per-day 400M, he could do it
+If admin wants to limit writeback as per-day 400M, they could do it
like below::
$ MB_SHIFT=20
@@ -375,16 +375,16 @@ like below::
$ echo 1 > /sys/block/zram0/writeback_limit_enable
If admins want to allow further write again once the budget is exhausted,
-he could do it like below::
+they could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit
-If admin wants to see remaining writeback budget since last set::
+If an admin wants to see the remaining writeback budget since last set::
$ cat /sys/block/zramX/writeback_limit
-If admin want to disable writeback limit, he could do::
+If an admin wants to disable writeback limit, they could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable
@@ -393,7 +393,7 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.
-If admin wants to measure writeback count in a certain period, he could
+If admin wants to measure writeback count in a certain period, they could
know it via /sys/block/zram0/bd_stat's 3rd column.
memory tracking
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..5bfafcbb9562 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -35,6 +35,7 @@ problems and bugs in particular.
:maxdepth: 1
reporting-issues
+ reporting-regressions
security-bugs
bug-hunting
bug-bisect
diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
index 9b14b0c2c9c4..609a3201fd4e 100644
--- a/Documentation/admin-guide/iostats.rst
+++ b/Documentation/admin-guide/iostats.rst
@@ -76,7 +76,7 @@ Field 3 -- # of sectors read (unsigned long)
Field 4 -- # of milliseconds spent reading (unsigned int)
This is the total number of milliseconds spent by all reads (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 5 -- # of writes completed (unsigned long)
This is the total number of writes completed successfully.
@@ -89,7 +89,7 @@ Field 7 -- # of sectors written (unsigned long)
Field 8 -- # of milliseconds spent writing (unsigned int)
This is the total number of milliseconds spent by all writes (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 9 -- # of I/Os currently in progress (unsigned int)
The only field that should go to zero. Incremented as requests are
@@ -120,7 +120,7 @@ Field 14 -- # of sectors discarded (unsigned long)
Field 15 -- # of milliseconds spent discarding (unsigned int)
This is the total number of milliseconds spent by all discards (as
- measured from __make_request() to end_that_request_last()).
+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 16 -- # of flush requests completed
This is the total number of flush requests completed successfully.
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 3861a25faae1..8419019b6a88 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -494,6 +494,14 @@ architecture which is used to lookup the page-tables for the Virtual
addresses in the higher VA range (refer to ARMv8 ARM document for
more details).
+MODULES_VADDR|MODULES_END|VMALLOC_START|VMALLOC_END|VMEMMAP_START|VMEMMAP_END
+-----------------------------------------------------------------------------
+
+Used to get the correct ranges:
+ MODULES_VADDR ~ MODULES_END-1 : Kernel module space.
+ VMALLOC_START ~ VMALLOC_END-1 : vmalloc() / ioremap() space.
+ VMEMMAP_START ~ VMEMMAP_END-1 : vmemmap region, used for struct page array.
+
arm
===
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7123524a86b8..486023f7209a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -944,6 +944,30 @@
dump out devices still on the deferred probe list after
retrying.
+ dell_smm_hwmon.ignore_dmi=
+ [HW] Continue probing hardware even if DMI data
+ indicates that the driver is running on unsupported
+ hardware.
+
+ dell_smm_hwmon.force=
+ [HW] Activate driver even if SMM BIOS signature does
+ not match list of supported models and enable otherwise
+ blacklisted features.
+
+ dell_smm_hwmon.power_status=
+ [HW] Report power status in /proc/i8k
+ (disabled by default).
+
+ dell_smm_hwmon.restricted=
+ [HW] Allow controlling fans only if SYS_ADMIN
+ capability is set.
+
+ dell_smm_hwmon.fan_mult=
+ [HW] Factor to multiply fan speed with.
+
+ dell_smm_hwmon.fan_max=
+ [HW] Maximum configurable fan speed.
+
dfltcc= [HW,S390]
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
@@ -1703,17 +1727,6 @@
i810= [HW,DRM]
- i8k.ignore_dmi [HW] Continue probing hardware even if DMI data
- indicates that the driver is running on unsupported
- hardware.
- i8k.force [HW] Activate i8k driver even if SMM BIOS signature
- does not match list of supported models.
- i8k.power_status
- [HW] Report power status in /proc/i8k
- (disabled by default)
- i8k.restricted [HW] Allow controlling fans only if SYS_ADMIN
- capability is set.
-
i915.invert_brightness=
[DRM] Invert the sense of the variable that is used to
set the brightness of the panel backlight. Normally a
@@ -2827,6 +2840,9 @@
For details see: Documentation/admin-guide/hw-vuln/mds.rst
+ mem=nn[KMG] [HEXAGON] Set the memory size.
+ Must be specified, otherwise memory size will be 0.
+
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
Amount of memory to be used in cases as follows:
@@ -2834,6 +2850,13 @@
2 when the kernel is not able to see the whole system memory;
3 memory that lies after 'mem=' boundary is excluded from
the hypervisor, then assigned to KVM guests.
+ 4 to limit the memory available for kdump kernel.
+
+ [ARC,MICROBLAZE] - the limit applies only to low memory,
+ high memory is not affected.
+
+ [ARM64] - only limits memory covered by the linear
+ mapping. The NOMAP regions are not affected.
[X86] Work as limiting max address. Use together
with memmap= to avoid physical address space collisions.
@@ -2844,6 +2867,14 @@
in above case 3, memory may need be hot added after boot
if system memory of hypervisor is not sufficient.
+ mem=nn[KMG]@ss[KMG]
+ [ARM,MIPS] - override the memory layout reported by
+ firmware.
+ Define a memory region of size nn[KMG] starting at
+ ss[KMG].
+ Multiple different regions can be specified with
+ multiple mem= parameters on the command line.
+
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory.
@@ -4504,6 +4535,8 @@
(the least-favored priority). Otherwise, when
RCU_BOOST is not set, valid values are 0-99 and
the default is zero (non-realtime operation).
+ When RCU_NOCB_CPU is set, also adjust the
+ priority of NOCB callback kthreads.
rcutree.rcu_nocb_gp_stride= [KNL]
Set the number of NOCB callback kthreads in
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 5a8f2529a033..69b23f087c05 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -8,6 +8,7 @@ Performance monitor support
:maxdepth: 1
hisi-pmu
+ hisi-pcie-pmu
imx-ddr
qcom_l2_pmu
qcom_l3_pmu
diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst
index 2f066df4ee9c..1923cb25073b 100644
--- a/Documentation/admin-guide/pm/amd-pstate.rst
+++ b/Documentation/admin-guide/pm/amd-pstate.rst
@@ -369,6 +369,32 @@ governor (for the policies it is attached to), or by the ``CPUFreq`` core (for t
policies with other scaling governors).
+Tracer Tool
+-------------
+
+``amd_pstate_tracer.py`` can record and parse ``amd-pstate`` trace log, then
+generate performance plots. This utility can be used to debug and tune the
+performance of ``amd-pstate`` driver. The tracer tool needs to import intel
+pstate tracer.
+
+Tracer tool located in ``linux/tools/power/x86/amd_pstate_tracer``. It can be
+used in two ways. If trace file is available, then directly parse the file
+with command ::
+
+ ./amd_pstate_trace.py [-c cpus] -t <trace_file> -n <test_name>
+
+Or generate trace file with root privilege, then parse and plot with command ::
+
+ sudo ./amd_pstate_trace.py [-c cpus] -n <test_name> -i <interval> [-m kbytes]
+
+The test result can be found in ``results/test_name``. Following is the example
+about part of the output. ::
+
+ common_cpu common_secs common_usecs min_perf des_perf max_perf freq mperf apef tsc load duration_ms sample_num elapsed_time common_comm
+ CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40
+ CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264
+
+
Reference
===========
diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
new file mode 100644
index 000000000000..09169d935835
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+==============================
+Intel Uncore Frequency Scaling
+==============================
+
+:Copyright: |copy| 2022 Intel Corporation
+
+:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
+
+Introduction
+------------
+
+The uncore can consume significant amount of power in Intel's Xeon servers based
+on the workload characteristics. To optimize the total power and improve overall
+performance, SoCs have internal algorithms for scaling uncore frequency. These
+algorithms monitor workload usage of uncore and set a desirable frequency.
+
+It is possible that users have different expectations of uncore performance and
+want to have control over it. The objective is similar to allowing users to set
+the scaling min/max frequencies via cpufreq sysfs to improve CPU performance.
+Users may have some latency sensitive workloads where they do not want any
+change to uncore frequency. Also, users may have workloads which require
+different core and uncore performance at distinct phases and they may want to
+use both cpufreq and the uncore scaling interface to distribute power and
+improve overall performance.
+
+Sysfs Interface
+---------------
+
+To control uncore frequency, a sysfs interface is provided in the directory:
+`/sys/devices/system/cpu/intel_uncore_frequency/`.
+
+There is one directory for each package and die combination as the scope of
+uncore scaling control is per die in multiple die/package SoCs or per
+package for single die per package SoCs. The name represents the
+scope of control. For example: 'package_00_die_00' is for package id 0 and
+die 0.
+
+Each package_*_die_* contains the following attributes:
+
+``initial_max_freq_khz``
+ Out of reset, this attribute represent the maximum possible frequency.
+ This is a read-only attribute. If users adjust max_freq_khz,
+ they can always go back to maximum using the value from this attribute.
+
+``initial_min_freq_khz``
+ Out of reset, this attribute represent the minimum possible frequency.
+ This is a read-only attribute. If users adjust min_freq_khz,
+ they can always go back to minimum using the value from this attribute.
+
+``max_freq_khz``
+ This attribute is used to set the maximum uncore frequency.
+
+``min_freq_khz``
+ This attribute is used to set the minimum uncore frequency.
+
+``current_freq_khz``
+ This attribute is used to get the current uncore frequency.
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
index 5d2757e2de65..ee45887811ff 100644
--- a/Documentation/admin-guide/pm/working-state.rst
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -15,3 +15,4 @@ Working-State Power Management
cpufreq_drivers
intel_epb
intel-speed-select
+ intel_uncore_frequency_scaling
diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst
index d7ac13f789cc..ec62151fe672 100644
--- a/Documentation/admin-guide/reporting-issues.rst
+++ b/Documentation/admin-guide/reporting-issues.rst
@@ -1,14 +1,5 @@
.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
-..
- If you want to distribute this text under CC-BY-4.0 only, please use 'The
- Linux kernel developers' for author attribution and link this as source:
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst
-..
- Note: Only the content of this RST file as found in the Linux kernel sources
- is available under CC-BY-4.0, as versions of this text that were processed
- (for example by the kernel's build system) might contain content taken from
- files which use a more restrictive license.
-
+.. See the bottom of this file for additional redistribution information.
Reporting issues
++++++++++++++++
@@ -395,22 +386,16 @@ fixed as soon as possible, hence there are 'issues of high priority' that get
handled slightly differently in the reporting process. Three type of cases
qualify: regressions, security issues, and really severe problems.
-You deal with a 'regression' if something that worked with an older version of
-the Linux kernel does not work with a newer one or somehow works worse with it.
-It thus is a regression when a WiFi driver that did a fine job with Linux 5.7
-somehow misbehaves with 5.8 or doesn't work at all. It's also a regression if
-an application shows erratic behavior with a newer kernel, which might happen
-due to incompatible changes in the interface between the kernel and the
-userland (like procfs and sysfs). Significantly reduced performance or
-increased power consumption also qualify as regression. But keep in mind: the
-new kernel needs to be built with a configuration that is similar to the one
-from the old kernel (see below how to achieve that). That's because the kernel
-developers sometimes can not avoid incompatibilities when implementing new
-features; but to avoid regressions such features have to be enabled explicitly
-during build time configuration.
+You deal with a regression if some application or practical use case running
+fine with one Linux kernel works worse or not at all with a newer version
+compiled using a similar configuration. The document
+Documentation/admin-guide/reporting-regressions.rst explains this in more
+detail. It also provides a good deal of other information about regressions you
+might want to be aware of; it for example explains how to add your issue to the
+list of tracked regressions, to ensure it won't fall through the cracks.
What qualifies as security issue is left to your judgment. Consider reading
-'Documentation/admin-guide/security-bugs.rst' before proceeding, as it
+Documentation/admin-guide/security-bugs.rst before proceeding, as it
provides additional details how to best handle security issues.
An issue is a 'really severe problem' when something totally unacceptably bad
@@ -517,7 +502,7 @@ line starting with 'CPU:'. It should end with 'Not tainted' if the kernel was
not tainted when it noticed the problem; it was tainted if you see 'Tainted:'
followed by a few spaces and some letters.
-If your kernel is tainted, study 'Documentation/admin-guide/tainted-kernels.rst'
+If your kernel is tainted, study Documentation/admin-guide/tainted-kernels.rst
to find out why. Try to eliminate the reason. Often it's caused by one these
three things:
@@ -1043,7 +1028,7 @@ down the culprit, as maintainers often won't have the time or setup at hand to
reproduce it themselves.
To find the change there is a process called 'bisection' which the document
-'Documentation/admin-guide/bug-bisect.rst' describes in detail. That process
+Documentation/admin-guide/bug-bisect.rst describes in detail. That process
will often require you to build about ten to twenty kernel images, trying to
reproduce the issue with each of them before building the next. Yes, that takes
some time, but don't worry, it works a lot quicker than most people assume.
@@ -1073,10 +1058,11 @@ When dealing with regressions make sure the issue you face is really caused by
the kernel and not by something else, as outlined above already.
In the whole process keep in mind: an issue only qualifies as regression if the
-older and the newer kernel got built with a similar configuration. The best way
-to archive this: copy the configuration file (``.config``) from the old working
-kernel freshly to each newer kernel version you try. Afterwards run ``make
-olddefconfig`` to adjust it for the needs of the new version.
+older and the newer kernel got built with a similar configuration. This can be
+achieved by using ``make olddefconfig``, as explained in more detail by
+Documentation/admin-guide/reporting-regressions.rst; that document also
+provides a good deal of other information about regressions you might want to be
+aware of.
Write and send the report
@@ -1283,7 +1269,7 @@ them when sending the report by mail. If you filed it in a bug tracker, forward
the report's text to these addresses; but on top of it put a small note where
you mention that you filed it with a link to the ticket.
-See 'Documentation/admin-guide/security-bugs.rst' for more information.
+See Documentation/admin-guide/security-bugs.rst for more information.
Duties after the report went out
@@ -1571,7 +1557,7 @@ Once your report is out your might get asked to do a proper one, as it allows to
pinpoint the exact change that causes the issue (which then can easily get
reverted to fix the issue quickly). Hence consider to do a proper bisection
right away if time permits. See the section 'Special care for regressions' and
-the document 'Documentation/admin-guide/bug-bisect.rst' for details how to
+the document Documentation/admin-guide/bug-bisect.rst for details how to
perform one. In case of a successful bisection add the author of the culprit to
the recipients; also CC everyone in the signed-off-by chain, which you find at
the end of its commit message.
@@ -1594,7 +1580,7 @@ Some fixes are too complex
Even small and seemingly obvious code-changes sometimes introduce new and
totally unexpected problems. The maintainers of the stable and longterm kernels
are very aware of that and thus only apply changes to these kernels that are
-within rules outlined in 'Documentation/process/stable-kernel-rules.rst'.
+within rules outlined in Documentation/process/stable-kernel-rules.rst.
Complex or risky changes for example do not qualify and thus only get applied
to mainline. Other fixes are easy to get backported to the newest stable and
@@ -1756,10 +1742,23 @@ art will lay some groundwork to improve the situation over time.
..
- This text is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If you
- spot a typo or small mistake, feel free to let him know directly and he'll
- fix it. You are free to do the same in a mostly informal way if you want
- to contribute changes to the text, but for copyright reasons please CC
+ end-of-content
+..
+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
+ you spot a typo or small mistake, feel free to let him know directly and
+ he'll fix it. You are free to do the same in a mostly informal way if you
+ want to contribute changes to the text, but for copyright reasons please CC
linux-doc@vger.kernel.org and "sign-off" your contribution as
Documentation/process/submitting-patches.rst outlines in the section "Sign
your work - the Developer's Certificate of Origin".
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use "The Linux kernel developers" for author attribution and link
+ this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/reporting-regressions.rst b/Documentation/admin-guide/reporting-regressions.rst
new file mode 100644
index 000000000000..d8adccdae23f
--- /dev/null
+++ b/Documentation/admin-guide/reporting-regressions.rst
@@ -0,0 +1,451 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. [see the bottom of this file for redistribution information]
+
+Reporting regressions
++++++++++++++++++++++
+
+"*We don't cause regressions*" is the first rule of Linux kernel development;
+Linux founder and lead developer Linus Torvalds established it himself and
+ensures it's obeyed.
+
+This document describes what the rule means for users and how the Linux kernel's
+development model ensures to address all reported regressions; aspects relevant
+for kernel developers are left to Documentation/process/handling-regressions.rst.
+
+
+The important bits (aka "TL;DR")
+================================
+
+#. It's a regression if something running fine with one Linux kernel works worse
+ or not at all with a newer version. Note, the newer kernel has to be compiled
+ using a similar configuration; the detailed explanations below describes this
+ and other fine print in more detail.
+
+#. Report your issue as outlined in Documentation/admin-guide/reporting-issues.rst,
+ it already covers all aspects important for regressions and repeated
+ below for convenience. Two of them are important: start your report's subject
+ with "[REGRESSION]" and CC or forward it to `the regression mailing list
+ <https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).
+
+#. Optional, but recommended: when sending or forwarding your report, make the
+ Linux kernel regression tracking bot "regzbot" track the issue by specifying
+ when the regression started like this::
+
+ #regzbot introduced v5.13..v5.14-rc1
+
+
+All the details on Linux kernel regressions relevant for users
+==============================================================
+
+
+The important basics
+--------------------
+
+
+What is a "regression" and what is the "no regressions rule"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's a regression if some application or practical use case running fine with
+one Linux kernel works worse or not at all with a newer version compiled using a
+similar configuration. The "no regressions rule" forbids this to take place; if
+it happens by accident, developers that caused it are expected to quickly fix
+the issue.
+
+It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
+5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
+It's also a regression if a perfectly working application suddenly shows erratic
+behavior with a newer kernel version; such issues can be caused by changes in
+procfs, sysfs, or one of the many other interfaces Linux provides to userland
+software. But keep in mind, as mentioned earlier: 5.14 in this example needs to
+be built from a configuration similar to the one from 5.13. This can be achieved
+using ``make olddefconfig``, as explained in more detail below.
+
+Note the "practical use case" in the first sentence of this section: developers
+despite the "no regressions" rule are free to change any aspect of the kernel
+and even APIs or ABIs to userland, as long as no existing application or use
+case breaks.
+
+Also be aware the "no regressions" rule covers only interfaces the kernel
+provides to the userland. It thus does not apply to kernel-internal interfaces
+like the module API, which some externally developed drivers use to hook into
+the kernel.
+
+How do I report a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Just report the issue as outlined in
+Documentation/admin-guide/reporting-issues.rst, it already describes the
+important points. The following aspects outlined there are especially relevant
+for regressions:
+
+ * When checking for existing reports to join, also search the `archives of the
+ Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
+ `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+ * Start your report's subject with "[REGRESSION]".
+
+ * In your report, clearly mention the last kernel version that worked fine and
+ the first broken one. Ideally try to find the exact change causing the
+ regression using a bisection, as explained below in more detail.
+
+ * Remember to let the Linux regressions mailing list
+ (regressions@lists.linux.dev) know about your report:
+
+ * If you report the regression by mail, CC the regressions list.
+
+ * If you report your regression to some bug tracker, forward the submitted
+ report by mail to the regressions list while CCing the maintainer and the
+ mailing list for the subsystem in question.
+
+ If it's a regression within a stable or longterm series (e.g.
+ v5.15.3..v5.15.5), remember to CC the `Linux stable mailing list
+ <https://lore.kernel.org/stable/>`_ (stable@vger.kernel.org).
+
+ In case you performed a successful bisection, add everyone to the CC the
+ culprit's commit message mentions in lines starting with "Signed-off-by:".
+
+When CCing for forwarding your report to the list, consider directly telling the
+aforementioned Linux kernel regression tracking bot about your report. To do
+that, include a paragraph like this in your mail::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+
+Regzbot will then consider your mail a report for a regression introduced in the
+specified version range. In above case Linux v5.13 still worked fine and Linux
+v5.14-rc1 was the first version where you encountered the issue. If you
+performed a bisection to find the commit that caused the regression, specify the
+culprit's commit-id instead::
+
+ #regzbot introduced: 1f2e3d4c5d
+
+Placing such a "regzbot command" is in your interest, as it will ensure the
+report won't fall through the cracks unnoticed. If you omit this, the Linux
+kernel's regressions tracker will take care of telling regzbot about your
+regression, as long as you send a copy to the regressions mailing lists. But the
+regression tracker is just one human which sometimes has to rest or occasionally
+might even enjoy some time away from computers (as crazy as that might sound).
+Relying on this person thus will result in an unnecessary delay before the
+regressions becomes mentioned `on the list of tracked and unresolved Linux
+kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
+weekly regression reports sent by regzbot. Such delays can result in Linus
+Torvalds being unaware of important regressions when deciding between "continue
+development or call this finished and release the final?".
+
+Are really all regressions fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Nearly all of them are, as long as the change causing the regression (the
+"culprit commit") is reliably identified. Some regressions can be fixed without
+this, but often it's required.
+
+Who needs to find the root cause of a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers of the affected code area should try to locate the culprit on their
+own. But for them that's often impossible to do with reasonable effort, as quite
+a lot of issues only occur in a particular environment outside the developer's
+reach -- for example, a specific hardware platform, firmware, Linux distro,
+system's configuration, or application. That's why in the end it's often up to
+the reporter to locate the culprit commit; sometimes users might even need to
+run additional tests afterwards to pinpoint the exact root cause. Developers
+should offer advice and reasonably help where they can, to make this process
+relatively easy and achievable for typical users.
+
+How can I find the culprit?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Perform a bisection, as roughly outlined in
+Documentation/admin-guide/reporting-issues.rst and described in more detail by
+Documentation/admin-guide/bug-bisect.rst. It might sound like a lot of work, but
+in many cases finds the culprit relatively quickly. If it's hard or
+time-consuming to reliably reproduce the issue, consider teaming up with other
+affected users to narrow down the search range together.
+
+Who can I ask for advice when it comes to regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
+CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
+issue might better be dealt with in private, feel free to omit the list.
+
+
+Additional details about regressions
+------------------------------------
+
+
+What is the goal of the "no regressions rule"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Users should feel safe when updating kernel versions and not have to worry
+something might break. This is in the interest of the kernel developers to make
+updating attractive: they don't want users to stay on stable or longterm Linux
+series that are either abandoned or more than one and a half years old. That's
+in everybody's interest, as `those series might have known bugs, security
+issues, or other problematic aspects already fixed in later versions
+<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
+Additionally, the kernel developers want to make it simple and appealing for
+users to test the latest pre-release or regular release. That's also in
+everybody's interest, as it's a lot easier to track down and fix problems, if
+they are reported shortly after being introduced.
+
+Is the "no regressions" rule really adhered in practice?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's taken really seriously, as can be seen by many mailing list posts from
+Linux creator and lead developer Linus Torvalds, some of which are quoted in
+Documentation/process/handling-regressions.rst.
+
+Exceptions to this rule are extremely rare; in the past developers almost always
+turned out to be wrong when they assumed a particular situation was warranting
+an exception.
+
+Who ensures the "no regressions" is actually followed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The subsystem maintainers should take care of that, which are watched and
+supported by the tree maintainers -- e.g. Linus Torvalds for mainline and
+Greg Kroah-Hartman et al. for various stable/longterm series.
+
+All of them are helped by people trying to ensure no regression report falls
+through the cracks. One of them is Thorsten Leemhuis, who's currently acting as
+the Linux kernel's "regressions tracker"; to facilitate this work he relies on
+regzbot, the Linux kernel regression tracking bot. That's why you want to bring
+your report on the radar of these people by CCing or forwarding each report to
+the regressions mailing list, ideally with a "regzbot command" in your mail to
+get it tracked immediately.
+
+How quickly are regressions normally fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers should fix any reported regression as quickly as possible, to provide
+affected users with a solution in a timely manner and prevent more users from
+running into the issue; nevertheless developers need to take enough time and
+care to ensure regression fixes do not cause additional damage.
+
+The answer thus depends on various factors like the impact of a regression, its
+age, or the Linux series in which it occurs. In the end though, most regressions
+should be fixed within two weeks.
+
+Is it a regression, if the issue can be avoided by updating some software?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Almost always: yes. If a developer tells you otherwise, ask the regression
+tracker for advice as outlined above.
+
+Is it a regression, if a newer kernel works slower or consumes more energy?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, but the difference has to be significant. A five percent slow-down in a
+micro-benchmark thus is unlikely to qualify as regression, unless it also
+influences the results of a broad benchmark by more than one percent. If in
+doubt, ask for advice.
+
+Is it a regression, if an external kernel module breaks when updating Linux?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+No, as the "no regression" rule is about interfaces and services the Linux
+kernel provides to the userland. It thus does not cover building or running
+externally developed kernel modules, as they run in kernel-space and hook into
+the kernel using internal interfaces occasionally changed.
+
+How are regressions handled that are caused by security fixes?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In extremely rare situations security issues can't be fixed without causing
+regressions; those fixes are given way, as they are the lesser evil in the end.
+Luckily this middling almost always can be avoided, as key developers for the
+affected area and often Linus Torvalds himself try very hard to fix security
+issues without causing regressions.
+
+If you nevertheless face such a case, check the mailing list archives if people
+tried their best to avoid the regression. If not, report it; if in doubt, ask
+for advice as outlined above.
+
+What happens if fixing a regression is impossible without causing another?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sadly these things happen, but luckily not very often; if they occur, expert
+developers of the affected code area should look into the issue to find a fix
+that avoids regressions or at least their impact. If you run into such a
+situation, do what was outlined already for regressions caused by security
+fixes: check earlier discussions if people already tried their best and ask for
+advice if in doubt.
+
+A quick note while at it: these situations could be avoided, if people would
+regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each
+development cycle a test run. This is best explained by imagining a change
+integrated between Linux v5.14 and v5.15-rc1 which causes a regression, but at
+the same time is a hard requirement for some other improvement applied for
+5.15-rc1. All these changes often can simply be reverted and the regression thus
+solved, if someone finds and reports it before 5.15 is released. A few days or
+weeks later this solution can become impossible, as some software might have
+started to rely on aspects introduced by one of the follow-up changes: reverting
+all changes would then cause a regression for users of said software and thus is
+out of the question.
+
+Is it a regression, if some feature I relied on was removed months ago?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is, but often it's hard to fix such regressions due to the aspects outlined
+in the previous section. It hence needs to be dealt with on a case-by-case
+basis. This is another reason why it's in everybody's interest to regularly test
+mainline pre-releases.
+
+Does the "no regression" rule apply if I seem to be the only affected person?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but only for practical usage: the Linux developers want to be free to
+remove support for hardware only to be found in attics and museums anymore.
+
+Note, sometimes regressions can't be avoided to make progress -- and the latter
+is needed to prevent Linux from stagnation. Hence, if only very few users seem
+to be affected by a regression, it for the greater good might be in their and
+everyone else's interest to lettings things pass. Especially if there is an
+easy way to circumvent the regression somehow, for example by updating some
+software or using a kernel parameter created just for this purpose.
+
+Does the regression rule apply for code in the staging tree as well?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not according to the `help text for the configuration option covering all
+staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
+which since its early days states::
+
+ Please note that these drivers are under heavy development, may or
+ may not work, and may contain userspace interfaces that most likely
+ will be changed in the near future.
+
+The staging developers nevertheless often adhere to the "no regressions" rule,
+but sometimes bend it to make progress. That's for example why some users had to
+deal with (often negligible) regressions when a WiFi driver from the staging
+tree was replaced by a totally different one written from scratch.
+
+Why do later versions have to be "compiled with a similar configuration"?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because the Linux kernel developers sometimes integrate changes known to cause
+regressions, but make them optional and disable them in the kernel's default
+configuration. This trick allows progress, as the "no regressions" rule
+otherwise would lead to stagnation.
+
+Consider for example a new security feature blocking access to some kernel
+interfaces often abused by malware, which at the same time are required to run a
+few rarely used applications. The outlined approach makes both camps happy:
+people using these applications can leave the new security feature off, while
+everyone else can enable it without running into trouble.
+
+How to create a configuration similar to the one of an older kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Start your machine with a known-good kernel and configure the newer Linux
+version with ``make olddefconfig``. This makes the kernel's build scripts pick
+up the configuration file (the ".config" file) from the running kernel as base
+for the new one you are about to compile; afterwards they set all new
+configuration options to their default value, which should disable new features
+that might cause regressions.
+
+Can I report a regression I found with pre-compiled vanilla kernels?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You need to ensure the newer kernel was compiled with a similar configuration
+file as the older one (see above), as those that built them might have enabled
+some known-to-be incompatible feature for the newer kernel. If in doubt, report
+the matter to the kernel's provider and ask for advice.
+
+
+More about regression tracking with "regzbot"
+---------------------------------------------
+
+What is regression tracking and why should I care about it?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Rules like "no regressions" need someone to ensure they are followed, otherwise
+they are broken either accidentally or on purpose. History has shown this to be
+true for Linux kernel development as well. That's why Thorsten Leemhuis, the
+Linux Kernel's regression tracker, and some people try to ensure all regression
+are fixed by keeping an eye on them until they are resolved. Neither of them are
+paid for this, that's why the work is done on a best effort basis.
+
+Why and how are Linux kernel regressions tracked using a bot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Tracking regressions completely manually has proven to be quite hard due to the
+distributed and loosely structured nature of Linux kernel development process.
+That's why the Linux kernel's regression tracker developed regzbot to facilitate
+the work, with the long term goal to automate regression tracking as much as
+possible for everyone involved.
+
+Regzbot works by watching for replies to reports of tracked regressions.
+Additionally, it's looking out for posted or committed patches referencing such
+reports with "Link:" tags; replies to such patch postings are tracked as well.
+Combined this data provides good insights into the current state of the fixing
+process.
+
+How to see which regressions regzbot tracks currently?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check out `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+What kind of issues are supposed to be tracked by regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bot is meant to track regressions, hence please don't involve regzbot for
+regular issues. But it's okay for the Linux kernel's regression tracker if you
+involve regzbot to track severe issues, like reports about hangs, corrupted
+data, or internal errors (Panic, Oops, BUG(), warning, ...).
+
+How to change aspects of a tracked regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By using a 'regzbot command' in a direct or indirect reply to the mail with the
+report. The easiest way to do that: find the report in your "Sent" folder or the
+mailing list archive and reply to it using your mailer's "Reply-all" function.
+In that mail, use one of the following commands in a stand-alone paragraph (IOW:
+use blank lines to separate one or multiple of these commands from the rest of
+the mail's text).
+
+ * Update when the regression started to happen, for example after performing a
+ bisection::
+
+ #regzbot introduced: 1f2e3d4c5d
+
+ * Set or update the title::
+
+ #regzbot title: foo
+
+ * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
+ the issue or a fix are discussed:::
+
+ #regzbot monitor: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
+ #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Point to a place with further details of interest, like a mailing list post
+ or a ticket in a bug tracker that are slightly related, but about a different
+ topic::
+
+ #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Mark a regression as invalid::
+
+ #regzbot invalid: wasn't a regression, problem has always existed
+
+Regzbot supports a few other commands primarily used by developers or people
+tracking regressions. They and more details about the aforementioned regzbot
+commands can be found in the `getting started guide
+<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ and
+the `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
+for regzbot.
+
+..
+ end-of-content
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use "The Linux kernel developers" for author attribution and link
+ this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-regressions.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index d359bcfadd39..5dd660aac0ae 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1029,23 +1029,17 @@ This is a directory, with the following entries:
* ``poolsize``: the entropy pool size, in bits;
* ``urandom_min_reseed_secs``: obsolete (used to determine the minimum
- number of seconds between urandom pool reseeding).
+ number of seconds between urandom pool reseeding). This file is
+ writable for compatibility purposes, but writing to it has no effect
+ on any RNG behavior.
* ``uuid``: a UUID generated every time this is retrieved (this can
thus be used to generate UUIDs at will);
* ``write_wakeup_threshold``: when the entropy count drops below this
(as a number of bits), processes waiting to write to ``/dev/random``
- are woken up.
-
-If ``drivers/char/random.c`` is built with ``ADD_INTERRUPT_BENCH``
-defined, these additional entries are present:
-
-* ``add_interrupt_avg_cycles``: the average number of cycles between
- interrupts used to feed the pool;
-
-* ``add_interrupt_avg_deviation``: the standard deviation seen on the
- number of cycles between interrupts used to feed the pool.
+ are woken up. This file is writable for compatibility purposes, but
+ writing to it has no effect on any RNG behavior.
randomize_va_space
diff --git a/Documentation/arm64/booting.rst b/Documentation/arm64/booting.rst
index 52d060caf8bb..29884b261aa9 100644
--- a/Documentation/arm64/booting.rst
+++ b/Documentation/arm64/booting.rst
@@ -10,9 +10,9 @@ This document is based on the ARM booting document by Russell King and
is relevant to all public releases of the AArch64 Linux kernel.
The AArch64 exception model is made up of a number of exception levels
-(EL0 - EL3), with EL0 and EL1 having a secure and a non-secure
-counterpart. EL2 is the hypervisor level and exists only in non-secure
-mode. EL3 is the highest priority level and exists only in secure mode.
+(EL0 - EL3), with EL0, EL1 and EL2 having a secure and a non-secure
+counterpart. EL2 is the hypervisor level, EL3 is the highest priority
+level and exists only in secure mode. Both are architecturally optional.
For the purposes of this document, we will use the term `boot loader`
simply to define all software that executes on the CPU(s) before control
@@ -167,8 +167,8 @@ Before jumping into the kernel, the following conditions must be met:
All forms of interrupts must be masked in PSTATE.DAIF (Debug, SError,
IRQ and FIQ).
- The CPU must be in either EL2 (RECOMMENDED in order to have access to
- the virtualisation extensions) or non-secure EL1.
+ The CPU must be in non-secure state, either in EL2 (RECOMMENDED in order
+ to have access to the virtualisation extensions), or in EL1.
- Caches, MMUs
diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst
index b72ff17d600a..a8f30963e550 100644
--- a/Documentation/arm64/elf_hwcaps.rst
+++ b/Documentation/arm64/elf_hwcaps.rst
@@ -259,6 +259,11 @@ HWCAP2_RPRES
Functionality implied by ID_AA64ISAR2_EL1.RPRES == 0b0001.
+HWCAP2_MTE3
+
+ Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0011, as described
+ by Documentation/arm64/memory-tagging-extension.rst.
+
4. Unused AT_HWCAP bits
-----------------------
diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst
index 7b99c8f428eb..dd27f78d7608 100644
--- a/Documentation/arm64/memory-tagging-extension.rst
+++ b/Documentation/arm64/memory-tagging-extension.rst
@@ -76,6 +76,9 @@ configurable behaviours:
with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting
address is unknown).
+- *Asymmetric* - Reads are handled as for synchronous mode while writes
+ are handled as for asynchronous mode.
+
The user can select the above modes, per thread, using the
``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where ``flags``
contains any number of the following values in the ``PR_MTE_TCF_MASK``
@@ -91,8 +94,9 @@ mode is specified, the program will run in that mode. If multiple
modes are specified, the mode is selected as described in the "Per-CPU
preferred tag checking modes" section below.
-The current tag check fault mode can be read using the
-``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
+The current tag check fault configuration can be read using the
+``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call. If
+multiple modes were requested then all will be reported.
Tag checking can also be disabled for a user thread by setting the
``PSTATE.TCO`` bit with ``MSR TCO, #1``.
@@ -139,18 +143,25 @@ tag checking mode as the CPU's preferred tag checking mode.
The preferred tag checking mode for each CPU is controlled by
``/sys/devices/system/cpu/cpu<N>/mte_tcf_preferred``, to which a
-privileged user may write the value ``async`` or ``sync``. The default
-preferred mode for each CPU is ``async``.
+privileged user may write the value ``async``, ``sync`` or ``asymm``. The
+default preferred mode for each CPU is ``async``.
To allow a program to potentially run in the CPU's preferred tag
checking mode, the user program may set multiple tag check fault mode
bits in the ``flags`` argument to the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
-flags, 0, 0, 0)`` system call. If the CPU's preferred tag checking
-mode is in the task's set of provided tag checking modes (this will
-always be the case at present because the kernel only supports two
-tag checking modes, but future kernels may support more modes), that
-mode will be selected. Otherwise, one of the modes in the task's mode
-set will be selected in a currently unspecified manner.
+flags, 0, 0, 0)`` system call. If both synchronous and asynchronous
+modes are requested then asymmetric mode may also be selected by the
+kernel. If the CPU's preferred tag checking mode is in the task's set
+of provided tag checking modes, that mode will be selected. Otherwise,
+one of the modes in the task's mode will be selected by the kernel
+from the task's mode set using the preference order:
+
+ 1. Asynchronous
+ 2. Asymmetric
+ 3. Synchronous
+
+Note that there is no way for userspace to request multiple modes and
+also disable asymmetric mode.
Initial process state
---------------------
@@ -213,6 +224,29 @@ address ABI control and MTE configuration of a process as per the
Documentation/arm64/tagged-address-abi.rst and above. The corresponding
``regset`` is 1 element of 8 bytes (``sizeof(long))``).
+Core dump support
+-----------------
+
+The allocation tags for user memory mapped with ``PROT_MTE`` are dumped
+in the core file as additional ``PT_ARM_MEMTAG_MTE`` segments. The
+program header for such segment is defined as:
+
+:``p_type``: ``PT_ARM_MEMTAG_MTE``
+:``p_flags``: 0
+:``p_offset``: segment file offset
+:``p_vaddr``: segment virtual address, same as the corresponding
+ ``PT_LOAD`` segment
+:``p_paddr``: 0
+:``p_filesz``: segment size in file, calculated as ``p_mem_sz / 32``
+ (two 4-bit tags cover 32 bytes of memory)
+:``p_memsz``: segment size in memory, same as the corresponding
+ ``PT_LOAD`` segment
+:``p_align``: 0
+
+The tags are stored in the core file at ``p_offset`` as two 4-bit tags
+in a byte. With the tag granule of 16 bytes, a 4K page requires 128
+bytes in the core file.
+
Example of correct usage
========================
diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
index ea281dd75517..466cb9e89047 100644
--- a/Documentation/arm64/silicon-errata.rst
+++ b/Documentation/arm64/silicon-errata.rst
@@ -136,7 +136,7 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| Cavium | ThunderX ITS | #23144 | CAVIUM_ERRATUM_23144 |
+----------------+-----------------+-----------------+-----------------------------+
-| Cavium | ThunderX GICv3 | #23154 | CAVIUM_ERRATUM_23154 |
+| Cavium | ThunderX GICv3 | #23154,38545 | CAVIUM_ERRATUM_23154 |
+----------------+-----------------+-----------------+-----------------------------+
| Cavium | ThunderX GICv3 | #38539 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
diff --git a/Documentation/asm-annotations.rst b/Documentation/asm-annotations.rst
index f4bf0f6395fb..a64f2ca469d4 100644
--- a/Documentation/asm-annotations.rst
+++ b/Documentation/asm-annotations.rst
@@ -130,14 +130,13 @@ denoting a range of code via ``SYM_*_START/END`` annotations.
In fact, this kind of annotation corresponds to the now deprecated ``ENTRY``
and ``ENDPROC`` macros.
-* ``SYM_FUNC_START_ALIAS`` and ``SYM_FUNC_START_LOCAL_ALIAS`` serve for those
- who decided to have two or more names for one function. The typical use is::
+* ``SYM_FUNC_ALIAS``, ``SYM_FUNC_ALIAS_LOCAL``, and ``SYM_FUNC_ALIAS_WEAK`` can
+ be used to define multiple names for a function. The typical use is::
- SYM_FUNC_START_ALIAS(__memset)
- SYM_FUNC_START(memset)
+ SYM_FUNC_START(__memset)
... asm insns ...
- SYM_FUNC_END(memset)
- SYM_FUNC_END_ALIAS(__memset)
+ SYN_FUNC_END(__memset)
+ SYM_FUNC_ALIAS(memset, __memset)
In this example, one can call ``__memset`` or ``memset`` with the same
result, except the debug information for the instructions is generated to
diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst
deleted file mode 100644
index 2098477851a4..000000000000
--- a/Documentation/block/biodoc.rst
+++ /dev/null
@@ -1,1164 +0,0 @@
-=====================================================
-Notes on the Generic Block Layer Rewrite in Linux 2.5
-=====================================================
-
-.. note::
-
- It seems that there are lot of outdated stuff here. This seems
- to be written somewhat as a task list. Yet, eventually, something
- here might still be useful.
-
-Notes Written on Jan 15, 2002:
-
- - Jens Axboe <jens.axboe@oracle.com>
- - Suparna Bhattacharya <suparna@in.ibm.com>
-
-Last Updated May 2, 2002
-
-September 2003: Updated I/O Scheduler portions
- - Nick Piggin <npiggin@kernel.dk>
-
-Introduction
-============
-
-These are some notes describing some aspects of the 2.5 block layer in the
-context of the bio rewrite. The idea is to bring out some of the key
-changes and a glimpse of the rationale behind those changes.
-
-Please mail corrections & suggestions to suparna@in.ibm.com.
-
-Credits
-=======
-
-2.5 bio rewrite:
- - Jens Axboe <jens.axboe@oracle.com>
-
-Many aspects of the generic block layer redesign were driven by and evolved
-over discussions, prior patches and the collective experience of several
-people. See sections 8 and 9 for a list of some related references.
-
-The following people helped with review comments and inputs for this
-document:
-
- - Christoph Hellwig <hch@infradead.org>
- - Arjan van de Ven <arjanv@redhat.com>
- - Randy Dunlap <rdunlap@xenotime.net>
- - Andre Hedrick <andre@linux-ide.org>
-
-The following people helped with fixes/contributions to the bio patches
-while it was still work-in-progress:
-
- - David S. Miller <davem@redhat.com>
-
-
-.. Description of Contents:
-
- 1. Scope for tuning of logic to various needs
- 1.1 Tuning based on device or low level driver capabilities
- - Per-queue parameters
- - Highmem I/O support
- - I/O scheduler modularization
- 1.2 Tuning based on high level requirements/capabilities
- 1.2.1 Request Priority/Latency
- 1.3 Direct access/bypass to lower layers for diagnostics and special
- device operations
- 1.3.1 Pre-built commands
- 2. New flexible and generic but minimalist i/o structure or descriptor
- (instead of using buffer heads at the i/o layer)
- 2.1 Requirements/Goals addressed
- 2.2 The bio struct in detail (multi-page io unit)
- 2.3 Changes in the request structure
- 3. Using bios
- 3.1 Setup/teardown (allocation, splitting)
- 3.2 Generic bio helper routines
- 3.2.1 Traversing segments and completion units in a request
- 3.2.2 Setting up DMA scatterlists
- 3.2.3 I/O completion
- 3.2.4 Implications for drivers that do not interpret bios (don't handle
- multiple segments)
- 3.3 I/O submission
- 4. The I/O scheduler
- 5. Scalability related changes
- 5.1 Granular locking: Removal of io_request_lock
- 5.2 Prepare for transition to 64 bit sector_t
- 6. Other Changes/Implications
- 6.1 Partition re-mapping handled by the generic block layer
- 7. A few tips on migration of older drivers
- 8. A list of prior/related/impacted patches/ideas
- 9. Other References/Discussion Threads
-
-
-Bio Notes
-=========
-
-Let us discuss the changes in the context of how some overall goals for the
-block layer are addressed.
-
-1. Scope for tuning the generic logic to satisfy various requirements
-=====================================================================
-
-The block layer design supports adaptable abstractions to handle common
-processing with the ability to tune the logic to an appropriate extent
-depending on the nature of the device and the requirements of the caller.
-One of the objectives of the rewrite was to increase the degree of tunability
-and to enable higher level code to utilize underlying device/driver
-capabilities to the maximum extent for better i/o performance. This is
-important especially in the light of ever improving hardware capabilities
-and application/middleware software designed to take advantage of these
-capabilities.
-
-1.1 Tuning based on low level device / driver capabilities
-----------------------------------------------------------
-
-Sophisticated devices with large built-in caches, intelligent i/o scheduling
-optimizations, high memory DMA support, etc may find some of the
-generic processing an overhead, while for less capable devices the
-generic functionality is essential for performance or correctness reasons.
-Knowledge of some of the capabilities or parameters of the device should be
-used at the generic block layer to take the right decisions on
-behalf of the driver.
-
-How is this achieved ?
-
-Tuning at a per-queue level:
-
-i. Per-queue limits/values exported to the generic layer by the driver
-
-Various parameters that the generic i/o scheduler logic uses are set at
-a per-queue level (e.g maximum request size, maximum number of segments in
-a scatter-gather list, logical block size)
-
-Some parameters that were earlier available as global arrays indexed by
-major/minor are now directly associated with the queue. Some of these may
-move into the block device structure in the future. Some characteristics
-have been incorporated into a queue flags field rather than separate fields
-in themselves. There are blk_queue_xxx functions to set the parameters,
-rather than update the fields directly
-
-Some new queue property settings:
-
- blk_queue_bounce_limit(q, u64 dma_address)
- Enable I/O to highmem pages, dma_address being the
- limit. No highmem default.
-
- blk_queue_max_sectors(q, max_sectors)
- Sets two variables that limit the size of the request.
-
- - The request queue's max_sectors, which is a soft size in
- units of 512 byte sectors, and could be dynamically varied
- by the core kernel.
-
- - The request queue's max_hw_sectors, which is a hard limit
- and reflects the maximum size request a driver can handle
- in units of 512 byte sectors.
-
- The default for both max_sectors and max_hw_sectors is
- 255. The upper limit of max_sectors is 1024.
-
- blk_queue_max_phys_segments(q, max_segments)
- Maximum physical segments you can handle in a request. 128
- default (driver limit). (See 3.2.2)
-
- blk_queue_max_hw_segments(q, max_segments)
- Maximum dma segments the hardware can handle in a request. 128
- default (host adapter limit, after dma remapping).
- (See 3.2.2)
-
- blk_queue_max_segment_size(q, max_seg_size)
- Maximum size of a clustered segment, 64kB default.
-
- blk_queue_logical_block_size(q, logical_block_size)
- Lowest possible sector size that the hardware can operate
- on, 512 bytes default.
-
-New queue flags:
-
- - QUEUE_FLAG_CLUSTER (see 3.2.2)
- - QUEUE_FLAG_QUEUED (see 3.2.4)
-
-
-ii. High-mem i/o capabilities are now considered the default
-
-The generic bounce buffer logic, present in 2.4, where the block layer would
-by default copyin/out i/o requests on high-memory buffers to low-memory buffers
-assuming that the driver wouldn't be able to handle it directly, has been
-changed in 2.5. The bounce logic is now applied only for memory ranges
-for which the device cannot handle i/o. A driver can specify this by
-setting the queue bounce limit for the request queue for the device
-(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out
-where a device is capable of handling high memory i/o.
-
-In order to enable high-memory i/o where the device is capable of supporting
-it, the pci dma mapping routines and associated data structures have now been
-modified to accomplish a direct page -> bus translation, without requiring
-a virtual address mapping (unlike the earlier scheme of virtual address
--> bus translation). So this works uniformly for high-memory pages (which
-do not have a corresponding kernel virtual address space mapping) and
-low-memory pages.
-
-Note: Please refer to Documentation/core-api/dma-api-howto.rst for a discussion
-on PCI high mem DMA aspects and mapping of scatter gather lists, and support
-for 64 bit PCI.
-
-Special handling is required only for cases where i/o needs to happen on
-pages at physical memory addresses beyond what the device can support. In these
-cases, a bounce bio representing a buffer from the supported memory range
-is used for performing the i/o with copyin/copyout as needed depending on
-the type of the operation. For example, in case of a read operation, the
-data read has to be copied to the original buffer on i/o completion, so a
-callback routine is set up to do this, while for write, the data is copied
-from the original buffer to the bounce buffer prior to issuing the
-operation. Since an original buffer may be in a high memory area that's not
-mapped in kernel virtual addr, a kmap operation may be required for
-performing the copy, and special care may be needed in the completion path
-as it may not be in irq context. Special care is also required (by way of
-GFP flags) when allocating bounce buffers, to avoid certain highmem
-deadlock possibilities.
-
-It is also possible that a bounce buffer may be allocated from high-memory
-area that's not mapped in kernel virtual addr, but within the range that the
-device can use directly; so the bounce page may need to be kmapped during
-copy operations. [Note: This does not hold in the current implementation,
-though]
-
-There are some situations when pages from high memory may need to
-be kmapped, even if bounce buffers are not necessary. For example a device
-may need to abort DMA operations and revert to PIO for the transfer, in
-which case a virtual mapping of the page is required. For SCSI it is also
-done in some scenarios where the low level driver cannot be trusted to
-handle a single sg entry correctly. The driver is expected to perform the
-kmaps as needed on such occasions as appropriate. A driver could also use
-the blk_queue_bounce() routine on its own to bounce highmem i/o to low
-memory for specific requests if so desired.
-
-iii. The i/o scheduler algorithm itself can be replaced/set as appropriate
-
-As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular
-queue or pick from (copy) existing generic schedulers and replace/override
-certain portions of it. The 2.5 rewrite provides improved modularization
-of the i/o scheduler. There are more pluggable callbacks, e.g for init,
-add request, extract request, which makes it possible to abstract specific
-i/o scheduling algorithm aspects and details outside of the generic loop.
-It also makes it possible to completely hide the implementation details of
-the i/o scheduler from block drivers.
-
-I/O scheduler wrappers are to be used instead of accessing the queue directly.
-See section 4. The I/O scheduler for details.
-
-1.2 Tuning Based on High level code capabilities
-------------------------------------------------
-
-i. Application capabilities for raw i/o
-
-This comes from some of the high-performance database/middleware
-requirements where an application prefers to make its own i/o scheduling
-decisions based on an understanding of the access patterns and i/o
-characteristics
-
-ii. High performance filesystems or other higher level kernel code's
-capabilities
-
-Kernel components like filesystems could also take their own i/o scheduling
-decisions for optimizing performance. Journalling filesystems may need
-some control over i/o ordering.
-
-What kind of support exists at the generic block layer for this ?
-
-The flags and rw fields in the bio structure can be used for some tuning
-from above e.g indicating that an i/o is just a readahead request, or priority
-settings (currently unused). As far as user applications are concerned they
-would need an additional mechanism either via open flags or ioctls, or some
-other upper level mechanism to communicate such settings to block.
-
-1.2.1 Request Priority/Latency
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Todo/Under discussion::
-
- Arjan's proposed request priority scheme allows higher levels some broad
- control (high/med/low) over the priority of an i/o request vs other pending
- requests in the queue. For example it allows reads for bringing in an
- executable page on demand to be given a higher priority over pending write
- requests which haven't aged too much on the queue. Potentially this priority
- could even be exposed to applications in some manner, providing higher level
- tunability. Time based aging avoids starvation of lower priority
- requests. Some bits in the bi_opf flags field in the bio structure are
- intended to be used for this priority information.
-
-
-1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)
------------------------------------------------------------------------
-
-(e.g Diagnostics, Systems Management)
-
-There are situations where high-level code needs to have direct access to
-the low level device capabilities or requires the ability to issue commands
-to the device bypassing some of the intermediate i/o layers.
-These could, for example, be special control commands issued through ioctl
-interfaces, or could be raw read/write commands that stress the drive's
-capabilities for certain kinds of fitness tests. Having direct interfaces at
-multiple levels without having to pass through upper layers makes
-it possible to perform bottom up validation of the i/o path, layer by
-layer, starting from the media.
-
-The normal i/o submission interfaces, e.g submit_bio, could be bypassed
-for specially crafted requests which such ioctl or diagnostics
-interfaces would typically use, and the elevator add_request routine
-can instead be used to directly insert such requests in the queue or preferably
-the blk_do_rq routine can be used to place the request on the queue and
-wait for completion. Alternatively, sometimes the caller might just
-invoke a lower level driver specific interface with the request as a
-parameter.
-
-If the request is a means for passing on special information associated with
-the command, then such information is associated with the request->special
-field (rather than misuse the request->buffer field which is meant for the
-request data buffer's virtual mapping).
-
-For passing request data, the caller must build up a bio descriptor
-representing the concerned memory buffer if the underlying driver interprets
-bio segments or uses the block layer end*request* functions for i/o
-completion. Alternatively one could directly use the request->buffer field to
-specify the virtual address of the buffer, if the driver expects buffer
-addresses passed in this way and ignores bio entries for the request type
-involved. In the latter case, the driver would modify and manage the
-request->buffer, request->sector and request->nr_sectors or
-request->current_nr_sectors fields itself rather than using the block layer
-end_request or end_that_request_first completion interfaces.
-(See 2.3 or Documentation/block/request.rst for a brief explanation of
-the request structure fields)
-
-::
-
- [TBD: end_that_request_last should be usable even in this case;
- Perhaps an end_that_direct_request_first routine could be implemented to make
- handling direct requests easier for such drivers; Also for drivers that
- expect bios, a helper function could be provided for setting up a bio
- corresponding to a data buffer]
-
- <JENS: I dont understand the above, why is end_that_request_first() not
- usable? Or _last for that matter. I must be missing something>
-
- <SUP: What I meant here was that if the request doesn't have a bio, then
- end_that_request_first doesn't modify nr_sectors or current_nr_sectors,
- and hence can't be used for advancing request state settings on the
- completion of partial transfers. The driver has to modify these fields
- directly by hand.
- This is because end_that_request_first only iterates over the bio list,
- and always returns 0 if there are none associated with the request.
- _last works OK in this case, and is not a problem, as I mentioned earlier
- >
-
-1.3.1 Pre-built Commands
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-A request can be created with a pre-built custom command to be sent directly
-to the device. The cmd block in the request structure has room for filling
-in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for
-command pre-building, and the type of the request is now indicated
-through rq->flags instead of via rq->cmd)
-
-The request structure flags can be set up to indicate the type of request
-in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC:
-packet command issued via blk_do_rq, REQ_SPECIAL: special request).
-
-It can help to pre-build device commands for requests in advance.
-Drivers can now specify a request prepare function (q->prep_rq_fn) that the
-block layer would invoke to pre-build device commands for a given request,
-or perform other preparatory processing for the request. This is routine is
-called by elv_next_request(), i.e. typically just before servicing a request.
-(The prepare function would not be called for requests that have RQF_DONTPREP
-enabled)
-
-Aside:
- Pre-building could possibly even be done early, i.e before placing the
- request on the queue, rather than construct the command on the fly in the
- driver while servicing the request queue when it may affect latencies in
- interrupt context or responsiveness in general. One way to add early
- pre-building would be to do it whenever we fail to merge on a request.
- Now REQ_NOMERGE is set in the request flags to skip this one in the future,
- which means that it will not change before we feed it to the device. So
- the pre-builder hook can be invoked there.
-
-
-2. Flexible and generic but minimalist i/o structure/descriptor
-===============================================================
-
-2.1 Reason for a new structure and requirements addressed
----------------------------------------------------------
-
-Prior to 2.5, buffer heads were used as the unit of i/o at the generic block
-layer, and the low level request structure was associated with a chain of
-buffer heads for a contiguous i/o request. This led to certain inefficiencies
-when it came to large i/o requests and readv/writev style operations, as it
-forced such requests to be broken up into small chunks before being passed
-on to the generic block layer, only to be merged by the i/o scheduler
-when the underlying device was capable of handling the i/o in one shot.
-Also, using the buffer head as an i/o structure for i/os that didn't originate
-from the buffer cache unnecessarily added to the weight of the descriptors
-which were generated for each such chunk.
-
-The following were some of the goals and expectations considered in the
-redesign of the block i/o data structure in 2.5.
-
-1. Should be appropriate as a descriptor for both raw and buffered i/o -
- avoid cache related fields which are irrelevant in the direct/page i/o path,
- or filesystem block size alignment restrictions which may not be relevant
- for raw i/o.
-2. Ability to represent high-memory buffers (which do not have a virtual
- address mapping in kernel address space).
-3. Ability to represent large i/os w/o unnecessarily breaking them up (i.e
- greater than PAGE_SIZE chunks in one shot)
-4. At the same time, ability to retain independent identity of i/os from
- different sources or i/o units requiring individual completion (e.g. for
- latency reasons)
-5. Ability to represent an i/o involving multiple physical memory segments
- (including non-page aligned page fragments, as specified via readv/writev)
- without unnecessarily breaking it up, if the underlying device is capable of
- handling it.
-6. Preferably should be based on a memory descriptor structure that can be
- passed around different types of subsystems or layers, maybe even
- networking, without duplication or extra copies of data/descriptor fields
- themselves in the process
-7. Ability to handle the possibility of splits/merges as the structure passes
- through layered drivers (lvm, md, evms), with minimal overhead.
-
-The solution was to define a new structure (bio) for the block layer,
-instead of using the buffer head structure (bh) directly, the idea being
-avoidance of some associated baggage and limitations. The bio structure
-is uniformly used for all i/o at the block layer ; it forms a part of the
-bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are
-mapped to bio structures.
-
-2.2 The bio struct
-------------------
-
-The bio structure uses a vector representation pointing to an array of tuples
-of <page, offset, len> to describe the i/o buffer, and has various other
-fields describing i/o parameters and state that needs to be maintained for
-performing the i/o.
-
-Notice that this representation means that a bio has no virtual address
-mapping at all (unlike buffer heads).
-
-::
-
- struct bio_vec {
- struct page *bv_page;
- unsigned short bv_len;
- unsigned short bv_offset;
- };
-
- /*
- * main unit of I/O for the block layer and lower layers (ie drivers)
- */
- struct bio {
- struct bio *bi_next; /* request queue link */
- struct block_device *bi_bdev; /* target device */
- unsigned long bi_flags; /* status, command, etc */
- unsigned long bi_opf; /* low bits: r/w, high: priority */
-
- unsigned int bi_vcnt; /* how may bio_vec's */
- struct bvec_iter bi_iter; /* current index into bio_vec array */
-
- unsigned int bi_size; /* total size in bytes */
- unsigned short bi_hw_segments; /* segments after DMA remapping */
- unsigned int bi_max; /* max bio_vecs we can hold
- used as index into pool */
- struct bio_vec *bi_io_vec; /* the actual vec list */
- bio_end_io_t *bi_end_io; /* bi_end_io (bio) */
- atomic_t bi_cnt; /* pin count: free when it hits zero */
- void *bi_private;
- };
-
-With this multipage bio design:
-
-- Large i/os can be sent down in one go using a bio_vec list consisting
- of an array of <page, offset, len> fragments (similar to the way fragments
- are represented in the zero-copy network code)
-- Splitting of an i/o request across multiple devices (as in the case of
- lvm or raid) is achieved by cloning the bio (where the clone points to
- the same bi_io_vec array, but with the index and size accordingly modified)
-- A linked list of bios is used as before for unrelated merges [#]_ - this
- avoids reallocs and makes independent completions easier to handle.
-- Code that traverses the req list can find all the segments of a bio
- by using rq_for_each_segment. This handles the fact that a request
- has multiple bios, each of which can have multiple segments.
-- Drivers which can't process a large bio in one shot can use the bi_iter
- field to keep track of the next bio_vec entry to process.
- (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
- [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
- bi_offset an len fields]
-
-.. [#]
-
- unrelated merges -- a request ends up containing two or more bios that
- didn't originate from the same place.
-
-bi_end_io() i/o callback gets called on i/o completion of the entire bio.
-
-At a lower level, drivers build a scatter gather list from the merged bios.
-The scatter gather list is in the form of an array of <page, offset, len>
-entries with their corresponding dma address mappings filled in at the
-appropriate time. As an optimization, contiguous physical pages can be
-covered by a single entry where <page> refers to the first page and <len>
-covers the range of pages (up to 16 contiguous pages could be covered this
-way). There is a helper routine (blk_rq_map_sg) which drivers can use to build
-the sg list.
-
-Note: Right now the only user of bios with more than one page is ll_rw_kio,
-which in turn means that only raw I/O uses it (direct i/o may not work
-right now). The intent however is to enable clustering of pages etc to
-become possible. The pagebuf abstraction layer from SGI also uses multi-page
-bios, but that is currently not included in the stock development kernels.
-The same is true of Andrew Morton's work-in-progress multipage bio writeout
-and readahead patches.
-
-2.3 Changes in the Request Structure
-------------------------------------
-
-The request structure is the structure that gets passed down to low level
-drivers. The block layer make_request function builds up a request structure,
-places it on the queue and invokes the drivers request_fn. The driver makes
-use of block layer helper routine elv_next_request to pull the next request
-off the queue. Control or diagnostic functions might bypass block and directly
-invoke underlying driver entry points passing in a specially constructed
-request structure.
-
-Only some relevant fields (mainly those which changed or may be referred
-to in some of the discussion here) are listed below, not necessarily in
-the order in which they occur in the structure (see include/linux/blkdev.h)
-Refer to Documentation/block/request.rst for details about all the request
-structure fields and a quick reference about the layers which are
-supposed to use or modify those fields::
-
- struct request {
- struct list_head queuelist; /* Not meant to be directly accessed by
- the driver.
- Used by q->elv_next_request_fn
- rq->queue is gone
- */
- .
- .
- unsigned char cmd[16]; /* prebuilt command data block */
- unsigned long flags; /* also includes earlier rq->cmd settings */
- .
- .
- sector_t sector; /* this field is now of type sector_t instead of int
- preparation for 64 bit sectors */
- .
- .
-
- /* Number of scatter-gather DMA addr+len pairs after
- * physical address coalescing is performed.
- */
- unsigned short nr_phys_segments;
-
- /* Number of scatter-gather addr+len pairs after
- * physical and DMA remapping hardware coalescing is performed.
- * This is the number of scatter-gather entries the driver
- * will actually have to deal with after DMA mapping is done.
- */
- unsigned short nr_hw_segments;
-
- /* Various sector counts */
- unsigned long nr_sectors; /* no. of sectors left: driver modifiable */
- unsigned long hard_nr_sectors; /* block internal copy of above */
- unsigned int current_nr_sectors; /* no. of sectors left in the
- current segment:driver modifiable */
- unsigned long hard_cur_sectors; /* block internal copy of the above */
- .
- .
- int tag; /* command tag associated with request */
- void *special; /* same as before */
- char *buffer; /* valid only for low memory buffers up to
- current_nr_sectors */
- .
- .
- struct bio *bio, *biotail; /* bio list instead of bh */
- struct request_list *rl;
- }
-
-See the req_ops and req_flag_bits definitions for an explanation of the various
-flags available. Some bits are used by the block layer or i/o scheduler.
-
-The behaviour of the various sector counts are almost the same as before,
-except that since we have multi-segment bios, current_nr_sectors refers
-to the numbers of sectors in the current segment being processed which could
-be one of the many segments in the current bio (i.e i/o completion unit).
-The nr_sectors value refers to the total number of sectors in the whole
-request that remain to be transferred (no change). The purpose of the
-hard_xxx values is for block to remember these counts every time it hands
-over the request to the driver. These values are updated by block on
-end_that_request_first, i.e. every time the driver completes a part of the
-transfer and invokes block end*request helpers to mark this. The
-driver should not modify these values. The block layer sets up the
-nr_sectors and current_nr_sectors fields (based on the corresponding
-hard_xxx values and the number of bytes transferred) and updates it on
-every transfer that invokes end_that_request_first. It does the same for the
-buffer, bio, bio->bi_iter fields too.
-
-The buffer field is just a virtual address mapping of the current segment
-of the i/o buffer in cases where the buffer resides in low-memory. For high
-memory i/o, this field is not valid and must not be used by drivers.
-
-Code that sets up its own request structures and passes them down to
-a driver needs to be careful about interoperation with the block layer helper
-functions which the driver uses. (Section 1.3)
-
-3. Using bios
-=============
-
-3.1 Setup/Teardown
-------------------
-
-There are routines for managing the allocation, and reference counting, and
-freeing of bios (bio_alloc, bio_get, bio_put).
-
-This makes use of Ingo Molnar's mempool implementation, which enables
-subsystems like bio to maintain their own reserve memory pools for guaranteed
-deadlock-free allocations during extreme VM load. For example, the VM
-subsystem makes use of the block layer to writeout dirty pages in order to be
-able to free up memory space, a case which needs careful handling. The
-allocation logic draws from the preallocated emergency reserve in situations
-where it cannot allocate through normal means. If the pool is empty and it
-can wait, then it would trigger action that would help free up memory or
-replenish the pool (without deadlocking) and wait for availability in the pool.
-If it is in IRQ context, and hence not in a position to do this, allocation
-could fail if the pool is empty. In general mempool always first tries to
-perform allocation without having to wait, even if it means digging into the
-pool as long it is not less that 50% full.
-
-On a free, memory is released to the pool or directly freed depending on
-the current availability in the pool. The mempool interface lets the
-subsystem specify the routines to be used for normal alloc and free. In the
-case of bio, these routines make use of the standard slab allocator.
-
-The caller of bio_alloc is expected to taken certain steps to avoid
-deadlocks, e.g. avoid trying to allocate more memory from the pool while
-already holding memory obtained from the pool.
-
-::
-
- [TBD: This is a potential issue, though a rare possibility
- in the bounce bio allocation that happens in the current code, since
- it ends up allocating a second bio from the same pool while
- holding the original bio ]
-
-Memory allocated from the pool should be released back within a limited
-amount of time (in the case of bio, that would be after the i/o is completed).
-This ensures that if part of the pool has been used up, some work (in this
-case i/o) must already be in progress and memory would be available when it
-is over. If allocating from multiple pools in the same code path, the order
-or hierarchy of allocation needs to be consistent, just the way one deals
-with multiple locks.
-
-The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc())
-for a non-clone bio. There are the 6 pools setup for different size biovecs,
-so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
-given size from these slabs.
-
-The bio_get() routine may be used to hold an extra reference on a bio prior
-to i/o submission, if the bio fields are likely to be accessed after the
-i/o is issued (since the bio may otherwise get freed in case i/o completion
-happens in the meantime).
-
-The bio_clone_fast() routine may be used to duplicate a bio, where the clone
-shares the bio_vec_list with the original bio (i.e. both point to the
-same bio_vec_list). This would typically be used for splitting i/o requests
-in lvm or md.
-
-3.2 Generic bio helper Routines
--------------------------------
-
-3.2.1 Traversing segments and completion units in a request
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The macro rq_for_each_segment() should be used for traversing the bios
-in the request list (drivers should avoid directly trying to do it
-themselves). Using these helpers should also make it easier to cope
-with block changes in the future.
-
-::
-
- struct req_iterator iter;
- rq_for_each_segment(bio_vec, rq, iter)
- /* bio_vec is now current segment */
-
-I/O completion callbacks are per-bio rather than per-segment, so drivers
-that traverse bio chains on completion need to keep that in mind. Drivers
-which don't make a distinction between segments and completion units would
-need to be reorganized to support multi-segment bios.
-
-3.2.2 Setting up DMA scatterlists
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The blk_rq_map_sg() helper routine would be used for setting up scatter
-gather lists from a request, so a driver need not do it on its own.
-
- nr_segments = blk_rq_map_sg(q, rq, scatterlist);
-
-The helper routine provides a level of abstraction which makes it easier
-to modify the internals of request to scatterlist conversion down the line
-without breaking drivers. The blk_rq_map_sg routine takes care of several
-things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER
-is set) and correct segment accounting to avoid exceeding the limits which
-the i/o hardware can handle, based on various queue properties.
-
-- Prevents a clustered segment from crossing a 4GB mem boundary
-- Avoids building segments that would exceed the number of physical
- memory segments that the driver can handle (phys_segments) and the
- number that the underlying hardware can handle at once, accounting for
- DMA remapping (hw_segments) (i.e. IOMMU aware limits).
-
-Routines which the low level driver can use to set up the segment limits:
-
-blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of
-hw data segments in a request (i.e. the maximum number of address/length
-pairs the host adapter can actually hand to the device at once)
-
-blk_queue_max_phys_segments() : Sets an upper limit on the maximum number
-of physical data segments in a request (i.e. the largest sized scatter list
-a driver could handle)
-
-3.2.3 I/O completion
-^^^^^^^^^^^^^^^^^^^^
-
-The existing generic block layer helper routines end_request,
-end_that_request_first and end_that_request_last can be used for i/o
-completion (and setting things up so the rest of the i/o or the next
-request can be kicked of) as before. With the introduction of multi-page
-bio support, end_that_request_first requires an additional argument indicating
-the number of sectors completed.
-
-3.2.4 Implications for drivers that do not interpret bios
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-(don't handle multiple segments)
-
-Drivers that do not interpret bios e.g those which do not handle multiple
-segments and do not support i/o into high memory addresses (require bounce
-buffers) and expect only virtually mapped buffers, can access the rq->buffer
-field. As before the driver should use current_nr_sectors to determine the
-size of remaining data in the current segment (that is the maximum it can
-transfer in one go unless it interprets segments), and rely on the block layer
-end_request, or end_that_request_first/last to take care of all accounting
-and transparent mapping of the next bio segment when a segment boundary
-is crossed on completion of a transfer. (The end*request* functions should
-be used if only if the request has come down from block/bio path, not for
-direct access requests which only specify rq->buffer without a valid rq->bio)
-
-3.3 I/O Submission
-------------------
-
-The routine submit_bio() is used to submit a single io. Higher level i/o
-routines make use of this:
-
-(a) Buffered i/o:
-
-The routine submit_bh() invokes submit_bio() on a bio corresponding to the
-bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before.
-
-(b) Kiobuf i/o (for raw/direct i/o):
-
-The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and
-maps the array to one or more multi-page bios, issuing submit_bio() to
-perform the i/o on each of these.
-
-The embedded bh array in the kiobuf structure has been removed and no
-preallocation of bios is done for kiobufs. [The intent is to remove the
-blocks array as well, but it's currently in there to kludge around direct i/o.]
-Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc.
-
-Todo/Observation:
-
- A single kiobuf structure is assumed to correspond to a contiguous range
- of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
- So right now it wouldn't work for direct i/o on non-contiguous blocks.
- This is to be resolved. The eventual direction is to replace kiobuf
- by kvec's.
-
- Badari Pulavarty has a patch to implement direct i/o correctly using
- bio and kvec.
-
-
-(c) Page i/o:
-
-Todo/Under discussion:
-
- Andrew Morton's multi-page bio patches attempt to issue multi-page
- writeouts (and reads) from the page cache, by directly building up
- large bios for submission completely bypassing the usage of buffer
- heads. This work is still in progress.
-
- Christoph Hellwig had some code that uses bios for page-io (rather than
- bh). This isn't included in bio as yet. Christoph was also working on a
- design for representing virtual/real extents as an entity and modifying
- some of the address space ops interfaces to utilize this abstraction rather
- than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
- abstraction, but intended to be as lightweight as possible).
-
-(d) Direct access i/o:
-
-Direct access requests that do not contain bios would be submitted differently
-as discussed earlier in section 1.3.
-
-Aside:
-
- Kvec i/o:
-
- Ben LaHaise's aio code uses a slightly different structure instead
- of kiobufs, called a kvec_cb. This contains an array of <page, offset, len>
- tuples (very much like the networking code), together with a callback function
- and data pointer. This is embedded into a brw_cb structure when passed
- to brw_kvec_async().
-
- Now it should be possible to directly map these kvecs to a bio. Just as while
- cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec
- array pointer to point to the veclet array in kvecs.
-
- TBD: In order for this to work, some changes are needed in the way multi-page
- bios are handled today. The values of the tuples in such a vector passed in
- from higher level code should not be modified by the block layer in the course
- of its request processing, since that would make it hard for the higher layer
- to continue to use the vector descriptor (kvec) after i/o completes. Instead,
- all such transient state should either be maintained in the request structure,
- and passed on in some way to the endio completion routine.
-
-
-4. The I/O scheduler
-====================
-
-I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch
-queue and specific I/O schedulers. Unless stated otherwise, elevator is used
-to refer to both parts and I/O scheduler to specific I/O schedulers.
-
-Block layer implements generic dispatch queue in `block/*.c`.
-The generic dispatch queue is responsible for requeueing, handling non-fs
-requests and all other subtleties.
-
-Specific I/O schedulers are responsible for ordering normal filesystem
-requests. They can also choose to delay certain requests to improve
-throughput or whatever purpose. As the plural form indicates, there are
-multiple I/O schedulers. They can be built as modules but at least one should
-be built inside the kernel. Each queue can choose different one and can also
-change to another one dynamically.
-
-A block layer call to the i/o scheduler follows the convention elv_xxx(). This
-calls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx
-and xxx might not match exactly, but use your imagination. If an elevator
-doesn't implement a function, the switch does nothing or some minimal house
-keeping work.
-
-4.1. I/O scheduler API
-----------------------
-
-The functions an elevator may implement are: (* are mandatory)
-
-=============================== ================================================
-elevator_merge_fn called to query requests for merge with a bio
-
-elevator_merge_req_fn called when two requests get merged. the one
- which gets merged into the other one will be
- never seen by I/O scheduler again. IOW, after
- being merged, the request is gone.
-
-elevator_merged_fn called when a request in the scheduler has been
- involved in a merge. It is used in the deadline
- scheduler for example, to reposition the request
- if its sorting order has changed.
-
-elevator_allow_merge_fn called whenever the block layer determines
- that a bio can be merged into an existing
- request safely. The io scheduler may still
- want to stop a merge at this point if it
- results in some sort of conflict internally,
- this hook allows it to do that. Note however
- that two *requests* can still be merged at later
- time. Currently the io scheduler has no way to
- prevent that. It can only learn about the fact
- from elevator_merge_req_fn callback.
-
-elevator_dispatch_fn* fills the dispatch queue with ready requests.
- I/O schedulers are free to postpone requests by
- not filling the dispatch queue unless @force
- is non-zero. Once dispatched, I/O schedulers
- are not allowed to manipulate the requests -
- they belong to generic dispatch queue.
-
-elevator_add_req_fn* called to add a new request into the scheduler
-
-elevator_former_req_fn
-elevator_latter_req_fn These return the request before or after the
- one specified in disk sort order. Used by the
- block layer to find merge possibilities.
-
-elevator_completed_req_fn called when a request is completed.
-
-elevator_set_req_fn
-elevator_put_req_fn Must be used to allocate and free any elevator
- specific storage for a request.
-
-elevator_activate_req_fn Called when device driver first sees a request.
- I/O schedulers can use this callback to
- determine when actual execution of a request
- starts.
-elevator_deactivate_req_fn Called when device driver decides to delay
- a request by requeueing it.
-
-elevator_init_fn*
-elevator_exit_fn Allocate and free any elevator specific storage
- for a queue.
-=============================== ================================================
-
-4.2 Request flows seen by I/O schedulers
-----------------------------------------
-
-All requests seen by I/O schedulers strictly follow one of the following three
-flows.
-
- set_req_fn ->
-
- i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn ->
- (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn
- ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn
- iii. [none]
-
- -> put_req_fn
-
-4.3 I/O scheduler implementation
---------------------------------
-
-The generic i/o scheduler algorithm attempts to sort/merge/batch requests for
-optimal disk scan and request servicing performance (based on generic
-principles and device capabilities), optimized for:
-
-i. improved throughput
-ii. improved latency
-iii. better utilization of h/w & CPU time
-
-Characteristics:
-
-i. Binary tree
-AS and deadline i/o schedulers use red black binary trees for disk position
-sorting and searching, and a fifo linked list for time-based searching. This
-gives good scalability and good availability of information. Requests are
-almost always dispatched in disk sort order, so a cache is kept of the next
-request in sort order to prevent binary tree lookups.
-
-This arrangement is not a generic block layer characteristic however, so
-elevators may implement queues as they please.
-
-ii. Merge hash
-AS and deadline use a hash table indexed by the last sector of a request. This
-enables merging code to quickly look up "back merge" candidates, even when
-multiple I/O streams are being performed at once on one disk.
-
-"Front merges", a new request being merged at the front of an existing request,
-are far less common than "back merges" due to the nature of most I/O patterns.
-Front merges are handled by the binary trees in AS and deadline schedulers.
-
-iii. Plugging the queue to batch requests in anticipation of opportunities for
- merge/sort optimizations
-
-Plugging is an approach that the current i/o scheduling algorithm resorts to so
-that it collects up enough requests in the queue to be able to take
-advantage of the sorting/merging logic in the elevator. If the
-queue is empty when a request comes in, then it plugs the request queue
-(sort of like plugging the bath tub of a vessel to get fluid to build up)
-till it fills up with a few more requests, before starting to service
-the requests. This provides an opportunity to merge/sort the requests before
-passing them down to the device. There are various conditions when the queue is
-unplugged (to open up the flow again), either through a scheduled task or
-could be on demand. For example wait_on_buffer sets the unplugging going
-through sync_buffer() running blk_run_address_space(mapping). Or the caller
-can do it explicity through blk_unplug(bdev). So in the read case,
-the queue gets explicitly unplugged as part of waiting for completion on that
-buffer.
-
-Aside:
- This is kind of controversial territory, as it's not clear if plugging is
- always the right thing to do. Devices typically have their own queues,
- and allowing a big queue to build up in software, while letting the device be
- idle for a while may not always make sense. The trick is to handle the fine
- balance between when to plug and when to open up. Also now that we have
- multi-page bios being queued in one shot, we may not need to wait to merge
- a big request from the broken up pieces coming by.
-
-4.4 I/O contexts
-----------------
-
-I/O contexts provide a dynamically allocated per process data area. They may
-be used in I/O schedulers, and in the block layer (could be used for IO statis,
-priorities for example). See `*io_context` in block/ll_rw_blk.c, and as-iosched.c
-for an example of usage in an i/o scheduler.
-
-
-5. Scalability related changes
-==============================
-
-5.1 Granular Locking: io_request_lock replaced by a per-queue lock
-------------------------------------------------------------------
-
-The global io_request_lock has been removed as of 2.5, to avoid
-the scalability bottleneck it was causing, and has been replaced by more
-granular locking. The request queue structure has a pointer to the
-lock to be used for that queue. As a result, locking can now be
-per-queue, with a provision for sharing a lock across queues if
-necessary (e.g the scsi layer sets the queue lock pointers to the
-corresponding adapter lock, which results in a per host locking
-granularity). The locking semantics are the same, i.e. locking is
-still imposed by the block layer, grabbing the lock before
-request_fn execution which it means that lots of older drivers
-should still be SMP safe. Drivers are free to drop the queue
-lock themselves, if required. Drivers that explicitly used the
-io_request_lock for serialization need to be modified accordingly.
-Usually it's as easy as adding a global lock::
-
- static DEFINE_SPINLOCK(my_driver_lock);
-
-and passing the address to that lock to blk_init_queue().
-
-5.2 64 bit sector numbers (sector_t prepares for 64 bit support)
-----------------------------------------------------------------
-
-The sector number used in the bio structure has been changed to sector_t,
-which could be defined as 64 bit in preparation for 64 bit sector support.
-
-6. Other Changes/Implications
-=============================
-
-6.1 Partition re-mapping handled by the generic block layer
------------------------------------------------------------
-
-In 2.5 some of the gendisk/partition related code has been reorganized.
-Now the generic block layer performs partition-remapping early and thus
-provides drivers with a sector number relative to whole device, rather than
-having to take partition number into account in order to arrive at the true
-sector number. The routine blk_partition_remap() is invoked by
-submit_bio_noacct even before invoking the queue specific ->submit_bio,
-so the i/o scheduler also gets to operate on whole disk sector numbers. This
-should typically not require changes to block drivers, it just never gets
-to invoke its own partition sector offset calculations since all bios
-sent are offset from the beginning of the device.
-
-
-7. A Few Tips on Migration of older drivers
-===========================================
-
-Old-style drivers that just use CURRENT and ignores clustered requests,
-may not need much change. The generic layer will automatically handle
-clustered requests, multi-page bios, etc for the driver.
-
-For a low performance driver or hardware that is PIO driven or just doesn't
-support scatter-gather changes should be minimal too.
-
-The following are some points to keep in mind when converting old drivers
-to bio.
-
-Drivers should use elv_next_request to pick up requests and are no longer
-supposed to handle looping directly over the request list.
-(struct request->queue has been removed)
-
-Now end_that_request_first takes an additional number_of_sectors argument.
-It used to handle always just the first buffer_head in a request, now
-it will loop and handle as many sectors (on a bio-segment granularity)
-as specified.
-
-Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
-right thing to use is bio_endio(bio) instead.
-
-If the driver is dropping the io_request_lock from its request_fn strategy,
-then it just needs to replace that with q->queue_lock instead.
-
-As described in Sec 1.1, drivers can set max sector size, max segment size
-etc per queue now. Drivers that used to define their own merge functions i
-to handle things like this can now just use the blk_queue_* functions at
-blk_init_queue time.
-
-Drivers no longer have to map a {partition, sector offset} into the
-correct absolute location anymore, this is done by the block layer, so
-where a driver received a request ala this before::
-
- rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */
- rq->sector = 0; /* first sector on hda5 */
-
-it will now see::
-
- rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */
- rq->sector = 123128; /* offset from start of disk */
-
-As mentioned, there is no virtual mapping of a bio. For DMA, this is
-not a problem as the driver probably never will need a virtual mapping.
-Instead it needs a bus mapping (dma_map_page for a single segment or
-use dma_map_sg for scatter gather) to be able to ship it to the driver. For
-PIO drivers (or drivers that need to revert to PIO transfer once in a
-while (IDE for example)), where the CPU is doing the actual data
-transfer a virtual mapping is needed. If the driver supports highmem I/O,
-(Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map
-a bio into the virtual address space.
-
-
-8. Prior/Related/Impacted patches
-=================================
-
-8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)
------------------------------------------------------
-
-- orig kiobuf & raw i/o patches (now in 2.4 tree)
-- direct kiobuf based i/o to devices (no intermediate bh's)
-- page i/o using kiobuf
-- kiobuf splitting for lvm (mkp)
-- elevator support for kiobuf request merging (axboe)
-
-8.2. Zero-copy networking (Dave Miller)
----------------------------------------
-
-8.3. SGI XFS - pagebuf patches - use of kiobufs
------------------------------------------------
-8.4. Multi-page pioent patch for bio (Christoph Hellwig)
---------------------------------------------------------
-8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11
---------------------------------------------------------------------
-8.6. Async i/o implementation patch (Ben LaHaise)
--------------------------------------------------
-8.7. EVMS layering design (IBM EVMS team)
------------------------------------------
-8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips)
--------------------------------------------------------------------------------------
-
- => larger contiguous physical memory buffers
-
-8.9. VM reservations patch (Ben LaHaise)
-----------------------------------------
-8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)
-----------------------------------------------------------
-8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+
----------------------------------------------------------------------------
-8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari)
--------------------------------------------------------------------------------
-8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven)
-------------------------------------------------------------------
-8.14 IDE Taskfile i/o patch (Andre Hedrick)
---------------------------------------------
-8.15 Multi-page writeout and readahead patches (Andrew Morton)
----------------------------------------------------------------
-8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)
------------------------------------------------------------------------
-
-9. Other References
-===================
-
-9.1 The Splice I/O Model
-------------------------
-
-Larry McVoy (and subsequent discussions on lkml, and Linus' comments - Jan 2001
-
-9.2 Discussions about kiobuf and bh design
-------------------------------------------
-
-On lkml between sct, linus, alan et al - Feb-March 2001 (many of the
-initial thoughts that led to bio were brought up in this discussion thread)
-
-9.3 Discussions on mempool on lkml - Dec 2001.
-----------------------------------------------
diff --git a/Documentation/block/capability.rst b/Documentation/block/capability.rst
index 160a5148b915..2ae7f064736a 100644
--- a/Documentation/block/capability.rst
+++ b/Documentation/block/capability.rst
@@ -7,4 +7,4 @@ This file documents the sysfs file ``block/<disk>/capability``.
``capability`` is a bitfield, printed in hexadecimal, indicating which
capabilities a specific block device supports:
-.. kernel-doc:: include/linux/genhd.h
+.. kernel-doc:: include/linux/blkdev.h
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index 3a41495dd77b..68f115f2b1c6 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -8,7 +8,6 @@ Block
:maxdepth: 1
bfq-iosched
- biodoc
biovecs
blk-mq
capability
diff --git a/Documentation/cdrom/packet-writing.rst b/Documentation/cdrom/packet-writing.rst
index c5c957195a5a..43db58c50d29 100644
--- a/Documentation/cdrom/packet-writing.rst
+++ b/Documentation/cdrom/packet-writing.rst
@@ -11,7 +11,7 @@ Getting started quick
- Compile and install kernel and modules, reboot.
- You need the udftools package (pktsetup, mkudffs, cdrwtool).
- Download from http://sourceforge.net/projects/linux-udf/
+ Download from https://github.com/pali/udftools
- Grab a new CD-RW disc and format it (assuming CD-RW is hdc, substitute
as appropriate)::
@@ -102,7 +102,7 @@ Using the pktcdvd sysfs interface
Since Linux 2.6.20, the pktcdvd module has a sysfs interface
and can be controlled by it. For example the "pktcdvd" tool uses
-this interface. (see http://tom.ist-im-web.de/download/pktcdvd )
+this interface. (see http://tom.ist-im-web.de/linux/software/pktcdvd )
"pktcdvd" works similar to "pktsetup", e.g.::
diff --git a/Documentation/conf.py b/Documentation/conf.py
index f07f2e9b9f2c..072ee31a301d 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -409,135 +409,25 @@ latex_elements = {
# Additional stuff for the LaTeX preamble.
'preamble': '''
- % Prevent column squeezing of tabulary.
- \\setlength{\\tymin}{20em}
% Use some font with UTF-8 support with XeLaTeX
\\usepackage{fontspec}
\\setsansfont{DejaVu Sans}
\\setromanfont{DejaVu Serif}
\\setmonofont{DejaVu Sans Mono}
- % Adjust \\headheight for fancyhdr
- \\addtolength{\\headheight}{1.6pt}
- \\addtolength{\\topmargin}{-1.6pt}
- ''',
+ ''',
}
-# Translations have Asian (CJK) characters which are only displayed if
-# xeCJK is used
-
-latex_elements['preamble'] += '''
- \\IfFontExistsTF{Noto Sans CJK SC}{
- % This is needed for translations
- \\usepackage{xeCJK}
- \\IfFontExistsTF{Noto Serif CJK SC}{
- \\setCJKmainfont{Noto Serif CJK SC}[AutoFakeSlant]
- }{
- \\setCJKmainfont{Noto Sans CJK SC}[AutoFakeSlant]
- }
- \\setCJKsansfont{Noto Sans CJK SC}[AutoFakeSlant]
- \\setCJKmonofont{Noto Sans Mono CJK SC}[AutoFakeSlant]
- % CJK Language-specific font choices
- \\IfFontExistsTF{Noto Serif CJK SC}{
- \\newCJKfontfamily[SCmain]\\scmain{Noto Serif CJK SC}[AutoFakeSlant]
- \\newCJKfontfamily[SCserif]\\scserif{Noto Serif CJK SC}[AutoFakeSlant]
- }{
- \\newCJKfontfamily[SCmain]\\scmain{Noto Sans CJK SC}[AutoFakeSlant]
- \\newCJKfontfamily[SCserif]\\scserif{Noto Sans CJK SC}[AutoFakeSlant]
- }
- \\newCJKfontfamily[SCsans]\\scsans{Noto Sans CJK SC}[AutoFakeSlant]
- \\newCJKfontfamily[SCmono]\\scmono{Noto Sans Mono CJK SC}[AutoFakeSlant]
- \\IfFontExistsTF{Noto Serif CJK TC}{
- \\newCJKfontfamily[TCmain]\\tcmain{Noto Serif CJK TC}[AutoFakeSlant]
- \\newCJKfontfamily[TCserif]\\tcserif{Noto Serif CJK TC}[AutoFakeSlant]
- }{
- \\newCJKfontfamily[TCmain]\\tcmain{Noto Sans CJK TC}[AutoFakeSlant]
- \\newCJKfontfamily[TCserif]\\tcserif{Noto Sans CJK TC}[AutoFakeSlant]
- }
- \\newCJKfontfamily[TCsans]\\tcsans{Noto Sans CJK TC}[AutoFakeSlant]
- \\newCJKfontfamily[TCmono]\\tcmono{Noto Sans Mono CJK TC}[AutoFakeSlant]
- \\IfFontExistsTF{Noto Serif CJK KR}{
- \\newCJKfontfamily[KRmain]\\krmain{Noto Serif CJK KR}[AutoFakeSlant]
- \\newCJKfontfamily[KRserif]\\krserif{Noto Serif CJK KR}[AutoFakeSlant]
- }{
- \\newCJKfontfamily[KRmain]\\krmain{Noto Sans CJK KR}[AutoFakeSlant]
- \\newCJKfontfamily[KRserif]\\krserif{Noto Sans CJK KR}[AutoFakeSlant]
- }
- \\newCJKfontfamily[KRsans]\\krsans{Noto Sans CJK KR}[AutoFakeSlant]
- \\newCJKfontfamily[KRmono]\\krmono{Noto Sans Mono CJK KR}[AutoFakeSlant]
- \\IfFontExistsTF{Noto Serif CJK JP}{
- \\newCJKfontfamily[JPmain]\\jpmain{Noto Serif CJK JP}[AutoFakeSlant]
- \\newCJKfontfamily[JPserif]\\jpserif{Noto Serif CJK JP}[AutoFakeSlant]
- }{
- \\newCJKfontfamily[JPmain]\\jpmain{Noto Sans CJK JP}[AutoFakeSlant]
- \\newCJKfontfamily[JPserif]\\jpserif{Noto Sans CJK JP}[AutoFakeSlant]
- }
- \\newCJKfontfamily[JPsans]\\jpsans{Noto Sans CJK JP}[AutoFakeSlant]
- \\newCJKfontfamily[JPmono]\\jpmono{Noto Sans Mono CJK JP}[AutoFakeSlant]
- % Dummy commands for Sphinx < 2.3 (no 'extrapackages' support)
- \\providecommand{\\onehalfspacing}{}
- \\providecommand{\\singlespacing}{}
- % Define custom macros to on/off CJK
- \\newcommand{\\kerneldocCJKon}{\\makexeCJKactive\\onehalfspacing}
- \\newcommand{\\kerneldocCJKoff}{\\makexeCJKinactive\\singlespacing}
- \\newcommand{\\kerneldocBeginSC}{%
- \\begingroup%
- \\scmain%
- }
- \\newcommand{\\kerneldocEndSC}{\\endgroup}
- \\newcommand{\\kerneldocBeginTC}{%
- \\begingroup%
- \\tcmain%
- \\renewcommand{\\CJKrmdefault}{TCserif}%
- \\renewcommand{\\CJKsfdefault}{TCsans}%
- \\renewcommand{\\CJKttdefault}{TCmono}%
- }
- \\newcommand{\\kerneldocEndTC}{\\endgroup}
- \\newcommand{\\kerneldocBeginKR}{%
- \\begingroup%
- \\xeCJKDeclareCharClass{HalfLeft}{`“,`‘}%
- \\xeCJKDeclareCharClass{HalfRight}{`”,`’}%
- \\krmain%
- \\renewcommand{\\CJKrmdefault}{KRserif}%
- \\renewcommand{\\CJKsfdefault}{KRsans}%
- \\renewcommand{\\CJKttdefault}{KRmono}%
- \\xeCJKsetup{CJKspace = true} % For inter-phrase space
- }
- \\newcommand{\\kerneldocEndKR}{\\endgroup}
- \\newcommand{\\kerneldocBeginJP}{%
- \\begingroup%
- \\xeCJKDeclareCharClass{HalfLeft}{`“,`‘}%
- \\xeCJKDeclareCharClass{HalfRight}{`”,`’}%
- \\jpmain%
- \\renewcommand{\\CJKrmdefault}{JPserif}%
- \\renewcommand{\\CJKsfdefault}{JPsans}%
- \\renewcommand{\\CJKttdefault}{JPmono}%
- }
- \\newcommand{\\kerneldocEndJP}{\\endgroup}
- % Single spacing in literal blocks
- \\fvset{baselinestretch=1}
- % To customize \\sphinxtableofcontents
- \\usepackage{etoolbox}
- % Inactivate CJK after tableofcontents
- \\apptocmd{\\sphinxtableofcontents}{\\kerneldocCJKoff}{}{}
- }{ % No CJK font found
- % Custom macros to on/off CJK (Dummy)
- \\newcommand{\\kerneldocCJKon}{}
- \\newcommand{\\kerneldocCJKoff}{}
- \\newcommand{\\kerneldocBeginSC}{}
- \\newcommand{\\kerneldocEndSC}{}
- \\newcommand{\\kerneldocBeginTC}{}
- \\newcommand{\\kerneldocEndTC}{}
- \\newcommand{\\kerneldocBeginKR}{}
- \\newcommand{\\kerneldocEndKR}{}
- \\newcommand{\\kerneldocBeginJP}{}
- \\newcommand{\\kerneldocEndJP}{}
- }
-'''
-
# Fix reference escape troubles with Sphinx 1.4.x
if major == 1:
latex_elements['preamble'] += '\\renewcommand*{\\DUrole}[2]{ #2 }\n'
+
+# Load kerneldoc specific LaTeX settings
+latex_elements['preamble'] += '''
+ % Load kerneldoc specific LaTeX settings
+ \\input{kerneldoc-preamble.sty}
+'''
+
# With Sphinx 1.6, it is possible to change the Bg color directly
# by using:
# \definecolor{sphinxnoteBgColor}{RGB}{204,255,255}
@@ -599,6 +489,11 @@ for fn in os.listdir('.'):
# If false, no module index is generated.
#latex_domain_indices = True
+# Additional LaTeX stuff to be copied to build directory
+latex_additional_files = [
+ 'sphinx/kerneldoc-preamble.sty',
+]
+
# -- Options for manual page output ---------------------------------------
diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst
new file mode 100644
index 000000000000..e12f22ab33c7
--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,279 @@
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+All transitions between execution domains require state updates which are
+subject to strict ordering constraints. State updates are required for the
+following:
+
+ * Lockdep
+ * RCU / Context tracking
+ * Preemption counter
+ * Tracing
+ * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
+exceptions`_, `NMI and NMI-like exceptions`_.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Most instrumentation facilities depend on RCU, so intrumentation is prohibited
+for entry code before RCU starts watching and exit code after RCU stops
+watching. In addition, many architectures must save and restore register state,
+which means that (for example) a breakpoint in the breakpoint entry code would
+overwrite the debug registers of the initial breakpoint.
+
+Such code must be marked with the 'noinstr' attribute, placing that code into a
+special section inaccessible to instrumentation and debug facilities. Some
+functions are partially instrumentable, which is handled by marking them
+noinstr and using instrumentation_begin() and instrumentation_end() to flag the
+instrumentable ranges of code:
+
+.. code-block:: c
+
+ noinstr void entry(void)
+ {
+ handle_entry(); // <-- must be 'noinstr' or '__always_inline'
+ ...
+
+ instrumentation_begin();
+ handle_context(); // <-- instrumentable code
+ instrumentation_end();
+
+ ...
+ handle_exit(); // <-- must be 'noinstr' or '__always_inline'
+ }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall-entry code starts in assembly code and calls out into low-level C code
+after establishing low-level architecture-specific state and stack frames. This
+low-level C code must not be instrumented. A typical syscall handling function
+invoked from low-level assembly code looks like this:
+
+.. code-block:: c
+
+ noinstr void syscall(struct pt_regs *regs, int nr)
+ {
+ arch_syscall_enter(regs);
+ nr = syscall_enter_from_user_mode(regs, nr);
+
+ instrumentation_begin();
+ if (!invoke_syscall(regs, nr) && nr != -1)
+ result_reg(regs) = __sys_ni_syscall(regs);
+ instrumentation_end();
+
+ syscall_exit_to_user_mode(regs);
+ }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+ * Lockdep
+ * RCU / Context tracking
+ * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp, audit,
+syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
+function can be invoked. The instrumentable code section then ends, after which
+syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+ * Tracing
+ * RCU / Context tracking
+ * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
+to print a warning.
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+Do not nest KVM entry/exit transitions because doing so is nonsensical.
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry and
+exit handling is slightly different. RCU state is only updated when the
+interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
+already be watching. Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture-specific part looks similar to syscall handling:
+
+.. code-block:: c
+
+ noinstr void interrupt(struct pt_regs *regs, int nr)
+ {
+ arch_interrupt_enter(regs);
+ state = irqentry_enter(regs);
+
+ instrumentation_begin();
+
+ irq_enter_rcu();
+ invoke_irq_handler(regs, nr);
+ irq_exit_rcu();
+
+ instrumentation_end();
+
+ irqentry_exit(regs, state);
+ }
+
+Note that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+In theory, the preemption count could be updated in irqentry_enter(). In
+practice, deferring this update to irq_enter_rcu() allows the preemption-count
+code to be traced, while also maintaining symmetry with irq_exit_rcu() and
+irqentry_exit(), which are described in the next paragraph. The only downside
+is that the early entry code up to irq_enter_rcu() must be aware that the
+preemption count has not yet been updated with the HARDIRQ_OFFSET state.
+
+Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
+before it handles soft interrupts, whose handlers must run in BH context rather
+than irq-disabled context. In addition, irqentry_exit() might schedule, which
+also requires that HARDIRQ_OFFSET has been removed from the preemption count.
+
+Even though interrupt handlers are expected to run with local interrupts
+disabled, interrupt nesting is common from an entry/exit perspective. For
+example, softirq handling happens within an irqentry_{enter,exit}() block with
+local interrupts enabled. Also, although uncommon, nothing prevents an
+interrupt handler from re-enabling interrupts.
+
+Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
+runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
+the entry code is shared between the two.
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI-like exceptions (machine checks, double faults, debug
+interrupts, etc.) can hit any context and must be extra careful with
+the state.
+
+State changes for debug exceptions and machine-check exceptions depend on
+whether these exceptions happened in user-space (breakpoints or watchpoints) or
+in kernel mode (code patching). From user-space, they are treated like
+interrupts, while from kernel mode they are treated like NMIs.
+
+NMIs and other NMI-like exceptions handle state transitions without
+distinguishing between user-mode and kernel-mode origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+ * Preemption counter
+ * Lockdep
+ * RCU / Context tracking
+ * Tracing
+
+The exit counterpart irqentry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case must not be
+traced.
+
+Architecture-specific code looks like this:
+
+.. code-block:: c
+
+ noinstr void nmi(struct pt_regs *regs)
+ {
+ arch_nmi_enter(regs);
+ state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+ nmi_handler(regs);
+ instrumentation_end();
+
+ irqentry_nmi_exit(regs);
+ }
+
+and for e.g. a debug exception it can look like this:
+
+.. code-block:: c
+
+ noinstr void debug(struct pt_regs *regs)
+ {
+ arch_nmi_enter(regs);
+
+ debug_regs = save_debug_regs();
+
+ if (user_mode(regs)) {
+ state = irqentry_enter(regs);
+
+ instrumentation_begin();
+ user_mode_debug_handler(regs, debug_regs);
+ instrumentation_end();
+
+ irqentry_exit(regs, state);
+ } else {
+ state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+ kernel_mode_debug_handler(regs, debug_regs);
+ instrumentation_end();
+
+ irqentry_nmi_exit(regs, state);
+ }
+ }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception-agnostic way.
+
+NMIs can happen in any context. For example, an NMI-like exception triggered
+while handling an NMI. So NMI entry code has to be reentrant and state updates
+need to handle nesting.
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 5de2c7a4b1b3..972d46a5ddf6 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -44,6 +44,14 @@ Library functionality that is used throughout the kernel.
timekeeping
errseq
+Low level entry and exit
+========================
+
+.. toctree::
+ :maxdepth: 1
+
+ entry
+
Concurrency primitives
======================
diff --git a/Documentation/dev-tools/ktap.rst b/Documentation/dev-tools/ktap.rst
index 878530cb9c27..5ee735c6687f 100644
--- a/Documentation/dev-tools/ktap.rst
+++ b/Documentation/dev-tools/ktap.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-========================================
-The Kernel Test Anything Protocol (KTAP)
-========================================
+===================================================
+The Kernel Test Anything Protocol (KTAP), version 1
+===================================================
TAP, or the Test Anything Protocol is a format for specifying test results used
by a number of projects. It's website and specification are found at this `link
@@ -68,7 +68,7 @@ Test case result lines
Test case result lines indicate the final status of a test.
They are required and must have the format:
-.. code-block::
+.. code-block:: none
<result> <number> [<description>][ # [<directive>] [<diagnostic data>]]
@@ -117,32 +117,32 @@ separator.
Example result lines include:
-.. code-block::
+.. code-block:: none
ok 1 test_case_name
The test "test_case_name" passed.
-.. code-block::
+.. code-block:: none
not ok 1 test_case_name
The test "test_case_name" failed.
-.. code-block::
+.. code-block:: none
ok 1 test # SKIP necessary dependency unavailable
The test "test" was SKIPPED with the diagnostic message "necessary dependency
unavailable".
-.. code-block::
+.. code-block:: none
not ok 1 test # TIMEOUT 30 seconds
The test "test" timed out, with diagnostic data "30 seconds".
-.. code-block::
+.. code-block:: none
ok 5 check return code # rcode=0
@@ -174,6 +174,13 @@ There may be lines within KTAP output that do not follow the format of one of
the four formats for lines described above. This is allowed, however, they will
not influence the status of the tests.
+This is an important difference from TAP. Kernel tests may print messages
+to the system console or a log file. Both of these destinations may contain
+messages either from unrelated kernel or userspace activity, or kernel
+messages from non-test code that is invoked by the test. The kernel code
+invoked by the test likely is not aware that a test is in progress and
+thus can not print the message as a diagnostic message.
+
Nested tests
------------
@@ -186,13 +193,16 @@ starting with another KTAP version line and test plan, and end with the overall
result. If one of the subtests fail, for example, the parent test should also
fail.
-Additionally, all result lines in a subtest should be indented. One level of
+Additionally, all lines in a subtest should be indented. One level of
indentation is two spaces: " ". The indentation should begin at the version
line and should end before the parent test's result line.
+"Unknown lines" are not considered to be lines in a subtest and thus are
+allowed to be either indented or not indented.
+
An example of a test with two nested subtests:
-.. code-block::
+.. code-block:: none
KTAP version 1
1..1
@@ -205,7 +215,7 @@ An example of a test with two nested subtests:
An example format with multiple levels of nested testing:
-.. code-block::
+.. code-block:: none
KTAP version 1
1..2
@@ -224,10 +234,15 @@ An example format with multiple levels of nested testing:
Major differences between TAP and KTAP
--------------------------------------
-Note the major differences between the TAP and KTAP specification:
-- yaml and json are not recommended in diagnostic messages
-- TODO directive not recognized
-- KTAP allows for an arbitrary number of tests to be nested
+================================================== ========= ===============
+Feature TAP KTAP
+================================================== ========= ===============
+yaml and json in diagnosic message ok not recommended
+TODO directive ok not recognized
+allows an arbitrary number of tests to be nested no yes
+"Unknown lines" are in category of "Anything else" yes no
+"Unknown lines" are incorrect allowed
+================================================== ========= ===============
The TAP14 specification does permit nested tests, but instead of using another
nested version line, uses a line of the form
@@ -235,7 +250,7 @@ nested version line, uses a line of the form
Example KTAP output
--------------------
-.. code-block::
+.. code-block:: none
KTAP version 1
1..1
diff --git a/Documentation/devicetree/bindings/arm/pmu.yaml b/Documentation/devicetree/bindings/arm/pmu.yaml
index 981bac451698..7a04b8aaaec3 100644
--- a/Documentation/devicetree/bindings/arm/pmu.yaml
+++ b/Documentation/devicetree/bindings/arm/pmu.yaml
@@ -20,6 +20,8 @@ properties:
items:
- enum:
- apm,potenza-pmu
+ - apple,firestorm-pmu
+ - apple,icestorm-pmu
- arm,armv8-pmuv3 # Only for s/w models
- arm,arm1136-pmu
- arm,arm1176-pmu
diff --git a/Documentation/devicetree/bindings/extcon/maxim,max77843.yaml b/Documentation/devicetree/bindings/extcon/maxim,max77843.yaml
new file mode 100644
index 000000000000..f9ffe3d6f957
--- /dev/null
+++ b/Documentation/devicetree/bindings/extcon/maxim,max77843.yaml
@@ -0,0 +1,40 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/extcon/maxim,max77843.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX77843 MicroUSB and Companion Power Management IC Extcon
+
+maintainers:
+ - Chanwoo Choi <cw00.choi@samsung.com>
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX77843 MicroUSB
+ Integrated Circuit (MUIC).
+
+ See also Documentation/devicetree/bindings/mfd/maxim,max77843.yaml for
+ additional information and example.
+
+properties:
+ compatible:
+ const: maxim,max77843-muic
+
+ connector:
+ $ref: /schemas/connector/usb-connector.yaml#
+
+ ports:
+ $ref: /schemas/graph.yaml#/properties/port
+ description:
+ Any connector to the data bus of this controller should be modelled using
+ the OF graph bindings specified
+ properties:
+ port:
+ $ref: /schemas/graph.yaml#/properties/port
+
+required:
+ - compatible
+ - connector
+
+additionalProperties: false
diff --git a/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml b/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
index 223393d7cafd..ab87f51c5aef 100644
--- a/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
+++ b/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
@@ -37,6 +37,72 @@ properties:
description:
Shunt resistor value in micro-Ohm.
+ adi,volt-curr-sample-average:
+ description: |
+ Number of samples to be used to report voltage and current values.
+ $ref: /schemas/types.yaml#/definitions/uint32
+ enum: [1, 2, 4, 8, 16, 32, 64, 128]
+
+ adi,power-sample-average:
+ description: |
+ Number of samples to be used to report power values.
+ $ref: /schemas/types.yaml#/definitions/uint32
+ enum: [1, 2, 4, 8, 16, 32, 64, 128]
+
+allOf:
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - adi,adm1075
+ - adi,adm1276
+ then:
+ properties:
+ adi,volt-curr-sample-average:
+ default: 128
+ adi,power-sample-average: false
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - adi,adm1275
+ then:
+ properties:
+ adi,volt-curr-sample-average:
+ default: 16
+ adi,power-sample-average: false
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - adi,adm1272
+ then:
+ properties:
+ adi,volt-curr-sample-average:
+ default: 128
+ adi,power-sample-average:
+ default: 128
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - adi,adm1278
+ - adi,adm1293
+ - adi,adm1294
+ then:
+ properties:
+ adi,volt-curr-sample-average:
+ default: 128
+ adi,power-sample-average:
+ default: 1
+
required:
- compatible
- reg
@@ -53,5 +119,7 @@ examples:
compatible = "adi,adm1272";
reg = <0x10>;
shunt-resistor-micro-ohms = <500>;
+ adi,volt-curr-sample-average = <128>;
+ adi,power-sample-average = <128>;
};
};
diff --git a/Documentation/devicetree/bindings/hwmon/national,lm90.yaml b/Documentation/devicetree/bindings/hwmon/national,lm90.yaml
index 6e1d54ff5d5b..30db92977937 100644
--- a/Documentation/devicetree/bindings/hwmon/national,lm90.yaml
+++ b/Documentation/devicetree/bindings/hwmon/national,lm90.yaml
@@ -60,7 +60,6 @@ additionalProperties: false
examples:
- |
- #include <dt-bindings/gpio/tegra-gpio.h>
#include <dt-bindings/interrupt-controller/irq.h>
i2c {
@@ -71,8 +70,7 @@ examples:
compatible = "onnn,nct1008";
reg = <0x4c>;
vcc-supply = <&palmas_ldo6_reg>;
- interrupt-parent = <&gpio>;
- interrupts = <TEGRA_GPIO(O, 4) IRQ_TYPE_LEVEL_LOW>;
+ interrupts = <4 IRQ_TYPE_LEVEL_LOW>;
#thermal-sensor-cells = <1>;
};
};
diff --git a/Documentation/devicetree/bindings/hwmon/ti,tmp464.yaml b/Documentation/devicetree/bindings/hwmon/ti,tmp464.yaml
new file mode 100644
index 000000000000..801ca9ba7d34
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/ti,tmp464.yaml
@@ -0,0 +1,114 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/hwmon/ti,tmp464.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: TMP464 and TMP468 temperature sensors
+
+maintainers:
+ - Agathe Porte <agathe.porte@nokia.com>
+
+description: |
+ ±0.0625°C Remote and Local temperature sensor
+ https://www.ti.com/lit/ds/symlink/tmp464.pdf
+ https://www.ti.com/lit/ds/symlink/tmp468.pdf
+
+properties:
+ compatible:
+ enum:
+ - ti,tmp464
+ - ti,tmp468
+
+ reg:
+ maxItems: 1
+
+ '#address-cells':
+ const: 1
+
+ '#size-cells':
+ const: 0
+
+required:
+ - compatible
+ - reg
+
+additionalProperties: false
+
+patternProperties:
+ "^channel@([0-8])$":
+ type: object
+ description: |
+ Represents channels of the device and their specific configuration.
+
+ properties:
+ reg:
+ description: |
+ The channel number. 0 is local channel, 1-8 are remote channels.
+ items:
+ minimum: 0
+ maximum: 8
+
+ label:
+ description: |
+ A descriptive name for this channel, like "ambient" or "psu".
+
+ ti,n-factor:
+ description: |
+ The value (two's complement) to be programmed in the channel specific N correction register.
+ For remote channels only.
+ $ref: /schemas/types.yaml#/definitions/int32
+ items:
+ minimum: -128
+ maximum: 127
+
+ required:
+ - reg
+
+ additionalProperties: false
+
+examples:
+ - |
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ sensor@4b {
+ compatible = "ti,tmp464";
+ reg = <0x4b>;
+ };
+ };
+ - |
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ sensor@4b {
+ compatible = "ti,tmp464";
+ reg = <0x4b>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ channel@0 {
+ reg = <0x0>;
+ label = "local";
+ };
+
+ channel@1 {
+ reg = <0x1>;
+ ti,n-factor = <(-10)>;
+ label = "external";
+ };
+
+ channel@2 {
+ reg = <0x2>;
+ ti,n-factor = <0x10>;
+ label = "somelabel";
+ };
+
+ channel@3 {
+ reg = <0x3>;
+ status = "disabled";
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt b/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt
index 23b18b92c558..bde63f8f090e 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt
+++ b/Documentation/devicetree/bindings/interrupt-controller/amlogic,meson-gpio-intc.txt
@@ -18,6 +18,7 @@ Required properties:
"amlogic,meson-g12a-gpio-intc" for G12A SoCs (S905D2, S905X2, S905Y2)
"amlogic,meson-sm1-gpio-intc" for SM1 SoCs (S905D3, S905X3, S905Y3)
"amlogic,meson-a1-gpio-intc" for A1 SoCs (A113L)
+ "amlogic,meson-s4-gpio-intc" for S4 SoCs (S802X2, S905Y4, S805X2G, S905W2)
- reg : Specifies base physical address and size of the registers.
- interrupt-controller : Identifies the node as an interrupt controller.
- #interrupt-cells : Specifies the number of cells needed to encode an
diff --git a/Documentation/devicetree/bindings/interrupt-controller/apple,aic.yaml b/Documentation/devicetree/bindings/interrupt-controller/apple,aic.yaml
index 97359024709a..85c85b694217 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/apple,aic.yaml
+++ b/Documentation/devicetree/bindings/interrupt-controller/apple,aic.yaml
@@ -56,6 +56,8 @@ properties:
- 1: virtual HV timer
- 2: physical guest timer
- 3: virtual guest timer
+ - 4: 'efficient' CPU PMU
+ - 5: 'performance' CPU PMU
The 3rd cell contains the interrupt flags. This is normally
IRQ_TYPE_LEVEL_HIGH (4).
@@ -68,6 +70,35 @@ properties:
power-domains:
maxItems: 1
+ affinities:
+ type: object
+ additionalProperties: false
+ description:
+ FIQ affinity can be expressed as a single "affinities" node,
+ containing a set of sub-nodes, one per FIQ with a non-default
+ affinity.
+ patternProperties:
+ "^.+-affinity$":
+ type: object
+ additionalProperties: false
+ properties:
+ apple,fiq-index:
+ description:
+ The interrupt number specified as a FIQ, and for which
+ the affinity is not the default.
+ $ref: /schemas/types.yaml#/definitions/uint32
+ maximum: 5
+
+ cpus:
+ $ref: /schemas/types.yaml#/definitions/phandle-array
+ description:
+ Should be a list of phandles to CPU nodes (as described in
+ Documentation/devicetree/bindings/arm/cpus.yaml).
+
+ required:
+ - fiq-index
+ - cpus
+
required:
- compatible
- '#interrupt-cells'
diff --git a/Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml b/Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml
new file mode 100644
index 000000000000..47a78a167aba
--- /dev/null
+++ b/Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml
@@ -0,0 +1,98 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interrupt-controller/apple,aic2.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Apple Interrupt Controller 2
+
+maintainers:
+ - Hector Martin <marcan@marcan.st>
+
+description: |
+ The Apple Interrupt Controller 2 is a simple interrupt controller present on
+ Apple ARM SoC platforms starting with t600x (M1 Pro and Max).
+
+ It provides the following features:
+
+ - Level-triggered hardware IRQs wired to SoC blocks
+ - Single mask bit per IRQ
+ - Automatic masking on event delivery (auto-ack)
+ - Software triggering (ORed with hw line)
+ - Automatic prioritization (single event/ack register per CPU, lower IRQs =
+ higher priority)
+ - Automatic masking on ack
+ - Support for multiple dies
+
+ This device also represents the FIQ interrupt sources on platforms using AIC,
+ which do not go through a discrete interrupt controller. It also handles
+ FIQ-based Fast IPIs.
+
+properties:
+ compatible:
+ items:
+ - const: apple,t6000-aic
+ - const: apple,aic2
+
+ interrupt-controller: true
+
+ '#interrupt-cells':
+ const: 4
+ description: |
+ The 1st cell contains the interrupt type:
+ - 0: Hardware IRQ
+ - 1: FIQ
+
+ The 2nd cell contains the die ID.
+
+ The next cell contains the interrupt number.
+ - HW IRQs: interrupt number
+ - FIQs:
+ - 0: physical HV timer
+ - 1: virtual HV timer
+ - 2: physical guest timer
+ - 3: virtual guest timer
+
+ The last cell contains the interrupt flags. This is normally
+ IRQ_TYPE_LEVEL_HIGH (4).
+
+ reg:
+ items:
+ - description: Address and size of the main AIC2 registers.
+ - description: Address and size of the AIC2 Event register.
+
+ reg-names:
+ items:
+ - const: core
+ - const: event
+
+ power-domains:
+ maxItems: 1
+
+required:
+ - compatible
+ - '#interrupt-cells'
+ - interrupt-controller
+ - reg
+ - reg-names
+
+additionalProperties: false
+
+allOf:
+ - $ref: /schemas/interrupt-controller.yaml#
+
+examples:
+ - |
+ soc {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ aic: interrupt-controller@28e100000 {
+ compatible = "apple,t6000-aic", "apple,aic2";
+ #interrupt-cells = <4>;
+ interrupt-controller;
+ reg = <0x2 0x8e100000 0x0 0xc000>,
+ <0x2 0x8e10c000 0x0 0x4>;
+ reg-names = "core", "event";
+ };
+ };
diff --git a/Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml b/Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml
new file mode 100644
index 000000000000..509d20c091af
--- /dev/null
+++ b/Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml
@@ -0,0 +1,96 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interrupt-controller/qcom,mpm.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Qualcom MPM Interrupt Controller
+
+maintainers:
+ - Shawn Guo <shawn.guo@linaro.org>
+
+description:
+ Qualcomm Technologies Inc. SoCs based on the RPM architecture have a
+ MSM Power Manager (MPM) that is in always-on domain. In addition to managing
+ resources during sleep, the hardware also has an interrupt controller that
+ monitors the interrupts when the system is asleep, wakes up the APSS when
+ one of these interrupts occur and replays it to GIC interrupt controller
+ after GIC becomes operational.
+
+allOf:
+ - $ref: /schemas/interrupt-controller.yaml#
+
+properties:
+ compatible:
+ items:
+ - const: qcom,mpm
+
+ reg:
+ maxItems: 1
+ description:
+ Specifies the base address and size of vMPM registers in RPM MSG RAM.
+
+ interrupts:
+ maxItems: 1
+ description:
+ Specify the IRQ used by RPM to wakeup APSS.
+
+ mboxes:
+ maxItems: 1
+ description:
+ Specify the mailbox used to notify RPM for writing vMPM registers.
+
+ interrupt-controller: true
+
+ '#interrupt-cells':
+ const: 2
+ description:
+ The first cell is the MPM pin number for the interrupt, and the second
+ is the trigger type.
+
+ qcom,mpm-pin-count:
+ description:
+ Specify the total MPM pin count that a SoC supports.
+ $ref: /schemas/types.yaml#/definitions/uint32
+
+ qcom,mpm-pin-map:
+ description:
+ A set of MPM pin numbers and the corresponding GIC SPIs.
+ $ref: /schemas/types.yaml#/definitions/uint32-matrix
+ items:
+ items:
+ - description: MPM pin number
+ - description: GIC SPI number for the MPM pin
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - mboxes
+ - interrupt-controller
+ - '#interrupt-cells'
+ - qcom,mpm-pin-count
+ - qcom,mpm-pin-map
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ mpm: interrupt-controller@45f01b8 {
+ compatible = "qcom,mpm";
+ interrupts = <GIC_SPI 197 IRQ_TYPE_EDGE_RISING>;
+ reg = <0x45f01b8 0x1000>;
+ mboxes = <&apcs_glb 1>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ interrupt-parent = <&intc>;
+ qcom,mpm-pin-count = <96>;
+ qcom,mpm-pin-map = <2 275>,
+ <5 296>,
+ <12 422>,
+ <24 79>,
+ <86 183>,
+ <90 260>,
+ <91 260>;
+ };
diff --git a/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.yaml b/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.yaml
index d19c881b4abc..e44daa09b137 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.yaml
+++ b/Documentation/devicetree/bindings/interrupt-controller/st,stm32-exti.yaml
@@ -20,6 +20,7 @@ properties:
- items:
- enum:
- st,stm32mp1-exti
+ - st,stm32mp13-exti
- const: syscon
"#interrupt-cells":
diff --git a/Documentation/devicetree/bindings/mfd/google,cros-ec.yaml b/Documentation/devicetree/bindings/mfd/google,cros-ec.yaml
index d1f53bd449f7..4caadf73fc4a 100644
--- a/Documentation/devicetree/bindings/mfd/google,cros-ec.yaml
+++ b/Documentation/devicetree/bindings/mfd/google,cros-ec.yaml
@@ -31,7 +31,7 @@ properties:
controller-data:
description:
- SPI controller data, see bindings/spi/spi-samsung.txt
+ SPI controller data, see bindings/spi/samsung,spi-peripheral-props.yaml
type: object
google,cros-ec-spi-pre-delay:
@@ -148,18 +148,21 @@ patternProperties:
required:
- compatible
-if:
- properties:
- compatible:
- contains:
- enum:
- - google,cros-ec-i2c
- - google,cros-ec-rpmsg
-then:
- properties:
- google,cros-ec-spi-pre-delay: false
- google,cros-ec-spi-msg-delay: false
- spi-max-frequency: false
+allOf:
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - google,cros-ec-i2c
+ - google,cros-ec-rpmsg
+ then:
+ properties:
+ google,cros-ec-spi-pre-delay: false
+ google,cros-ec-spi-msg-delay: false
+ spi-max-frequency: false
+ else:
+ $ref: /schemas/spi/spi-peripheral-props.yaml
additionalProperties: false
@@ -200,7 +203,7 @@ examples:
spi-max-frequency = <5000000>;
proximity {
- compatible = "google,cros-ec-mkbp-proximity";
+ compatible = "google,cros-ec-mkbp-proximity";
};
cbas {
diff --git a/Documentation/devicetree/bindings/mfd/max14577.txt b/Documentation/devicetree/bindings/mfd/max14577.txt
deleted file mode 100644
index be11943a0560..000000000000
--- a/Documentation/devicetree/bindings/mfd/max14577.txt
+++ /dev/null
@@ -1,147 +0,0 @@
-Maxim MAX14577/77836 Multi-Function Device
-
-MAX14577 is a Multi-Function Device with Micro-USB Interface Circuit, Li+
-Battery Charger and SFOUT LDO output for powering USB devices. It is
-interfaced to host controller using I2C.
-
-MAX77836 additionally contains PMIC (with two LDO regulators) and Fuel Gauge.
-For the description of Fuel Gauge low SOC alert interrupt see:
-../power/supply/max17040_battery.txt
-
-
-Required properties:
-- compatible : Must be "maxim,max14577" or "maxim,max77836".
-- reg : I2C slave address for the max14577 chip (0x25 for max14577/max77836)
-- interrupts : IRQ line for the chip.
-
-
-Required nodes:
- - charger :
- Node for configuring the charger driver.
- Required properties:
- - compatible : "maxim,max14577-charger"
- or "maxim,max77836-charger"
- - maxim,fast-charge-uamp : Current in uA for Fast Charge;
- Valid values:
- - for max14577: 90000 - 950000;
- - for max77836: 45000 - 475000;
- - maxim,eoc-uamp : Current in uA for End-Of-Charge mode;
- Valid values:
- - for max14577: 50000 - 200000;
- - for max77836: 5000 - 100000;
- - maxim,ovp-uvolt : OverVoltage Protection Threshold in uV;
- In an overvoltage condition, INT asserts and charging
- stops. Valid values:
- - 6000000, 6500000, 7000000, 7500000;
- - maxim,constant-uvolt : Battery Constant Voltage in uV;
- Valid values:
- - 4000000 - 4280000 (step by 20000);
- - 4350000;
-
-
-Optional nodes:
-- max14577-muic/max77836-muic :
- Node used only by extcon consumers.
- Required properties:
- - compatible : "maxim,max14577-muic" or "maxim,max77836-muic"
-
-- regulators :
- Required properties:
- - compatible : "maxim,max14577-regulator"
- or "maxim,max77836-regulator"
-
- May contain a sub-node per regulator from the list below. Each
- sub-node should contain the constraints and initialization information
- for that regulator. See regulator.txt for a description of standard
- properties for these sub-nodes.
-
- List of valid regulator names:
- - for max14577: CHARGER, SAFEOUT.
- - for max77836: CHARGER, SAFEOUT, LDO1, LDO2.
-
- The SAFEOUT is a fixed voltage regulator so there is no need to specify
- voltages for it.
-
-
-Example:
-
-#include <dt-bindings/interrupt-controller/irq.h>
-
-max14577@25 {
- compatible = "maxim,max14577";
- reg = <0x25>;
- interrupt-parent = <&gpx1>;
- interrupts = <5 IRQ_TYPE_LEVEL_LOW>;
-
- muic: max14577-muic {
- compatible = "maxim,max14577-muic";
- };
-
- regulators {
- compatible = "maxim,max14577-regulator";
-
- SAFEOUT {
- regulator-name = "SAFEOUT";
- };
- CHARGER {
- regulator-name = "CHARGER";
- regulator-min-microamp = <90000>;
- regulator-max-microamp = <950000>;
- regulator-boot-on;
- };
- };
-
- charger {
- compatible = "maxim,max14577-charger";
-
- maxim,constant-uvolt = <4350000>;
- maxim,fast-charge-uamp = <450000>;
- maxim,eoc-uamp = <50000>;
- maxim,ovp-uvolt = <6500000>;
- };
-};
-
-
-max77836@25 {
- compatible = "maxim,max77836";
- reg = <0x25>;
- interrupt-parent = <&gpx1>;
- interrupts = <5 IRQ_TYPE_LEVEL_LOW>;
-
- muic: max77836-muic {
- compatible = "maxim,max77836-muic";
- };
-
- regulators {
- compatible = "maxim,max77836-regulator";
-
- SAFEOUT {
- regulator-name = "SAFEOUT";
- };
- CHARGER {
- regulator-name = "CHARGER";
- regulator-min-microamp = <90000>;
- regulator-max-microamp = <950000>;
- regulator-boot-on;
- };
- LDO1 {
- regulator-name = "LDO1";
- regulator-min-microvolt = <2700000>;
- regulator-max-microvolt = <2700000>;
- };
- LDO2 {
- regulator-name = "LDO2";
- regulator-min-microvolt = <800000>;
- regulator-max-microvolt = <3950000>;
- };
- };
-
- charger {
- compatible = "maxim,max77836-charger";
-
- maxim,constant-uvolt = <4350000>;
- maxim,fast-charge-uamp = <225000>;
- maxim,eoc-uamp = <7500>;
- maxim,ovp-uvolt = <6500000>;
- };
-};
diff --git a/Documentation/devicetree/bindings/mfd/max77802.txt b/Documentation/devicetree/bindings/mfd/max77802.txt
deleted file mode 100644
index 09decac20d91..000000000000
--- a/Documentation/devicetree/bindings/mfd/max77802.txt
+++ /dev/null
@@ -1,25 +0,0 @@
-Maxim MAX77802 multi-function device
-
-The Maxim MAX77802 is a Power Management IC (PMIC) that contains 10 high
-efficiency Buck regulators, 32 Low-DropOut (LDO) regulators used to power
-up application processors and peripherals, a 2-channel 32kHz clock outputs,
-a Real-Time-Clock (RTC) and a I2C interface to program the individual
-regulators, clocks outputs and the RTC.
-
-Bindings for the built-in 32k clock generator block and
-regulators are defined in ../clk/maxim,max77802.txt and
-../regulator/max77802.txt respectively.
-
-Required properties:
-- compatible : Must be "maxim,max77802"
-- reg : Specifies the I2C slave address of PMIC block.
-- interrupts : I2C device IRQ line connected to the main SoC.
-
-Example:
-
- max77802: pmic@9 {
- compatible = "maxim,max77802";
- interrupt-parent = <&intc>;
- interrupts = <26 IRQ_TYPE_NONE>;
- reg = <0x09>;
- };
diff --git a/Documentation/devicetree/bindings/mfd/maxim,max14577.yaml b/Documentation/devicetree/bindings/mfd/maxim,max14577.yaml
new file mode 100644
index 000000000000..27870b8760a6
--- /dev/null
+++ b/Documentation/devicetree/bindings/mfd/maxim,max14577.yaml
@@ -0,0 +1,195 @@
+# SPDX-License-Identifier: GPL-2.0-only
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mfd/maxim,max14577.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX14577/MAX77836 MicroUSB and Companion Power Management IC
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX14577/MAX77836 MicroUSB
+ Integrated Circuit (MUIC).
+
+ The Maxim MAX14577 is a MicroUSB and Companion Power Management IC which
+ includes voltage safeout regulators, charger and MicroUSB management IC.
+
+ The Maxim MAX77836 is a MicroUSB and Companion Power Management IC which
+ includes voltage safeout and LDO regulators, charger, fuel-gauge and MicroUSB
+ management IC.
+
+properties:
+ compatible:
+ enum:
+ - maxim,max14577
+ - maxim,max77836
+
+ interrupts:
+ maxItems: 1
+
+ reg:
+ maxItems: 1
+
+ wakeup-source: true
+
+ charger:
+ $ref: /schemas/power/supply/maxim,max14577.yaml
+
+ extcon:
+ type: object
+ properties:
+ compatible:
+ enum:
+ - maxim,max14577-muic
+ - maxim,max77836-muic
+
+ required:
+ - compatible
+
+ regulators:
+ $ref: /schemas/regulator/maxim,max14577.yaml
+
+required:
+ - compatible
+ - interrupts
+ - reg
+ - charger
+
+allOf:
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: maxim,max14577
+ then:
+ properties:
+ charger:
+ properties:
+ compatible:
+ const: maxim,max14577-charger
+ extcon:
+ properties:
+ compatible:
+ const: maxim,max14577-muic
+ regulator:
+ properties:
+ compatible:
+ const: maxim,max14577-regulator
+ else:
+ properties:
+ charger:
+ properties:
+ compatible:
+ const: maxim,max77836-charger
+ extcon:
+ properties:
+ compatible:
+ const: maxim,max77836-muic
+ regulator:
+ properties:
+ compatible:
+ const: maxim,max77836-regulator
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pmic@25 {
+ compatible = "maxim,max14577";
+ reg = <0x25>;
+ interrupt-parent = <&gpx1>;
+ interrupts = <5 IRQ_TYPE_LEVEL_LOW>;
+
+ extcon {
+ compatible = "maxim,max14577-muic";
+ };
+
+ regulators {
+ compatible = "maxim,max14577-regulator";
+
+ SAFEOUT {
+ regulator-name = "SAFEOUT";
+ };
+
+ CHARGER {
+ regulator-name = "CHARGER";
+ regulator-min-microamp = <90000>;
+ regulator-max-microamp = <950000>;
+ regulator-boot-on;
+ };
+ };
+
+ charger {
+ compatible = "maxim,max14577-charger";
+
+ maxim,constant-uvolt = <4350000>;
+ maxim,fast-charge-uamp = <450000>;
+ maxim,eoc-uamp = <50000>;
+ maxim,ovp-uvolt = <6500000>;
+ };
+ };
+ };
+
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pmic@25 {
+ compatible = "maxim,max77836";
+ interrupt-parent = <&gpx1>;
+ interrupts = <5 IRQ_TYPE_NONE>;
+ reg = <0x25>;
+ wakeup-source;
+
+ extcon {
+ compatible = "maxim,max77836-muic";
+ };
+
+ regulators {
+ compatible = "maxim,max77836-regulator";
+
+ SAFEOUT {
+ regulator-name = "SAFEOUT";
+ };
+
+ CHARGER {
+ regulator-name = "CHARGER";
+ regulator-min-microamp = <45000>;
+ regulator-max-microamp = <475000>;
+ regulator-boot-on;
+ };
+
+ LDO1 {
+ regulator-name = "MOT_2.7V";
+ regulator-min-microvolt = <1100000>;
+ regulator-max-microvolt = <2700000>;
+ };
+
+ LDO2 {
+ regulator-name = "UNUSED_LDO2";
+ regulator-min-microvolt = <800000>;
+ regulator-max-microvolt = <3950000>;
+ };
+ };
+
+ charger {
+ compatible = "maxim,max77836-charger";
+
+ maxim,constant-uvolt = <4350000>;
+ maxim,fast-charge-uamp = <225000>;
+ maxim,eoc-uamp = <7500>;
+ maxim,ovp-uvolt = <6500000>;
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/mfd/maxim,max77802.yaml b/Documentation/devicetree/bindings/mfd/maxim,max77802.yaml
new file mode 100644
index 000000000000..baa1346ac5d5
--- /dev/null
+++ b/Documentation/devicetree/bindings/mfd/maxim,max77802.yaml
@@ -0,0 +1,194 @@
+# SPDX-License-Identifier: GPL-2.0-only
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mfd/maxim,max77802.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX77802 Power Management IC
+
+maintainers:
+ - Javier Martinez Canillas <javier@dowhile0.org>
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX77802 Power Management
+ Integrated Circuit (PMIC).
+
+ The Maxim MAX77802 is a Power Management IC which includes voltage and
+ current regulators (10 high efficiency Buck regulators and 32 Low-DropOut
+ (LDO)), RTC and clock outputs.
+
+ The MAX77802 provides two 32.768khz clock outputs that can be controlled
+ (gated/ungated) over I2C. The clock IDs are defined as preprocessor macros
+ in dt-bindings/clock/maxim,max77802.h.
+
+properties:
+ compatible:
+ const: maxim,max77802
+
+ '#clock-cells':
+ const: 1
+
+ interrupts:
+ maxItems: 1
+
+ reg:
+ maxItems: 1
+
+ regulators:
+ $ref: /schemas/regulator/maxim,max77802.yaml
+ description:
+ List of child nodes that specify the regulators.
+
+ inb1-supply:
+ description: Power supply for buck1
+ inb2-supply:
+ description: Power supply for buck2
+ inb3-supply:
+ description: Power supply for buck3
+ inb4-supply:
+ description: Power supply for buck4
+ inb5-supply:
+ description: Power supply for buck5
+ inb6-supply:
+ description: Power supply for buck6
+ inb7-supply:
+ description: Power supply for buck7
+ inb8-supply:
+ description: Power supply for buck8
+ inb9-supply:
+ description: Power supply for buck9
+ inb10-supply:
+ description: Power supply for buck10
+
+ inl1-supply:
+ description: Power supply for LDO8, LDO15
+ inl2-supply:
+ description: Power supply for LDO17, LDO27, LDO30, LDO35
+ inl3-supply:
+ description: Power supply for LDO3, LDO5, LDO7, LDO7
+ inl4-supply:
+ description: Power supply for LDO10, LDO11, LDO13, LDO14
+ inl5-supply:
+ description: Power supply for LDO9, LDO19
+ inl6-supply:
+ description: Power supply for LDO4, LDO21, LDO24, LDO33
+ inl7-supply:
+ description: Power supply for LDO18, LDO20, LDO28, LDO29
+ inl9-supply:
+ description: Power supply for LDO12, LDO23, LDO25, LDO26, LDO32, LDO34
+ inl10-supply:
+ description: Power supply for LDO1, LDO2
+
+ wakeup-source: true
+
+required:
+ - compatible
+ - '#clock-cells'
+ - reg
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+ #include <dt-bindings/regulator/maxim,max77802.h>
+
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pmic@9 {
+ compatible = "maxim,max77802";
+ interrupt-parent = <&gpx3>;
+ interrupts = <1 IRQ_TYPE_NONE>;
+ pinctrl-names = "default";
+ pinctrl-0 = <&max77802_irq>, <&pmic_selb>,
+ <&pmic_dvs_1>, <&pmic_dvs_2>, <&pmic_dvs_3>;
+ wakeup-source;
+ reg = <0x9>;
+ #clock-cells = <1>;
+
+ inb1-supply = <&tps65090_dcdc2>;
+ inb2-supply = <&tps65090_dcdc1>;
+ inb3-supply = <&tps65090_dcdc2>;
+ inb4-supply = <&tps65090_dcdc2>;
+ inb5-supply = <&tps65090_dcdc1>;
+ inb6-supply = <&tps65090_dcdc2>;
+ inb7-supply = <&tps65090_dcdc1>;
+ inb8-supply = <&tps65090_dcdc1>;
+ inb9-supply = <&tps65090_dcdc1>;
+ inb10-supply = <&tps65090_dcdc1>;
+
+ inl1-supply = <&buck5_reg>;
+ inl2-supply = <&buck7_reg>;
+ inl3-supply = <&buck9_reg>;
+ inl4-supply = <&buck9_reg>;
+ inl5-supply = <&buck9_reg>;
+ inl6-supply = <&tps65090_dcdc2>;
+ inl7-supply = <&buck9_reg>;
+ inl9-supply = <&tps65090_dcdc2>;
+ inl10-supply = <&buck7_reg>;
+
+ regulators {
+ BUCK1 {
+ regulator-name = "vdd_mif";
+ regulator-min-microvolt = <800000>;
+ regulator-max-microvolt = <1300000>;
+ regulator-always-on;
+ regulator-boot-on;
+ regulator-ramp-delay = <12500>;
+ regulator-state-mem {
+ regulator-off-in-suspend;
+ };
+ };
+
+ BUCK2 {
+ regulator-name = "vdd_arm";
+ regulator-min-microvolt = <800000>;
+ regulator-max-microvolt = <1500000>;
+ regulator-always-on;
+ regulator-boot-on;
+ regulator-ramp-delay = <12500>;
+ regulator-coupled-with = <&buck3_reg>;
+ regulator-coupled-max-spread = <300000>;
+ regulator-state-mem {
+ regulator-off-in-suspend;
+ };
+ };
+
+ // ...
+
+ BUCK10 {
+ regulator-name = "vdd_1v8";
+ regulator-min-microvolt = <1800000>;
+ regulator-max-microvolt = <1800000>;
+ regulator-always-on;
+ regulator-boot-on;
+ regulator-state-mem {
+ regulator-on-in-suspend;
+ };
+ };
+
+ LDO1 {
+ regulator-name = "vdd_1v0";
+ regulator-min-microvolt = <1000000>;
+ regulator-max-microvolt = <1000000>;
+ regulator-always-on;
+ regulator-initial-mode = <MAX77802_OPMODE_NORMAL>;
+ regulator-state-mem {
+ regulator-on-in-suspend;
+ regulator-mode = <MAX77802_OPMODE_LP>;
+ };
+ };
+
+ // ...
+
+ LDO35 {
+ regulator-name = "ldo_35";
+ regulator-min-microvolt = <1200000>;
+ regulator-max-microvolt = <1200000>;
+ };
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/mfd/maxim,max77843.yaml b/Documentation/devicetree/bindings/mfd/maxim,max77843.yaml
new file mode 100644
index 000000000000..61a0f9dcb983
--- /dev/null
+++ b/Documentation/devicetree/bindings/mfd/maxim,max77843.yaml
@@ -0,0 +1,144 @@
+# SPDX-License-Identifier: GPL-2.0-only
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mfd/maxim,max77843.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX77843 MicroUSB and Companion Power Management IC
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX77843 MicroUSB
+ Integrated Circuit (MUIC).
+
+ The Maxim MAX77843 is a MicroUSB and Companion Power Management IC which
+ includes voltage current regulators, charger, fuel-gauge, haptic motor driver
+ and MicroUSB management IC.
+
+properties:
+ compatible:
+ const: maxim,max77843
+
+ interrupts:
+ maxItems: 1
+
+ reg:
+ maxItems: 1
+
+ extcon:
+ $ref: /schemas/extcon/maxim,max77843.yaml
+
+ motor-driver:
+ type: object
+ properties:
+ compatible:
+ const: maxim,max77843-haptic
+
+ haptic-supply:
+ description: Power supply to the haptic motor
+
+ pwms:
+ maxItems: 1
+
+ required:
+ - compatible
+ - haptic-supply
+ - pwms
+
+ regulators:
+ $ref: /schemas/regulator/maxim,max77843.yaml
+
+required:
+ - compatible
+ - interrupts
+ - reg
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pmic@66 {
+ compatible = "maxim,max77843";
+ interrupt-parent = <&gpa1>;
+ interrupts = <5 IRQ_TYPE_EDGE_FALLING>;
+ reg = <0x66>;
+
+ extcon {
+ compatible = "maxim,max77843-muic";
+
+ connector {
+ compatible = "samsung,usb-connector-11pin",
+ "usb-b-connector";
+ label = "micro-USB";
+ type = "micro";
+
+ ports {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ port@0 {
+ /*
+ * TODO: The DTS this is based on does not have
+ * port@0 which is a required property. The ports
+ * look incomplete and need fixing.
+ * Add a disabled port just to satisfy dtschema.
+ */
+ reg = <0>;
+ status = "disabled";
+ };
+
+ port@3 {
+ reg = <3>;
+ endpoint {
+ remote-endpoint = <&mhl_to_musb_con>;
+ };
+ };
+ };
+ };
+
+ ports {
+ port {
+ endpoint {
+ remote-endpoint = <&usb_to_muic>;
+ };
+ };
+ };
+ };
+
+ regulators {
+ compatible = "maxim,max77843-regulator";
+
+ SAFEOUT1 {
+ regulator-name = "SAFEOUT1";
+ regulator-min-microvolt = <3300000>;
+ regulator-max-microvolt = <4950000>;
+ };
+
+ SAFEOUT2 {
+ regulator-name = "SAFEOUT2";
+ regulator-min-microvolt = <3300000>;
+ regulator-max-microvolt = <4950000>;
+ };
+
+ CHARGER {
+ regulator-name = "CHARGER";
+ regulator-min-microamp = <100000>;
+ regulator-max-microamp = <3150000>;
+ };
+ };
+
+ motor-driver {
+ compatible = "maxim,max77843-haptic";
+ haptic-supply = <&ldo38_reg>;
+ pwms = <&pwm 0 33670 0>;
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/mtd/jedec,spi-nor.yaml b/Documentation/devicetree/bindings/mtd/jedec,spi-nor.yaml
index 39421f7233e4..4abfb4cfc157 100644
--- a/Documentation/devicetree/bindings/mtd/jedec,spi-nor.yaml
+++ b/Documentation/devicetree/bindings/mtd/jedec,spi-nor.yaml
@@ -47,7 +47,8 @@ properties:
identified by the JEDEC READ ID opcode (0x9F).
reg:
- maxItems: 1
+ minItems: 1
+ maxItems: 2
spi-max-frequency: true
spi-rx-bus-width: true
diff --git a/Documentation/devicetree/bindings/perf/marvell-cn10k-ddr.yaml b/Documentation/devicetree/bindings/perf/marvell-cn10k-ddr.yaml
new file mode 100644
index 000000000000..a18dd0a8c43a
--- /dev/null
+++ b/Documentation/devicetree/bindings/perf/marvell-cn10k-ddr.yaml
@@ -0,0 +1,37 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/perf/marvell-cn10k-ddr.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Marvell CN10K DDR performance monitor
+
+maintainers:
+ - Bharat Bhushan <bbhushan2@marvell.com>
+
+properties:
+ compatible:
+ items:
+ - enum:
+ - marvell,cn10k-ddr-pmu
+
+ reg:
+ maxItems: 1
+
+required:
+ - compatible
+ - reg
+
+additionalProperties: false
+
+examples:
+ - |
+ bus {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ pmu@87e1c0000000 {
+ compatible = "marvell,cn10k-ddr-pmu";
+ reg = <0x87e1 0xc0000000 0x0 0x10000>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/power/supply/maxim,max14577.yaml b/Documentation/devicetree/bindings/power/supply/maxim,max14577.yaml
new file mode 100644
index 000000000000..3978b48299de
--- /dev/null
+++ b/Documentation/devicetree/bindings/power/supply/maxim,max14577.yaml
@@ -0,0 +1,84 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/power/supply/maxim,max14577.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX14577/MAX77836 MicroUSB and Companion Power Management IC Charger
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX14577/MAX77836 MicroUSB
+ Integrated Circuit (MUIC).
+
+ See also Documentation/devicetree/bindings/mfd/maxim,max14577.yaml for
+ additional information and example.
+
+properties:
+ compatible:
+ enum:
+ - maxim,max14577-charger
+ - maxim,max77836-charger
+
+ maxim,constant-uvolt:
+ description:
+ Battery Constant Voltage in uV
+ $ref: /schemas/types.yaml#/definitions/uint32
+ minimum: 4000000
+ maximum: 4350000
+
+ maxim,eoc-uamp:
+ description: |
+ Current in uA for End-Of-Charge mode.
+ MAX14577: 50000-20000
+ MAX77836: 5000-100000
+ $ref: /schemas/types.yaml#/definitions/uint32
+
+ maxim,fast-charge-uamp:
+ description: |
+ Current in uA for Fast Charge
+ MAX14577: 90000-950000
+ MAX77836: 45000-475000
+ $ref: /schemas/types.yaml#/definitions/uint32
+
+ maxim,ovp-uvolt:
+ description:
+ OverVoltage Protection Threshold in uV; In an overvoltage condition, INT
+ asserts and charging stops.
+ $ref: /schemas/types.yaml#/definitions/uint32
+ enum: [6000000, 6500000, 7000000, 7500000]
+
+required:
+ - compatible
+ - maxim,constant-uvolt
+ - maxim,eoc-uamp
+ - maxim,fast-charge-uamp
+ - maxim,ovp-uvolt
+
+allOf:
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: maxim,max14577-charger
+ then:
+ properties:
+ maxim,eoc-uamp:
+ minimum: 50000
+ maximum: 200000
+ maxim,fast-charge-uamp:
+ minimum: 90000
+ maximum: 950000
+ else:
+ # max77836
+ properties:
+ maxim,eoc-uamp:
+ minimum: 5000
+ maximum: 100000
+ maxim,fast-charge-uamp:
+ minimum: 45000
+ maximum: 475000
+
+additionalProperties: false
diff --git a/Documentation/devicetree/bindings/regulator/max77802.txt b/Documentation/devicetree/bindings/regulator/max77802.txt
deleted file mode 100644
index b82943d83677..000000000000
--- a/Documentation/devicetree/bindings/regulator/max77802.txt
+++ /dev/null
@@ -1,111 +0,0 @@
-Binding for Maxim MAX77802 regulators
-
-This is a part of device tree bindings of MAX77802 multi-function device.
-More information can be found in bindings/mfd/max77802.txt file.
-
-The MAX77802 PMIC has 10 high-efficiency Buck and 32 Low-dropout (LDO)
-regulators that can be controlled over I2C.
-
-Following properties should be present in main device node of the MFD chip.
-
-Optional properties:
-- inb1-supply: The input supply for BUCK1
-- inb2-supply: The input supply for BUCK2
-- inb3-supply: The input supply for BUCK3
-- inb4-supply: The input supply for BUCK4
-- inb5-supply: The input supply for BUCK5
-- inb6-supply: The input supply for BUCK6
-- inb7-supply: The input supply for BUCK7
-- inb8-supply: The input supply for BUCK8
-- inb9-supply: The input supply for BUCK9
-- inb10-supply: The input supply for BUCK10
-- inl1-supply: The input supply for LDO8 and LDO15
-- inl2-supply: The input supply for LDO17, LDO27, LDO30 and LDO35
-- inl3-supply: The input supply for LDO3, LDO5, LDO6 and LDO7
-- inl4-supply: The input supply for LDO10, LDO11, LDO13 and LDO14
-- inl5-supply: The input supply for LDO9 and LDO19
-- inl6-supply: The input supply for LDO4, LDO21, LDO24 and LDO33
-- inl7-supply: The input supply for LDO18, LDO20, LDO28 and LDO29
-- inl9-supply: The input supply for LDO12, LDO23, LDO25, LDO26, LDO32 and LDO34
-- inl10-supply: The input supply for LDO1 and LDO2
-
-Optional nodes:
-- regulators : The regulators of max77802 have to be instantiated
- under subnode named "regulators" using the following format.
-
- regulator-name {
- standard regulator constraints....
- };
- refer Documentation/devicetree/bindings/regulator/regulator.txt
-
-The regulator node name should be initialized with a string to get matched
-with their hardware counterparts as follow. The valid names are:
-
- -LDOn : for LDOs, where n can lie in ranges 1-15, 17-21, 23-30
- and 32-35.
- example: LDO1, LDO2, LDO35.
- -BUCKn : for BUCKs, where n can lie in range 1 to 10.
- example: BUCK1, BUCK5, BUCK10.
-
-The max77802 regulator supports two different operating modes: Normal and Low
-Power Mode. Some regulators support the modes to be changed at startup or by
-the consumers during normal operation while others only support to change the
-mode during system suspend. The standard regulator suspend states binding can
-be used to configure the regulator operating mode.
-
-The regulators that support the standard "regulator-initial-mode" property,
-changing their mode during normal operation are: LDOs 1, 3, 20 and 21.
-
-The possible values for "regulator-initial-mode" and "regulator-mode" are:
- 1: Normal regulator voltage output mode.
- 3: Low Power which reduces the quiescent current down to only 1uA
-
-The valid modes list is defined in the dt-bindings/regulator/maxim,max77802.h
-header and can be included by device tree source files.
-
-The standard "regulator-mode" property can only be used for regulators that
-support changing their mode to Low Power Mode during suspend. These regulators
-are: BUCKs 2-4 and LDOs 1-35. Also, it only takes effect if the regulator has
-been enabled for the given suspend state using "regulator-on-in-suspend" and
-has not been disabled for that state using "regulator-off-in-suspend".
-
-Example:
-
- max77802@9 {
- compatible = "maxim,max77802";
- interrupt-parent = <&wakeup_eint>;
- interrupts = <26 0>;
- reg = <0x09>;
- #address-cells = <1>;
- #size-cells = <0>;
-
- inb1-supply = <&parent_reg>;
-
- regulators {
- ldo1_reg: LDO1 {
- regulator-name = "vdd_1v0";
- regulator-min-microvolt = <1000000>;
- regulator-max-microvolt = <1000000>;
- regulator-always-on;
- regulator-initial-mode = <MAX77802_OPMODE_LP>;
- };
-
- ldo11_reg: LDO11 {
- regulator-name = "vdd_ldo11";
- regulator-min-microvolt = <1900000>;
- regulator-max-microvolt = <1900000>;
- regulator-always-on;
- regulator-state-mem {
- regulator-on-in-suspend;
- regulator-mode = <MAX77802_OPMODE_LP>;
- };
- };
-
- buck1_reg: BUCK1 {
- regulator-name = "vdd_mif";
- regulator-min-microvolt = <950000>;
- regulator-max-microvolt = <1300000>;
- regulator-always-on;
- regulator-boot-on;
- };
- };
diff --git a/Documentation/devicetree/bindings/regulator/maxim,max14577.yaml b/Documentation/devicetree/bindings/regulator/maxim,max14577.yaml
new file mode 100644
index 000000000000..16f01886a601
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/maxim,max14577.yaml
@@ -0,0 +1,78 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/maxim,max14577.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX14577/MAX77836 MicroUSB and Companion Power Management IC regulators
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX14577/MAX77836 MicroUSB
+ Integrated Circuit (MUIC).
+
+ See also Documentation/devicetree/bindings/mfd/maxim,max14577.yaml for
+ additional information and example.
+
+properties:
+ compatible:
+ enum:
+ - maxim,max14577-regulator
+ - maxim,max77836-regulator
+
+ CHARGER:
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+ description: |
+ Current regulator.
+
+ properties:
+ regulator-min-microvolt: false
+ regulator-max-microvolt: false
+
+ SAFEOUT:
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+ description: |
+ Safeout LDO regulator (fixed voltage).
+
+ properties:
+ regulator-min-microamp: false
+ regulator-max-microamp: false
+ regulator-min-microvolt:
+ const: 4900000
+ regulator-max-microvolt:
+ const: 4900000
+
+patternProperties:
+ "^LDO[12]$":
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+ description: |
+ Current regulator.
+
+ properties:
+ regulator-min-microamp: false
+ regulator-max-microamp: false
+ regulator-min-microvolt:
+ minimum: 800000
+ regulator-max-microvolt:
+ maximum: 3950000
+
+allOf:
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: maxim,max14577-regulator
+ then:
+ properties:
+ LDO1: false
+ LDO2: false
+
+additionalProperties: false
diff --git a/Documentation/devicetree/bindings/regulator/maxim,max77802.yaml b/Documentation/devicetree/bindings/regulator/maxim,max77802.yaml
new file mode 100644
index 000000000000..f2b4dd15a0f3
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/maxim,max77802.yaml
@@ -0,0 +1,85 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/maxim,max77802.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX77802 Power Management IC regulators
+
+maintainers:
+ - Javier Martinez Canillas <javier@dowhile0.org>
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX77802 Power Management
+ Integrated Circuit (PMIC).
+
+ The Maxim MAX77686 provides 10 high-efficiency Buck and 32 Low-DropOut (LDO)
+ regulators.
+
+ See also Documentation/devicetree/bindings/mfd/maxim,max77802.yaml for
+ additional information and example.
+
+ Certain regulators support "regulator-initial-mode" and "regulator-mode".
+ The valid modes list is defined in the dt-bindings/regulator/maxim,max77802.h
+ and their meaning is::
+ 1 - Normal regulator voltage output mode.
+ 3 - Low Power which reduces the quiescent current down to only 1uA
+
+ The standard "regulator-mode" property can only be used for regulators that
+ support changing their mode to Low Power Mode during suspend. These
+ regulators are:: bucks 2-4 and LDOs 1-35. Also, it only takes effect if the
+ regulator has been enabled for the given suspend state using
+ "regulator-on-in-suspend" and has not been disabled for that state using
+ "regulator-off-in-suspend".
+
+patternProperties:
+ # LDO1, LDO3, LDO20, LDO21
+ "^LDO([13]|2[01])$":
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+ description:
+ LDOs supporting the regulator-initial-mode property and changing their
+ mode during normal operation.
+
+ # LDO2, LDO4-15, LDO17-19, LDO23-30, LDO32-35
+ "^LDO([24-9]|1[0-5789]|2[3-9]|3[02345])$":
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+ description:
+ LDOs supporting the regulator-mode property (changing mode to Low Power
+ Mode during suspend).
+
+ properties:
+ regulator-initial-mode: false
+
+ # buck2-4
+ "^BUCK[2-4]$":
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+ description:
+ bucks supporting the regulator-mode property (changing mode to Low Power
+ Mode during suspend).
+
+ properties:
+ regulator-initial-mode: false
+
+ # buck1, buck5-10
+ "^BUCK([15-9]|10)$":
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+
+ properties:
+ regulator-initial-mode: false
+
+ patternProperties:
+ regulator-state-(standby|mem|disk):
+ type: object
+ properties:
+ regulator-mode: false
+
+additionalProperties: false
diff --git a/Documentation/devicetree/bindings/regulator/maxim,max77843.yaml b/Documentation/devicetree/bindings/regulator/maxim,max77843.yaml
new file mode 100644
index 000000000000..a963025e96c1
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/maxim,max77843.yaml
@@ -0,0 +1,65 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/maxim,max77843.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Maxim MAX77843 MicroUSB and Companion Power Management IC regulators
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ This is a part of device tree bindings for Maxim MAX77843 MicroUSB Integrated
+ Circuit (MUIC).
+
+ See also Documentation/devicetree/bindings/mfd/maxim,max77843.yaml for
+ additional information and example.
+
+properties:
+ compatible:
+ const: maxim,max77843-regulator
+
+ CHARGER:
+ type: object
+ $ref: regulator.yaml#
+ additionalProperties: false
+ description: |
+ Current regulator.
+
+ properties:
+ regulator-name: true
+ regulator-always-on: true
+ regulator-boot-on: true
+ regulator-min-microamp:
+ minimum: 100000
+ regulator-max-microamp:
+ maximum: 3150000
+
+ required:
+ - regulator-name
+
+patternProperties:
+ "^SAFEOUT[12]$":
+ type: object
+ $ref: regulator.yaml#
+ additionalProperties: false
+ description: |
+ Safeout LDO regulator.
+
+ properties:
+ regulator-name: true
+ regulator-always-on: true
+ regulator-boot-on: true
+ regulator-min-microvolt:
+ minimum: 3300000
+ regulator-max-microvolt:
+ maximum: 4950000
+
+ required:
+ - regulator-name
+
+required:
+ - compatible
+
+additionalProperties: false
diff --git a/Documentation/devicetree/bindings/regulator/maxim,max8973.yaml b/Documentation/devicetree/bindings/regulator/maxim,max8973.yaml
index 35c53e27f78c..5898dcf10f06 100644
--- a/Documentation/devicetree/bindings/regulator/maxim,max8973.yaml
+++ b/Documentation/devicetree/bindings/regulator/maxim,max8973.yaml
@@ -113,7 +113,7 @@ examples:
};
- |
- #include <dt-bindings/gpio/tegra-gpio.h>
+ #include <dt-bindings/gpio/gpio.h>
#include <dt-bindings/interrupt-controller/irq.h>
i2c {
@@ -123,8 +123,7 @@ examples:
regulator@1b {
compatible = "maxim,max77621";
reg = <0x1b>;
- interrupt-parent = <&gpio>;
- interrupts = <TEGRA_GPIO(Y, 1) IRQ_TYPE_LEVEL_LOW>;
+ interrupts = <1 IRQ_TYPE_LEVEL_LOW>;
regulator-always-on;
regulator-boot-on;
diff --git a/Documentation/devicetree/bindings/regulator/pfuze100.yaml b/Documentation/devicetree/bindings/regulator/pfuze100.yaml
index f578e72778a7..a26bbd68b729 100644
--- a/Documentation/devicetree/bindings/regulator/pfuze100.yaml
+++ b/Documentation/devicetree/bindings/regulator/pfuze100.yaml
@@ -70,7 +70,11 @@ properties:
$ref: "regulator.yaml#"
type: object
- "^(vsnvs|vref|vrefddr|swbst|coin)$":
+ "^vldo[1-4]$":
+ $ref: "regulator.yaml#"
+ type: object
+
+ "^(vsnvs|vref|vrefddr|swbst|coin|v33|vccsd)$":
$ref: "regulator.yaml#"
type: object
diff --git a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml
index 5c73d3f639c7..e28ee9e46788 100644
--- a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml
+++ b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml
@@ -48,6 +48,7 @@ description: |
For PMI8998, bob
For PMR735A, smps1 - smps3, ldo1 - ldo7
For PMX55, smps1 - smps7, ldo1 - ldo16
+ For PMX65, smps1 - smps8, ldo1 - ldo21
properties:
compatible:
@@ -70,6 +71,7 @@ properties:
- qcom,pmm8155au-rpmh-regulators
- qcom,pmr735a-rpmh-regulators
- qcom,pmx55-rpmh-regulators
+ - qcom,pmx65-rpmh-regulators
qcom,pmic-id:
description: |
diff --git a/Documentation/devicetree/bindings/regulator/richtek,rt5190a-regulator.yaml b/Documentation/devicetree/bindings/regulator/richtek,rt5190a-regulator.yaml
new file mode 100644
index 000000000000..28725c5467fc
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/richtek,rt5190a-regulator.yaml
@@ -0,0 +1,141 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/richtek,rt5190a-regulator.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Richtek RT5190A PMIC Regulator
+
+maintainers:
+ - ChiYuan Huang <cy_huang@richtek.com>
+
+description: |
+ The RT5190A integrates 1 channel buck controller, 3 channels high efficiency
+ synchronous buck converters, 1 LDO, I2C control interface and peripherial
+ logical control.
+
+ It also supports mute AC OFF depop sound and quick setting storage while
+ input power is removed.
+
+properties:
+ compatible:
+ enum:
+ - richtek,rt5190a
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+ vin2-supply:
+ description: phandle to buck2 input voltage.
+
+ vin3-supply:
+ description: phandle to buck3 input voltage.
+
+ vin4-supply:
+ description: phandle to buck4 input voltage.
+
+ vinldo-supply:
+ description: phandle to ldo input voltage
+
+ richtek,mute-enable:
+ description: |
+ The mute function uses 'mutein', 'muteout', and 'vdet' pins as the control
+ signal. When enabled, The normal behavior is to bypass the 'mutein' signal
+ 'muteout'. But if the power source removal is detected from 'vdet',
+ whatever the 'mutein' signal is, it will pull down the 'muteout' to force
+ speakers mute. this function is commonly used to prevent the speaker pop
+ noise during AC power turned off in the modern TV system design.
+ type: boolean
+
+ regulators:
+ type: object
+
+ patternProperties:
+ "^buck[1-4]$|^ldo$":
+ type: object
+ $ref: regulator.yaml#
+ description: |
+ regulator description for buck1 and buck4.
+
+ properties:
+ regulator-allowed-modes:
+ description: |
+ buck operating mode, only buck1/4 support mode operating.
+ 0: auto mode
+ 1: force pwm mode
+ items:
+ enum: [0, 1]
+
+ richtek,latchup-enable:
+ type: boolean
+ description: |
+ If specified, undervolt protection mode changes from the default
+ hiccup to latchup.
+
+ unevaluatedProperties: false
+
+ additionalProperties: false
+
+required:
+ - compatible
+ - reg
+ - regulators
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+ #include <dt-bindings/regulator/richtek,rt5190a-regulator.h>
+
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pmic@64 {
+ compatible = "richtek,rt5190a";
+ reg = <0x64>;
+ interrupts-extended = <&gpio26 0 IRQ_TYPE_LEVEL_LOW>;
+ vin2-supply = <&rt5190_buck1>;
+ vin3-supply = <&rt5190_buck1>;
+ vin4-supply = <&rt5190_buck1>;
+
+ regulators {
+ rt5190_buck1: buck1 {
+ regulator-name = "rt5190a-buck1";
+ regulator-min-microvolt = <5090000>;
+ regulator-max-microvolt = <5090000>;
+ regulator-allowed-modes = <RT5190A_OPMODE_AUTO RT5190A_OPMODE_FPWM>;
+ regulator-boot-on;
+ };
+ buck2 {
+ regulator-name = "rt5190a-buck2";
+ regulator-min-microvolt = <600000>;
+ regulator-max-microvolt = <1400000>;
+ regulator-boot-on;
+ };
+ buck3 {
+ regulator-name = "rt5190a-buck3";
+ regulator-min-microvolt = <600000>;
+ regulator-max-microvolt = <1400000>;
+ regulator-boot-on;
+ };
+ buck4 {
+ regulator-name = "rt5190a-buck4";
+ regulator-min-microvolt = <850000>;
+ regulator-max-microvolt = <850000>;
+ regulator-allowed-modes = <RT5190A_OPMODE_AUTO RT5190A_OPMODE_FPWM>;
+ regulator-boot-on;
+ };
+ ldo {
+ regulator-name = "rt5190a-ldo";
+ regulator-min-microvolt = <1200000>;
+ regulator-max-microvolt = <1200000>;
+ regulator-boot-on;
+ };
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/regulator/ti,tps62360.yaml b/Documentation/devicetree/bindings/regulator/ti,tps62360.yaml
new file mode 100644
index 000000000000..12aeddedde05
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/ti,tps62360.yaml
@@ -0,0 +1,98 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/ti,tps62360.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Texas Instruments TPS6236x Voltage Regulators
+
+maintainers:
+ - Laxman Dewangan <ldewangan@nvidia.com>
+
+description: |
+ The TPS6236x are a family of step down dc-dc converter with
+ an input voltage range of 2.5V to 5.5V. The devices provide
+ up to 3A peak load current, and an output voltage range of
+ 0.77V to 1.4V (TPS62360/62) and 0.5V to 1.77V (TPS62361B/63).
+
+ Datasheet is available at:
+ https://www.ti.com/lit/gpn/tps62360
+
+allOf:
+ - $ref: "regulator.yaml#"
+
+properties:
+ compatible:
+ enum:
+ - ti,tps62360
+ - ti,tps62361
+ - ti,tps62362
+ - ti,tps62363
+
+ reg:
+ maxItems: 1
+
+ ti,vsel0-gpio:
+ description: |
+ GPIO for controlling VSEL0 line. If this property
+ is missing, then assume that there is no GPIO for
+ VSEL0 control.
+ maxItems: 1
+
+ ti,vsel1-gpio:
+ description: |
+ GPIO for controlling VSEL1 line. If this property
+ is missing, then assume that there is no GPIO for
+ VSEL1 control.
+ maxItems: 1
+
+ ti,enable-vout-discharge:
+ description: Enable output discharge.
+ type: boolean
+
+ ti,enable-pull-down:
+ description: Enable pull down.
+ type: boolean
+
+ ti,vsel0-state-high:
+ description: |
+ Initial state of VSEL0 input is high. If this property
+ is missing, then assume the state as low.
+ type: boolean
+
+ ti,vsel1-state-high:
+ description: |
+ Initial state of VSEL1 input is high. If this property
+ is missing, then assume the state as low.
+ type: boolean
+
+required:
+ - compatible
+ - reg
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/gpio/gpio.h>
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ regulator@60 {
+ compatible = "ti,tps62361";
+ reg = <0x60>;
+ regulator-name = "tps62361-vout";
+ regulator-min-microvolt = <500000>;
+ regulator-max-microvolt = <1500000>;
+ regulator-boot-on;
+ ti,vsel0-gpio = <&gpio1 16 GPIO_ACTIVE_HIGH>;
+ ti,vsel1-gpio = <&gpio1 17 GPIO_ACTIVE_HIGH>;
+ ti,vsel0-state-high;
+ ti,vsel1-state-high;
+ ti,enable-pull-down;
+ ti,enable-vout-discharge;
+ };
+ };
+
+...
diff --git a/Documentation/devicetree/bindings/regulator/ti,tps62864.yaml b/Documentation/devicetree/bindings/regulator/ti,tps62864.yaml
new file mode 100644
index 000000000000..0f29c75f42ea
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/ti,tps62864.yaml
@@ -0,0 +1,63 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/ti,tps62864.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: TI TPS62864/TPS6286/TPS62868/TPS62869 voltage regulator
+
+maintainers:
+ - Vincent Whitchurch <vincent.whitchurch@axis.com>
+
+properties:
+ compatible:
+ enum:
+ - ti,tps62864
+ - ti,tps62866
+ - ti,tps62868
+ - ti,tps62869
+
+ reg:
+ maxItems: 1
+
+ regulators:
+ type: object
+
+ properties:
+ "SW":
+ type: object
+ $ref: regulator.yaml#
+ unevaluatedProperties: false
+
+ additionalProperties: false
+
+required:
+ - compatible
+ - reg
+ - regulators
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/regulator/ti,tps62864.h>
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ regulator@48 {
+ compatible = "ti,tps62864";
+ reg = <0x48>;
+
+ regulators {
+ SW {
+ regulator-name = "+0.85V";
+ regulator-min-microvolt = <800000>;
+ regulator-max-microvolt = <890000>;
+ regulator-initial-mode = <TPS62864_MODE_FPWM>;
+ };
+ };
+ };
+ };
+
+...
diff --git a/Documentation/devicetree/bindings/regulator/tps62360-regulator.txt b/Documentation/devicetree/bindings/regulator/tps62360-regulator.txt
deleted file mode 100644
index 1b20c3dbcdb8..000000000000
--- a/Documentation/devicetree/bindings/regulator/tps62360-regulator.txt
+++ /dev/null
@@ -1,44 +0,0 @@
-TPS62360 Voltage regulators
-
-Required properties:
-- compatible: Must be one of the following.
- "ti,tps62360"
- "ti,tps62361",
- "ti,tps62362",
- "ti,tps62363",
-- reg: I2C slave address
-
-Optional properties:
-- ti,enable-vout-discharge: Enable output discharge. This is boolean value.
-- ti,enable-pull-down: Enable pull down. This is boolean value.
-- ti,vsel0-gpio: GPIO for controlling VSEL0 line.
- If this property is missing, then assume that there is no GPIO
- for vsel0 control.
-- ti,vsel1-gpio: Gpio for controlling VSEL1 line.
- If this property is missing, then assume that there is no GPIO
- for vsel1 control.
-- ti,vsel0-state-high: Initial state of vsel0 input is high.
- If this property is missing, then assume the state as low (0).
-- ti,vsel1-state-high: Initial state of vsel1 input is high.
- If this property is missing, then assume the state as low (0).
-
-Any property defined as part of the core regulator binding, defined in
-regulator.txt, can also be used.
-
-Example:
-
- abc: tps62360 {
- compatible = "ti,tps62361";
- reg = <0x60>;
- regulator-name = "tps62361-vout";
- regulator-min-microvolt = <500000>;
- regulator-max-microvolt = <1500000>;
- regulator-boot-on
- ti,vsel0-gpio = <&gpio1 16 0>;
- ti,vsel1-gpio = <&gpio1 17 0>;
- ti,vsel0-state-high;
- ti,vsel1-state-high;
- ti,enable-pull-down;
- ti,enable-force-pwm;
- ti,enable-vout-discharge;
- };
diff --git a/Documentation/devicetree/bindings/soc/samsung/exynos-usi.yaml b/Documentation/devicetree/bindings/soc/samsung/exynos-usi.yaml
index 273f2d95a043..e72b6a3fae99 100644
--- a/Documentation/devicetree/bindings/soc/samsung/exynos-usi.yaml
+++ b/Documentation/devicetree/bindings/soc/samsung/exynos-usi.yaml
@@ -22,7 +22,7 @@ description: |
[1] Documentation/devicetree/bindings/serial/samsung_uart.yaml
[2] Documentation/devicetree/bindings/i2c/i2c-exynos5.txt
- [3] Documentation/devicetree/bindings/spi/spi-samsung.txt
+ [3] Documentation/devicetree/bindings/spi/samsung,spi.yaml
properties:
$nodename:
diff --git a/Documentation/devicetree/bindings/spi/mediatek,spi-mt65xx.yaml b/Documentation/devicetree/bindings/spi/mediatek,spi-mt65xx.yaml
new file mode 100644
index 000000000000..818130b11bb9
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/mediatek,spi-mt65xx.yaml
@@ -0,0 +1,107 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/spi/mediatek,spi-mt65xx.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: SPI Bus controller for MediaTek ARM SoCs
+
+maintainers:
+ - Leilk Liu <leilk.liu@mediatek.com>
+
+allOf:
+ - $ref: "/schemas/spi/spi-controller.yaml#"
+
+properties:
+ compatible:
+ oneOf:
+ - items:
+ - enum:
+ - mediatek,mt7629-spi
+ - const: mediatek,mt7622-spi
+ - items:
+ - enum:
+ - mediatek,mt8516-spi
+ - const: mediatek,mt2712-spi
+ - items:
+ - enum:
+ - mediatek,mt6779-spi
+ - mediatek,mt8186-spi
+ - mediatek,mt8192-spi
+ - mediatek,mt8195-spi
+ - const: mediatek,mt6765-spi
+ - items:
+ - enum:
+ - mediatek,mt7986-spi-ipm
+ - const: mediatek,spi-ipm
+ - items:
+ - enum:
+ - mediatek,mt2701-spi
+ - mediatek,mt2712-spi
+ - mediatek,mt6589-spi
+ - mediatek,mt6765-spi
+ - mediatek,mt6893-spi
+ - mediatek,mt7622-spi
+ - mediatek,mt8135-spi
+ - mediatek,mt8173-spi
+ - mediatek,mt8183-spi
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+ clocks:
+ items:
+ - description: clock used for the parent clock
+ - description: clock used for the muxes clock
+ - description: clock used for the clock gate
+
+ clock-names:
+ items:
+ - const: parent-clk
+ - const: sel-clk
+ - const: spi-clk
+
+ mediatek,pad-select:
+ $ref: /schemas/types.yaml#/definitions/uint32-array
+ minItems: 1
+ maxItems: 4
+ items:
+ enum: [0, 1, 2, 3]
+ description:
+ specify which pins group(ck/mi/mo/cs) spi controller used.
+ This is an array.
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - clocks
+ - clock-names
+ - '#address-cells'
+ - '#size-cells'
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/clock/mt8173-clk.h>
+ #include <dt-bindings/gpio/gpio.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ spi@1100a000 {
+ compatible = "mediatek,mt8173-spi";
+ #address-cells = <1>;
+ #size-cells = <0>;
+ reg = <0x1100a000 0x1000>;
+ interrupts = <GIC_SPI 110 IRQ_TYPE_LEVEL_LOW>;
+ clocks = <&topckgen CLK_TOP_SYSPLL3_D2>,
+ <&topckgen CLK_TOP_SPI_SEL>,
+ <&pericfg CLK_PERI_SPI0>;
+ clock-names = "parent-clk", "sel-clk", "spi-clk";
+ cs-gpios = <&pio 105 GPIO_ACTIVE_LOW>, <&pio 72 GPIO_ACTIVE_LOW>;
+ mediatek,pad-select = <1>, <0>;
+ };
diff --git a/Documentation/devicetree/bindings/spi/mediatek,spi-mtk-nor.yaml b/Documentation/devicetree/bindings/spi/mediatek,spi-mtk-nor.yaml
index 4e4694e3d539..be3cc7faed53 100644
--- a/Documentation/devicetree/bindings/spi/mediatek,spi-mtk-nor.yaml
+++ b/Documentation/devicetree/bindings/spi/mediatek,spi-mtk-nor.yaml
@@ -30,6 +30,7 @@ properties:
- mediatek,mt7622-nor
- mediatek,mt7623-nor
- mediatek,mt7629-nor
+ - mediatek,mt8186-nor
- mediatek,mt8192-nor
- mediatek,mt8195-nor
- enum:
@@ -49,6 +50,8 @@ properties:
- description: clock used for controller
- description: clock used for nor dma bus. this depends on hardware
design, so this is optional.
+ - description: clock used for controller axi slave bus.
+ this depends on hardware design, so it is optional.
clock-names:
minItems: 2
@@ -56,6 +59,7 @@ properties:
- const: spi
- const: sf
- const: axi
+ - const: axi_s
required:
- compatible
diff --git a/Documentation/devicetree/bindings/spi/mediatek,spi-slave-mt27xx.yaml b/Documentation/devicetree/bindings/spi/mediatek,spi-slave-mt27xx.yaml
new file mode 100644
index 000000000000..7977799a8ee1
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/mediatek,spi-slave-mt27xx.yaml
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/spi/mediatek,spi-slave-mt27xx.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: SPI Slave controller for MediaTek ARM SoCs
+
+maintainers:
+ - Leilk Liu <leilk.liu@mediatek.com>
+
+allOf:
+ - $ref: "/schemas/spi/spi-controller.yaml#"
+
+properties:
+ compatible:
+ enum:
+ - mediatek,mt2712-spi-slave
+ - mediatek,mt8195-spi-slave
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+ clocks:
+ maxItems: 1
+
+ clock-names:
+ items:
+ - const: spi
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - clocks
+ - clock-names
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/clock/mt2712-clk.h>
+ #include <dt-bindings/gpio/gpio.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ spi@10013000 {
+ compatible = "mediatek,mt2712-spi-slave";
+ reg = <0x10013000 0x100>;
+ interrupts = <GIC_SPI 283 IRQ_TYPE_LEVEL_LOW>;
+ clocks = <&infracfg CLK_INFRA_AO_SPI1>;
+ clock-names = "spi";
+ assigned-clocks = <&topckgen CLK_TOP_SPISLV_SEL>;
+ assigned-clock-parents = <&topckgen CLK_TOP_UNIVPLL1_D2>;
+ };
diff --git a/Documentation/devicetree/bindings/spi/microchip,mpfs-spi.yaml b/Documentation/devicetree/bindings/spi/microchip,mpfs-spi.yaml
new file mode 100644
index 000000000000..ece261b8e963
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/microchip,mpfs-spi.yaml
@@ -0,0 +1,52 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/spi/microchip,mpfs-spi.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Microchip MPFS {Q,}SPI Controller Device Tree Bindings
+
+maintainers:
+ - Conor Dooley <conor.dooley@microchip.com>
+
+allOf:
+ - $ref: spi-controller.yaml#
+
+properties:
+ compatible:
+ enum:
+ - microchip,mpfs-spi
+ - microchip,mpfs-qspi
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+ clock-names:
+ maxItems: 1
+
+ clocks:
+ maxItems: 1
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - clocks
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include "dt-bindings/clock/microchip,mpfs-clock.h"
+ spi@20108000 {
+ compatible = "microchip,mpfs-spi";
+ reg = <0x20108000 0x1000>;
+ clocks = <&clkcfg CLK_SPI0>;
+ interrupt-parent = <&plic>;
+ interrupts = <54>;
+ spi-max-frequency = <25000000>;
+ };
+...
diff --git a/Documentation/devicetree/bindings/spi/nvidia,tegra210-quad.yaml b/Documentation/devicetree/bindings/spi/nvidia,tegra210-quad.yaml
index 35a8045b2c70..0296edd1de22 100644
--- a/Documentation/devicetree/bindings/spi/nvidia,tegra210-quad.yaml
+++ b/Documentation/devicetree/bindings/spi/nvidia,tegra210-quad.yaml
@@ -19,6 +19,7 @@ properties:
- nvidia,tegra210-qspi
- nvidia,tegra186-qspi
- nvidia,tegra194-qspi
+ - nvidia,tegra234-qspi
reg:
maxItems: 1
@@ -106,7 +107,7 @@ examples:
dma-names = "rx", "tx";
flash@0 {
- compatible = "spi-nor";
+ compatible = "jedec,spi-nor";
reg = <0>;
spi-max-frequency = <104000000>;
spi-tx-bus-width = <2>;
diff --git a/Documentation/devicetree/bindings/spi/renesas,rspi.yaml b/Documentation/devicetree/bindings/spi/renesas,rspi.yaml
index 76e6d9e52fc7..2c3c6bd6ec45 100644
--- a/Documentation/devicetree/bindings/spi/renesas,rspi.yaml
+++ b/Documentation/devicetree/bindings/spi/renesas,rspi.yaml
@@ -22,7 +22,8 @@ properties:
- renesas,rspi-r7s72100 # RZ/A1H
- renesas,rspi-r7s9210 # RZ/A2
- renesas,r9a07g044-rspi # RZ/G2{L,LC}
- - const: renesas,rspi-rz # RZ/A and RZ/G2{L,LC}
+ - renesas,r9a07g054-rspi # RZ/V2L
+ - const: renesas,rspi-rz
- items:
- enum:
@@ -124,6 +125,7 @@ allOf:
enum:
- renesas,qspi
- renesas,r9a07g044-rspi
+ - renesas,r9a07g054-rspi
then:
required:
- resets
diff --git a/Documentation/devicetree/bindings/spi/samsung,spi-peripheral-props.yaml b/Documentation/devicetree/bindings/spi/samsung,spi-peripheral-props.yaml
new file mode 100644
index 000000000000..f0db3fb3d688
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/samsung,spi-peripheral-props.yaml
@@ -0,0 +1,33 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/spi/samsung,spi-peripheral-props.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Peripheral-specific properties for Samsung S3C/S5P/Exynos SoC SPI controller
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description:
+ See spi-peripheral-props.yaml for more info.
+
+properties:
+ controller-data:
+ type: object
+ additionalProperties: false
+
+ properties:
+ samsung,spi-feedback-delay:
+ description: |
+ The sampling phase shift to be applied on the miso line (to account
+ for any lag in the miso line). Valid values:
+ - 0: No phase shift.
+ - 1: 90 degree phase shift sampling.
+ - 2: 180 degree phase shift sampling.
+ - 3: 270 degree phase shift sampling.
+ $ref: /schemas/types.yaml#/definitions/uint32
+ enum: [0, 1, 2, 3]
+ default: 0
+
+additionalProperties: true
diff --git a/Documentation/devicetree/bindings/spi/samsung,spi.yaml b/Documentation/devicetree/bindings/spi/samsung,spi.yaml
new file mode 100644
index 000000000000..bf9a76d931d2
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/samsung,spi.yaml
@@ -0,0 +1,188 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/spi/samsung,spi.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Samsung S3C/S5P/Exynos SoC SPI controller
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description:
+ All the SPI controller nodes should be represented in the aliases node using
+ the following format 'spi{n}' where n is a unique number for the alias.
+
+properties:
+ compatible:
+ oneOf:
+ - enum:
+ - samsung,s3c2443-spi # for S3C2443, S3C2416 and S3C2450
+ - samsung,s3c6410-spi
+ - samsung,s5pv210-spi # for S5PV210 and S5PC110
+ - samsung,exynos5433-spi
+ - tesla,fsd-spi
+ - const: samsung,exynos7-spi
+ deprecated: true
+
+ clocks:
+ minItems: 2
+ maxItems: 3
+
+ clock-names:
+ minItems: 2
+ maxItems: 3
+
+ cs-gpios: true
+
+ dmas:
+ minItems: 2
+ maxItems: 2
+
+ dma-names:
+ items:
+ - const: tx
+ - const: rx
+
+ interrupts:
+ maxItems: 1
+
+ no-cs-readback:
+ description:
+ The CS line is disconnected, therefore the device should not operate
+ based on CS signalling.
+ type: boolean
+
+ num-cs:
+ minimum: 1
+ maximum: 4
+ default: 1
+
+ samsung,spi-src-clk:
+ description:
+ If the spi controller includes a internal clock mux to select the clock
+ source for the spi bus clock, this property can be used to indicate the
+ clock to be used for driving the spi bus clock. If not specified, the
+ clock number 0 is used as default.
+ $ref: /schemas/types.yaml#/definitions/uint32
+ default: 0
+
+ reg:
+ maxItems: 1
+
+required:
+ - compatible
+ - clocks
+ - clock-names
+ - dmas
+ - dma-names
+ - interrupts
+ - reg
+
+allOf:
+ - $ref: spi-controller.yaml#
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: samsung,exynos5433-spi
+ then:
+ properties:
+ clocks:
+ minItems: 3
+ maxItems: 3
+ clock-names:
+ items:
+ - const: spi
+ - enum:
+ - spi_busclk0
+ - spi_busclk1
+ - spi_busclk2
+ - spi_busclk3
+ - const: spi_ioclk
+ else:
+ properties:
+ clocks:
+ minItems: 2
+ maxItems: 2
+ clock-names:
+ items:
+ - const: spi
+ - enum:
+ - spi_busclk0
+ - spi_busclk1
+ - spi_busclk2
+ - spi_busclk3
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/clock/exynos5433.h>
+ #include <dt-bindings/clock/samsung,s2mps11.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/gpio/gpio.h>
+
+ spi@14d30000 {
+ compatible = "samsung,exynos5433-spi";
+ reg = <0x14d30000 0x100>;
+ interrupts = <GIC_SPI 433 IRQ_TYPE_LEVEL_HIGH>;
+ dmas = <&pdma0 11>, <&pdma0 10>;
+ dma-names = "tx", "rx";
+ #address-cells = <1>;
+ #size-cells = <0>;
+ clocks = <&cmu_peric CLK_PCLK_SPI1>,
+ <&cmu_peric CLK_SCLK_SPI1>,
+ <&cmu_peric CLK_SCLK_IOCLK_SPI1>;
+ clock-names = "spi",
+ "spi_busclk0",
+ "spi_ioclk";
+ samsung,spi-src-clk = <0>;
+ pinctrl-names = "default";
+ pinctrl-0 = <&spi1_bus>;
+ num-cs = <1>;
+
+ cs-gpios = <&gpd6 3 GPIO_ACTIVE_HIGH>;
+
+ audio-codec@0 {
+ compatible = "wlf,wm5110";
+ reg = <0x0>;
+ spi-max-frequency = <20000000>;
+ interrupt-parent = <&gpa0>;
+ interrupts = <4 IRQ_TYPE_NONE>;
+ clocks = <&pmu_system_controller 0>,
+ <&s2mps13_osc S2MPS11_CLK_BT>;
+ clock-names = "mclk1", "mclk2";
+
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+
+ wlf,micd-detect-debounce = <300>;
+ wlf,micd-bias-start-time = <0x1>;
+ wlf,micd-rate = <0x7>;
+ wlf,micd-dbtime = <0x2>;
+ wlf,micd-force-micbias;
+ wlf,micd-configs = <0x0 1 0>;
+ wlf,hpdet-channel = <1>;
+ wlf,gpsw = <0x1>;
+ wlf,inmode = <2 0 2 0>;
+
+ wlf,reset = <&gpc0 7 GPIO_ACTIVE_HIGH>;
+ wlf,ldoena = <&gpf0 0 GPIO_ACTIVE_HIGH>;
+
+ /* core supplies */
+ AVDD-supply = <&ldo18_reg>;
+ DBVDD1-supply = <&ldo18_reg>;
+ CPVDD-supply = <&ldo18_reg>;
+ DBVDD2-supply = <&ldo18_reg>;
+ DBVDD3-supply = <&ldo18_reg>;
+ SPKVDDL-supply = <&ldo18_reg>;
+ SPKVDDR-supply = <&ldo18_reg>;
+
+ controller-data {
+ samsung,spi-feedback-delay = <0>;
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/spi/spi-controller.yaml b/Documentation/devicetree/bindings/spi/spi-controller.yaml
index 36b72518f565..0f4d40218400 100644
--- a/Documentation/devicetree/bindings/spi/spi-controller.yaml
+++ b/Documentation/devicetree/bindings/spi/spi-controller.yaml
@@ -139,4 +139,11 @@ examples:
spi-max-frequency = <100000>;
reg = <1>;
};
+
+ flash@2 {
+ compatible = "jedec,spi-nor";
+ spi-max-frequency = <50000000>;
+ reg = <2>, <3>;
+ stacked-memories = /bits/ 64 <0x10000000 0x10000000>;
+ };
};
diff --git a/Documentation/devicetree/bindings/spi/spi-mt65xx.txt b/Documentation/devicetree/bindings/spi/spi-mt65xx.txt
deleted file mode 100644
index 2a24969159cc..000000000000
--- a/Documentation/devicetree/bindings/spi/spi-mt65xx.txt
+++ /dev/null
@@ -1,68 +0,0 @@
-Binding for MTK SPI controller
-
-Required properties:
-- compatible: should be one of the following.
- - mediatek,mt2701-spi: for mt2701 platforms
- - mediatek,mt2712-spi: for mt2712 platforms
- - mediatek,mt6589-spi: for mt6589 platforms
- - mediatek,mt6765-spi: for mt6765 platforms
- - mediatek,mt7622-spi: for mt7622 platforms
- - "mediatek,mt7629-spi", "mediatek,mt7622-spi": for mt7629 platforms
- - mediatek,mt8135-spi: for mt8135 platforms
- - mediatek,mt8173-spi: for mt8173 platforms
- - mediatek,mt8183-spi: for mt8183 platforms
- - mediatek,mt6893-spi: for mt6893 platforms
- - "mediatek,mt8192-spi", "mediatek,mt6765-spi": for mt8192 platforms
- - "mediatek,mt8195-spi", "mediatek,mt6765-spi": for mt8195 platforms
- - "mediatek,mt8516-spi", "mediatek,mt2712-spi": for mt8516 platforms
- - "mediatek,mt6779-spi", "mediatek,mt6765-spi": for mt6779 platforms
-
-- #address-cells: should be 1.
-
-- #size-cells: should be 0.
-
-- reg: Address and length of the register set for the device
-
-- interrupts: Should contain spi interrupt
-
-- clocks: phandles to input clocks.
- The first should be one of the following. It's PLL.
- - <&clk26m>: specify parent clock 26MHZ.
- - <&topckgen CLK_TOP_SYSPLL3_D2>: specify parent clock 109MHZ.
- It's the default one.
- - <&topckgen CLK_TOP_SYSPLL4_D2>: specify parent clock 78MHZ.
- - <&topckgen CLK_TOP_UNIVPLL2_D4>: specify parent clock 104MHZ.
- - <&topckgen CLK_TOP_UNIVPLL1_D8>: specify parent clock 78MHZ.
- The second should be <&topckgen CLK_TOP_SPI_SEL>. It's clock mux.
- The third is <&pericfg CLK_PERI_SPI0>. It's clock gate.
-
-- clock-names: shall be "parent-clk" for the parent clock, "sel-clk" for the
- muxes clock, and "spi-clk" for the clock gate.
-
-Optional properties:
--cs-gpios: see spi-bus.txt.
-
-- mediatek,pad-select: specify which pins group(ck/mi/mo/cs) spi
- controller used. This is an array, the element value should be 0~3,
- only required for MT8173.
- 0: specify GPIO69,70,71,72 for spi pins.
- 1: specify GPIO102,103,104,105 for spi pins.
- 2: specify GPIO128,129,130,131 for spi pins.
- 3: specify GPIO5,6,7,8 for spi pins.
-
-Example:
-
-- SoC Specific Portion:
-spi: spi@1100a000 {
- compatible = "mediatek,mt8173-spi";
- #address-cells = <1>;
- #size-cells = <0>;
- reg = <0 0x1100a000 0 0x1000>;
- interrupts = <GIC_SPI 110 IRQ_TYPE_LEVEL_LOW>;
- clocks = <&topckgen CLK_TOP_SYSPLL3_D2>,
- <&topckgen CLK_TOP_SPI_SEL>,
- <&pericfg CLK_PERI_SPI0>;
- clock-names = "parent-clk", "sel-clk", "spi-clk";
- cs-gpios = <&pio 105 GPIO_ACTIVE_LOW>, <&pio 72 GPIO_ACTIVE_LOW>;
- mediatek,pad-select = <1>, <0>;
-};
diff --git a/Documentation/devicetree/bindings/spi/spi-nxp-fspi.yaml b/Documentation/devicetree/bindings/spi/spi-nxp-fspi.yaml
index 283815d59e85..1b552c298277 100644
--- a/Documentation/devicetree/bindings/spi/spi-nxp-fspi.yaml
+++ b/Documentation/devicetree/bindings/spi/spi-nxp-fspi.yaml
@@ -7,7 +7,8 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: NXP Flex Serial Peripheral Interface (FSPI)
maintainers:
- - Kuldeep Singh <kuldeep.singh@nxp.com>
+ - Han Xu <han.xu@nxp.com>
+ - Kuldeep Singh <singh.kuldeep87k@gmail.com>
allOf:
- $ref: "spi-controller.yaml#"
diff --git a/Documentation/devicetree/bindings/spi/spi-peripheral-props.yaml b/Documentation/devicetree/bindings/spi/spi-peripheral-props.yaml
index 3ec2d7b83775..5e32928c4fc3 100644
--- a/Documentation/devicetree/bindings/spi/spi-peripheral-props.yaml
+++ b/Documentation/devicetree/bindings/spi/spi-peripheral-props.yaml
@@ -83,8 +83,34 @@ properties:
description:
Delay, in microseconds, after a write transfer.
+ stacked-memories:
+ description: Several SPI memories can be wired in stacked mode.
+ This basically means that either a device features several chip
+ selects, or that different devices must be seen as a single
+ bigger chip. This basically doubles (or more) the total address
+ space with only a single additional wire, while still needing
+ to repeat the commands when crossing a chip boundary. The size of
+ each chip should be provided as members of the array.
+ $ref: /schemas/types.yaml#/definitions/uint64-array
+ minItems: 2
+ maxItems: 4
+
+ parallel-memories:
+ description: Several SPI memories can be wired in parallel mode.
+ The devices are physically on a different buses but will always
+ act synchronously as each data word is spread across the
+ different memories (eg. even bits are stored in one memory, odd
+ bits in the other). This basically doubles the address space and
+ the throughput while greatly complexifying the wiring because as
+ many busses as devices must be wired. The size of each chip should
+ be provided as members of the array.
+ $ref: /schemas/types.yaml#/definitions/uint64-array
+ minItems: 2
+ maxItems: 4
+
# The controller specific properties go here.
allOf:
- $ref: cdns,qspi-nor-peripheral-props.yaml#
+ - $ref: samsung,spi-peripheral-props.yaml#
additionalProperties: true
diff --git a/Documentation/devicetree/bindings/spi/spi-pl022.yaml b/Documentation/devicetree/bindings/spi/spi-pl022.yaml
index 6d633728fc2b..bda45ff3d294 100644
--- a/Documentation/devicetree/bindings/spi/spi-pl022.yaml
+++ b/Documentation/devicetree/bindings/spi/spi-pl022.yaml
@@ -38,9 +38,7 @@ properties:
clock-names:
items:
- - enum:
- - SSPCLK
- - sspclk
+ - const: sspclk
- const: apb_pclk
pl022,autosuspend-delay:
diff --git a/Documentation/devicetree/bindings/spi/spi-samsung.txt b/Documentation/devicetree/bindings/spi/spi-samsung.txt
deleted file mode 100644
index 49028a4f5df1..000000000000
--- a/Documentation/devicetree/bindings/spi/spi-samsung.txt
+++ /dev/null
@@ -1,122 +0,0 @@
-* Samsung SPI Controller
-
-The Samsung SPI controller is used to interface with various devices such as flash
-and display controllers using the SPI communication interface.
-
-Required SoC Specific Properties:
-
-- compatible: should be one of the following.
- - samsung,s3c2443-spi: for s3c2443, s3c2416 and s3c2450 platforms
- - samsung,s3c6410-spi: for s3c6410 platforms
- - samsung,s5pv210-spi: for s5pv210 and s5pc110 platforms
- - samsung,exynos5433-spi: for exynos5433 compatible controllers
- - samsung,exynos7-spi: for exynos7 platforms <DEPRECATED>
-
-- reg: physical base address of the controller and length of memory mapped
- region.
-
-- interrupts: The interrupt number to the cpu. The interrupt specifier format
- depends on the interrupt controller.
-
-- dmas : Two or more DMA channel specifiers following the convention outlined
- in bindings/dma/dma.txt
-
-- dma-names: Names for the dma channels. There must be at least one channel
- named "tx" for transmit and named "rx" for receive.
-
-- clocks: specifies the clock IDs provided to the SPI controller; they are
- required for interacting with the controller itself, for synchronizing the bus
- and as I/O clock (the latter is required by exynos5433 and exynos7).
-
-- clock-names: string names of the clocks in the 'clocks' property; for all the
- the devices the names must be "spi", "spi_busclkN" (where N is determined by
- "samsung,spi-src-clk"), while Exynos5433 should specify a third clock
- "spi_ioclk" for the I/O clock.
-
-Required Board Specific Properties:
-
-- #address-cells: should be 1.
-- #size-cells: should be 0.
-
-Optional Board Specific Properties:
-
-- samsung,spi-src-clk: If the spi controller includes a internal clock mux to
- select the clock source for the spi bus clock, this property can be used to
- indicate the clock to be used for driving the spi bus clock. If not specified,
- the clock number 0 is used as default.
-
-- num-cs: Specifies the number of chip select lines supported. If
- not specified, the default number of chip select lines is set to 1.
-
-- cs-gpios: should specify GPIOs used for chipselects (see spi-bus.txt)
-
-- no-cs-readback: the CS line is disconnected, therefore the device should not
- operate based on CS signalling.
-
-SPI Controller specific data in SPI slave nodes:
-
-- The spi slave nodes should provide the following information which is required
- by the spi controller.
-
- - samsung,spi-feedback-delay: The sampling phase shift to be applied on the
- miso line (to account for any lag in the miso line). The following are the
- valid values.
-
- - 0: No phase shift.
- - 1: 90 degree phase shift sampling.
- - 2: 180 degree phase shift sampling.
- - 3: 270 degree phase shift sampling.
-
-Aliases:
-
-- All the SPI controller nodes should be represented in the aliases node using
- the following format 'spi{n}' where n is a unique number for the alias.
-
-
-Example:
-
-- SoC Specific Portion:
-
- spi_0: spi@12d20000 {
- compatible = "samsung,exynos4210-spi";
- reg = <0x12d20000 0x100>;
- interrupts = <0 66 0>;
- dmas = <&pdma0 5
- &pdma0 4>;
- dma-names = "tx", "rx";
- #address-cells = <1>;
- #size-cells = <0>;
- };
-
-- Board Specific Portion:
-
- spi_0: spi@12d20000 {
- #address-cells = <1>;
- #size-cells = <0>;
- pinctrl-names = "default";
- pinctrl-0 = <&spi0_bus>;
- cs-gpios = <&gpa2 5 0>;
-
- w25q80bw@0 {
- #address-cells = <1>;
- #size-cells = <1>;
- compatible = "w25x80";
- reg = <0>;
- spi-max-frequency = <10000>;
-
- controller-data {
- samsung,spi-feedback-delay = <0>;
- };
-
- partition@0 {
- label = "U-Boot";
- reg = <0x0 0x40000>;
- read-only;
- };
-
- partition@40000 {
- label = "Kernel";
- reg = <0x40000 0xc0000>;
- };
- };
- };
diff --git a/Documentation/devicetree/bindings/spi/spi-slave-mt27xx.txt b/Documentation/devicetree/bindings/spi/spi-slave-mt27xx.txt
deleted file mode 100644
index 9192724540fd..000000000000
--- a/Documentation/devicetree/bindings/spi/spi-slave-mt27xx.txt
+++ /dev/null
@@ -1,33 +0,0 @@
-Binding for MTK SPI Slave controller
-
-Required properties:
-- compatible: should be one of the following.
- - mediatek,mt2712-spi-slave: for mt2712 platforms
- - mediatek,mt8195-spi-slave: for mt8195 platforms
-- reg: Address and length of the register set for the device.
-- interrupts: Should contain spi interrupt.
-- clocks: phandles to input clocks.
- It's clock gate, and should be <&infracfg CLK_INFRA_AO_SPI1>.
-- clock-names: should be "spi" for the clock gate.
-
-Optional properties:
-- assigned-clocks: it's mux clock, should be <&topckgen CLK_TOP_SPISLV_SEL>.
-- assigned-clock-parents: parent of mux clock.
- It's PLL, and should be one of the following.
- - <&topckgen CLK_TOP_UNIVPLL1_D2>: specify parent clock 312MHZ.
- It's the default one.
- - <&topckgen CLK_TOP_UNIVPLL1_D4>: specify parent clock 156MHZ.
- - <&topckgen CLK_TOP_UNIVPLL2_D4>: specify parent clock 104MHZ.
- - <&topckgen CLK_TOP_UNIVPLL1_D8>: specify parent clock 78MHZ.
-
-Example:
-- SoC Specific Portion:
-spis1: spi@10013000 {
- compatible = "mediatek,mt2712-spi-slave";
- reg = <0 0x10013000 0 0x100>;
- interrupts = <GIC_SPI 283 IRQ_TYPE_LEVEL_LOW>;
- clocks = <&infracfg CLK_INFRA_AO_SPI1>;
- clock-names = "spi";
- assigned-clocks = <&topckgen CLK_TOP_SPISLV_SEL>;
- assigned-clock-parents = <&topckgen CLK_TOP_UNIVPLL1_D2>;
-};
diff --git a/Documentation/devicetree/bindings/spi/spi-sunplus-sp7021.yaml b/Documentation/devicetree/bindings/spi/spi-sunplus-sp7021.yaml
new file mode 100644
index 000000000000..3a58cf0f1ec8
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/spi-sunplus-sp7021.yaml
@@ -0,0 +1,78 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+# Copyright (C) Sunplus Co., Ltd. 2021
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/spi/spi-sunplus-sp7021.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Sunplus sp7021 SPI controller
+
+allOf:
+ - $ref: "spi-controller.yaml"
+
+maintainers:
+ - Li-hao Kuo <lhjeff911@gmail.com>
+
+properties:
+ compatible:
+ enum:
+ - sunplus,sp7021-spi
+
+ reg:
+ items:
+ - description: the SPI master registers
+ - description: the SPI slave registers
+
+ reg-names:
+ items:
+ - const: master
+ - const: slave
+
+ interrupt-names:
+ items:
+ - const: dma_w
+ - const: master_risc
+ - const: slave_risc
+
+ interrupts:
+ minItems: 3
+
+ clocks:
+ maxItems: 1
+
+ resets:
+ maxItems: 1
+
+required:
+ - compatible
+ - reg
+ - reg-names
+ - interrupts
+ - interrupt-names
+ - clocks
+ - resets
+ - pinctrl-names
+ - pinctrl-0
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+ spi@9C002D80 {
+ compatible = "sunplus,sp7021-spi";
+ reg = <0x9C002D80 0x80>, <0x9C002E00 0x80>;
+ reg-names = "master", "slave";
+ interrupt-parent = <&intc>;
+ interrupt-names = "dma_w",
+ "master_risc",
+ "slave_risc";
+ interrupts = <144 IRQ_TYPE_LEVEL_HIGH>,
+ <146 IRQ_TYPE_LEVEL_HIGH>,
+ <145 IRQ_TYPE_LEVEL_HIGH>;
+ clocks = <&clkc 0x32>;
+ resets = <&rstc 0x22>;
+ pinctrl-names = "default";
+ pinctrl-0 = <&pins_spi0>;
+ };
+...
diff --git a/Documentation/devicetree/bindings/thermal/exynos-thermal.txt b/Documentation/devicetree/bindings/thermal/exynos-thermal.txt
deleted file mode 100644
index 33004ce7e5df..000000000000
--- a/Documentation/devicetree/bindings/thermal/exynos-thermal.txt
+++ /dev/null
@@ -1,106 +0,0 @@
-* Exynos Thermal Management Unit (TMU)
-
-** Required properties:
-
-- compatible : One of the following:
- "samsung,exynos3250-tmu"
- "samsung,exynos4412-tmu"
- "samsung,exynos4210-tmu"
- "samsung,exynos5250-tmu"
- "samsung,exynos5260-tmu"
- "samsung,exynos5420-tmu" for TMU channel 0, 1 on Exynos5420
- "samsung,exynos5420-tmu-ext-triminfo" for TMU channels 2, 3 and 4
- Exynos5420 (Must pass triminfo base and triminfo clock)
- "samsung,exynos5433-tmu"
- "samsung,exynos7-tmu"
-- reg : Address range of the thermal registers. For soc's which has multiple
- instances of TMU and some registers are shared across all TMU's like
- interrupt related then 2 set of register has to supplied. First set
- belongs to register set of TMU instance and second set belongs to
- registers shared with the TMU instance.
-
- NOTE: On Exynos5420, the TRIMINFO register is misplaced for TMU
- channels 2, 3 and 4
- Use "samsung,exynos5420-tmu-ext-triminfo" in cases, there is a misplaced
- register, also provide clock to access that base.
-
- TRIMINFO at 0x1006c000 contains data for TMU channel 3
- TRIMINFO at 0x100a0000 contains data for TMU channel 4
- TRIMINFO at 0x10068000 contains data for TMU channel 2
-
-- interrupts : Should contain interrupt for thermal system
-- clocks : The main clocks for TMU device
- -- 1. operational clock for TMU channel
- -- 2. optional clock to access the shared registers of TMU channel
- -- 3. optional special clock for functional operation
-- clock-names : Thermal system clock name
- -- "tmu_apbif" operational clock for current TMU channel
- -- "tmu_triminfo_apbif" clock to access the shared triminfo register
- for current TMU channel
- -- "tmu_sclk" clock for functional operation of the current TMU
- channel
-
-The Exynos TMU supports generating interrupts when reaching given
-temperature thresholds. Number of supported thermal trip points depends
-on the SoC (only first trip points defined in DT will be configured):
- - most of SoC: 4
- - samsung,exynos5433-tmu: 8
- - samsung,exynos7-tmu: 8
-
-** Optional properties:
-
-- vtmu-supply: This entry is optional and provides the regulator node supplying
- voltage to TMU. If needed this entry can be placed inside
- board/platform specific dts file.
-
-Example 1):
-
- tmu@100c0000 {
- compatible = "samsung,exynos4412-tmu";
- interrupt-parent = <&combiner>;
- reg = <0x100C0000 0x100>;
- interrupts = <2 4>;
- clocks = <&clock 383>;
- clock-names = "tmu_apbif";
- vtmu-supply = <&tmu_regulator_node>;
- #thermal-sensor-cells = <0>;
- };
-
-Example 2): (In case of Exynos5420 "with misplaced TRIMINFO register")
- tmu_cpu2: tmu@10068000 {
- compatible = "samsung,exynos5420-tmu-ext-triminfo";
- reg = <0x10068000 0x100>, <0x1006c000 0x4>;
- interrupts = <0 184 0>;
- clocks = <&clock 318>, <&clock 318>;
- clock-names = "tmu_apbif", "tmu_triminfo_apbif";
- #thermal-sensor-cells = <0>;
- };
-
- tmu_cpu3: tmu@1006c000 {
- compatible = "samsung,exynos5420-tmu-ext-triminfo";
- reg = <0x1006c000 0x100>, <0x100a0000 0x4>;
- interrupts = <0 185 0>;
- clocks = <&clock 318>, <&clock 319>;
- clock-names = "tmu_apbif", "tmu_triminfo_apbif";
- #thermal-sensor-cells = <0>;
- };
-
- tmu_gpu: tmu@100a0000 {
- compatible = "samsung,exynos5420-tmu-ext-triminfo";
- reg = <0x100a0000 0x100>, <0x10068000 0x4>;
- interrupts = <0 215 0>;
- clocks = <&clock 319>, <&clock 318>;
- clock-names = "tmu_apbif", "tmu_triminfo_apbif";
- #thermal-sensor-cells = <0>;
- };
-
-Note: For multi-instance tmu each instance should have an alias correctly
-numbered in "aliases" node.
-
-Example:
-
-aliases {
- tmuctrl0 = &tmuctrl_0;
- tmuctrl1 = &tmuctrl_1;
- tmuctrl2 = &tmuctrl_2;
-};
diff --git a/Documentation/devicetree/bindings/thermal/qcom-lmh.yaml b/Documentation/devicetree/bindings/thermal/qcom-lmh.yaml
index 289e9a845600..a9b7388ca9ac 100644
--- a/Documentation/devicetree/bindings/thermal/qcom-lmh.yaml
+++ b/Documentation/devicetree/bindings/thermal/qcom-lmh.yaml
@@ -19,6 +19,7 @@ properties:
compatible:
enum:
- qcom,sdm845-lmh
+ - qcom,sm8150-lmh
reg:
items:
diff --git a/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml b/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml
index d3b9e9b600a2..b6406bcc683f 100644
--- a/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml
+++ b/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml
@@ -43,6 +43,7 @@ properties:
- description: v2 of TSENS
items:
- enum:
+ - qcom,msm8953-tsens
- qcom,msm8996-tsens
- qcom,msm8998-tsens
- qcom,sc7180-tsens
diff --git a/Documentation/devicetree/bindings/thermal/samsung,exynos-thermal.yaml b/Documentation/devicetree/bindings/thermal/samsung,exynos-thermal.yaml
new file mode 100644
index 000000000000..17129f75d962
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/samsung,exynos-thermal.yaml
@@ -0,0 +1,184 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/thermal/samsung,exynos-thermal.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Samsung Exynos SoC Thermal Management Unit (TMU)
+
+maintainers:
+ - Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
+
+description: |
+ For multi-instance tmu each instance should have an alias correctly numbered
+ in "aliases" node.
+
+properties:
+ compatible:
+ enum:
+ - samsung,exynos3250-tmu
+ - samsung,exynos4412-tmu
+ - samsung,exynos4210-tmu
+ - samsung,exynos5250-tmu
+ - samsung,exynos5260-tmu
+ # For TMU channel 0, 1 on Exynos5420:
+ - samsung,exynos5420-tmu
+ # For TMU channels 2, 3 and 4 of Exynos5420:
+ - samsung,exynos5420-tmu-ext-triminfo
+ - samsung,exynos5433-tmu
+ - samsung,exynos7-tmu
+
+ clocks:
+ minItems: 1
+ maxItems: 3
+
+ clock-names:
+ minItems: 1
+ maxItems: 3
+
+ interrupts:
+ description: |
+ The Exynos TMU supports generating interrupts when reaching given
+ temperature thresholds. Number of supported thermal trip points depends
+ on the SoC (only first trip points defined in DT will be configured)::
+ - most of SoC: 4
+ - samsung,exynos5433-tmu: 8
+ - samsung,exynos7-tmu: 8
+ maxItems: 1
+
+ reg:
+ items:
+ - description: TMU instance registers.
+ - description: |
+ Shared TMU registers.
+
+ Note:: On Exynos5420, the TRIMINFO register is misplaced for TMU
+ channels 2, 3 and 4 Use "samsung,exynos5420-tmu-ext-triminfo" in
+ cases, there is a misplaced register, also provide clock to access
+ that base.
+ TRIMINFO at 0x1006c000 contains data for TMU channel 3
+ TRIMINFO at 0x100a0000 contains data for TMU channel 4
+ TRIMINFO at 0x10068000 contains data for TMU channel 2
+ minItems: 1
+
+ '#thermal-sensor-cells': true
+
+ vtmu-supply:
+ description: The regulator node supplying voltage to TMU.
+
+required:
+ - compatible
+ - clocks
+ - clock-names
+ - interrupts
+ - reg
+
+allOf:
+ - $ref: /schemas/thermal/thermal-sensor.yaml
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: samsung,exynos5420-tmu-ext-triminfo
+ then:
+ properties:
+ clocks:
+ items:
+ - description:
+ Operational clock for TMU channel.
+ - description:
+ Optional clock to access the shared registers (e.g. TRIMINFO) of TMU
+ channel.
+ clock-names:
+ items:
+ - const: tmu_apbif
+ - const: tmu_triminfo_apbif
+ reg:
+ minItems: 2
+ maxItems: 2
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - samsung,exynos5433-tmu
+ - samsung,exynos7-tmu
+ then:
+ properties:
+ clocks:
+ items:
+ - description:
+ Operational clock for TMU channel.
+ - description:
+ Optional special clock for functional operation of TMU channel.
+ clock-names:
+ items:
+ - const: tmu_apbif
+ - const: tmu_sclk
+ reg:
+ minItems: 1
+ maxItems: 1
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - samsung,exynos3250-tmu
+ - samsung,exynos4412-tmu
+ - samsung,exynos4210-tmu
+ - samsung,exynos5250-tmu
+ - samsung,exynos5260-tmu
+ - samsung,exynos5420-tmu
+ then:
+ properties:
+ clocks:
+ minItems: 1
+ maxItems: 1
+ reg:
+ minItems: 1
+ maxItems: 1
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/clock/exynos4.h>
+
+ tmu@100c0000 {
+ compatible = "samsung,exynos4412-tmu";
+ reg = <0x100C0000 0x100>;
+ interrupt-parent = <&combiner>;
+ interrupts = <2 4>;
+ #thermal-sensor-cells = <0>;
+ clocks = <&clock CLK_TMU_APBIF>;
+ clock-names = "tmu_apbif";
+ vtmu-supply = <&ldo10_reg>;
+ };
+
+ - |
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+ tmu@10068000 {
+ compatible = "samsung,exynos5420-tmu-ext-triminfo";
+ reg = <0x10068000 0x100>, <0x1006c000 0x4>;
+ interrupts = <GIC_SPI 184 IRQ_TYPE_LEVEL_HIGH>;
+ #thermal-sensor-cells = <0>;
+ clocks = <&clock 318>, <&clock 318>; /* CLK_TMU */
+ clock-names = "tmu_apbif", "tmu_triminfo_apbif";
+ vtmu-supply = <&ldo7_reg>;
+ };
+
+ - |
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+ tmu@10060000 {
+ compatible = "samsung,exynos5433-tmu";
+ reg = <0x10060000 0x200>;
+ interrupts = <GIC_SPI 95 IRQ_TYPE_LEVEL_HIGH>;
+ #thermal-sensor-cells = <0>;
+ clocks = <&cmu_peris 3>, /* CLK_PCLK_TMU0_APBIF */
+ <&cmu_peris 35>; /* CLK_SCLK_TMU0 */
+ clock-names = "tmu_apbif", "tmu_sclk";
+ vtmu-supply = <&ldo3_reg>;
+ };
diff --git a/Documentation/devicetree/bindings/timer/nvidia,tegra-timer.yaml b/Documentation/devicetree/bindings/timer/nvidia,tegra-timer.yaml
new file mode 100644
index 000000000000..b78209cd0f28
--- /dev/null
+++ b/Documentation/devicetree/bindings/timer/nvidia,tegra-timer.yaml
@@ -0,0 +1,150 @@
+# SPDX-License-Identifier: GPL-2.0-only
+%YAML 1.2
+---
+$id: "http://devicetree.org/schemas/timer/nvidia,tegra-timer.yaml#"
+$schema: "http://devicetree.org/meta-schemas/core.yaml#"
+
+title: NVIDIA Tegra timer
+
+maintainers:
+ - Stephen Warren <swarren@nvidia.com>
+
+allOf:
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: nvidia,tegra210-timer
+ then:
+ properties:
+ interrupts:
+ # Either a single combined interrupt or up to 14 individual interrupts
+ minItems: 1
+ maxItems: 14
+ description: >
+ A list of 14 interrupts; one per each timer channels 0 through 13
+
+ - if:
+ properties:
+ compatible:
+ oneOf:
+ - items:
+ - enum:
+ - nvidia,tegra114-timer
+ - nvidia,tegra124-timer
+ - nvidia,tegra132-timer
+ - const: nvidia,tegra30-timer
+ - items:
+ - const: nvidia,tegra30-timer
+ - const: nvidia,tegra20-timer
+ then:
+ properties:
+ interrupts:
+ # Either a single combined interrupt or up to 6 individual interrupts
+ minItems: 1
+ maxItems: 6
+ description: >
+ A list of 6 interrupts; one per each of timer channels 1 through 5,
+ and one for the shared interrupt for the remaining channels.
+
+ - if:
+ properties:
+ compatible:
+ const: nvidia,tegra20-timer
+ then:
+ properties:
+ interrupts:
+ # Either a single combined interrupt or up to 4 individual interrupts
+ minItems: 1
+ maxItems: 4
+ description: |
+ A list of 4 interrupts; one per timer channel.
+
+properties:
+ compatible:
+ oneOf:
+ - const: nvidia,tegra210-timer
+ description: >
+ The Tegra210 timer provides fourteen 29-bit timer counters and one 32-bit
+ timestamp counter. The TMRs run at either a fixed 1 MHz clock rate derived
+ from the oscillator clock (TMR0-TMR9) or directly at the oscillator clock
+ (TMR10-TMR13). Each TMR can be programmed to generate one-shot, periodic,
+ or watchdog interrupts.
+ - items:
+ - enum:
+ - nvidia,tegra114-timer
+ - nvidia,tegra124-timer
+ - nvidia,tegra132-timer
+ - const: nvidia,tegra30-timer
+ - items:
+ - const: nvidia,tegra30-timer
+ - const: nvidia,tegra20-timer
+ description: >
+ The Tegra30 timer provides ten 29-bit timer channels, a single 32-bit free
+ running counter, and 5 watchdog modules. The first two channels may also
+ trigger a legacy watchdog reset.
+ - const: nvidia,tegra20-timer
+ description: >
+ The Tegra20 timer provides four 29-bit timer channels and a single 32-bit free
+ running counter. The first two channels may also trigger a watchdog reset.
+
+ reg:
+ maxItems: 1
+
+ interrupts: true
+
+ clocks:
+ maxItems: 1
+
+ clock-names:
+ items:
+ - const: timer
+
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - clocks
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+ timer@60005000 {
+ compatible = "nvidia,tegra30-timer", "nvidia,tegra20-timer";
+ reg = <0x60005000 0x400>;
+ interrupts = <0 0 IRQ_TYPE_LEVEL_HIGH>,
+ <0 1 IRQ_TYPE_LEVEL_HIGH>,
+ <0 41 IRQ_TYPE_LEVEL_HIGH>,
+ <0 42 IRQ_TYPE_LEVEL_HIGH>,
+ <0 121 IRQ_TYPE_LEVEL_HIGH>,
+ <0 122 IRQ_TYPE_LEVEL_HIGH>;
+ clocks = <&tegra_car 214>;
+ };
+ - |
+ #include <dt-bindings/clock/tegra210-car.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ timer@60005000 {
+ compatible = "nvidia,tegra210-timer";
+ reg = <0x60005000 0x400>;
+ interrupts = <GIC_SPI 156 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 0 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 1 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 41 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 42 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 121 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 152 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 153 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 154 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 155 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 176 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 177 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 178 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 179 IRQ_TYPE_LEVEL_HIGH>;
+ clocks = <&tegra_car TEGRA210_CLK_TIMER>;
+ clock-names = "timer";
+ };
diff --git a/Documentation/devicetree/bindings/timer/nvidia,tegra20-timer.txt b/Documentation/devicetree/bindings/timer/nvidia,tegra20-timer.txt
deleted file mode 100644
index 4a864bd10d3d..000000000000
--- a/Documentation/devicetree/bindings/timer/nvidia,tegra20-timer.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-NVIDIA Tegra20 timer
-
-The Tegra20 timer provides four 29-bit timer channels and a single 32-bit free
-running counter. The first two channels may also trigger a watchdog reset.
-
-Required properties:
-
-- compatible : should be "nvidia,tegra20-timer".
-- reg : Specifies base physical address and size of the registers.
-- interrupts : A list of 4 interrupts; one per timer channel.
-- clocks : Must contain one entry, for the module clock.
- See ../clocks/clock-bindings.txt for details.
-
-Example:
-
-timer {
- compatible = "nvidia,tegra20-timer";
- reg = <0x60005000 0x60>;
- interrupts = <0 0 0x04
- 0 1 0x04
- 0 41 0x04
- 0 42 0x04>;
- clocks = <&tegra_car 132>;
-};
diff --git a/Documentation/devicetree/bindings/timer/nvidia,tegra210-timer.txt b/Documentation/devicetree/bindings/timer/nvidia,tegra210-timer.txt
deleted file mode 100644
index 032cda96fe0d..000000000000
--- a/Documentation/devicetree/bindings/timer/nvidia,tegra210-timer.txt
+++ /dev/null
@@ -1,36 +0,0 @@
-NVIDIA Tegra210 timer
-
-The Tegra210 timer provides fourteen 29-bit timer counters and one 32-bit
-timestamp counter. The TMRs run at either a fixed 1 MHz clock rate derived
-from the oscillator clock (TMR0-TMR9) or directly at the oscillator clock
-(TMR10-TMR13). Each TMR can be programmed to generate one-shot, periodic,
-or watchdog interrupts.
-
-Required properties:
-- compatible : "nvidia,tegra210-timer".
-- reg : Specifies base physical address and size of the registers.
-- interrupts : A list of 14 interrupts; one per each timer channels 0 through
- 13.
-- clocks : Must contain one entry, for the module clock.
- See ../clocks/clock-bindings.txt for details.
-
-timer@60005000 {
- compatible = "nvidia,tegra210-timer";
- reg = <0x0 0x60005000 0x0 0x400>;
- interrupts = <GIC_SPI 156 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 0 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 1 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 41 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 42 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 121 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 152 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 153 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 154 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 155 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 176 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 177 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 178 IRQ_TYPE_LEVEL_HIGH>,
- <GIC_SPI 179 IRQ_TYPE_LEVEL_HIGH>;
- clocks = <&tegra_car TEGRA210_CLK_TIMER>;
- clock-names = "timer";
-};
diff --git a/Documentation/devicetree/bindings/timer/nvidia,tegra30-timer.txt b/Documentation/devicetree/bindings/timer/nvidia,tegra30-timer.txt
deleted file mode 100644
index 1761f53ee36f..000000000000
--- a/Documentation/devicetree/bindings/timer/nvidia,tegra30-timer.txt
+++ /dev/null
@@ -1,28 +0,0 @@
-NVIDIA Tegra30 timer
-
-The Tegra30 timer provides ten 29-bit timer channels, a single 32-bit free
-running counter, and 5 watchdog modules. The first two channels may also
-trigger a legacy watchdog reset.
-
-Required properties:
-
-- compatible : For Tegra30, must contain "nvidia,tegra30-timer". Otherwise,
- must contain '"nvidia,<chip>-timer", "nvidia,tegra30-timer"' where
- <chip> is tegra124 or tegra132.
-- reg : Specifies base physical address and size of the registers.
-- interrupts : A list of 6 interrupts; one per each of timer channels 1
- through 5, and one for the shared interrupt for the remaining channels.
-- clocks : Must contain one entry, for the module clock.
- See ../clocks/clock-bindings.txt for details.
-
-timer {
- compatible = "nvidia,tegra30-timer", "nvidia,tegra20-timer";
- reg = <0x60005000 0x400>;
- interrupts = <0 0 0x04
- 0 1 0x04
- 0 41 0x04
- 0 42 0x04
- 0 121 0x04
- 0 122 0x04>;
- clocks = <&tegra_car 214>;
-};
diff --git a/Documentation/devicetree/bindings/trivial-devices.yaml b/Documentation/devicetree/bindings/trivial-devices.yaml
index 091792ba993e..da929cb08463 100644
--- a/Documentation/devicetree/bindings/trivial-devices.yaml
+++ b/Documentation/devicetree/bindings/trivial-devices.yaml
@@ -137,6 +137,8 @@ properties:
- infineon,slb9645tt
# Infineon TLV493D-A1B6 I2C 3D Magnetic Sensor
- infineon,tlv493d-a1b6
+ # Infineon Multi-phase Digital VR Controller xdpe11280
+ - infineon,xdpe11280
# Infineon Multi-phase Digital VR Controller xdpe12254
- infineon,xdpe12254
# Infineon Multi-phase Digital VR Controller xdpe12284
@@ -337,6 +339,7 @@ properties:
# Thermometer with SPI interface
- ti,tmp121
- ti,tmp122
+ - ti,tmp125
# Digital Temperature Sensor
- ti,tmp275
# TI DC-DC converter on PMBus
@@ -354,6 +357,8 @@ properties:
- ti,tps544c25
# Winbond/Nuvoton H/W Monitor
- winbond,w83793
+ # Vicor Corporation Digital Supervisor
+ - vicor,pli1209bc
# i2c trusted platform module (TPM)
- winbond,wpct301
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml b/Documentation/devicetree/bindings/vendor-prefixes.yaml
index 294093d45a23..047a83a089ce 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.yaml
+++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml
@@ -1298,6 +1298,8 @@ patternProperties:
description: Vertexcom Technologies, Inc.
"^via,.*":
description: VIA Technologies, Inc.
+ "^vicor,.*":
+ description: Vicor Corporation
"^videostrong,.*":
description: Videostrong Technology Co., Ltd.
"^virtio,.*":
diff --git a/Documentation/driver-api/gpio/board.rst b/Documentation/driver-api/gpio/board.rst
index 191fa867826a..4e3adf31c8d1 100644
--- a/Documentation/driver-api/gpio/board.rst
+++ b/Documentation/driver-api/gpio/board.rst
@@ -71,14 +71,14 @@ with the help of _DSD (Device Specific Data), introduced in ACPI 5.1::
Device (FOO) {
Name (_CRS, ResourceTemplate () {
- GpioIo (Exclusive, ..., IoRestrictionOutputOnly,
- "\\_SB.GPI0") {15} // red
- GpioIo (Exclusive, ..., IoRestrictionOutputOnly,
- "\\_SB.GPI0") {16} // green
- GpioIo (Exclusive, ..., IoRestrictionOutputOnly,
- "\\_SB.GPI0") {17} // blue
- GpioIo (Exclusive, ..., IoRestrictionOutputOnly,
- "\\_SB.GPI0") {1} // power
+ GpioIo (Exclusive, PullUp, 0, 0, IoRestrictionOutputOnly,
+ "\\_SB.GPI0", 0, ResourceConsumer) { 15 } // red
+ GpioIo (Exclusive, PullUp, 0, 0, IoRestrictionOutputOnly,
+ "\\_SB.GPI0", 0, ResourceConsumer) { 16 } // green
+ GpioIo (Exclusive, PullUp, 0, 0, IoRestrictionOutputOnly,
+ "\\_SB.GPI0", 0, ResourceConsumer) { 17 } // blue
+ GpioIo (Exclusive, PullNone, 0, 0, IoRestrictionOutputOnly,
+ "\\_SB.GPI0", 0, ResourceConsumer) { 1 } // power
})
Name (_DSD, Package () {
@@ -92,10 +92,7 @@ with the help of _DSD (Device Specific Data), introduced in ACPI 5.1::
^FOO, 2, 0, 1,
}
},
- Package () {
- "power-gpios",
- Package () {^FOO, 3, 0, 0},
- },
+ Package () { "power-gpios", Package () { ^FOO, 3, 0, 0 } },
}
})
}
diff --git a/Documentation/driver-api/mtd/index.rst b/Documentation/driver-api/mtd/index.rst
index 436ba5a851d7..6a4278f409d7 100644
--- a/Documentation/driver-api/mtd/index.rst
+++ b/Documentation/driver-api/mtd/index.rst
@@ -7,6 +7,6 @@ Memory Technology Device (MTD)
.. toctree::
:maxdepth: 1
- intel-spi
+ spi-intel
nand_ecc
spi-nor
diff --git a/Documentation/driver-api/mtd/intel-spi.rst b/Documentation/driver-api/mtd/spi-intel.rst
index 0465f6879262..df854f20ead1 100644
--- a/Documentation/driver-api/mtd/intel-spi.rst
+++ b/Documentation/driver-api/mtd/spi-intel.rst
@@ -1,5 +1,5 @@
==============================
-Upgrading BIOS using intel-spi
+Upgrading BIOS using spi-intel
==============================
Many Intel CPUs like Baytrail and Braswell include SPI serial flash host
@@ -11,12 +11,12 @@ avoid accidental (or on purpose) overwrite of the content.
Not all manufacturers protect the SPI serial flash, mainly because it
allows upgrading the BIOS image directly from an OS.
-The intel-spi driver makes it possible to read and write the SPI serial
+The spi-intel driver makes it possible to read and write the SPI serial
flash, if certain protection bits are not set and locked. If it finds
any of them set, the whole MTD device is made read-only to prevent
partial overwrites. By default the driver exposes SPI serial flash
contents as read-only but it can be changed from kernel command line,
-passing "intel-spi.writeable=1".
+passing "spi_intel.writeable=1".
Please keep in mind that overwriting the BIOS image on SPI serial flash
might render the machine unbootable and requires special equipment like
@@ -32,7 +32,7 @@ Linux.
serial flash. Distros like Debian and Fedora have this prepackaged with
name "mtd-utils".
- 3) Add "intel-spi.writeable=1" to the kernel command line and reboot
+ 3) Add "spi_intel.writeable=1" to the kernel command line and reboot
the board (you can also reload the driver passing "writeable=1" as
module parameter to modprobe).
diff --git a/Documentation/driver-api/serial/driver.rst b/Documentation/driver-api/serial/driver.rst
index 31bd4e16fb1f..06ec04ba086f 100644
--- a/Documentation/driver-api/serial/driver.rst
+++ b/Documentation/driver-api/serial/driver.rst
@@ -311,7 +311,7 @@ hardware.
This call must not sleep
set_ldisc(port,termios)
- Notifier for discipline change. See Documentation/driver-api/serial/tty.rst.
+ Notifier for discipline change. See Documentation/tty/tty_ldisc.rst.
Locking: caller holds tty_port->mutex
diff --git a/Documentation/driver-api/thermal/index.rst b/Documentation/driver-api/thermal/index.rst
index 4cb0b9b6bfb8..030306ffa408 100644
--- a/Documentation/driver-api/thermal/index.rst
+++ b/Documentation/driver-api/thermal/index.rst
@@ -17,3 +17,4 @@ Thermal
intel_powerclamp
nouveau_thermal
x86_pkg_temperature_thermal
+ intel_dptf
diff --git a/Documentation/driver-api/thermal/intel_dptf.rst b/Documentation/driver-api/thermal/intel_dptf.rst
new file mode 100644
index 000000000000..96668dca753a
--- /dev/null
+++ b/Documentation/driver-api/thermal/intel_dptf.rst
@@ -0,0 +1,272 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================================
+Intel(R) Dynamic Platform and Thermal Framework Sysfs Interface
+===============================================================
+
+:Copyright: |copy| 2022 Intel Corporation
+
+:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
+
+Introduction
+------------
+
+Intel(R) Dynamic Platform and Thermal Framework (DPTF) is a platform
+level hardware/software solution for power and thermal management.
+
+As a container for multiple power/thermal technologies, DPTF provides
+a coordinated approach for different policies to effect the hardware
+state of a system.
+
+Since it is a platform level framework, this has several components.
+Some parts of the technology is implemented in the firmware and uses
+ACPI and PCI devices to expose various features for monitoring and
+control. Linux has a set of kernel drivers exposing hardware interface
+to user space. This allows user space thermal solutions like
+"Linux Thermal Daemon" to read platform specific thermal and power
+tables to deliver adequate performance while keeping the system under
+thermal limits.
+
+DPTF ACPI Drivers interface
+----------------------------
+
+:file:`/sys/bus/platform/devices/<N>/uuids`, where <N>
+=INT3400|INTC1040|INTC1041|INTC10A0
+
+``available_uuids`` (RO)
+ A set of UUIDs strings presenting available policies
+ which should be notified to the firmware when the
+ user space can support those policies.
+
+ UUID strings:
+
+ "42A441D6-AE6A-462b-A84B-4A8CE79027D3" : Passive 1
+
+ "3A95C389-E4B8-4629-A526-C52C88626BAE" : Active
+
+ "97C68AE7-15FA-499c-B8C9-5DA81D606E0A" : Critical
+
+ "63BE270F-1C11-48FD-A6F7-3AF253FF3E2D" : Adaptive performance
+
+ "5349962F-71E6-431D-9AE8-0A635B710AEE" : Emergency call
+
+ "9E04115A-AE87-4D1C-9500-0F3E340BFE75" : Passive 2
+
+ "F5A35014-C209-46A4-993A-EB56DE7530A1" : Power Boss
+
+ "6ED722A7-9240-48A5-B479-31EEF723D7CF" : Virtual Sensor
+
+ "16CAF1B7-DD38-40ED-B1C1-1B8A1913D531" : Cooling mode
+
+ "BE84BABF-C4D4-403D-B495-3128FD44dAC1" : HDC
+
+``current_uuid`` (RW)
+ User space can write strings from available UUIDs, one at a
+ time.
+
+:file:`/sys/bus/platform/devices/<N>/`, where <N>
+=INT3400|INTC1040|INTC1041|INTC10A0
+
+``imok`` (WO)
+ User space daemon write 1 to respond to firmware event
+ for sending keep alive notification. User space receives
+ THERMAL_EVENT_KEEP_ALIVE kobject uevent notification when
+ firmware calls for user space to respond with imok ACPI
+ method.
+
+``odvp*`` (RO)
+ Firmware thermal status variable values. Thermal tables
+ calls for different processing based on these variable
+ values.
+
+``data_vault`` (RO)
+ Binary thermal table. Refer to
+ https:/github.com/intel/thermal_daemon for decoding
+ thermal table.
+
+
+ACPI Thermal Relationship table interface
+------------------------------------------
+
+:file:`/dev/acpi_thermal_rel`
+
+ This device provides IOCTL interface to read standard ACPI
+ thermal relationship tables via ACPI methods _TRT and _ART.
+ These IOCTLs are defined in
+ drivers/thermal/intel/int340x_thermal/acpi_thermal_rel.h
+
+ IOCTLs:
+
+ ACPI_THERMAL_GET_TRT_LEN: Get length of TRT table
+
+ ACPI_THERMAL_GET_ART_LEN: Get length of ART table
+
+ ACPI_THERMAL_GET_TRT_COUNT: Number of records in TRT table
+
+ ACPI_THERMAL_GET_ART_COUNT: Number of records in ART table
+
+ ACPI_THERMAL_GET_TRT: Read binary TRT table, length to read is
+ provided via argument to ioctl().
+
+ ACPI_THERMAL_GET_ART: Read binary ART table, length to read is
+ provided via argument to ioctl().
+
+DPTF ACPI Sensor drivers
+-------------------------
+
+DPTF Sensor drivers are presented as standard thermal sysfs thermal_zone.
+
+
+DPTF ACPI Cooling drivers
+--------------------------
+
+DPTF cooling drivers are presented as standard thermal sysfs cooling_device.
+
+
+DPTF Processor thermal PCI Driver interface
+--------------------------------------------
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/power_limits/`
+
+Refer to Documentation/power/powercap/powercap.rst for powercap
+ABI.
+
+``power_limit_0_max_uw`` (RO)
+ Maximum powercap sysfs constraint_0_power_limit_uw for Intel RAPL
+
+``power_limit_0_step_uw`` (RO)
+ Power limit increment/decrements for Intel RAPL constraint 0 power limit
+
+``power_limit_0_min_uw`` (RO)
+ Minimum powercap sysfs constraint_0_power_limit_uw for Intel RAPL
+
+``power_limit_0_tmin_us`` (RO)
+ Minimum powercap sysfs constraint_0_time_window_us for Intel RAPL
+
+``power_limit_0_tmax_us`` (RO)
+ Maximum powercap sysfs constraint_0_time_window_us for Intel RAPL
+
+``power_limit_1_max_uw`` (RO)
+ Maximum powercap sysfs constraint_1_power_limit_uw for Intel RAPL
+
+``power_limit_1_step_uw`` (RO)
+ Power limit increment/decrements for Intel RAPL constraint 1 power limit
+
+``power_limit_1_min_uw`` (RO)
+ Minimum powercap sysfs constraint_1_power_limit_uw for Intel RAPL
+
+``power_limit_1_tmin_us`` (RO)
+ Minimum powercap sysfs constraint_1_time_window_us for Intel RAPL
+
+``power_limit_1_tmax_us`` (RO)
+ Maximum powercap sysfs constraint_1_time_window_us for Intel RAPL
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/`
+
+``tcc_offset_degree_celsius`` (RW)
+ TCC offset from the critical temperature where hardware will throttle
+ CPU.
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/workload_request`
+
+``workload_available_types`` (RO)
+ Available workload types. User space can specify one of the workload type
+ it is currently executing via workload_type. For example: idle, bursty,
+ sustained etc.
+
+``workload_type`` (RW)
+ User space can specify any one of the available workload type using
+ this interface.
+
+DPTF Processor thermal RFIM interface
+--------------------------------------------
+
+RFIM interface allows adjustment of FIVR (Fully Integrated Voltage Regulator)
+and DDR (Double Data Rate)frequencies to avoid RF interference with WiFi and 5G.
+
+Switching voltage regulators (VR) generate radiated EMI or RFI at the
+fundamental frequency and its harmonics. Some harmonics may interfere
+with very sensitive wireless receivers such as Wi-Fi and cellular that
+are integrated into host systems like notebook PCs. One of mitigation
+methods is requesting SOC integrated VR (IVR) switching frequency to a
+small % and shift away the switching noise harmonic interference from
+radio channels. OEM or ODMs can use the driver to control SOC IVR
+operation within the range where it does not impact IVR performance.
+
+DRAM devices of DDR IO interface and their power plane can generate EMI
+at the data rates. Similar to IVR control mechanism, Intel offers a
+mechanism by which DDR data rates can be changed if several conditions
+are met: there is strong RFI interference because of DDR; CPU power
+management has no other restriction in changing DDR data rates;
+PC ODMs enable this feature (real time DDR RFI Mitigation referred to as
+DDR-RFIM) for Wi-Fi from BIOS.
+
+
+FIVR attributes
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/fivr/`
+
+``vco_ref_code_lo`` (RW)
+ The VCO reference code is an 11-bit field and controls the FIVR
+ switching frequency. This is the 3-bit LSB field.
+
+``vco_ref_code_hi`` (RW)
+ The VCO reference code is an 11-bit field and controls the FIVR
+ switching frequency. This is the 8-bit MSB field.
+
+``spread_spectrum_pct`` (RW)
+ Set the FIVR spread spectrum clocking percentage
+
+``spread_spectrum_clk_enable`` (RW)
+ Enable/disable of the FIVR spread spectrum clocking feature
+
+``rfi_vco_ref_code`` (RW)
+ This field is a read only status register which reflects the
+ current FIVR switching frequency
+
+``fivr_fffc_rev`` (RW)
+ This field indicated the revision of the FIVR HW.
+
+
+DVFS attributes
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/dvfs/`
+
+``rfi_restriction_run_busy`` (RW)
+ Request the restriction of specific DDR data rate and set this
+ value 1. Self reset to 0 after operation.
+
+``rfi_restriction_err_code`` (RW)
+ 0 :Request is accepted, 1:Feature disabled,
+ 2: the request restricts more points than it is allowed
+
+``rfi_restriction_data_rate_Delta`` (RW)
+ Restricted DDR data rate for RFI protection: Lower Limit
+
+``rfi_restriction_data_rate_Base`` (RW)
+ Restricted DDR data rate for RFI protection: Upper Limit
+
+``ddr_data_rate_point_0`` (RO)
+ DDR data rate selection 1st point
+
+``ddr_data_rate_point_1`` (RO)
+ DDR data rate selection 2nd point
+
+``ddr_data_rate_point_2`` (RO)
+ DDR data rate selection 3rd point
+
+``ddr_data_rate_point_3`` (RO)
+ DDR data rate selection 4th point
+
+``rfi_disable (RW)``
+ Disable DDR rate change feature
+
+DPTF Power supply and Battery Interface
+----------------------------------------
+
+Refer to Documentation/ABI/testing/sysfs-platform-dptf
+
+DPTF Fan Control
+----------------------------------------
+
+Refer to Documentation/admin-guide/acpi/fan_performance_states.rst
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
index e3b30429d703..c04609d8ee24 100644
--- a/Documentation/filesystems/dax.rst
+++ b/Documentation/filesystems/dax.rst
@@ -23,11 +23,11 @@ on it as usual. The `DAX` code currently only supports files with a block
size equal to your kernel's `PAGE_SIZE`, so you may need to specify a block
size when creating the filesystem.
-Currently 4 filesystems support `DAX`: ext2, ext4, xfs and virtiofs.
+Currently 5 filesystems support `DAX`: ext2, ext4, xfs, virtiofs and erofs.
Enabling `DAX` on them is different.
-Enabling DAX on ext2
---------------------
+Enabling DAX on ext2 and erofs
+------------------------------
When mounting the filesystem, use the ``-o dax`` option on the command line or
add 'dax' to the options in ``/etc/fstab``. This works to enable `DAX` on all files
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index 7119aa213be7..bef6d3040ce4 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -40,7 +40,7 @@ Here is the main features of EROFS:
Inode metadata size 32 bytes 64 bytes
Max file size 4 GB 16 EB (also limited by max. vol size)
Max uids/gids 65536 4294967296
- File change time no yes (64 + 32-bit timestamp)
+ Per-inode timestamp no yes (64 + 32-bit timestamp)
Max hardlinks 65536 4294967296
Metadata reserved 4 bytes 14 bytes
===================== ============ =====================================
diff --git a/Documentation/filesystems/ext4/blocks.rst b/Documentation/filesystems/ext4/blocks.rst
index bd722ecd92d6..b0f80ea87c90 100644
--- a/Documentation/filesystems/ext4/blocks.rst
+++ b/Documentation/filesystems/ext4/blocks.rst
@@ -39,7 +39,7 @@ For 32-bit filesystems, limits are as follows:
- 4TiB
- 8TiB
- 16TiB
- - 256PiB
+ - 256TiB
* - Blocks Per Block Group
- 8,192
- 16,384
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 4d5d50dca65c..6ccd5efb25b7 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -1047,8 +1047,8 @@ astute users may notice some differences in behavior:
may be used to overwrite the source files but isn't guaranteed to be
effective on all filesystems and storage devices.
-- Direct I/O is not supported on encrypted files. Attempts to use
- direct I/O on such files will fall back to buffered I/O.
+- Direct I/O is supported on encrypted files only under some
+ circumstances. For details, see `Direct I/O support`_.
- The fallocate operations FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE are not supported on encrypted files and will
@@ -1179,6 +1179,27 @@ Inline encryption doesn't affect the ciphertext or other aspects of
the on-disk format, so users may freely switch back and forth between
using "inlinecrypt" and not using "inlinecrypt".
+Direct I/O support
+==================
+
+For direct I/O on an encrypted file to work, the following conditions
+must be met (in addition to the conditions for direct I/O on an
+unencrypted file):
+
+* The file must be using inline encryption. Usually this means that
+ the filesystem must be mounted with ``-o inlinecrypt`` and inline
+ encryption hardware must be present. However, a software fallback
+ is also available. For details, see `Inline encryption support`_.
+
+* The I/O request must be fully aligned to the filesystem block size.
+ This means that the file position the I/O is targeting, the lengths
+ of all I/O segments, and the memory addresses of all I/O buffers
+ must be multiples of this value. Note that the filesystem block
+ size may be greater than the logical block size of the block device.
+
+If either of the above conditions is not met, then direct I/O on the
+encrypted file will fall back to buffered I/O.
+
Implementation details
======================
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 3f9b1497ebb8..aaca0b601819 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -438,13 +438,13 @@ prototypes::
locking rules:
====================== ============= ================= =========
-ops inode->i_lock blocked_lock_lock may block
+ops flc_lock blocked_lock_lock may block
====================== ============= ================= =========
-lm_notify: yes yes no
+lm_notify: no yes no
lm_grant: no no no
lm_break: yes no no
lm_change yes no no
-lm_breaker_owns_lease: no no no
+lm_breaker_owns_lease: yes no no
====================== ============= ================= =========
buffer_head
diff --git a/Documentation/firmware-guide/acpi/enumeration.rst b/Documentation/firmware-guide/acpi/enumeration.rst
index 74b830b2fd59..c414646a1bb4 100644
--- a/Documentation/firmware-guide/acpi/enumeration.rst
+++ b/Documentation/firmware-guide/acpi/enumeration.rst
@@ -19,16 +19,17 @@ possible we decided to do following:
platform devices.
- Devices behind real busses where there is a connector resource
- are represented as struct spi_device or struct i2c_device
- (standard UARTs are not busses so there is no struct uart_device).
+ are represented as struct spi_device or struct i2c_device. Note
+ that standard UARTs are not busses so there is no struct uart_device,
+ although some of them may be represented by sturct serdev_device.
As both ACPI and Device Tree represent a tree of devices (and their
resources) this implementation follows the Device Tree way as much as
possible.
-The ACPI implementation enumerates devices behind busses (platform, SPI and
-I2C), creates the physical devices and binds them to their ACPI handle in
-the ACPI namespace.
+The ACPI implementation enumerates devices behind busses (platform, SPI,
+I2C, and in some cases UART), creates the physical devices and binds them
+to their ACPI handle in the ACPI namespace.
This means that when ACPI_HANDLE(dev) returns non-NULL the device was
enumerated from ACPI namespace. This handle can be used to extract other
@@ -46,18 +47,16 @@ some minor changes.
Adding ACPI support for an existing driver should be pretty
straightforward. Here is the simplest example::
- #ifdef CONFIG_ACPI
static const struct acpi_device_id mydrv_acpi_match[] = {
/* ACPI IDs here */
{ }
};
MODULE_DEVICE_TABLE(acpi, mydrv_acpi_match);
- #endif
static struct platform_driver my_driver = {
...
.driver = {
- .acpi_match_table = ACPI_PTR(mydrv_acpi_match),
+ .acpi_match_table = mydrv_acpi_match,
},
};
@@ -155,7 +154,7 @@ Here is what the ACPI namespace for a SPI slave might look like::
Device (EEP0)
{
Name (_ADR, 1)
- Name (_CID, Package() {
+ Name (_CID, Package () {
"ATML0025",
"AT25",
})
@@ -172,59 +171,51 @@ The SPI device drivers only need to add ACPI IDs in a similar way than with
the platform device drivers. Below is an example where we add ACPI support
to at25 SPI eeprom driver (this is meant for the above ACPI snippet)::
- #ifdef CONFIG_ACPI
static const struct acpi_device_id at25_acpi_match[] = {
{ "AT25", 0 },
- { },
+ { }
};
MODULE_DEVICE_TABLE(acpi, at25_acpi_match);
- #endif
static struct spi_driver at25_driver = {
.driver = {
...
- .acpi_match_table = ACPI_PTR(at25_acpi_match),
+ .acpi_match_table = at25_acpi_match,
},
};
Note that this driver actually needs more information like page size of the
-eeprom etc. but at the time writing this there is no standard way of
-passing those. One idea is to return this in _DSM method like::
+eeprom, etc. This information can be passed via _DSD method like::
Device (EEP0)
{
...
- Method (_DSM, 4, NotSerialized)
+ Name (_DSD, Package ()
{
- Store (Package (6)
+ ToUUID("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"),
+ Package ()
{
- "byte-len", 1024,
- "addr-mode", 2,
- "page-size, 32
- }, Local0)
-
- // Check UUIDs etc.
-
- Return (Local0)
- }
-
-Then the at25 SPI driver can get this configuration by calling _DSM on its
-ACPI handle like::
-
- struct acpi_buffer output = { ACPI_ALLOCATE_BUFFER, NULL };
- struct acpi_object_list input;
- acpi_status status;
+ Package () { "size", 1024 },
+ Package () { "pagesize", 32 },
+ Package () { "address-width", 16 },
+ }
+ })
+ }
- /* Fill in the input buffer */
+Then the at25 SPI driver can get this configuration by calling device property
+APIs during ->probe() phase like::
- status = acpi_evaluate_object(ACPI_HANDLE(&spi->dev), "_DSM",
- &input, &output);
- if (ACPI_FAILURE(status))
- /* Handle the error */
+ err = device_property_read_u32(dev, "size", &size);
+ if (err)
+ ...error handling...
- /* Extract the data here */
+ err = device_property_read_u32(dev, "pagesize", &page_size);
+ if (err)
+ ...error handling...
- kfree(output.pointer);
+ err = device_property_read_u32(dev, "address-width", &addr_width);
+ if (err)
+ ...error handling...
I2C serial bus support
======================
@@ -237,26 +228,24 @@ registered.
Below is an example of how to add ACPI support to the existing mpu3050
input driver::
- #ifdef CONFIG_ACPI
static const struct acpi_device_id mpu3050_acpi_match[] = {
{ "MPU3050", 0 },
- { },
+ { }
};
MODULE_DEVICE_TABLE(acpi, mpu3050_acpi_match);
- #endif
static struct i2c_driver mpu3050_i2c_driver = {
.driver = {
.name = "mpu3050",
- .owner = THIS_MODULE,
.pm = &mpu3050_pm,
.of_match_table = mpu3050_of_match,
- .acpi_match_table = ACPI_PTR(mpu3050_acpi_match),
+ .acpi_match_table = mpu3050_acpi_match,
},
.probe = mpu3050_probe,
.remove = mpu3050_remove,
.id_table = mpu3050_ids,
};
+ module_i2c_driver(mpu3050_i2c_driver);
Reference to PWM device
=======================
@@ -282,9 +271,9 @@ introduced, i.e.::
}
}
}
-
})
...
+ }
In the above example the PWM-based LED driver references to the PWM channel 0
of \_SB.PCI0.PWM device with initial period setting equal to 600 ms (note that
@@ -306,26 +295,13 @@ For example::
{
Name (SBUF, ResourceTemplate()
{
- ...
// Used to power on/off the device
- GpioIo (Exclusive, PullDefault, 0x0000, 0x0000,
- IoRestrictionOutputOnly, "\\_SB.PCI0.GPI0",
- 0x00, ResourceConsumer,,)
- {
- // Pin List
- 0x0055
- }
+ GpioIo (Exclusive, PullNone, 0, 0, IoRestrictionOutputOnly,
+ "\\_SB.PCI0.GPI0", 0, ResourceConsumer) { 85 }
// Interrupt for the device
- GpioInt (Edge, ActiveHigh, ExclusiveAndWake, PullNone,
- 0x0000, "\\_SB.PCI0.GPI0", 0x00, ResourceConsumer,,)
- {
- // Pin list
- 0x0058
- }
-
- ...
-
+ GpioInt (Edge, ActiveHigh, ExclusiveAndWake, PullNone, 0,
+ "\\_SB.PCI0.GPI0", 0, ResourceConsumer) { 88 }
}
Return (SBUF)
@@ -337,11 +313,12 @@ For example::
ToUUID("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"),
Package ()
{
- Package () {"power-gpios", Package() {^DEV, 0, 0, 0 }},
- Package () {"irq-gpios", Package() {^DEV, 1, 0, 0 }},
+ Package () { "power-gpios", Package () { ^DEV, 0, 0, 0 } },
+ Package () { "irq-gpios", Package () { ^DEV, 1, 0, 0 } },
}
})
...
+ }
These GPIO numbers are controller relative and path "\\_SB.PCI0.GPI0"
specifies the path to the controller. In order to use these GPIOs in Linux
@@ -460,10 +437,10 @@ namespace link::
Device (TMP0)
{
Name (_HID, "PRP0001")
- Name (_DSD, Package() {
+ Name (_DSD, Package () {
ToUUID("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"),
Package () {
- Package (2) { "compatible", "ti,tmp75" },
+ Package () { "compatible", "ti,tmp75" },
}
})
Method (_CRS, 0, Serialized)
diff --git a/Documentation/firmware-guide/acpi/gpio-properties.rst b/Documentation/firmware-guide/acpi/gpio-properties.rst
index df4b711053ee..eaec732cc77c 100644
--- a/Documentation/firmware-guide/acpi/gpio-properties.rst
+++ b/Documentation/firmware-guide/acpi/gpio-properties.rst
@@ -21,18 +21,18 @@ index, like the ASL example below shows::
Name (_CRS, ResourceTemplate ()
{
GpioIo (Exclusive, PullUp, 0, 0, IoRestrictionOutputOnly,
- "\\_SB.GPO0", 0, ResourceConsumer) {15}
+ "\\_SB.GPO0", 0, ResourceConsumer) { 15 }
GpioIo (Exclusive, PullUp, 0, 0, IoRestrictionOutputOnly,
- "\\_SB.GPO0", 0, ResourceConsumer) {27, 31}
+ "\\_SB.GPO0", 0, ResourceConsumer) { 27, 31 }
})
Name (_DSD, Package ()
{
ToUUID("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"),
Package ()
- {
- Package () {"reset-gpios", Package() {^BTH, 1, 1, 0 }},
- Package () {"shutdown-gpios", Package() {^BTH, 0, 0, 0 }},
+ {
+ Package () { "reset-gpios", Package () { ^BTH, 1, 1, 0 } },
+ Package () { "shutdown-gpios", Package () { ^BTH, 0, 0, 0 } },
}
})
}
@@ -123,17 +123,17 @@ Example::
// _DSD Hierarchical Properties Extension UUID
ToUUID("dbb8e3e6-5886-4ba6-8795-1319f52a966b"),
Package () {
- Package () {"hog-gpio8", "G8PU"}
+ Package () { "hog-gpio8", "G8PU" }
}
})
Name (G8PU, Package () {
ToUUID("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"),
Package () {
- Package () {"gpio-hog", 1},
- Package () {"gpios", Package () {8, 0}},
- Package () {"output-high", 1},
- Package () {"line-name", "gpio8-pullup"},
+ Package () { "gpio-hog", 1 },
+ Package () { "gpios", Package () { 8, 0 } },
+ Package () { "output-high", 1 },
+ Package () { "line-name", "gpio8-pullup" },
}
})
@@ -266,15 +266,17 @@ have a device like below::
Name (_CRS, ResourceTemplate () {
GpioIo (Exclusive, PullNone, 0, 0, IoRestrictionNone,
- "\\_SB.GPO0", 0, ResourceConsumer) {15}
+ "\\_SB.GPO0", 0, ResourceConsumer) { 15 }
GpioIo (Exclusive, PullNone, 0, 0, IoRestrictionNone,
- "\\_SB.GPO0", 0, ResourceConsumer) {27}
+ "\\_SB.GPO0", 0, ResourceConsumer) { 27 }
})
}
The driver might expect to get the right GPIO when it does::
desc = gpiod_get(dev, "reset", GPIOD_OUT_LOW);
+ if (IS_ERR(desc))
+ ...error handling...
but since there is no way to know the mapping between "reset" and
the GpioIo() in _CRS desc will hold ERR_PTR(-ENOENT).
diff --git a/Documentation/hwmon/aquacomputer_d5next.rst b/Documentation/hwmon/aquacomputer_d5next.rst
index 1f4bb4ba2e4b..3373e27b707d 100644
--- a/Documentation/hwmon/aquacomputer_d5next.rst
+++ b/Documentation/hwmon/aquacomputer_d5next.rst
@@ -6,22 +6,21 @@ Kernel driver aquacomputer-d5next
Supported devices:
* Aquacomputer D5 Next watercooling pump
+* Aquacomputer Farbwerk 360 RGB controller
Author: Aleksa Savic
Description
-----------
-This driver exposes hardware sensors of the Aquacomputer D5 Next watercooling
-pump, which communicates through a proprietary USB HID protocol.
+This driver exposes hardware sensors of listed Aquacomputer devices, which
+communicate through proprietary USB HID protocols.
-Available sensors are pump and fan speed, power, voltage and current, as
-well as coolant temperature. Also available through debugfs are the serial
-number, firmware version and power-on count.
-
-Attaching a fan is optional and allows it to be controlled using temperature
-curves directly from the pump. If it's not connected, the fan-related sensors
-will report zeroes.
+For the D5 Next pump, available sensors are pump and fan speed, power, voltage
+and current, as well as coolant temperature. Also available through debugfs are
+the serial number, firmware version and power-on count. Attaching a fan to it is
+optional and allows it to be controlled using temperature curves directly from the
+pump. If it's not connected, the fan-related sensors will report zeroes.
The pump can be configured either through software or via its physical
interface. Configuring the pump through this driver is not implemented, as it
@@ -29,33 +28,31 @@ seems to require sending it a complete configuration. That includes addressable
RGB LEDs, for which there is no standard sysfs interface. Thus, that task is
better suited for userspace tools.
+The Farbwerk 360 exposes four temperature sensors. Depending on the device,
+not all sysfs and debugfs entries will be available.
+
Usage notes
-----------
-The pump communicates via HID reports. The driver is loaded automatically by
+The devices communicate via HID reports. The driver is loaded automatically by
the kernel and supports hotswapping.
Sysfs entries
-------------
-============ =============================================
-temp1_input Coolant temperature (in millidegrees Celsius)
-fan1_input Pump speed (in RPM)
-fan2_input Fan speed (in RPM)
-power1_input Pump power (in micro Watts)
-power2_input Fan power (in micro Watts)
-in0_input Pump voltage (in milli Volts)
-in1_input Fan voltage (in milli Volts)
-in2_input +5V rail voltage (in milli Volts)
-curr1_input Pump current (in milli Amperes)
-curr2_input Fan current (in milli Amperes)
-============ =============================================
+================ =============================================
+temp[1-4]_input Temperature sensors (in millidegrees Celsius)
+fan[1-2]_input Pump/fan speed (in RPM)
+power[1-2]_input Pump/fan power (in micro Watts)
+in[0-2]_input Pump/fan voltage (in milli Volts)
+curr[1-2]_input Pump/fan current (in milli Amperes)
+================ =============================================
Debugfs entries
---------------
-================ ===============================================
-serial_number Serial number of the pump
+================ =================================================
+serial_number Serial number of the device
firmware_version Version of installed firmware
-power_cycles Count of how many times the pump was powered on
-================ ===============================================
+power_cycles Count of how many times the device was powered on
+================ =================================================
diff --git a/Documentation/hwmon/asus_ec_sensors.rst b/Documentation/hwmon/asus_ec_sensors.rst
new file mode 100644
index 000000000000..e7e8f1640f45
--- /dev/null
+++ b/Documentation/hwmon/asus_ec_sensors.rst
@@ -0,0 +1,54 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+Kernel driver asus_ec_sensors
+=================================
+
+Supported boards:
+ * PRIME X570-PRO,
+ * Pro WS X570-ACE,
+ * ROG CROSSHAIR VIII DARK HERO,
+ * ROG CROSSHAIR VIII HERO (WI-FI)
+ * ROG CROSSHAIR VIII FORMULA,
+ * ROG CROSSHAIR VIII HERO,
+ * ROG CROSSHAIR VIII IMPACT,
+ * ROG STRIX B550-E GAMING,
+ * ROG STRIX B550-I GAMING,
+ * ROG STRIX X570-E GAMING,
+ * ROG STRIX X570-F GAMING,
+ * ROG STRIX X570-I GAMING
+
+Authors:
+ - Eugene Shalygin <eugene.shalygin@gmail.com>
+
+Description:
+------------
+ASUS mainboards publish hardware monitoring information via Super I/O
+chip and the ACPI embedded controller (EC) registers. Some of the sensors
+are only available via the EC.
+
+The driver is aware of and reads the following sensors:
+
+1. Chipset (PCH) temperature
+2. CPU package temperature
+3. Motherboard temperature
+4. Readings from the T_Sensor header
+5. VRM temperature
+6. CPU_Opt fan RPM
+7. VRM heatsink fan RPM
+8. Chipset fan RPM
+9. Readings from the "Water flow meter" header (RPM)
+10. Readings from the "Water In" and "Water Out" temperature headers
+11. CPU current
+12. CPU core voltage
+
+Sensor values are read from EC registers, and to avoid race with the board
+firmware the driver acquires ACPI mutex, the one used by the WMI when its
+methods access the EC.
+
+Module Parameters
+-----------------
+ * mutex_path: string
+ The driver holds path to the ACPI mutex for each board (actually,
+ the path is mostly identical for them). If ASUS changes this path
+ in a future BIOS update, this parameter can be used to override
+ the stored in the driver value until it gets updated.
diff --git a/Documentation/hwmon/dell-smm-hwmon.rst b/Documentation/hwmon/dell-smm-hwmon.rst
index beec88491171..d3323a96665d 100644
--- a/Documentation/hwmon/dell-smm-hwmon.rst
+++ b/Documentation/hwmon/dell-smm-hwmon.rst
@@ -165,3 +165,183 @@ obtain the same information and to control the fan status. The ioctl
interface can be accessed from C programs or from shell using the
i8kctl utility. See the source file of ``i8kutils`` for more
information on how to use the ioctl interface.
+
+SMM Interface
+-------------
+
+.. warning:: The SMM interface was reverse-engineered by trial-and-error
+ since Dell did not provide any Documentation,
+ please keep that in mind.
+
+The driver uses the SMM interface to send commands to the system BIOS.
+This interface is normally used by Dell's 32-bit diagnostic program or
+on newer notebook models by the buildin BIOS diagnostics.
+The SMM is triggered by writing to the special ioports ``0xb2`` and ``0x84``,
+and may cause short hangs when the BIOS code is taking too long to
+execute.
+
+The SMM handler inside the system BIOS looks at the contents of the
+``eax``, ``ebx``, ``ecx``, ``edx``, ``esi`` and ``edi`` registers.
+Each register has a special purpose:
+
+=============== ==================================
+Register Purpose
+=============== ==================================
+eax Holds the command code before SMM,
+ holds the first result after SMM.
+ebx Holds the arguments.
+ecx Unknown, set to 0.
+edx Holds the second result after SMM.
+esi Unknown, set to 0.
+edi Unknown, set to 0.
+=============== ==================================
+
+The SMM handler can signal a failure by either:
+
+- setting the lower sixteen bits of ``eax`` to ``0xffff``
+- not modifying ``eax`` at all
+- setting the carry flag
+
+SMM command codes
+-----------------
+
+=============== ======================= ================================================
+Command Code Command Name Description
+=============== ======================= ================================================
+``0x0025`` Get Fn key status Returns the Fn key pressed after SMM:
+
+ - 9th bit in ``eax`` indicates Volume up
+ - 10th bit in ``eax`` indicates Volume down
+ - both bits indicate Volume mute
+
+``0xa069`` Get power status Returns current power status after SMM:
+
+ - 1st bit in ``eax`` indicates Battery connected
+ - 3th bit in ``eax`` indicates AC connected
+
+``0x00a3`` Get fan state Returns current fan state after SMM:
+
+ - 1st byte in ``eax`` holds the current
+ fan state (0 - 2 or 3)
+
+``0x01a3`` Set fan state Sets the fan speed:
+
+ - 1st byte in ``ebx`` holds the fan number
+ - 2nd byte in ``ebx`` holds the desired
+ fan state (0 - 2 or 3)
+
+``0x02a3`` Get fan speed Returns the current fan speed in RPM:
+
+ - 1st byte in ``ebx`` holds the fan number
+ - 1st word in ``eax`` holds the current
+ fan speed in RPM (after SMM)
+
+``0x03a3`` Get fan type Returns the fan type:
+
+ - 1st byte in ``ebx`` holds the fan number
+ - 1st byte in ``eax`` holds the
+ fan type (after SMM):
+
+ - 5th bit indicates docking fan
+ - 1 indicates Processor fan
+ - 2 indicates Motherboard fan
+ - 3 indicates Video fan
+ - 4 indicates Power supply fan
+ - 5 indicates Chipset fan
+ - 6 indicates other fan type
+
+``0x04a3`` Get nominal fan speed Returns the nominal RPM in each fan state:
+
+ - 1st byte in ``ebx`` holds the fan number
+ - 2nd byte in ``ebx`` holds the fan state
+ in question (0 - 2 or 3)
+ - 1st word in ``eax`` holds the nominal
+ fan speed in RPM (after SMM)
+
+``0x05a3`` Get fan speed tolerance Returns the speed tolerance for each fan state:
+
+ - 1st byte in ``ebx`` holds the fan number
+ - 2nd byte in ``ebx`` holds the fan state
+ in question (0 - 2 or 3)
+ - 1st byte in ``eax`` returns the speed
+ tolerance
+
+``0x10a3`` Get sensor temperature Returns the measured temperature:
+
+ - 1st byte in ``ebx`` holds the sensor number
+ - 1st byte in ``eax`` holds the measured
+ temperature (after SMM)
+
+``0x11a3`` Get sensor type Returns the sensor type:
+
+ - 1st byte in ``ebx`` holds the sensor number
+ - 1st byte in ``eax`` holds the
+ temperature type (after SMM):
+
+ - 1 indicates CPU sensor
+ - 2 indicates GPU sensor
+ - 3 indicates SODIMM sensor
+ - 4 indicates other sensor type
+ - 5 indicates Ambient sensor
+ - 6 indicates other sensor type
+
+``0xfea3`` Get SMM signature Returns Dell signature if interface
+ is supported (after SMM):
+
+ - ``eax`` holds 1145651527
+ (0x44494147 or "DIAG")
+ - ``edx`` holds 1145392204
+ (0x44454c4c or "DELL")
+
+``0xffa3`` Get SMM signature Same as ``0xfea3``, check both.
+=============== ======================= ================================================
+
+There are additional commands for enabling (``0x31a3`` or ``0x35a3``) and
+disabling (``0x30a3`` or ``0x34a3``) automatic fan speed control.
+The commands are however causing severe sideeffects on many machines, so
+they are not used by default.
+
+On several machines (Inspiron 3505, Precision 490, Vostro 1720, ...), the
+fans supports a 4th "magic" state, which signals the BIOS that automatic
+fan control should be enabled for a specific fan.
+However there are also some machines who do support a 4th regular fan state too,
+but in case of the "magic" state, the nominal RPM reported for this state is a
+placeholder value, which however is not always detectable.
+
+Firmware Bugs
+-------------
+
+The SMM calls can behave erratic on some machines:
+
+======================================================= =================
+Firmware Bug Affected Machines
+======================================================= =================
+Reading of fan states return spurious errors. Precision 490
+
+Reading of fan types causes erratic fan behaviour. Studio XPS 8000
+
+ Studio XPS 8100
+
+ Inspiron 580
+
+Fan-related SMM calls take too long (about 500ms). Inspiron 7720
+
+ Vostro 3360
+
+ XPS 13 9333
+
+ XPS 15 L502X
+======================================================= =================
+
+In case you experience similar issues on your Dell machine, please
+submit a bugreport on bugzilla to we can apply workarounds.
+
+Limitations
+-----------
+
+The SMM calls can take too long to execute on some machines, causing
+short hangs and/or audio glitches.
+Also the fan state needs to be restored after suspend, as well as
+the automatic mode settings.
+When reading a temperature sensor, values above 127 degrees indicate
+a BIOS read error or a deactivated sensor.
diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index df20022c741f..9d2787a12a69 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -43,6 +43,7 @@ Hardware Monitoring Kernel Drivers
asb100
asc7621
aspeed-pwm-tacho
+ asus_ec_sensors
asus_wmi_ec_sensors
asus_wmi_sensors
bcm54140
@@ -160,6 +161,7 @@ Hardware Monitoring Kernel Drivers
pc87427
pcf8591
pim4328
+ pli1209bc
pm6764tr
pmbus
powr1220
@@ -193,6 +195,7 @@ Hardware Monitoring Kernel Drivers
tmp108
tmp401
tmp421
+ tmp464
tmp513
tps23861
tps40422
diff --git a/Documentation/hwmon/lm70.rst b/Documentation/hwmon/lm70.rst
index 6ddc5b67ccb5..11303a7e16a8 100644
--- a/Documentation/hwmon/lm70.rst
+++ b/Documentation/hwmon/lm70.rst
@@ -15,6 +15,10 @@ Supported chips:
Information: https://www.ti.com/product/tmp122
+ * Texas Instruments TMP125
+
+ Information: https://www.ti.com/product/tmp125
+
* National Semiconductor LM71
Datasheet: https://www.ti.com/product/LM71
@@ -53,6 +57,9 @@ The LM74 and TMP121/TMP122/TMP123/TMP124 are very similar; main difference is
The TMP122/TMP124 also feature configurable temperature thresholds.
+The TMP125 is less accurate and provides 10-bit temperature data
+with 0.25 degrees Celsius resolution.
+
The LM71 is also very similar; main difference is 14-bit temperature
data (0.03125 degrees celsius resolution).
diff --git a/Documentation/hwmon/max6639.rst b/Documentation/hwmon/max6639.rst
index 3da54225f83c..c85d285a3489 100644
--- a/Documentation/hwmon/max6639.rst
+++ b/Documentation/hwmon/max6639.rst
@@ -9,7 +9,7 @@ Supported chips:
Addresses scanned: I2C 0x2c, 0x2e, 0x2f
- Datasheet: http://pdfserv.maxim-ic.com/en/ds/MAX6639.pdf
+ Datasheet: https://datasheets.maximintegrated.com/en/ds/MAX6639-MAX6639F.pdf
Authors:
- He Changqing <hechangqing@semptian.com>
diff --git a/Documentation/hwmon/pli1209bc.rst b/Documentation/hwmon/pli1209bc.rst
new file mode 100644
index 000000000000..ea5b3f68a515
--- /dev/null
+++ b/Documentation/hwmon/pli1209bc.rst
@@ -0,0 +1,75 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel driver pli1209bc
+=======================
+
+Supported chips:
+
+ * Digital Supervisor PLI1209BC
+
+ Prefix: 'pli1209bc'
+
+ Addresses scanned: 0x50 - 0x5F
+
+ Datasheet: https://www.vicorpower.com/documents/datasheets/ds-PLI1209BCxyzz-VICOR.pdf
+
+Authors:
+ - Marcello Sylvester Bauer <sylv@sylv.io>
+
+Description
+-----------
+
+The Vicor PLI1209BC is an isolated digital power system supervisor that provides
+a communication interface between a host processor and one Bus Converter Module
+(BCM). The PLI communicates with a system controller via a PMBus compatible
+interface over an isolated UART interface. Through the PLI, the host processor
+can configure, set protection limits, and monitor the BCM.
+
+Sysfs entries
+-------------
+
+======================= ========================================================
+in1_label "vin2"
+in1_input Input voltage.
+in1_rated_min Minimum rated input voltage.
+in1_rated_max Maximum rated input voltage.
+in1_max Maximum input voltage.
+in1_max_alarm Input voltage high alarm.
+in1_crit Critical input voltage.
+in1_crit_alarm Input voltage critical alarm.
+
+in2_label "vout2"
+in2_input Output voltage.
+in2_rated_min Minimum rated output voltage.
+in2_rated_max Maximum rated output voltage.
+in2_alarm Output voltage alarm
+
+curr1_label "iin2"
+curr1_input Input current.
+curr1_max Maximum input current.
+curr1_max_alarm Maximum input current high alarm.
+curr1_crit Critical input current.
+curr1_crit_alarm Input current critical alarm.
+
+curr2_label "iout2"
+curr2_input Output current.
+curr2_crit Critical output current.
+curr2_crit_alarm Output current critical alarm.
+curr2_max Maximum output current.
+curr2_max_alarm Output current high alarm.
+
+power1_label "pin2"
+power1_input Input power.
+power1_alarm Input power alarm.
+
+power2_label "pout2"
+power2_input Output power.
+power2_rated_max Maximum rated output power.
+
+temp1_input Die temperature.
+temp1_alarm Die temperature alarm.
+temp1_max Maximum die temperature.
+temp1_max_alarm Die temperature high alarm.
+temp1_crit Critical die temperature.
+temp1_crit_alarm Die temperature critical alarm.
+======================= ========================================================
diff --git a/Documentation/hwmon/sch5627.rst b/Documentation/hwmon/sch5627.rst
index 187682e99114..ecb4fc84d045 100644
--- a/Documentation/hwmon/sch5627.rst
+++ b/Documentation/hwmon/sch5627.rst
@@ -20,6 +20,10 @@ Description
SMSC SCH5627 Super I/O chips include complete hardware monitoring
capabilities. They can monitor up to 5 voltages, 4 fans and 8 temperatures.
+In addition, the SCH5627 exports data describing which temperature sensors
+affect the speed of each fan. Setting pwmX_auto_channels_temp to 0 forces
+the corresponding fan to full speed until another value is written.
+
The SMSC SCH5627 hardware monitoring part also contains an integrated
watchdog. In order for this watchdog to function some motherboard specific
initialization most be done by the BIOS, so if the watchdog is not enabled
diff --git a/Documentation/hwmon/sysfs-interface.rst b/Documentation/hwmon/sysfs-interface.rst
index 85652a6aaa3e..209626fb2405 100644
--- a/Documentation/hwmon/sysfs-interface.rst
+++ b/Documentation/hwmon/sysfs-interface.rst
@@ -99,6 +99,10 @@ Global attributes
`name`
The chip name.
+`label`
+ A descriptive label that allows to uniquely identify a device
+ within the system.
+
`update_interval`
The interval at which the chip will update readings.
diff --git a/Documentation/hwmon/tmp464.rst b/Documentation/hwmon/tmp464.rst
new file mode 100644
index 000000000000..7596e7623d06
--- /dev/null
+++ b/Documentation/hwmon/tmp464.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel driver tmp464
+====================
+
+Supported chips:
+
+ * Texas Instruments TMP464
+
+ Prefix: 'tmp464'
+
+ Addresses scanned: I2C 0x48, 0x49, 0x4a and 0x4b
+
+ Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp464.html
+
+ * Texas Instruments TMP468
+
+ Prefix: 'tmp468'
+
+ Addresses scanned: I2C 0x48, 0x49, 0x4a and 0x4b
+
+ Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp468.html
+
+Authors:
+
+ Agathe Porte <agathe.porte@nokia.com>
+ Guenter Roeck <linux@roeck-us.net>
+
+Description
+-----------
+
+This driver implements support for Texas Instruments TMP464 and TMP468
+temperature sensor chips. TMP464 provides one local and four remote
+sensors. TMP468 provides one local and eight remote sensors.
+Temperature is measured in degrees Celsius. The chips are wired over
+I2C/SMBus and specified over a temperature range of -40 to +125 degrees
+Celsius. Resolution for both the local and remote channels is 0.0625
+degree C.
+
+The chips support only temperature measurements. The driver exports
+temperature values, limits, and alarms via the following sysfs files:
+
+**temp[1-9]_input**
+
+**temp[1-9]_max**
+
+**temp[1-9]_max_hyst**
+
+**temp[1-9]_max_alarm**
+
+**temp[1-9]_crit**
+
+**temp[1-9]_crit_alarm**
+
+**temp[1-9]_crit_hyst**
+
+**temp[2-9]_offset**
+
+**temp[2-9]_fault**
+
+Each sensor can be individually disabled via Devicetree or from sysfs
+via:
+
+**temp[1-9]_enable**
+
+If labels were specified in Devicetree, additional sysfs files will
+be present:
+
+**temp[1-9]_label**
+
+The update interval is configurable with the following sysfs attribute.
+
+**update_interval**
diff --git a/Documentation/hwmon/xdpe12284.rst b/Documentation/hwmon/xdpe12284.rst
index 67d1f87808e5..a224dc74ad35 100644
--- a/Documentation/hwmon/xdpe12284.rst
+++ b/Documentation/hwmon/xdpe12284.rst
@@ -5,6 +5,10 @@ Kernel driver xdpe122
Supported chips:
+ * Infineon XDPE11280
+
+ Prefix: 'xdpe11280'
+
* Infineon XDPE12254
Prefix: 'xdpe12254'
@@ -20,10 +24,10 @@ Authors:
Description
-----------
-This driver implements support for Infineon Multi-phase XDPE122 family
-dual loop voltage regulators.
-The family includes XDPE12284 and XDPE12254 devices.
-The devices from this family complaint with:
+This driver implements support for Infineon Multi-phase XDPE112 and XDPE122
+family dual loop voltage regulators.
+These families include XDPE11280, XDPE12284 and XDPE12254 devices.
+The devices from this family compliant with:
- Intel VR13 and VR13HC rev 1.3, IMVP8 rev 1.2 and IMPVP9 rev 1.3 DC-DC
converter specification.
diff --git a/Documentation/locking/locktypes.rst b/Documentation/locking/locktypes.rst
index 4fd7b70fcde1..bfa75ea1b66a 100644
--- a/Documentation/locking/locktypes.rst
+++ b/Documentation/locking/locktypes.rst
@@ -247,7 +247,7 @@ based on rt_mutex which changes the semantics:
Non-PREEMPT_RT kernels disable preemption to get this effect.
PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
- preemption disabled. The lock disables softirq handlers and also
+ preemption enabled. The lock disables softirq handlers and also
prevents reentrancy due to task preemption.
PREEMPT_RT kernels preserve all other spinlock_t semantics:
diff --git a/Documentation/process/applying-patches.rst b/Documentation/process/applying-patches.rst
index c2121c1e55d7..c269f5e1a0a3 100644
--- a/Documentation/process/applying-patches.rst
+++ b/Documentation/process/applying-patches.rst
@@ -249,6 +249,10 @@ The 5.x.y (-stable) and 5.x patches live at
https://www.kernel.org/pub/linux/kernel/v5.x/
+The 5.x.y incremental patches live at
+
+ https://www.kernel.org/pub/linux/kernel/v5.x/incr/
+
The -rc patches are not stored on the webserver but are generated on
demand from git tags such as
@@ -308,12 +312,11 @@ versions.
If no 5.x.y kernel is available, then the highest numbered 5.x kernel is
the current stable kernel.
-.. note::
+The -stable team provides normal as well as incremental patches. Below is
+how to apply these patches.
- The -stable team usually do make incremental patches available as well
- as patches against the latest mainline release, but I only cover the
- non-incremental ones below. The incremental ones can be found at
- https://www.kernel.org/pub/linux/kernel/v5.x/incr/
+Normal patches
+~~~~~~~~~~~~~~
These patches are not incremental, meaning that for example the 5.7.3
patch does not apply on top of the 5.7.2 kernel source, but rather on top
@@ -331,6 +334,21 @@ Here's a small example::
$ cd ..
$ mv linux-5.7.2 linux-5.7.3 # rename the kernel source dir
+Incremental patches
+~~~~~~~~~~~~~~~~~~~
+
+Incremental patches are different: instead of being applied on top
+of base 5.x kernel, they are applied on top of previous stable kernel
+(5.x.y-1).
+
+Here's the example to apply these::
+
+ $ cd ~/linux-5.7.2 # change to the kernel source dir
+ $ patch -p1 < ../patch-5.7.2-3 # apply the new 5.7.3 patch
+ $ cd ..
+ $ mv linux-5.7.2 linux-5.7.3 # rename the kernel source dir
+
+
The -rc kernels
===============
diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst
index 388cb19f5dbb..a6e36d9c3d14 100644
--- a/Documentation/process/deprecated.rst
+++ b/Documentation/process/deprecated.rst
@@ -71,6 +71,9 @@ Instead, the 2-factor form of the allocator should be used::
foo = kmalloc_array(count, size, GFP_KERNEL);
+Specifically, kmalloc() can be replaced with kmalloc_array(), and
+kzalloc() can be replaced with kcalloc().
+
If no 2-factor form is available, the saturate-on-overflow helpers should
be used::
@@ -91,9 +94,20 @@ Instead, use the helper::
array usage and switch to a `flexible array member
<#zero-length-and-one-element-arrays>`_ instead.
-See array_size(), array3_size(), and struct_size(),
-for more details as well as the related check_add_overflow() and
-check_mul_overflow() family of functions.
+For other calculations, please compose the use of the size_mul(),
+size_add(), and size_sub() helpers. For example, in the case of::
+
+ foo = krealloc(current_size + chunk_size * (count - 3), GFP_KERNEL);
+
+Instead, use the helpers::
+
+ foo = krealloc(size_add(current_size,
+ size_mul(chunk_size,
+ size_sub(count, 3))), GFP_KERNEL);
+
+For more details, also see array3_size() and flex_array_size(),
+as well as the related check_mul_overflow(), check_add_overflow(),
+check_sub_overflow(), and check_shl_overflow() family of functions.
simple_strtol(), simple_strtoll(), simple_strtoul(), simple_strtoull()
----------------------------------------------------------------------
diff --git a/Documentation/process/handling-regressions.rst b/Documentation/process/handling-regressions.rst
new file mode 100644
index 000000000000..abb741b1aeee
--- /dev/null
+++ b/Documentation/process/handling-regressions.rst
@@ -0,0 +1,746 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. See the bottom of this file for additional redistribution information.
+
+Handling regressions
+++++++++++++++++++++
+
+*We don't cause regressions* -- this document describes what this "first rule of
+Linux kernel development" means in practice for developers. It complements
+Documentation/admin-guide/reporting-regressions.rst, which covers the topic from a
+user's point of view; if you never read that text, go and at least skim over it
+before continuing here.
+
+The important bits (aka "The TL;DR")
+====================================
+
+#. Ensure subscribers of the `regression mailing list <https://lore.kernel.org/regressions/>`_
+ (regressions@lists.linux.dev) quickly become aware of any new regression
+ report:
+
+ * When receiving a mailed report that did not CC the list, bring it into the
+ loop by immediately sending at least a brief "Reply-all" with the list
+ CCed.
+
+ * Forward or bounce any reports submitted in bug trackers to the list.
+
+#. Make the Linux kernel regression tracking bot "regzbot" track the issue (this
+ is optional, but recommended):
+
+ * For mailed reports, check if the reporter included a line like ``#regzbot
+ introduced v5.13..v5.14-rc1``. If not, send a reply (with the regressions
+ list in CC) containing a paragraph like the following, which tells regzbot
+ when the issue started to happen::
+
+ #regzbot ^introduced 1f2e3d4c5b6a
+
+ * When forwarding reports from a bug tracker to the regressions list (see
+ above), include a paragraph like the following::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+ #regzbot from: Some N. Ice Human <some.human@example.com>
+ #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
+
+#. When submitting fixes for regressions, add "Link:" tags to the patch
+ description pointing to all places where the issue was reported, as
+ mandated by Documentation/process/submitting-patches.rst and
+ :ref:`Documentation/process/5.Posting.rst <development_posting>`.
+
+#. Try to fix regressions quickly once the culprit has been identified; fixes
+ for most regressions should be merged within two weeks, but some need to be
+ resolved within two or three days.
+
+
+All the details on Linux kernel regressions relevant for developers
+===================================================================
+
+
+The important basics in more detail
+-----------------------------------
+
+
+What to do when receiving regression reports
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ensure the Linux kernel's regression tracker and others subscribers of the
+`regression mailing list <https://lore.kernel.org/regressions/>`_
+(regressions@lists.linux.dev) become aware of any newly reported regression:
+
+ * When you receive a report by mail that did not CC the list, immediately bring
+ it into the loop by sending at least a brief "Reply-all" with the list CCed;
+ try to ensure it gets CCed again in case you reply to a reply that omitted
+ the list.
+
+ * If a report submitted in a bug tracker hits your Inbox, forward or bounce it
+ to the list. Consider checking the list archives beforehand, if the reporter
+ already forwarded the report as instructed by
+ Documentation/admin-guide/reporting-issues.rst.
+
+When doing either, consider making the Linux kernel regression tracking bot
+"regzbot" immediately start tracking the issue:
+
+ * For mailed reports, check if the reporter included a "regzbot command" like
+ ``#regzbot introduced 1f2e3d4c5b6a``. If not, send a reply (with the
+ regressions list in CC) with a paragraph like the following:::
+
+ #regzbot ^introduced: v5.13..v5.14-rc1
+
+ This tells regzbot the version range in which the issue started to happen;
+ you can specify a range using commit-ids as well or state a single commit-id
+ in case the reporter bisected the culprit.
+
+ Note the caret (^) before the "introduced": it tells regzbot to treat the
+ parent mail (the one you reply to) as the initial report for the regression
+ you want to see tracked; that's important, as regzbot will later look out
+ for patches with "Link:" tags pointing to the report in the archives on
+ lore.kernel.org.
+
+ * When forwarding a regressions reported to a bug tracker, include a paragraph
+ with these regzbot commands::
+
+ #regzbot introduced: 1f2e3d4c5b6a
+ #regzbot from: Some N. Ice Human <some.human@example.com>
+ #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
+
+ Regzbot will then automatically associate patches with the report that
+ contain "Link:" tags pointing to your mail or the mentioned ticket.
+
+What's important when fixing regressions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You don't need to do anything special when submitting fixes for regression, just
+remember to do what Documentation/process/submitting-patches.rst,
+:ref:`Documentation/process/5.Posting.rst <development_posting>`, and
+Documentation/process/stable-kernel-rules.rst already explain in more detail:
+
+ * Point to all places where the issue was reported using "Link:" tags::
+
+ Link: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
+ Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
+
+ * Add a "Fixes:" tag to specify the commit causing the regression.
+
+ * If the culprit was merged in an earlier development cycle, explicitly mark
+ the fix for backporting using the ``Cc: stable@vger.kernel.org`` tag.
+
+All this is expected from you and important when it comes to regression, as
+these tags are of great value for everyone (you included) that might be looking
+into the issue weeks, months, or years later. These tags are also crucial for
+tools and scripts used by other kernel developers or Linux distributions; one of
+these tools is regzbot, which heavily relies on the "Link:" tags to associate
+reports for regression with changes resolving them.
+
+Prioritize work on fixing regressions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You should fix any reported regression as quickly as possible, to provide
+affected users with a solution in a timely manner and prevent more users from
+running into the issue; nevertheless developers need to take enough time and
+care to ensure regression fixes do not cause additional damage.
+
+In the end though, developers should give their best to prevent users from
+running into situations where a regression leaves them only three options: "run
+a kernel with a regression that seriously impacts usage", "continue running an
+outdated and thus potentially insecure kernel version for more than two weeks
+after a regression's culprit was identified", and "downgrade to a still
+supported kernel series that lack required features".
+
+How to realize this depends a lot on the situation. Here are a few rules of
+thumb for you, in order or importance:
+
+ * Prioritize work on handling regression reports and fixing regression over all
+ other Linux kernel work, unless the latter concerns acute security issues or
+ bugs causing data loss or damage.
+
+ * Always consider reverting the culprit commits and reapplying them later
+ together with necessary fixes, as this might be the least dangerous and
+ quickest way to fix a regression.
+
+ * Developers should handle regressions in all supported kernel series, but are
+ free to delegate the work to the stable team, if the issue probably at no
+ point in time occurred with mainline.
+
+ * Try to resolve any regressions introduced in the current development before
+ its end. If you fear a fix might be too risky to apply only days before a new
+ mainline release, let Linus decide: submit the fix separately to him as soon
+ as possible with the explanation of the situation. He then can make a call
+ and postpone the release if necessary, for example if multiple such changes
+ show up in his inbox.
+
+ * Address regressions in stable, longterm, or proper mainline releases with
+ more urgency than regressions in mainline pre-releases. That changes after
+ the release of the fifth pre-release, aka "-rc5": mainline then becomes as
+ important, to ensure all the improvements and fixes are ideally tested
+ together for at least one week before Linus releases a new mainline version.
+
+ * Fix regressions within two or three days, if they are critical for some
+ reason -- for example, if the issue is likely to affect many users of the
+ kernel series in question on all or certain architectures. Note, this
+ includes mainline, as issues like compile errors otherwise might prevent many
+ testers or continuous integration systems from testing the series.
+
+ * Aim to fix regressions within one week after the culprit was identified, if
+ the issue was introduced in either:
+
+ * a recent stable/longterm release
+
+ * the development cycle of the latest proper mainline release
+
+ In the latter case (say Linux v5.14), try to address regressions even
+ quicker, if the stable series for the predecessor (v5.13) will be abandoned
+ soon or already was stamped "End-of-Life" (EOL) -- this usually happens about
+ three to four weeks after a new mainline release.
+
+ * Try to fix all other regressions within two weeks after the culprit was
+ found. Two or three additional weeks are acceptable for performance
+ regressions and other issues which are annoying, but don't prevent anyone
+ from running Linux (unless it's an issue in the current development cycle,
+ as those should ideally be addressed before the release). A few weeks in
+ total are acceptable if a regression can only be fixed with a risky change
+ and at the same time is affecting only a few users; as much time is
+ also okay if the regression is already present in the second newest longterm
+ kernel series.
+
+Note: The aforementioned time frames for resolving regressions are meant to
+include getting the fix tested, reviewed, and merged into mainline, ideally with
+the fix being in linux-next at least briefly. This leads to delays you need to
+account for.
+
+Subsystem maintainers are expected to assist in reaching those periods by doing
+timely reviews and quick handling of accepted patches. They thus might have to
+send git-pull requests earlier or more often than usual; depending on the fix,
+it might even be acceptable to skip testing in linux-next. Especially fixes for
+regressions in stable and longterm kernels need to be handled quickly, as fixes
+need to be merged in mainline before they can be backported to older series.
+
+
+More aspects regarding regressions developers should be aware of
+----------------------------------------------------------------
+
+
+How to deal with changes where a risk of regression is known
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Evaluate how big the risk of regressions is, for example by performing a code
+search in Linux distributions and Git forges. Also consider asking other
+developers or projects likely to be affected to evaluate or even test the
+proposed change; if problems surface, maybe some solution acceptable for all
+can be found.
+
+If the risk of regressions in the end seems to be relatively small, go ahead
+with the change, but let all involved parties know about the risk. Hence, make
+sure your patch description makes this aspect obvious. Once the change is
+merged, tell the Linux kernel's regression tracker and the regressions mailing
+list about the risk, so everyone has the change on the radar in case reports
+trickle in. Depending on the risk, you also might want to ask the subsystem
+maintainer to mention the issue in his mainline pull request.
+
+What else is there to known about regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check out Documentation/admin-guide/reporting-regressions.rst, it covers a lot
+of other aspects you want might want to be aware of:
+
+ * the purpose of the "no regressions rule"
+
+ * what issues actually qualify as regression
+
+ * who's in charge for finding the root cause of a regression
+
+ * how to handle tricky situations, e.g. when a regression is caused by a
+ security fix or when fixing a regression might cause another one
+
+Whom to ask for advice when it comes to regressions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
+CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
+issue might better be dealt with in private, feel free to omit the list.
+
+
+More about regression tracking and regzbot
+------------------------------------------
+
+
+Why the Linux kernel has a regression tracker, and why is regzbot used?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Rules like "no regressions" need someone to ensure they are followed, otherwise
+they are broken either accidentally or on purpose. History has shown this to be
+true for the Linux kernel as well. That's why Thorsten Leemhuis volunteered to
+keep an eye on things as the Linux kernel's regression tracker, who's
+occasionally helped by other people. Neither of them are paid to do this,
+that's why regression tracking is done on a best effort basis.
+
+Earlier attempts to manually track regressions have shown it's an exhausting and
+frustrating work, which is why they were abandoned after a while. To prevent
+this from happening again, Thorsten developed regzbot to facilitate the work,
+with the long term goal to automate regression tracking as much as possible for
+everyone involved.
+
+How does regression tracking work with regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bot watches for replies to reports of tracked regressions. Additionally,
+it's looking out for posted or committed patches referencing such reports
+with "Link:" tags; replies to such patch postings are tracked as well.
+Combined this data provides good insights into the current state of the fixing
+process.
+
+Regzbot tries to do its job with as little overhead as possible for both
+reporters and developers. In fact, only reporters are burdened with an extra
+duty: they need to tell regzbot about the regression report using the ``#regzbot
+introduced`` command outlined above; if they don't do that, someone else can
+take care of that using ``#regzbot ^introduced``.
+
+For developers there normally is no extra work involved, they just need to make
+sure to do something that was expected long before regzbot came to light: add
+"Link:" tags to the patch description pointing to all reports about the issue
+fixed.
+
+Do I have to use regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's in the interest of everyone if you do, as kernel maintainers like Linus
+Torvalds partly rely on regzbot's tracking in their work -- for example when
+deciding to release a new version or extend the development phase. For this they
+need to be aware of all unfixed regression; to do that, Linus is known to look
+into the weekly reports sent by regzbot.
+
+Do I have to tell regzbot about every regression I stumble upon?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ideally yes: we are all humans and easily forget problems when something more
+important unexpectedly comes up -- for example a bigger problem in the Linux
+kernel or something in real life that's keeping us away from keyboards for a
+while. Hence, it's best to tell regzbot about every regression, except when you
+immediately write a fix and commit it to a tree regularly merged to the affected
+kernel series.
+
+How to see which regressions regzbot tracks currently?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
+for the latest info; alternatively, `search for the latest regression report
+<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
+which regzbot normally sends out once a week on Sunday evening (UTC), which is a
+few hours before Linus usually publishes new (pre-)releases.
+
+What places is regzbot monitoring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Regzbot is watching the most important Linux mailing lists as well as the git
+repositories of linux-next, mainline, and stable/longterm.
+
+What kind of issues are supposed to be tracked by regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bot is meant to track regressions, hence please don't involve regzbot for
+regular issues. But it's okay for the Linux kernel's regression tracker if you
+use regzbot to track severe issues, like reports about hangs, corrupted data,
+or internal errors (Panic, Oops, BUG(), warning, ...).
+
+Can I add regressions found by CI systems to regzbot's tracking?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Feel free to do so, if the particular regression likely has impact on practical
+use cases and thus might be noticed by users; hence, please don't involve
+regzbot for theoretical regressions unlikely to show themselves in real world
+usage.
+
+How to interact with regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By using a 'regzbot command' in a direct or indirect reply to the mail with the
+regression report. These commands need to be in their own paragraph (IOW: they
+need to be separated from the rest of the mail using blank lines).
+
+One such command is ``#regzbot introduced <version or commit>``, which makes
+regzbot consider your mail as a regressions report added to the tracking, as
+already described above; ``#regzbot ^introduced <version or commit>`` is another
+such command, which makes regzbot consider the parent mail as a report for a
+regression which it starts to track.
+
+Once one of those two commands has been utilized, other regzbot commands can be
+used in direct or indirect replies to the report. You can write them below one
+of the `introduced` commands or in replies to the mail that used one of them
+or itself is a reply to that mail:
+
+ * Set or update the title::
+
+ #regzbot title: foo
+
+ * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
+ the issue or a fix are discussed -- for example the posting of a patch fixing
+ the regression::
+
+ #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
+
+ Monitoring only works for lore.kernel.org and bugzilla.kernel.org; regzbot
+ will consider all messages in that thread or ticket as related to the fixing
+ process.
+
+ * Point to a place with further details of interest, like a mailing list post
+ or a ticket in a bug tracker that are slightly related, but about a different
+ topic::
+
+ #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Mark a regression as fixed by a commit that is heading upstream or already
+ landed::
+
+ #regzbot fixed-by: 1f2e3d4c5d
+
+ * Mark a regression as a duplicate of another one already tracked by regzbot::
+
+ #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
+
+ * Mark a regression as invalid::
+
+ #regzbot invalid: wasn't a regression, problem has always existed
+
+Is there more to tell about regzbot and its commands?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+More detailed and up-to-date information about the Linux
+kernel's regression tracking bot can be found on its
+`project page <https://gitlab.com/knurd42/regzbot>`_, which among others
+contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
+and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
+which both cover more details than the above section.
+
+Quotes from Linus about regression
+----------------------------------
+
+Find below a few real life examples of how Linus Torvalds expects regressions to
+be handled:
+
+ * From `2017-10-26 (1/2)
+ <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_::
+
+ If you break existing user space setups THAT IS A REGRESSION.
+
+ It's not ok to say "but we'll fix the user space setup".
+
+ Really. NOT OK.
+
+ [...]
+
+ The first rule is:
+
+ - we don't cause regressions
+
+ and the corollary is that when regressions *do* occur, we admit to
+ them and fix them, instead of blaming user space.
+
+ The fact that you have apparently been denying the regression now for
+ three weeks means that I will revert, and I will stop pulling apparmor
+ requests until the people involved understand how kernel development
+ is done.
+
+ * From `2017-10-26 (2/2)
+ <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_::
+
+ People should basically always feel like they can update their kernel
+ and simply not have to worry about it.
+
+ I refuse to introduce "you can only update the kernel if you also
+ update that other program" kind of limitations. If the kernel used to
+ work for you, the rule is that it continues to work for you.
+
+ There have been exceptions, but they are few and far between, and they
+ generally have some major and fundamental reasons for having happened,
+ that were basically entirely unavoidable, and people _tried_hard_ to
+ avoid them. Maybe we can't practically support the hardware any more
+ after it is decades old and nobody uses it with modern kernels any
+ more. Maybe there's a serious security issue with how we did things,
+ and people actually depended on that fundamentally broken model. Maybe
+ there was some fundamental other breakage that just _had_ to have a
+ flag day for very core and fundamental reasons.
+
+ And notice that this is very much about *breaking* peoples environments.
+
+ Behavioral changes happen, and maybe we don't even support some
+ feature any more. There's a number of fields in /proc/<pid>/stat that
+ are printed out as zeroes, simply because they don't even *exist* in
+ the kernel any more, or because showing them was a mistake (typically
+ an information leak). But the numbers got replaced by zeroes, so that
+ the code that used to parse the fields still works. The user might not
+ see everything they used to see, and so behavior is clearly different,
+ but things still _work_, even if they might no longer show sensitive
+ (or no longer relevant) information.
+
+ But if something actually breaks, then the change must get fixed or
+ reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
+ your user space then". It was a kernel change that exposed the
+ problem, it needs to be the kernel that corrects for it, because we
+ have a "upgrade in place" model. We don't have a "upgrade with new
+ user space".
+
+ And I seriously will refuse to take code from people who do not
+ understand and honor this very simple rule.
+
+ This rule is also not going to change.
+
+ And yes, I realize that the kernel is "special" in this respect. I'm
+ proud of it.
+
+ I have seen, and can point to, lots of projects that go "We need to
+ break that use case in order to make progress" or "you relied on
+ undocumented behavior, it sucks to be you" or "there's a better way to
+ do what you want to do, and you have to change to that new better
+ way", and I simply don't think that's acceptable outside of very early
+ alpha releases that have experimental users that know what they signed
+ up for. The kernel hasn't been in that situation for the last two
+ decades.
+
+ We do API breakage _inside_ the kernel all the time. We will fix
+ internal problems by saying "you now need to do XYZ", but then it's
+ about internal kernel API's, and the people who do that then also
+ obviously have to fix up all the in-kernel users of that API. Nobody
+ can say "I now broke the API you used, and now _you_ need to fix it
+ up". Whoever broke something gets to fix it too.
+
+ And we simply do not break user space.
+
+ * From `2020-05-21
+ <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_::
+
+ The rules about regressions have never been about any kind of
+ documented behavior, or where the code lives.
+
+ The rules about regressions are always about "breaks user workflow".
+
+ Users are literally the _only_ thing that matters.
+
+ No amount of "you shouldn't have used this" or "that behavior was
+ undefined, it's your own fault your app broke" or "that used to work
+ simply because of a kernel bug" is at all relevant.
+
+ Now, reality is never entirely black-and-white. So we've had things
+ like "serious security issue" etc that just forces us to make changes
+ that may break user space. But even then the rule is that we don't
+ really have other options that would allow things to continue.
+
+ And obviously, if users take years to even notice that something
+ broke, or if we have sane ways to work around the breakage that
+ doesn't make for too much trouble for users (ie "ok, there are a
+ handful of users, and they can use a kernel command line to work
+ around it" kind of things) we've also been a bit less strict.
+
+ But no, "that was documented to be broken" (whether it's because the
+ code was in staging or because the man-page said something else) is
+ irrelevant. If staging code is so useful that people end up using it,
+ that means that it's basically regular kernel code with a flag saying
+ "please clean this up".
+
+ The other side of the coin is that people who talk about "API
+ stability" are entirely wrong. API's don't matter either. You can make
+ any changes to an API you like - as long as nobody notices.
+
+ Again, the regression rule is not about documentation, not about
+ API's, and not about the phase of the moon.
+
+ It's entirely about "we caused problems for user space that used to work".
+
+ * From `2017-11-05
+ <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_::
+
+ And our regression rule has never been "behavior doesn't change".
+ That would mean that we could never make any changes at all.
+
+ For example, we do things like add new error handling etc all the
+ time, which we then sometimes even add tests for in our kselftest
+ directory.
+
+ So clearly behavior changes all the time and we don't consider that a
+ regression per se.
+
+ The rule for a regression for the kernel is that some real user
+ workflow breaks. Not some test. Not a "look, I used to be able to do
+ X, now I can't".
+
+ * From `2018-08-03
+ <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_::
+
+ YOU ARE MISSING THE #1 KERNEL RULE.
+
+ We do not regress, and we do not regress exactly because your are 100% wrong.
+
+ And the reason you state for your opinion is in fact exactly *WHY* you
+ are wrong.
+
+ Your "good reasons" are pure and utter garbage.
+
+ The whole point of "we do not regress" is so that people can upgrade
+ the kernel and never have to worry about it.
+
+ > Kernel had a bug which has been fixed
+
+ That is *ENTIRELY* immaterial.
+
+ Guys, whether something was buggy or not DOES NOT MATTER.
+
+ Why?
+
+ Bugs happen. That's a fact of life. Arguing that "we had to break
+ something because we were fixing a bug" is completely insane. We fix
+ tens of bugs every single day, thinking that "fixing a bug" means that
+ we can break something is simply NOT TRUE.
+
+ So bugs simply aren't even relevant to the discussion. They happen,
+ they get found, they get fixed, and it has nothing to do with "we
+ break users".
+
+ Because the only thing that matters IS THE USER.
+
+ How hard is that to understand?
+
+ Anybody who uses "but it was buggy" as an argument is entirely missing
+ the point. As far as the USER was concerned, it wasn't buggy - it
+ worked for him/her.
+
+ Maybe it worked *because* the user had taken the bug into account,
+ maybe it worked because the user didn't notice - again, it doesn't
+ matter. It worked for the user.
+
+ Breaking a user workflow for a "bug" is absolutely the WORST reason
+ for breakage you can imagine.
+
+ It's basically saying "I took something that worked, and I broke it,
+ but now it's better". Do you not see how f*cking insane that statement
+ is?
+
+ And without users, your program is not a program, it's a pointless
+ piece of code that you might as well throw away.
+
+ Seriously. This is *why* the #1 rule for kernel development is "we
+ don't break users". Because "I fixed a bug" is absolutely NOT AN
+ ARGUMENT if that bug fix broke a user setup. You actually introduced a
+ MUCH BIGGER bug by "fixing" something that the user clearly didn't
+ even care about.
+
+ And dammit, we upgrade the kernel ALL THE TIME without upgrading any
+ other programs at all. It is absolutely required, because flag-days
+ and dependencies are horribly bad.
+
+ And it is also required simply because I as a kernel developer do not
+ upgrade random other tools that I don't even care about as I develop
+ the kernel, and I want any of my users to feel safe doing the same
+ time.
+
+ So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
+ without upgrading some other random binary, then we have a problem.
+
+ * From `2021-06-05
+ <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_::
+
+ THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
+
+ Honestly, security people need to understand that "not working" is not
+ a success case of security. It's a failure case.
+
+ Yes, "not working" may be secure. But security in that case is *pointless*.
+
+ * From `2011-05-06 (1/3)
+ <https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_::
+
+ Binary compatibility is more important.
+
+ And if binaries don't use the interface to parse the format (or just
+ parse it wrongly - see the fairly recent example of adding uuid's to
+ /proc/self/mountinfo), then it's a regression.
+
+ And regressions get reverted, unless there are security issues or
+ similar that makes us go "Oh Gods, we really have to break things".
+
+ I don't understand why this simple logic is so hard for some kernel
+ developers to understand. Reality matters. Your personal wishes matter
+ NOT AT ALL.
+
+ If you made an interface that can be used without parsing the
+ interface description, then we're stuck with the interface. Theory
+ simply doesn't matter.
+
+ You could help fix the tools, and try to avoid the compatibility
+ issues that way. There aren't that many of them.
+
+ From `2011-05-06 (2/3)
+ <https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_::
+
+ it's clearly NOT an internal tracepoint. By definition. It's being
+ used by powertop.
+
+ From `2011-05-06 (3/3)
+ <https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_::
+
+ We have programs that use that ABI and thus it's a regression if they break.
+
+ * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_::
+
+ > Now this got me wondering if Debian _unstable_ actually qualifies as a
+ > standard distro userspace.
+
+ Oh, if the kernel breaks some standard user space, that counts. Tons
+ of people run Debian unstable
+
+ * From `2019-09-15
+ <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_::
+
+ One _particularly_ last-minute revert is the top-most commit (ignoring
+ the version change itself) done just before the release, and while
+ it's very annoying, it's perhaps also instructive.
+
+ What's instructive about it is that I reverted a commit that wasn't
+ actually buggy. In fact, it was doing exactly what it set out to do,
+ and did it very well. In fact it did it _so_ well that the much
+ improved IO patterns it caused then ended up revealing a user-visible
+ regression due to a real bug in a completely unrelated area.
+
+ The actual details of that regression are not the reason I point that
+ revert out as instructive, though. It's more that it's an instructive
+ example of what counts as a regression, and what the whole "no
+ regressions" kernel rule means. The reverted commit didn't change any
+ API's, and it didn't introduce any new bugs. But it ended up exposing
+ another problem, and as such caused a kernel upgrade to fail for a
+ user. So it got reverted.
+
+ The point here being that we revert based on user-reported _behavior_,
+ not based on some "it changes the ABI" or "it caused a bug" concept.
+ The problem was really pre-existing, and it just didn't happen to
+ trigger before. The better IO patterns introduced by the change just
+ happened to expose an old bug, and people had grown to depend on the
+ previously benign behavior of that old issue.
+
+ And never fear, we'll re-introduce the fix that improved on the IO
+ patterns once we've decided just how to handle the fact that we had a
+ bad interaction with an interface that people had then just happened
+ to rely on incidental behavior for before. It's just that we'll have
+ to hash through how to do that (there are no less than three different
+ patches by three different developers being discussed, and there might
+ be more coming...). In the meantime, I reverted the thing that exposed
+ the problem to users for this release, even if I hope it will be
+ re-introduced (perhaps even backported as a stable patch) once we have
+ consensus about the issue it exposed.
+
+ Take-away from the whole thing: it's not about whether you change the
+ kernel-userspace ABI, or fix a bug, or about whether the old code
+ "should never have worked in the first place". It's about whether
+ something breaks existing users' workflow.
+
+ Anyway, that was my little aside on the whole regression thing. Since
+ it's that "first rule of kernel programming", I felt it is perhaps
+ worth just bringing it up every once in a while
+
+..
+ end-of-content
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use "The Linux kernel developers" for author attribution and link
+ this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/process/handling-regressions.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/process/index.rst b/Documentation/process/index.rst
index 9f1b88492bb3..3587dae4d0ef 100644
--- a/Documentation/process/index.rst
+++ b/Documentation/process/index.rst
@@ -25,6 +25,7 @@ Below are the essential guides that every developer should read.
code-of-conduct-interpretation
development-process
submitting-patches
+ handling-regressions
programming-language
coding-style
maintainer-handbooks
@@ -48,6 +49,7 @@ Other guides to the community that are of interest to most developers are:
deprecated
embargoed-hardware-issues
maintainers
+ researcher-guidelines
These are some overall technical guides that have been put here for now for
lack of a better place.
diff --git a/Documentation/process/researcher-guidelines.rst b/Documentation/process/researcher-guidelines.rst
new file mode 100644
index 000000000000..afc944e0e898
--- /dev/null
+++ b/Documentation/process/researcher-guidelines.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _researcher_guidelines:
+
+Researcher Guidelines
++++++++++++++++++++++
+
+The Linux kernel community welcomes transparent research on the Linux
+kernel, the activities involved in producing it, and any other byproducts
+of its development. Linux benefits greatly from this kind of research, and
+most aspects of Linux are driven by research in one form or another.
+
+The community greatly appreciates if researchers can share preliminary
+findings before making their results public, especially if such research
+involves security. Getting involved early helps both improve the quality
+of research and ability for Linux to improve from it. In any case,
+sharing open access copies of the published research with the community
+is recommended.
+
+This document seeks to clarify what the Linux kernel community considers
+acceptable and non-acceptable practices when conducting such research. At
+the very least, such research and related activities should follow
+standard research ethics rules. For more background on research ethics
+generally, ethics in technology, and research of developer communities
+in particular, see:
+
+* `History of Research Ethics <https://www.unlv.edu/research/ORI-HSR/history-ethics>`_
+* `IEEE Ethics <https://www.ieee.org/about/ethics/index.html>`_
+* `Developer and Researcher Views on the Ethics of Experiments on Open-Source Projects <https://arxiv.org/pdf/2112.13217.pdf>`_
+
+The Linux kernel community expects that everyone interacting with the
+project is participating in good faith to make Linux better. Research on
+any publicly-available artifact (including, but not limited to source
+code) produced by the Linux kernel community is welcome, though research
+on developers must be distinctly opt-in.
+
+Passive research that is based entirely on publicly available sources,
+including posts to public mailing lists and commits to public
+repositories, is clearly permissible. Though, as with any research,
+standard ethics must still be followed.
+
+Active research on developer behavior, however, must be done with the
+explicit agreement of, and full disclosure to, the individual developers
+involved. Developers cannot be interacted with/experimented on without
+consent; this, too, is standard research ethics.
+
+To help clarify: sending patches to developers *is* interacting
+with them, but they have already consented to receiving *good faith
+contributions*. Sending intentionally flawed/vulnerable patches or
+contributing misleading information to discussions is not consented
+to. Such communication can be damaging to the developer (e.g. draining
+time, effort, and morale) and damaging to the project by eroding
+the entire developer community's trust in the contributor (and the
+contributor's organization as a whole), undermining efforts to provide
+constructive feedback to contributors, and putting end users at risk of
+software flaws.
+
+Participation in the development of Linux itself by researchers, as
+with anyone, is welcomed and encouraged. Research into Linux code is
+a common practice, especially when it comes to developing or running
+analysis tools that produce actionable results.
+
+When engaging with the developer community, sending a patch has
+traditionally been the best way to make an impact. Linux already has
+plenty of known bugs -- what's much more helpful is having vetted fixes.
+Before contributing, carefully read the appropriate documentation:
+
+* Documentation/process/development-process.rst
+* Documentation/process/submitting-patches.rst
+* Documentation/admin-guide/reporting-issues.rst
+* Documentation/admin-guide/security-bugs.rst
+
+Then send a patch (including a commit log with all the details listed
+below) and follow up on any feedback from other developers.
+
+When sending patches produced from research, the commit logs should
+contain at least the following details, so that developers have
+appropriate context for understanding the contribution. Answer:
+
+* What is the specific problem that has been found?
+* How could the problem be reached on a running system?
+* What effect would encountering the problem have on the system?
+* How was the problem found? Specifically include details about any
+ testing, static or dynamic analysis programs, and any other tools or
+ methods used to perform the work.
+* Which version of Linux was the problem found on? Using the most recent
+ release or a recent linux-next branch is strongly preferred (see
+ Documentation/process/howto.rst).
+* What was changed to fix the problem, and why it is believed to be correct?
+* How was the change build tested and run-time tested?
+* What prior commit does this change fix? This should go in a "Fixes:"
+ tag as the documentation describes.
+* Who else has reviewed this patch? This should go in appropriate
+ "Reviewed-by:" tags; see below.
+
+For example::
+
+ From: Author <author@email>
+ Subject: [PATCH] drivers/foo_bar: Add missing kfree()
+
+ The error path in foo_bar driver does not correctly free the allocated
+ struct foo_bar_info. This can happen if the attached foo_bar device
+ rejects the initialization packets sent during foo_bar_probe(). This
+ would result in a 64 byte slab memory leak once per device attach,
+ wasting memory resources over time.
+
+ This flaw was found using an experimental static analysis tool we are
+ developing, LeakMagic[1], which reported the following warning when
+ analyzing the v5.15 kernel release:
+
+ path/to/foo_bar.c:187: missing kfree() call?
+
+ Add the missing kfree() to the error path. No other references to
+ this memory exist outside the probe function, so this is the only
+ place it can be freed.
+
+ x86_64 and arm64 defconfig builds with CONFIG_FOO_BAR=y using GCC
+ 11.2 show no new warnings, and LeakMagic no longer warns about this
+ code path. As we don't have a FooBar device to test with, no runtime
+ testing was able to be performed.
+
+ [1] https://url/to/leakmagic/details
+
+ Reported-by: Researcher <researcher@email>
+ Fixes: aaaabbbbccccdddd ("Introduce support for FooBar")
+ Signed-off-by: Author <author@email>
+ Reviewed-by: Reviewer <reviewer@email>
+
+If you are a first time contributor it is recommended that the patch
+itself be vetted by others privately before being posted to public lists.
+(This is required if you have been explicitly told your patches need
+more careful internal review.) These people are expected to have their
+"Reviewed-by" tag included in the resulting patch. Finding another
+developer familiar with Linux contribution, especially within your own
+organization, and having them help with reviews before sending them to
+the public mailing lists tends to significantly improve the quality of the
+resulting patches, and there by reduces the burden on other developers.
+
+If no one can be found to internally review patches and you need
+help finding such a person, or if you have any other questions
+related to this document and the developer community's expectations,
+please reach out to the private Technical Advisory Board mailing list:
+<tech-board@lists.linux-foundation.org>.
diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
index 31ea120ce531..fb496b2ebfd3 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -495,7 +495,8 @@ Using Reported-by:, Tested-by:, Reviewed-by:, Suggested-by: and Fixes:
The Reported-by tag gives credit to people who find bugs and report them and it
hopefully inspires them to help us again in the future. Please note that if
the bug was reported in private, then ask for permission first before using the
-Reported-by tag.
+Reported-by tag. The tag is intended for bugs; please do not use it to credit
+feature requests.
A Tested-by: tag indicates that the patch has been successfully tested (in
some environment) by the person named. This tag informs maintainers that
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 88900aabdbf7..693eaac769a7 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -14,6 +14,7 @@ Linux Scheduler
sched-domains
sched-capacity
sched-energy
+ schedutil
sched-nice-design
sched-rt-group
sched-stats
diff --git a/Documentation/scheduler/sched-domains.rst b/Documentation/scheduler/sched-domains.rst
index 84dcdcd2911c..e57ad28301bd 100644
--- a/Documentation/scheduler/sched-domains.rst
+++ b/Documentation/scheduler/sched-domains.rst
@@ -37,10 +37,10 @@ rebalancing event for the current runqueue has arrived. The actual load
balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
in softirq context (SCHED_SOFTIRQ).
-The latter function takes two arguments: the current CPU and whether it was idle
-at the time the scheduler_tick() happened and iterates over all sched domains
-our CPU is on, starting from its base domain and going up the ->parent chain.
-While doing that, it checks to see if the current domain has exhausted its
+The latter function takes two arguments: the runqueue of current CPU and whether
+the CPU was idle at the time the scheduler_tick() happened and iterates over all
+sched domains our CPU is on, starting from its base domain and going up the ->parent
+chain. While doing that, it checks to see if the current domain has exhausted its
rebalance interval. If so, it runs load_balance() on that domain. It then checks
the parent sched_domain (if it exists), and the parent of the parent and so
forth.
diff --git a/Documentation/scheduler/schedutil.txt b/Documentation/scheduler/schedutil.rst
index 78f6b91e2291..32c7d69fc86c 100644
--- a/Documentation/scheduler/schedutil.txt
+++ b/Documentation/scheduler/schedutil.rst
@@ -1,11 +1,15 @@
+=========
+Schedutil
+=========
+.. note::
-NOTE; all this assumes a linear relation between frequency and work capacity,
-we know this is flawed, but it is the best workable approximation.
+ All this assumes a linear relation between frequency and work capacity,
+ we know this is flawed, but it is the best workable approximation.
PELT (Per Entity Load Tracking)
--------------------------------
+===============================
With PELT we track some metrics across the various scheduler entities, from
individual tasks to task-group slices to CPU runqueues. As the basis for this
@@ -38,8 +42,8 @@ while 'runnable' will increase to reflect the amount of contention.
For more detail see: kernel/sched/pelt.c
-Frequency- / CPU Invariance
----------------------------
+Frequency / CPU Invariance
+==========================
Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
@@ -47,7 +51,7 @@ a big CPU, we allow architectures to scale the time delta with two ratios, one
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
For simple DVFS architectures (where software is in full control) we trivially
-compute the ratio as:
+compute the ratio as::
f_cur
r_dvfs := -----
@@ -55,7 +59,7 @@ compute the ratio as:
For more dynamic systems where the hardware is in control of DVFS we use
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
-For Intel specifically, we use:
+For Intel specifically, we use::
APERF
f_cur := ----- * P0
@@ -87,7 +91,7 @@ For more detail see:
UTIL_EST / UTIL_EST_FASTUP
---------------------------
+==========================
Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
@@ -106,7 +110,7 @@ For more detail see: kernel/sched/fair.c:util_est_dequeue()
UCLAMP
-------
+======
It is possible to set effective u_min and u_max clamps on each CFS or RT task;
the runqueue keeps an max aggregate of these clamps for all running tasks.
@@ -115,7 +119,7 @@ For more detail see: include/uapi/linux/sched/types.h
Schedutil / DVFS
-----------------
+================
Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
@@ -123,7 +127,7 @@ DVFS state.
The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
-a desired frequency like:
+a desired frequency like::
max( running, util_est ); if UTIL_EST
u_cfs := { running; otherwise
@@ -135,7 +139,7 @@ a desired frequency like:
f_des := min( f_max, 1.25 u * f_max )
-XXX IO-wait; when the update is due to a task wakeup from IO-completion we
+XXX IO-wait: when the update is due to a task wakeup from IO-completion we
boost 'u' above.
This frequency is then used to select a P-state/OPP or directly munged into a
@@ -153,7 +157,7 @@ For more information see: kernel/sched/cpufreq_schedutil.c
NOTES
------
+=====
- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
will closely reflect utilization.
diff --git a/Documentation/security/SCTP.rst b/Documentation/security/SCTP.rst
index d5fd6ccc3dcb..b73eb764a001 100644
--- a/Documentation/security/SCTP.rst
+++ b/Documentation/security/SCTP.rst
@@ -15,10 +15,7 @@ For security module support, three SCTP specific hooks have been implemented::
security_sctp_assoc_request()
security_sctp_bind_connect()
security_sctp_sk_clone()
-
-Also the following security hook has been utilised::
-
- security_inet_conn_established()
+ security_sctp_assoc_established()
The usage of these hooks are described below with the SELinux implementation
described in the `SCTP SELinux Support`_ chapter.
@@ -122,11 +119,12 @@ calls **sctp_peeloff**\(3).
@newsk - pointer to new sock structure.
-security_inet_conn_established()
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Called when a COOKIE ACK is received::
+security_sctp_assoc_established()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Called when a COOKIE ACK is received, and the peer secid will be
+saved into ``@asoc->peer_secid`` for client::
- @sk - pointer to sock structure.
+ @asoc - pointer to sctp association structure.
@skb - pointer to skbuff of the COOKIE ACK packet.
@@ -134,7 +132,7 @@ Security Hooks used for Association Establishment
-------------------------------------------------
The following diagram shows the use of ``security_sctp_bind_connect()``,
-``security_sctp_assoc_request()``, ``security_inet_conn_established()`` when
+``security_sctp_assoc_request()``, ``security_sctp_assoc_established()`` when
establishing an association.
::
@@ -172,7 +170,7 @@ establishing an association.
<------------------------------------------- COOKIE ACK
| |
sctp_sf_do_5_1E_ca |
- Call security_inet_conn_established() |
+ Call security_sctp_assoc_established() |
to set the peer label. |
| |
| If SCTP_SOCKET_TCP or peeled off
@@ -198,7 +196,7 @@ hooks with the SELinux specifics expanded below::
security_sctp_assoc_request()
security_sctp_bind_connect()
security_sctp_sk_clone()
- security_inet_conn_established()
+ security_sctp_assoc_established()
security_sctp_assoc_request()
@@ -271,12 +269,12 @@ sockets sid and peer sid to that contained in the ``@asoc sid`` and
@newsk - pointer to new sock structure.
-security_inet_conn_established()
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+security_sctp_assoc_established()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Called when a COOKIE ACK is received where it sets the connection's peer sid
to that in ``@skb``::
- @sk - pointer to sock structure.
+ @asoc - pointer to sctp association structure.
@skb - pointer to skbuff of the COOKIE ACK packet.
diff --git a/Documentation/security/keys/trusted-encrypted.rst b/Documentation/security/keys/trusted-encrypted.rst
index 80d5a5af62a1..f614dad7de12 100644
--- a/Documentation/security/keys/trusted-encrypted.rst
+++ b/Documentation/security/keys/trusted-encrypted.rst
@@ -107,12 +107,13 @@ Encrypted Keys
--------------
Encrypted keys do not depend on a trust source, and are faster, as they use AES
-for encryption/decryption. New keys are created from kernel-generated random
-numbers, and are encrypted/decrypted using a specified ‘master’ key. The
-‘master’ key can either be a trusted-key or user-key type. The main disadvantage
-of encrypted keys is that if they are not rooted in a trusted key, they are only
-as secure as the user key encrypting them. The master user key should therefore
-be loaded in as secure a way as possible, preferably early in boot.
+for encryption/decryption. New keys are created either from kernel-generated
+random numbers or user-provided decrypted data, and are encrypted/decrypted
+using a specified ‘master’ key. The ‘master’ key can either be a trusted-key or
+user-key type. The main disadvantage of encrypted keys is that if they are not
+rooted in a trusted key, they are only as secure as the user key encrypting
+them. The master user key should therefore be loaded in as secure a way as
+possible, preferably early in boot.
Usage
@@ -199,6 +200,8 @@ Usage::
keyctl add encrypted name "new [format] key-type:master-key-name keylen"
ring
+ keyctl add encrypted name "new [format] key-type:master-key-name keylen
+ decrypted-data" ring
keyctl add encrypted name "load hex_blob" ring
keyctl update keyid "update key-type:master-key-name"
@@ -303,6 +306,16 @@ Load an encrypted key "evm" from saved blob::
82dbbc55be2a44616e4959430436dc4f2a7a9659aa60bb4652aeb2120f149ed197c564e0
24717c64 5972dcb82ab2dde83376d82b2e3c09ffc
+Instantiate an encrypted key "evm" using user-provided decrypted data::
+
+ $ keyctl add encrypted evm "new default user:kmk 32 `cat evm_decrypted_data.blob`" @u
+ 794890253
+
+ $ keyctl print 794890253
+ default user:kmk 32 2375725ad57798846a9bbd240de8906f006e66c03af53b1b382d
+ bbc55be2a44616e4959430436dc4f2a7a9659aa60bb4652aeb2120f149ed197c564e0247
+ 17c64 5972dcb82ab2dde83376d82b2e3c09ffc
+
Other uses for trusted and encrypted keys, such as for disk and file encryption
are anticipated. In particular the new format 'ecryptfs' has been defined
in order to use encrypted keys to mount an eCryptfs filesystem. More details
diff --git a/Documentation/sphinx/kerneldoc-preamble.sty b/Documentation/sphinx/kerneldoc-preamble.sty
new file mode 100644
index 000000000000..9d0204dc38be
--- /dev/null
+++ b/Documentation/sphinx/kerneldoc-preamble.sty
@@ -0,0 +1,226 @@
+% -*- coding: utf-8 -*-
+% SPDX-License-Identifier: GPL-2.0
+%
+% LaTeX preamble for "make latexdocs" or "make pdfdocs" including:
+% - TOC width settings
+% - Setting of tabulary (\tymin)
+% - Headheight setting for fancyhdr
+% - Fontfamily settings for CJK (Chinese, Japanese, and Korean) translations
+%
+% Note on the suffix of .sty:
+% This is not implemented as a LaTeX style file, but as a file containing
+% plain LaTeX code to be included into preamble.
+% ".sty" is chosen because ".tex" would cause the build scripts to confuse
+% this file with a LaTeX main file.
+%
+% Copyright (C) 2022 Akira Yokosawa
+
+% Custom width parameters for TOC
+% - Redefine low-level commands defined in report.cls.
+% - Indent of 2 chars is preserved for ease of comparison.
+% Summary of changes from default params:
+% Width of page number (\@pnumwidth): 1.55em -> 2.7em
+% Width of chapter number: 1.5em -> 1.8em
+% Indent of section number: 1.5em -> 1.8em
+% Width of section number: 2.6em -> 3.2em
+% Indent of sebsection number: 4.1em -> 5em
+% Width of subsection number: 3.5em -> 4.3em
+%
+% These params can have 4 digit page counts, 2 digit chapter counts,
+% section counts of 4 digits + 1 period (e.g., 18.10), and subsection counts
+% of 5 digits + 2 periods (e.g., 18.7.13).
+\makeatletter
+%% Redefine \@pnumwidth (page number width)
+\renewcommand*\@pnumwidth{2.7em}
+%% Redefine \l@chapter (chapter list entry)
+\renewcommand*\l@chapter[2]{%
+ \ifnum \c@tocdepth >\m@ne
+ \addpenalty{-\@highpenalty}%
+ \vskip 1.0em \@plus\p@
+ \setlength\@tempdima{1.8em}%
+ \begingroup
+ \parindent \z@ \rightskip \@pnumwidth
+ \parfillskip -\@pnumwidth
+ \leavevmode \bfseries
+ \advance\leftskip\@tempdima
+ \hskip -\leftskip
+ #1\nobreak\hfil
+ \nobreak\hb@xt@\@pnumwidth{\hss #2%
+ \kern-\p@\kern\p@}\par
+ \penalty\@highpenalty
+ \endgroup
+ \fi}
+%% Redefine \l@section and \l@subsection
+\renewcommand*\l@section{\@dottedtocline{1}{1.8em}{3.2em}}
+\renewcommand*\l@subsection{\@dottedtocline{2}{5em}{4.3em}}
+\makeatother
+%% Sphinx < 1.8 doesn't have \sphinxtableofcontentshook
+\providecommand{\sphinxtableofcontentshook}{}
+%% Undefine it for compatibility with Sphinx 1.7.9
+\renewcommand{\sphinxtableofcontentshook}{} % Empty the hook
+
+% Prevent column squeezing of tabulary. \tymin is set by Sphinx as:
+% \setlength{\tymin}{3\fontcharwd\font`0 }
+% , which is too short.
+\setlength{\tymin}{20em}
+
+% Adjust \headheight for fancyhdr
+\addtolength{\headheight}{1.6pt}
+\addtolength{\topmargin}{-1.6pt}
+
+% Translations have Asian (CJK) characters which are only displayed if
+% xeCJK is used
+\IfFontExistsTF{Noto Sans CJK SC}{
+ % Load xeCJK when CJK font is available
+ \usepackage{xeCJK}
+ % Noto CJK fonts don't provide slant shape. [AutoFakeSlant] permits
+ % its emulation.
+ % Select KR variant at the beginning of each document so that quotation
+ % and apostorph symbols of half-width is used in TOC of Latin documents.
+ \IfFontExistsTF{Noto Serif CJK KR}{
+ \setCJKmainfont{Noto Serif CJK KR}[AutoFakeSlant]
+ }{
+ \setCJKmainfont{Noto Sans CJK KR}[AutoFakeSlant]
+ }
+ \setCJKsansfont{Noto Sans CJK KR}[AutoFakeSlant]
+ \setCJKmonofont{Noto Sans Mono CJK KR}[AutoFakeSlant]
+ % Teach xeCJK of half-width symbols
+ \xeCJKDeclareCharClass{HalfLeft}{`“,`‘}
+ \xeCJKDeclareCharClass{HalfRight}{`”,`’}
+ % CJK Language-specific font choices
+ %% for Simplified Chinese
+ \IfFontExistsTF{Noto Serif CJK SC}{
+ \newCJKfontfamily[SCmain]\scmain{Noto Serif CJK SC}[AutoFakeSlant]
+ \newCJKfontfamily[SCserif]\scserif{Noto Serif CJK SC}[AutoFakeSlant]
+ }{
+ \newCJKfontfamily[SCmain]\scmain{Noto Sans CJK SC}[AutoFakeSlant]
+ \newCJKfontfamily[SCserif]\scserif{Noto Sans CJK SC}[AutoFakeSlant]
+ }
+ \newCJKfontfamily[SCsans]\scsans{Noto Sans CJK SC}[AutoFakeSlant]
+ \newCJKfontfamily[SCmono]\scmono{Noto Sans Mono CJK SC}[AutoFakeSlant]
+ %% for Traditional Chinese
+ \IfFontExistsTF{Noto Serif CJK TC}{
+ \newCJKfontfamily[TCmain]\tcmain{Noto Serif CJK TC}[AutoFakeSlant]
+ \newCJKfontfamily[TCserif]\tcserif{Noto Serif CJK TC}[AutoFakeSlant]
+ }{
+ \newCJKfontfamily[TCmain]\tcmain{Noto Sans CJK TC}[AutoFakeSlant]
+ \newCJKfontfamily[TCserif]\tcserif{Noto Sans CJK TC}[AutoFakeSlant]
+ }
+ \newCJKfontfamily[TCsans]\tcsans{Noto Sans CJK TC}[AutoFakeSlant]
+ \newCJKfontfamily[TCmono]\tcmono{Noto Sans Mono CJK TC}[AutoFakeSlant]
+ %% for Korean
+ \IfFontExistsTF{Noto Serif CJK KR}{
+ \newCJKfontfamily[KRmain]\krmain{Noto Serif CJK KR}[AutoFakeSlant]
+ \newCJKfontfamily[KRserif]\krserif{Noto Serif CJK KR}[AutoFakeSlant]
+ }{
+ \newCJKfontfamily[KRmain]\krmain{Noto Sans CJK KR}[AutoFakeSlant]
+ \newCJKfontfamily[KRserif]\krserif{Noto Sans CJK KR}[AutoFakeSlant]
+ }
+ \newCJKfontfamily[KRsans]\krsans{Noto Sans CJK KR}[AutoFakeSlant]
+ \newCJKfontfamily[KRmono]\krmono{Noto Sans Mono CJK KR}[AutoFakeSlant]
+ %% for Japanese
+ \IfFontExistsTF{Noto Serif CJK JP}{
+ \newCJKfontfamily[JPmain]\jpmain{Noto Serif CJK JP}[AutoFakeSlant]
+ \newCJKfontfamily[JPserif]\jpserif{Noto Serif CJK JP}[AutoFakeSlant]
+ }{
+ \newCJKfontfamily[JPmain]\jpmain{Noto Sans CJK JP}[AutoFakeSlant]
+ \newCJKfontfamily[JPserif]\jpserif{Noto Sans CJK JP}[AutoFakeSlant]
+ }
+ \newCJKfontfamily[JPsans]\jpsans{Noto Sans CJK JP}[AutoFakeSlant]
+ \newCJKfontfamily[JPmono]\jpmono{Noto Sans Mono CJK JP}[AutoFakeSlant]
+ % Dummy commands for Sphinx < 2.3 (no 'extrapackages' support)
+ \providecommand{\onehalfspacing}{}
+ \providecommand{\singlespacing}{}
+ % Define custom macros to on/off CJK
+ %% One and half spacing for CJK contents
+ \newcommand{\kerneldocCJKon}{\makexeCJKactive\onehalfspacing}
+ \newcommand{\kerneldocCJKoff}{\makexeCJKinactive\singlespacing}
+ % Define custom macros for switching CJK font setting
+ %% for Simplified Chinese
+ \newcommand{\kerneldocBeginSC}{%
+ \begingroup%
+ \scmain%
+ \xeCJKDeclareCharClass{FullLeft}{`“,`‘}% Full-width in SC
+ \xeCJKDeclareCharClass{FullRight}{`”,`’}% Full-width in SC
+ \renewcommand{\CJKrmdefault}{SCserif}%
+ \renewcommand{\CJKsfdefault}{SCsans}%
+ \renewcommand{\CJKttdefault}{SCmono}%
+ \xeCJKsetup{CJKspace = false}% gobble white spaces by ' '
+ % For CJK ascii-art alignment
+ \setmonofont{Noto Sans Mono CJK SC}[AutoFakeSlant]%
+ }
+ \newcommand{\kerneldocEndSC}{\endgroup}
+ %% for Traditional Chinese
+ \newcommand{\kerneldocBeginTC}{%
+ \begingroup%
+ \tcmain%
+ \xeCJKDeclareCharClass{FullLeft}{`“,`‘}% Full-width in TC
+ \xeCJKDeclareCharClass{FullRight}{`”,`’}% Full-width in TC
+ \renewcommand{\CJKrmdefault}{TCserif}%
+ \renewcommand{\CJKsfdefault}{TCsans}%
+ \renewcommand{\CJKttdefault}{TCmono}%
+ \xeCJKsetup{CJKspace = false}% gobble white spaces by ' '
+ % For CJK ascii-art alignment
+ \setmonofont{Noto Sans Mono CJK TC}[AutoFakeSlant]%
+ }
+ \newcommand{\kerneldocEndTC}{\endgroup}
+ %% for Korean
+ \newcommand{\kerneldocBeginKR}{%
+ \begingroup%
+ \krmain%
+ \renewcommand{\CJKrmdefault}{KRserif}%
+ \renewcommand{\CJKsfdefault}{KRsans}%
+ \renewcommand{\CJKttdefault}{KRmono}%
+ % \xeCJKsetup{CJKspace = true} % true by default
+ % For CJK ascii-art alignment (still misaligned for Hangul)
+ \setmonofont{Noto Sans Mono CJK KR}[AutoFakeSlant]%
+ }
+ \newcommand{\kerneldocEndKR}{\endgroup}
+ %% for Japanese
+ \newcommand{\kerneldocBeginJP}{%
+ \begingroup%
+ \jpmain%
+ \renewcommand{\CJKrmdefault}{JPserif}%
+ \renewcommand{\CJKsfdefault}{JPsans}%
+ \renewcommand{\CJKttdefault}{JPmono}%
+ \xeCJKsetup{CJKspace = false}% gobble white space by ' '
+ % For CJK ascii-art alignment
+ \setmonofont{Noto Sans Mono CJK JP}[AutoFakeSlant]%
+ }
+ \newcommand{\kerneldocEndJP}{\endgroup}
+
+ % Single spacing in literal blocks
+ \fvset{baselinestretch=1}
+ % To customize \sphinxtableofcontents
+ \usepackage{etoolbox}
+ % Inactivate CJK after tableofcontents
+ \apptocmd{\sphinxtableofcontents}{\kerneldocCJKoff}{}{}
+ \xeCJKsetup{CJKspace = true}% For inter-phrase space of Korean TOC
+}{ % No CJK font found
+ % Custom macros to on/off CJK and switch CJK fonts (Dummy)
+ \newcommand{\kerneldocCJKon}{}
+ \newcommand{\kerneldocCJKoff}{}
+ %% By defining \kerneldocBegin(SC|TC|KR|JP) as commands with an argument
+ %% and ignore the argument (#1) in their definitions, whole contents of
+ %% CJK chapters can be ignored.
+ \newcommand{\kerneldocBeginSC}[1]{%
+ %% Put a note on missing CJK fonts in place of zh_CN translation.
+ \begin{sphinxadmonition}{note}{Note on missing fonts:}
+ Translations of Simplified Chinese (zh\_CN), Traditional Chinese
+ (zh\_TW), Korean (ko\_KR), and Japanese (ja\_JP) were skipped
+ due to the lack of suitable font families.
+
+ If you want them, please install ``Noto Sans CJK'' font families
+ by following instructions from
+ \sphinxcode{./scripts/sphinx-pre-install}.
+ Having optional ``Noto Serif CJK'' font families will improve
+ the looks of those translations.
+ \end{sphinxadmonition}}
+ \newcommand{\kerneldocEndSC}{}
+ \newcommand{\kerneldocBeginTC}[1]{}
+ \newcommand{\kerneldocEndTC}{}
+ \newcommand{\kerneldocBeginKR}[1]{}
+ \newcommand{\kerneldocEndKR}{}
+ \newcommand{\kerneldocBeginJP}[1]{}
+ \newcommand{\kerneldocEndJP}{}
+}
diff --git a/Documentation/sphinx/kfigure.py b/Documentation/sphinx/kfigure.py
index 3c78828330be..24d2b2addcce 100644
--- a/Documentation/sphinx/kfigure.py
+++ b/Documentation/sphinx/kfigure.py
@@ -31,10 +31,13 @@ u"""
* ``dot(1)``: Graphviz (https://www.graphviz.org). If Graphviz is not
available, the DOT language is inserted as literal-block.
+ For conversion to PDF, ``rsvg-convert(1)`` of librsvg
+ (https://gitlab.gnome.org/GNOME/librsvg) is used when available.
* SVG to PDF: To generate PDF, you need at least one of this tools:
- ``convert(1)``: ImageMagick (https://www.imagemagick.org)
+ - ``inkscape(1)``: Inkscape (https://inkscape.org/)
List of customizations:
@@ -49,6 +52,7 @@ import os
from os import path
import subprocess
from hashlib import sha1
+import re
from docutils import nodes
from docutils.statemachine import ViewList
from docutils.parsers.rst import directives
@@ -109,10 +113,20 @@ def pass_handle(self, node): # pylint: disable=W0613
# Graphviz's dot(1) support
dot_cmd = None
+# dot(1) -Tpdf should be used
+dot_Tpdf = False
# ImageMagick' convert(1) support
convert_cmd = None
+# librsvg's rsvg-convert(1) support
+rsvg_convert_cmd = None
+
+# Inkscape's inkscape(1) support
+inkscape_cmd = None
+# Inkscape prior to 1.0 uses different command options
+inkscape_ver_one = False
+
def setup(app):
# check toolchain first
@@ -160,23 +174,62 @@ def setupTools(app):
This function is called once, when the builder is initiated.
"""
- global dot_cmd, convert_cmd # pylint: disable=W0603
+ global dot_cmd, dot_Tpdf, convert_cmd, rsvg_convert_cmd # pylint: disable=W0603
+ global inkscape_cmd, inkscape_ver_one # pylint: disable=W0603
kernellog.verbose(app, "kfigure: check installed tools ...")
dot_cmd = which('dot')
convert_cmd = which('convert')
+ rsvg_convert_cmd = which('rsvg-convert')
+ inkscape_cmd = which('inkscape')
if dot_cmd:
kernellog.verbose(app, "use dot(1) from: " + dot_cmd)
+
+ try:
+ dot_Thelp_list = subprocess.check_output([dot_cmd, '-Thelp'],
+ stderr=subprocess.STDOUT)
+ except subprocess.CalledProcessError as err:
+ dot_Thelp_list = err.output
+ pass
+
+ dot_Tpdf_ptn = b'pdf'
+ dot_Tpdf = re.search(dot_Tpdf_ptn, dot_Thelp_list)
else:
kernellog.warn(app, "dot(1) not found, for better output quality install "
"graphviz from https://www.graphviz.org")
- if convert_cmd:
- kernellog.verbose(app, "use convert(1) from: " + convert_cmd)
+ if inkscape_cmd:
+ kernellog.verbose(app, "use inkscape(1) from: " + inkscape_cmd)
+ inkscape_ver = subprocess.check_output([inkscape_cmd, '--version'],
+ stderr=subprocess.DEVNULL)
+ ver_one_ptn = b'Inkscape 1'
+ inkscape_ver_one = re.search(ver_one_ptn, inkscape_ver)
+ convert_cmd = None
+ rsvg_convert_cmd = None
+ dot_Tpdf = False
+
else:
- kernellog.warn(app,
- "convert(1) not found, for SVG to PDF conversion install "
- "ImageMagick (https://www.imagemagick.org)")
+ if convert_cmd:
+ kernellog.verbose(app, "use convert(1) from: " + convert_cmd)
+ else:
+ kernellog.warn(app,
+ "Neither inkscape(1) nor convert(1) found.\n"
+ "For SVG to PDF conversion, "
+ "install either Inkscape (https://inkscape.org/) (preferred) or\n"
+ "ImageMagick (https://www.imagemagick.org)")
+
+ if rsvg_convert_cmd:
+ kernellog.verbose(app, "use rsvg-convert(1) from: " + rsvg_convert_cmd)
+ kernellog.verbose(app, "use 'dot -Tsvg' and rsvg-convert(1) for DOT -> PDF conversion")
+ dot_Tpdf = False
+ else:
+ kernellog.verbose(app,
+ "rsvg-convert(1) not found.\n"
+ " SVG rendering of convert(1) is done by ImageMagick-native renderer.")
+ if dot_Tpdf:
+ kernellog.verbose(app, "use 'dot -Tpdf' for DOT -> PDF conversion")
+ else:
+ kernellog.verbose(app, "use 'dot -Tsvg' and convert(1) for DOT -> PDF conversion")
# integrate conversion tools
@@ -242,7 +295,7 @@ def convert_image(img_node, translator, src_fname=None):
elif in_ext == '.svg':
if translator.builder.format == 'latex':
- if convert_cmd is None:
+ if not inkscape_cmd and convert_cmd is None:
kernellog.verbose(app,
"no SVG to PDF conversion available / include SVG raw.")
img_node.replace_self(file2literal(src_fname))
@@ -266,7 +319,14 @@ def convert_image(img_node, translator, src_fname=None):
if in_ext == '.dot':
kernellog.verbose(app, 'convert DOT to: {out}/' + _name)
- ok = dot2format(app, src_fname, dst_fname)
+ if translator.builder.format == 'latex' and not dot_Tpdf:
+ svg_fname = path.join(translator.builder.outdir, fname + '.svg')
+ ok1 = dot2format(app, src_fname, svg_fname)
+ ok2 = svg2pdf_by_rsvg(app, svg_fname, dst_fname)
+ ok = ok1 and ok2
+
+ else:
+ ok = dot2format(app, src_fname, dst_fname)
elif in_ext == '.svg':
kernellog.verbose(app, 'convert SVG to: {out}/' + _name)
@@ -303,22 +363,70 @@ def dot2format(app, dot_fname, out_fname):
return bool(exit_code == 0)
def svg2pdf(app, svg_fname, pdf_fname):
- """Converts SVG to PDF with ``convert(1)`` command.
+ """Converts SVG to PDF with ``inkscape(1)`` or ``convert(1)`` command.
- Uses ``convert(1)`` from ImageMagick (https://www.imagemagick.org) for
- conversion. Returns ``True`` on success and ``False`` if an error occurred.
+ Uses ``inkscape(1)`` from Inkscape (https://inkscape.org/) or ``convert(1)``
+ from ImageMagick (https://www.imagemagick.org) for conversion.
+ Returns ``True`` on success and ``False`` if an error occurred.
* ``svg_fname`` pathname of the input SVG file with extension (``.svg``)
* ``pdf_name`` pathname of the output PDF file with extension (``.pdf``)
"""
cmd = [convert_cmd, svg_fname, pdf_fname]
- # use stdout and stderr from parent
- exit_code = subprocess.call(cmd)
+ cmd_name = 'convert(1)'
+
+ if inkscape_cmd:
+ cmd_name = 'inkscape(1)'
+ if inkscape_ver_one:
+ cmd = [inkscape_cmd, '-o', pdf_fname, svg_fname]
+ else:
+ cmd = [inkscape_cmd, '-z', '--export-pdf=%s' % pdf_fname, svg_fname]
+
+ try:
+ warning_msg = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
+ exit_code = 0
+ except subprocess.CalledProcessError as err:
+ warning_msg = err.output
+ exit_code = err.returncode
+ pass
+
if exit_code != 0:
kernellog.warn(app, "Error #%d when calling: %s" % (exit_code, " ".join(cmd)))
+ if warning_msg:
+ kernellog.warn(app, "Warning msg from %s: %s"
+ % (cmd_name, str(warning_msg, 'utf-8')))
+ elif warning_msg:
+ kernellog.verbose(app, "Warning msg from %s (likely harmless):\n%s"
+ % (cmd_name, str(warning_msg, 'utf-8')))
+
return bool(exit_code == 0)
+def svg2pdf_by_rsvg(app, svg_fname, pdf_fname):
+ """Convert SVG to PDF with ``rsvg-convert(1)`` command.
+
+ * ``svg_fname`` pathname of input SVG file, including extension ``.svg``
+ * ``pdf_fname`` pathname of output PDF file, including extension ``.pdf``
+
+ Input SVG file should be the one generated by ``dot2format()``.
+ SVG -> PDF conversion is done by ``rsvg-convert(1)``.
+
+ If ``rsvg-convert(1)`` is unavailable, fall back to ``svg2pdf()``.
+
+ """
+
+ if rsvg_convert_cmd is None:
+ ok = svg2pdf(app, svg_fname, pdf_fname)
+ else:
+ cmd = [rsvg_convert_cmd, '--format=pdf', '-o', pdf_fname, svg_fname]
+ # use stdout and stderr from parent
+ exit_code = subprocess.call(cmd)
+ if exit_code != 0:
+ kernellog.warn(app, "Error #%d when calling: %s" % (exit_code, " ".join(cmd)))
+ ok = bool(exit_code == 0)
+
+ return ok
+
# image handling
# ---------------------
diff --git a/Documentation/spi/pxa2xx.rst b/Documentation/spi/pxa2xx.rst
index 6347580826be..716f65d87d04 100644
--- a/Documentation/spi/pxa2xx.rst
+++ b/Documentation/spi/pxa2xx.rst
@@ -101,7 +101,6 @@ device. All fields are optional.
u8 rx_threshold;
u8 dma_burst_size;
u32 timeout;
- int gpio_cs;
};
The "pxa2xx_spi_chip.tx_threshold" and "pxa2xx_spi_chip.rx_threshold" fields are
@@ -146,7 +145,6 @@ field. Below is a sample configuration using the PXA255 NSSP.
.rx_threshold = 8, /* SSP hardward FIFO threshold */
.dma_burst_size = 8, /* Byte wide transfers used so 8 byte bursts */
.timeout = 235, /* See Intel documentation */
- .gpio_cs = 2, /* Use external chip select */
};
static struct pxa2xx_spi_chip cs8405a_chip_info = {
@@ -154,7 +152,6 @@ field. Below is a sample configuration using the PXA255 NSSP.
.rx_threshold = 8, /* SSP hardward FIFO threshold */
.dma_burst_size = 8, /* Byte wide transfers used so 8 byte bursts */
.timeout = 235, /* See Intel documentation */
- .gpio_cs = 3, /* Use external chip select */
};
static struct spi_board_info streetracer_spi_board_info[] __initdata = {
diff --git a/Documentation/trace/osnoise-tracer.rst b/Documentation/trace/osnoise-tracer.rst
index b648cb9bf1f0..963def9f97c6 100644
--- a/Documentation/trace/osnoise-tracer.rst
+++ b/Documentation/trace/osnoise-tracer.rst
@@ -51,7 +51,7 @@ For example::
[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer
-It is possible to follow the trace by reading the trace trace file::
+It is possible to follow the trace by reading the trace file::
[root@f32 tracing]# cat trace
# tracer: osnoise
@@ -108,7 +108,7 @@ The tracer has a set of options inside the osnoise directory, they are:
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
- will be used, which is currently 5 us.
+ be used, which is currently 5 us.
Additional Tracing
------------------
diff --git a/Documentation/translations/conf.py b/Documentation/translations/conf.py
deleted file mode 100644
index 92cdbba74229..000000000000
--- a/Documentation/translations/conf.py
+++ /dev/null
@@ -1,12 +0,0 @@
-# -*- coding: utf-8 -*-
-# SPDX-License-Identifier: GPL-2.0
-
-# -- Additinal options for LaTeX output ----------------------------------
-# font config for ascii-art alignment
-
-latex_elements['preamble'] += '''
- \\IfFontExistsTF{Noto Sans CJK SC}{
- % For CJK ascii-art alignment
- \\setmonofont{Noto Sans Mono CJK SC}[AutoFakeSlant]
- }{}
-'''
diff --git a/Documentation/translations/ja_JP/index.rst b/Documentation/translations/ja_JP/index.rst
index 88d4d98eed15..20738c931d02 100644
--- a/Documentation/translations/ja_JP/index.rst
+++ b/Documentation/translations/ja_JP/index.rst
@@ -3,7 +3,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
- \kerneldocBeginJP
+ \kerneldocBeginJP{
Japanese translations
=====================
@@ -15,4 +15,4 @@ Japanese translations
.. raw:: latex
- \kerneldocEndJP
+ }\kerneldocEndJP
diff --git a/Documentation/translations/ko_KR/index.rst b/Documentation/translations/ko_KR/index.rst
index f636b482fb4c..4add6b2fe1f2 100644
--- a/Documentation/translations/ko_KR/index.rst
+++ b/Documentation/translations/ko_KR/index.rst
@@ -3,7 +3,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
- \kerneldocBeginKR
+ \kerneldocBeginKR{
한국어 번역
===========
@@ -26,5 +26,4 @@
.. raw:: latex
- \normalsize
- \kerneldocEndKR
+ }\kerneldocEndKR
diff --git a/Documentation/translations/zh_CN/accounting/delay-accounting.rst b/Documentation/translations/zh_CN/accounting/delay-accounting.rst
index 67d5606e5401..f1849411018e 100644
--- a/Documentation/translations/zh_CN/accounting/delay-accounting.rst
+++ b/Documentation/translations/zh_CN/accounting/delay-accounting.rst
@@ -17,6 +17,8 @@ a) 等待一个CPU(任务为可运行)
b) 完成由该任务发起的块I/O同步请求
c) 页面交换
d) 内存回收
+e) 页缓存抖动
+f) 直接规整
并将这些统计信息通过taskstats接口提供给用户空间。
@@ -37,10 +39,10 @@ d) 内存回收
向用户态返回一个通用数据结构,对应每pid或每tgid的统计信息。延时计数功能填写
该数据结构的特定字段。见
- include/linux/taskstats.h
+ include/uapi/linux/taskstats.h
其描述了延时计数相关字段。系统通常以计数器形式返回 CPU、同步块 I/O、交换、内存
-回收等的累积延时。
+回收、页缓存抖动、直接规整等的累积延时。
取任务某计数器两个连续读数的差值,将得到任务在该时间间隔内等待对应资源的总延时。
@@ -72,40 +74,36 @@ kernel.task_delayacct进行开关。注意,只有在启用延时计数后启
getdelays命令的一般格式::
- getdelays [-t tgid] [-p pid] [-c cmd...]
+ getdelays [-dilv] [-t tgid] [-p pid]
获取pid为10的任务从系统启动后的延时信息::
- # ./getdelays -p 10
+ # ./getdelays -d -p 10
(输出信息和下例相似)
获取所有tgid为5的任务从系统启动后的总延时信息::
- # ./getdelays -t 5
-
-
- CPU count real total virtual total delay total
- 7876 92005750 100000000 24001500
- IO count delay total
- 0 0
- SWAP count delay total
- 0 0
- RECLAIM count delay total
- 0 0
-
-获取指定简单命令运行时的延时信息::
-
- # ./getdelays -c ls /
-
- bin data1 data3 data5 dev home media opt root srv sys usr
- boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
-
-
- CPU count real total virtual total delay total
- 6 4000250 4000000 0
- IO count delay total
- 0 0
- SWAP count delay total
- 0 0
- RECLAIM count delay total
- 0 0
+ # ./getdelays -d -t 5
+ print delayacct stats ON
+ TGID 5
+
+
+ CPU count real total virtual total delay total delay average
+ 8 7000000 6872122 3382277 0.423ms
+ IO count delay total delay average
+ 0 0 0ms
+ SWAP count delay total delay average
+ 0 0 0ms
+ RECLAIM count delay total delay average
+ 0 0 0ms
+ THRASHING count delay total delay average
+ 0 0 0ms
+ COMPACT count delay total delay average
+ 0 0 0ms
+
+获取pid为1的IO计数,它只和-p一起使用::
+ # ./getdelays -i -p 1
+ printing IO accounting
+ linuxrc: read=65536, write=0, cancelled_write=0
+
+上面的命令与-v一起使用,可以获取更多调试信息。
diff --git a/Documentation/translations/zh_CN/admin-guide/index.rst b/Documentation/translations/zh_CN/admin-guide/index.rst
index 548e57f4b3f1..be535ffaf4b0 100644
--- a/Documentation/translations/zh_CN/admin-guide/index.rst
+++ b/Documentation/translations/zh_CN/admin-guide/index.rst
@@ -20,15 +20,15 @@ Linux 内核用户和管理员指南
Todolist:
- kernel-parameters
- devices
- sysctl/index
+* kernel-parameters
+* devices
+* sysctl/index
本节介绍CPU漏洞及其缓解措施。
Todolist:
- hw-vuln/index
+* hw-vuln/index
下面的一组文档,针对的是试图跟踪问题和bug的用户。
@@ -44,18 +44,18 @@ Todolist:
Todolist:
- reporting-bugs
- ramoops
- dynamic-debug-howto
- kdump/index
- perf/index
+* reporting-bugs
+* ramoops
+* dynamic-debug-howto
+* kdump/index
+* perf/index
这是应用程序开发人员感兴趣的章节的开始。可以在这里找到涵盖内核ABI各个
方面的文档。
Todolist:
- sysfs-rules
+* sysfs-rules
本手册的其余部分包括各种指南,介绍如何根据您的喜好配置内核的特定行为。
@@ -69,61 +69,61 @@ Todolist:
lockup-watchdogs
unicode
sysrq
+ mm/index
Todolist:
- acpi/index
- aoe/index
- auxdisplay/index
- bcache
- binderfs
- binfmt-misc
- blockdev/index
- bootconfig
- braille-console
- btmrvl
- cgroup-v1/index
- cgroup-v2
- cifs/index
- dell_rbu
- device-mapper/index
- edid
- efi-stub
- ext4
- nfs/index
- gpio/index
- highuid
- hw_random
- initrd
- iostats
- java
- jfs
- kernel-per-CPU-kthreads
- laptops/index
- lcd-panel-cgram
- ldm
- LSM/index
- md
- media/index
- mm/index
- module-signing
- mono
- namespaces/index
- numastat
- parport
- perf-security
- pm/index
- pnp
- rapidio
- ras
- rtc
- serial-console
- svga
- thunderbolt
- ufs
- vga-softcursor
- video-output
- xfs
+* acpi/index
+* aoe/index
+* auxdisplay/index
+* bcache
+* binderfs
+* binfmt-misc
+* blockdev/index
+* bootconfig
+* braille-console
+* btmrvl
+* cgroup-v1/index
+* cgroup-v2
+* cifs/index
+* dell_rbu
+* device-mapper/index
+* edid
+* efi-stub
+* ext4
+* nfs/index
+* gpio/index
+* highuid
+* hw_random
+* initrd
+* iostats
+* java
+* jfs
+* kernel-per-CPU-kthreads
+* laptops/index
+* lcd-panel-cgram
+* ldm
+* LSM/index
+* md
+* media/index
+* module-signing
+* mono
+* namespaces/index
+* numastat
+* parport
+* perf-security
+* pm/index
+* pnp
+* rapidio
+* ras
+* rtc
+* serial-console
+* svga
+* thunderbolt
+* ufs
+* vga-softcursor
+* video-output
+* xfs
.. only:: subproject and html
diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst
new file mode 100644
index 000000000000..0c8276109fc0
--- /dev/null
+++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../../disclaimer-zh_CN.rst
+
+:Original: Documentation/admin-guide/mm/damon/index.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+============
+监测数据访问
+============
+
+:doc:`DAMON </vm/damon/index>` 允许轻量级的数据访问监测。使用DAMON,
+用户可以分析他们系统的内存访问模式,并优化它们。
+
+.. toctree::
+ :maxdepth: 2
+
+ start
+ usage
+ reclaim
+
+
+
+
diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst
new file mode 100644
index 000000000000..9e541578f38d
--- /dev/null
+++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../../disclaimer-zh_CN.rst
+
+:Original: Documentation/admin-guide/mm/damon/reclaim.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+===============
+基于DAMON的回收
+===============
+
+基于DAMON的回收(DAMON_RECLAIM)是一个静态的内核模块,旨在用于轻度内存压力下的主动和轻
+量级的回收。它的目的不是取代基于LRU列表的页面回收,而是有选择地用于不同程度的内存压力和要
+求。
+
+哪些地方需要主动回收?
+======================
+
+在一般的内存超量使用(over-committed systems,虚拟化相关术语)的系统上,主动回收冷页
+有助于节省内存和减少延迟高峰,这些延迟是由直接回收进程或kswapd的CPU消耗引起的,同时只产
+生最小的性能下降 [1]_ [2]_ 。
+
+基于空闲页报告 [3]_ 的内存过度承诺的虚拟化系统就是很好的例子。在这样的系统中,客户机
+向主机报告他们的空闲内存,而主机则将报告的内存重新分配给其他客户。因此,系统的内存得到了充
+分的利用。然而,客户可能不那么节省内存,主要是因为一些内核子系统和用户空间应用程序被设计为
+使用尽可能多的内存。然后,客户机可能只向主机报告少量的内存是空闲的,导致系统的内存利用率下降。
+在客户中运行主动回收可以缓解这个问题。
+
+它是如何工作的?
+================
+
+DAMON_RECLAIM找到在特定时间内没有被访问的内存区域并分页。为了避免它在分页操作中消耗过多
+的CPU,可以配置一个速度限制。在这个速度限制下,它首先分页出那些没有被访问过的内存区域。系
+统管理员还可以配置在什么情况下这个方案应该自动激活和停用三个内存压力水位。
+
+接口: 模块参数
+==============
+
+要使用这个功能,你首先要确保你的系统运行在一个以 ``CONFIG_DAMON_RECLAIM=y`` 构建的内
+核上。
+
+为了让系统管理员启用或禁用它,并为给定的系统进行调整,DAMON_RECLAIM利用了模块参数。也就
+是说,你可以把 ``damon_reclaim.<parameter>=<value>`` 放在内核启动命令行上,或者把
+适当的值写入 ``/sys/modules/damon_reclaim/parameters/<parameter>`` 文件。
+
+注意,除 ``启用`` 外的参数值只在DAMON_RECLAIM启动时应用。因此,如果你想在运行时应用新
+的参数值,而DAMON_RECLAIM已经被启用,你应该通过 ``启用`` 的参数文件禁用和重新启用它。
+在重新启用之前,应将新的参数值写入适当的参数值中。
+
+下面是每个参数的描述。
+
+enable
+------
+
+启用或禁用DAMON_RECLAIM。
+
+你可以通过把这个参数的值设置为 ``Y`` 来启用DAMON_RCLAIM,把它设置为 ``N`` 可以禁用
+DAMON_RECLAIM。注意,由于基于水位的激活条件,DAMON_RECLAIM不能进行真正的监测和回收。
+这一点请参考下面关于水位参数的描述。
+
+min_age
+-------
+
+识别冷内存区域的时间阈值,单位是微秒。
+
+如果一个内存区域在这个时间或更长的时间内没有被访问,DAMON_RECLAIM会将该区域识别为冷的,
+并回收它。
+
+默认为120秒。
+
+quota_ms
+--------
+
+回收的时间限制,以毫秒为单位。
+
+DAMON_RECLAIM 试图在一个时间窗口(quota_reset_interval_ms)内只使用到这个时间,以
+尝试回收冷页。这可以用来限制DAMON_RECLAIM的CPU消耗。如果该值为零,则该限制被禁用。
+
+默认为10ms。
+
+quota_sz
+--------
+
+回收的内存大小限制,单位为字节。
+
+DAMON_RECLAIM 收取在一个时间窗口(quota_reset_interval_ms)内试图回收的内存量,并
+使其不超过这个限制。这可以用来限制CPU和IO的消耗。如果该值为零,则限制被禁用。
+
+默认情况下是128 MiB。
+
+quota_reset_interval_ms
+-----------------------
+
+时间/大小配额收取重置间隔,单位为毫秒。
+
+时间(quota_ms)和大小(quota_sz)的配额的目标重置间隔。也就是说,DAMON_RECLAIM在
+尝试回收‘不’超过quota_ms毫秒或quota_sz字节的内存。
+
+默认为1秒。
+
+wmarks_interval
+---------------
+
+当DAMON_RECLAIM被启用但由于其水位规则而不活跃时,在检查水位之前的最小等待时间。
+
+wmarks_high
+-----------
+
+高水位的可用内存率(每千字节)。
+
+如果系统的可用内存(以每千字节为单位)高于这个数值,DAMON_RECLAIM就会变得不活跃,所以
+它什么也不做,只是定期检查水位。
+
+wmarks_mid
+----------
+
+中间水位的可用内存率(每千字节)。
+
+如果系统的空闲内存(以每千字节为单位)在这个和低水位线之间,DAMON_RECLAIM就会被激活,
+因此开始监测和回收。
+
+wmarks_low
+----------
+
+低水位的可用内存率(每千字节)。
+
+如果系统的空闲内存(以每千字节为单位)低于这个数值,DAMON_RECLAIM就会变得不活跃,所以
+它除了定期检查水位外什么都不做。在这种情况下,系统会退回到基于LRU列表的页面粒度回收逻辑。
+
+sample_interval
+---------------
+
+监测的采样间隔,单位是微秒。
+
+DAMON用于监测冷内存的采样间隔。更多细节请参考DAMON文档 (:doc:`usage`) 。
+
+aggr_interval
+-------------
+
+监测的聚集间隔,单位是微秒。
+
+DAMON对冷内存监测的聚集间隔。更多细节请参考DAMON文档 (:doc:`usage`)。
+
+min_nr_regions
+--------------
+
+监测区域的最小数量。
+
+DAMON用于冷内存监测的最小监测区域数。这可以用来设置监测质量的下限。但是,设
+置的太高可能会导致监测开销的增加。更多细节请参考DAMON文档 (:doc:`usage`) 。
+
+max_nr_regions
+--------------
+
+监测区域的最大数量。
+
+DAMON用于冷内存监测的最大监测区域数。这可以用来设置监测开销的上限值。但是,
+设置得太低可能会导致监测质量不好。更多细节请参考DAMON文档 (:doc:`usage`) 。
+
+monitor_region_start
+--------------------
+
+目标内存区域的物理地址起点。
+
+DAMON_RECLAIM将对其进行工作的内存区域的起始物理地址。也就是说,DAMON_RECLAIM
+将在这个区域中找到冷的内存区域并进行回收。默认情况下,该区域使用最大系统内存区。
+
+monitor_region_end
+------------------
+
+目标内存区域的结束物理地址。
+
+DAMON_RECLAIM将对其进行工作的内存区域的末端物理地址。也就是说,DAMON_RECLAIM将
+在这个区域内找到冷的内存区域并进行回收。默认情况下,该区域使用最大系统内存区。
+
+kdamond_pid
+-----------
+
+DAMON线程的PID。
+
+如果DAMON_RECLAIM被启用,这将成为工作线程的PID。否则,为-1。
+
+nr_reclaim_tried_regions
+------------------------
+
+试图通过DAMON_RECLAIM回收的内存区域的数量。
+
+bytes_reclaim_tried_regions
+---------------------------
+
+试图通过DAMON_RECLAIM回收的内存区域的总字节数。
+
+nr_reclaimed_regions
+--------------------
+
+通过DAMON_RECLAIM成功回收的内存区域的数量。
+
+bytes_reclaimed_regions
+-----------------------
+
+通过DAMON_RECLAIM成功回收的内存区域的总字节数。
+
+nr_quota_exceeds
+----------------
+
+超过时间/空间配额限制的次数。
+
+例子
+====
+
+下面的运行示例命令使DAMON_RECLAIM找到30秒或更长时间没有访问的内存区域并“回收”?
+为了避免DAMON_RECLAIM在分页操作中消耗过多的CPU时间,回收被限制在每秒1GiB以内。
+它还要求DAMON_RECLAIM在系统的可用内存率超过50%时不做任何事情,但如果它低于40%时
+就开始真正的工作。如果DAMON_RECLAIM没有取得进展,因此空闲内存率低于20%,它会要求
+DAMON_RECLAIM再次什么都不做,这样我们就可以退回到基于LRU列表的页面粒度回收了::
+
+ # cd /sys/modules/damon_reclaim/parameters
+ # echo 30000000 > min_age
+ # echo $((1 * 1024 * 1024 * 1024)) > quota_sz
+ # echo 1000 > quota_reset_interval_ms
+ # echo 500 > wmarks_high
+ # echo 400 > wmarks_mid
+ # echo 200 > wmarks_low
+ # echo Y > enabled
+
+.. [1] https://research.google/pubs/pub48551/
+.. [2] https://lwn.net/Articles/787611/
+.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst
new file mode 100644
index 000000000000..67d1b49481dc
--- /dev/null
+++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst
@@ -0,0 +1,132 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../../disclaimer-zh_CN.rst
+
+:Original: Documentation/admin-guide/mm/damon/start.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+========
+入门指南
+========
+
+本文通过演示DAMON的默认用户空间工具,简要地介绍了如何使用DAMON。请注意,为了简洁
+起见,本文档只描述了它的部分功能。更多细节请参考该工具的使用文档。
+`doc <https://github.com/awslabs/damo/blob/next/USAGE.md>`_ .
+
+
+前提条件
+========
+
+内核
+----
+
+首先,你要确保你当前系统中跑的内核构建时选定了这个功能选项 ``CONFIG_DAMON_*=y``.
+
+
+用户空间工具
+------------
+
+在演示中,我们将使用DAMON的默认用户空间工具,称为DAMON Operator(DAMO)。它可以在
+https://github.com/awslabs/damo找到。下面的例子假设DAMO在你的$PATH上。当然,但
+这并不是强制性的。
+
+因为DAMO使用的是DAMON的debugfs接口(详情请参考 :doc:`usage` 中的使用方法) 你应该
+确保debugfs被挂载。手动挂载它,如下所示::
+
+ # mount -t debugfs none /sys/kernel/debug/
+
+或者在你的 ``/etc/fstab`` 文件中添加以下一行,这样你的系统就可以在启动时自动挂载
+debugfs了::
+
+ debugfs /sys/kernel/debug debugfs defaults 0 0
+
+
+记录数据访问模式
+================
+
+下面的命令记录了一个程序的内存访问模式,并将监测结果保存到文件中。 ::
+
+ $ git clone https://github.com/sjp38/masim
+ $ cd masim; make; ./masim ./configs/zigzag.cfg &
+ $ sudo damo record -o damon.data $(pidof masim)
+
+命令的前两行下载了一个人工内存访问生成器程序并在后台运行。生成器将重复地逐一访问两个
+100 MiB大小的内存区域。你可以用你的真实工作负载来代替它。最后一行要求 ``damo`` 将
+访问模式记录在 ``damon.data`` 文件中。
+
+
+将记录的模式可视化
+==================
+
+你可以在heatmap中直观地看到这种模式,显示哪个内存区域(X轴)何时被访问(Y轴)以及访
+问的频率(数字)。::
+
+ $ sudo damo report heats --heatmap stdout
+ 22222222222222222222222222222222222222211111111111111111111111111111111111111100
+ 44444444444444444444444444444444444444434444444444444444444444444444444444443200
+ 44444444444444444444444444444444444444433444444444444444444444444444444444444200
+ 33333333333333333333333333333333333333344555555555555555555555555555555555555200
+ 33333333333333333333333333333333333344444444444444444444444444444444444444444200
+ 22222222222222222222222222222222222223355555555555555555555555555555555555555200
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ 33333333333333333333333333333333333333355555555555555555555555555555555555555200
+ 88888888888888888888888888888888888888600000000000000000000000000000000000000000
+ 88888888888888888888888888888888888888600000000000000000000000000000000000000000
+ 33333333333333333333333333333333333333444444444444444444444444444444444444443200
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ [...]
+ # access_frequency: 0 1 2 3 4 5 6 7 8 9
+ # x-axis: space (139728247021568-139728453431248: 196.848 MiB)
+ # y-axis: time (15256597248362-15326899978162: 1 m 10.303 s)
+ # resolution: 80x40 (2.461 MiB and 1.758 s for each character)
+
+你也可以直观地看到工作集的大小分布,按大小排序。::
+
+ $ sudo damo report wss --range 0 101 10
+ # <percentile> <wss>
+ # target_id 18446632103789443072
+ # avr: 107.708 MiB
+ 0 0 B | |
+ 10 95.328 MiB |**************************** |
+ 20 95.332 MiB |**************************** |
+ 30 95.340 MiB |**************************** |
+ 40 95.387 MiB |**************************** |
+ 50 95.387 MiB |**************************** |
+ 60 95.398 MiB |**************************** |
+ 70 95.398 MiB |**************************** |
+ 80 95.504 MiB |**************************** |
+ 90 190.703 MiB |********************************************************* |
+ 100 196.875 MiB |***********************************************************|
+
+在上述命令中使用 ``--sortby`` 选项,可以显示工作集的大小是如何按时间顺序变化的。::
+
+ $ sudo damo report wss --range 0 101 10 --sortby time
+ # <percentile> <wss>
+ # target_id 18446632103789443072
+ # avr: 107.708 MiB
+ 0 3.051 MiB | |
+ 10 190.703 MiB |***********************************************************|
+ 20 95.336 MiB |***************************** |
+ 30 95.328 MiB |***************************** |
+ 40 95.387 MiB |***************************** |
+ 50 95.332 MiB |***************************** |
+ 60 95.320 MiB |***************************** |
+ 70 95.398 MiB |***************************** |
+ 80 95.398 MiB |***************************** |
+ 90 95.340 MiB |***************************** |
+ 100 95.398 MiB |***************************** |
+
+
+数据访问模式感知的内存管理
+==========================
+
+以下三个命令使每一个大小>=4K的内存区域在你的工作负载中没有被访问>=60秒,就会被换掉。 ::
+
+ $ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme
+ $ echo "4K max 0 0 60s max pageout" >> test_scheme
+ $ damo schemes -c test_scheme <pid of your workload>
diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst
new file mode 100644
index 000000000000..5d7533347216
--- /dev/null
+++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst
@@ -0,0 +1,286 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../../disclaimer-zh_CN.rst
+
+:Original: Documentation/admin-guide/mm/damon/usage.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+========
+详细用法
+========
+
+DAMON 为不同的用户提供了下面三种接口。
+
+- *DAMON用户空间工具。*
+ `这 <https://github.com/awslabs/damo>`_ 为有这特权的人, 如系统管理员,希望有一个刚好
+ 可以工作的人性化界面。
+ 使用它,用户可以以人性化的方式使用DAMON的主要功能。不过,它可能不会为特殊情况进行高度调整。
+ 它同时支持虚拟和物理地址空间的监测。更多细节,请参考它的 `使用文档
+ <https://github.com/awslabs/damo/blob/next/USAGE.md>`_。
+- *debugfs接口。*
+ :ref:`这 <debugfs_interface>` 是为那些希望更高级的使用DAMON的特权用户空间程序员准备的。
+ 使用它,用户可以通过读取和写入特殊的debugfs文件来使用DAMON的主要功能。因此,你可以编写和使
+ 用你个性化的DAMON debugfs包装程序,代替你读/写debugfs文件。 `DAMON用户空间工具
+ <https://github.com/awslabs/damo>`_ 就是这种程序的一个例子 它同时支持虚拟和物理地址
+ 空间的监测。注意,这个界面只提供简单的监测结果 :ref:`统计 <damos_stats>`。对于详细的监测
+ 结果,DAMON提供了一个:ref:`跟踪点 <tracepoint>`。
+
+- *内核空间编程接口。*
+ :doc:`This </vm/damon/api>` 这是为内核空间程序员准备的。使用它,用户可以通过为你编写内
+ 核空间的DAMON应用程序,最灵活有效地利用DAMON的每一个功能。你甚至可以为各种地址空间扩展DAMON。
+ 详细情况请参考接口 :doc:`文件 </vm/damon/api>`。
+
+
+debugfs接口
+===========
+
+DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``,
+``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` 和
+``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
+
+
+属性
+----
+
+用户可以通过读取和写入 ``attrs`` 文件获得和设置 ``采样间隔`` 、 ``聚集间隔`` 、 ``区域更新间隔``
+以及监测目标区域的最小/最大数量。要详细了解监测属性,请参考 `:doc:/vm/damon/design` 。例如,
+下面的命令将这些值设置为5ms、100ms、1000ms、10和1000,然后再次检查::
+
+ # cd <debugfs>/damon
+ # echo 5000 100000 1000000 10 1000 > attrs
+ # cat attrs
+ 5000 100000 1000000 10 1000
+
+
+目标ID
+------
+
+一些类型的地址空间支持多个监测目标。例如,虚拟内存地址空间的监测可以有多个进程作为监测目标。用户
+可以通过写入目标的相关id值来设置目标,并通过读取 ``target_ids`` 文件来获得当前目标的id。在监
+测虚拟地址空间的情况下,这些值应该是监测目标进程的pid。例如,下面的命令将pid为42和4242的进程设
+为监测目标,并再次检查::
+
+ # cd <debugfs>/damon
+ # echo 42 4242 > target_ids
+ # cat target_ids
+ 42 4242
+
+用户还可以通过在文件中写入一个特殊的关键字 "paddr\n" 来监测系统的物理内存地址空间。因为物理地
+址空间监测不支持多个目标,读取文件会显示一个假值,即 ``42`` ,如下图所示::
+
+ # cd <debugfs>/damon
+ # echo paddr > target_ids
+ # cat target_ids
+ 42
+
+请注意,设置目标ID并不启动监测。
+
+
+初始监测目标区域
+----------------
+
+在虚拟地址空间监测的情况下,DAMON自动设置和更新监测的目标区域,这样就可以覆盖目标进程的整个
+内存映射。然而,用户可能希望将监测区域限制在特定的地址范围内,如堆、栈或特定的文件映射区域。
+或者,一些用户可以知道他们工作负载的初始访问模式,因此希望为“自适应区域调整”设置最佳初始区域。
+
+相比之下,DAMON在物理内存监测的情况下不会自动设置和更新监测目标区域。因此,用户应该自己设置
+监测目标区域。
+
+在这种情况下,用户可以通过在 ``init_regions`` 文件中写入适当的值,明确地设置他们想要的初
+始监测目标区域。输入的每一行应代表一个区域,形式如下::
+
+ <target idx> <start address> <end address>
+
+目标idx应该是 ``target_ids`` 文件中目标的索引,从 ``0`` 开始,区域应该按照地址顺序传递。
+例如,下面的命令将设置几个地址范围, ``1-100`` 和 ``100-200`` 作为pid 42的初始监测目标
+区域,这是 ``target_ids`` 中的第一个(索引 ``0`` ),另外几个地址范围, ``20-40`` 和
+``50-100`` 作为pid 4242的地址,这是 ``target_ids`` 中的第二个(索引 ``1`` )::
+
+ # cd <debugfs>/damon
+ # cat target_ids
+ 42 4242
+ # echo "0 1 100
+ 0 100 200
+ 1 20 40
+ 1 50 100" > init_regions
+
+请注意,这只是设置了初始的监测目标区域。在虚拟内存监测的情况下,DAMON会在一个 ``区域更新间隔``
+后自动更新区域的边界。因此,在这种情况下,如果用户不希望更新的话,应该把 ``区域的更新间隔`` 设
+置得足够大。
+
+
+方案
+----
+
+对于通常的基于DAMON的数据访问感知的内存管理优化,用户只是希望系统对特定访问模式的内存区域应用内
+存管理操作。DAMON从用户那里接收这种形式化的操作方案,并将这些方案应用到目标进程中。
+
+用户可以通过读取和写入 ``scheme`` debugfs文件来获得和设置这些方案。读取该文件还可以显示每个
+方案的统计数据。在文件中,每一个方案都应该在每一行中以下列形式表示出来::
+
+ <target access pattern> <action> <quota> <watermarks>
+
+你可以通过简单地在文件中写入一个空字符串来禁用方案。
+
+目标访问模式
+~~~~~~~~~~~~
+
+``<目标访问模式>`` 是由三个范围构成的,形式如下::
+
+ min-size max-size min-acc max-acc min-age max-age
+
+具体来说,区域大小的字节数( `min-size` 和 `max-size` ),访问频率的每聚合区间的监测访问次
+数( `min-acc` 和 `max-acc` ),区域年龄的聚合区间数( `min-age` 和 `max-age` )都被指定。
+请注意,这些范围是封闭区间。
+
+动作
+~~~~
+
+``<action>`` 是一个预定义的内存管理动作的整数,DAMON将应用于具有目标访问模式的区域。支持
+的数字和它们的含义如下::
+
+ - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``
+ - 1: Call ``madvise()`` for the region with ``MADV_COLD``
+ - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
+ - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``
+ - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
+ - 5: Do nothing but count the statistics
+
+配额
+~~~~
+
+每个 ``动作`` 的最佳 ``目标访问模式`` 取决于工作负载,所以不容易找到。更糟糕的是,将某个
+动作的方案设置得过于激进会导致严重的开销。为了避免这种开销,用户可以通过下面表格中的 ``<quota>``
+来限制方案的时间和大小配额::
+
+ <ms> <sz> <reset interval> <priority weights>
+
+这使得DAMON在 ``<reset interval>`` 毫秒内,尽量只用 ``<ms>`` 毫秒的时间对 ``目标访
+问模式`` 的内存区域应用动作,并在 ``<reset interval>`` 内只对最多<sz>字节的内存区域应
+用动作。将 ``<ms>`` 和 ``<sz>`` 都设置为零,可以禁用配额限制。
+
+当预计超过配额限制时,DAMON会根据 ``目标访问模式`` 的大小、访问频率和年龄,对发现的内存
+区域进行优先排序。为了实现个性化的优先级,用户可以在 ``<优先级权重>`` 中设置这三个属性的
+权重,具体形式如下::
+
+ <size weight> <access frequency weight> <age weight>
+
+水位
+~~~~
+
+有些方案需要根据系统特定指标的当前值来运行,如自由内存比率。对于这种情况,用户可以为该条
+件指定水位。::
+
+ <metric> <check interval> <high mark> <middle mark> <low mark>
+
+``<metric>`` 是一个预定义的整数,用于要检查的度量。支持的数字和它们的含义如下。
+
+ - 0: 忽视水位
+ - 1: 系统空闲内存率 (千分比)
+
+每隔 ``<检查间隔>`` 微秒检查一次公制的值。
+
+如果该值高于 ``<高标>`` 或低于 ``<低标>`` ,该方案被停用。如果该值低于 ``<中标>`` ,
+该方案将被激活。
+
+统计数据
+~~~~~~~~
+
+它还统计每个方案被尝试应用的区域的总数量和字节数,每个方案被成功应用的区域的两个数量,以
+及超过配额限制的总数量。这些统计数据可用于在线分析或调整方案。
+
+统计数据可以通过读取方案文件来显示。读取该文件将显示你在每一行中输入的每个 ``方案`` ,
+统计的五个数字将被加在每一行的末尾。
+
+例子
+~~~~
+
+下面的命令应用了一个方案:”如果一个大小为[4KiB, 8KiB]的内存区域在[10, 20]的聚合时间
+间隔内显示出每一个聚合时间间隔[0, 5]的访问量,请分页出该区域。对于分页,每秒最多只能使
+用10ms,而且每秒分页不能超过1GiB。在这一限制下,首先分页出具有较长年龄的内存区域。另外,
+每5秒钟检查一次系统的可用内存率,当可用内存率低于50%时开始监测和分页,但如果可用内存率
+大于60%,或低于30%,则停止监测“::
+
+ # cd <debugfs>/damon
+ # scheme="4096 8192 0 5 10 20 2" # target access pattern and action
+ # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas
+ # scheme+=" 0 0 100" # prioritization weights
+ # scheme+=" 1 5000000 600 500 300" # watermarks
+ # echo "$scheme" > schemes
+
+
+开关
+----
+
+除非你明确地启动监测,否则如上所述的文件设置不会产生效果。你可以通过写入和读取 ``monitor_on``
+文件来启动、停止和检查监测的当前状态。写入 ``on`` 该文件可以启动对有属性的目标的监测。写入
+``off`` 该文件则停止这些目标。如果每个目标进程被终止,DAMON也会停止。下面的示例命令开启、关
+闭和检查DAMON的状态::
+
+ # cd <debugfs>/damon
+ # echo on > monitor_on
+ # echo off > monitor_on
+ # cat monitor_on
+ off
+
+请注意,当监测开启时,你不能写到上述的debugfs文件。如果你在DAMON运行时写到这些文件,将会返
+回一个错误代码,如 ``-EBUSY`` 。
+
+
+监测线程PID
+-----------
+
+DAMON通过一个叫做kdamond的内核线程来进行请求监测。你可以通过读取 ``kdamond_pid`` 文件获
+得该线程的 ``pid`` 。当监测被 ``关闭`` 时,读取该文件不会返回任何信息::
+
+ # cd <debugfs>/damon
+ # cat monitor_on
+ off
+ # cat kdamond_pid
+ none
+ # echo on > monitor_on
+ # cat kdamond_pid
+ 18594
+
+
+使用多个监测线程
+----------------
+
+每个监测上下文都会创建一个 ``kdamond`` 线程。你可以使用 ``mk_contexts`` 和 ``rm_contexts``
+文件为多个 ``kdamond`` 需要的用例创建和删除监测上下文。
+
+将新上下文的名称写入 ``mk_contexts`` 文件,在 ``DAMON debugfs`` 目录上创建一个该名称的目录。
+该目录将有该上下文的 ``DAMON debugfs`` 文件::
+
+ # cd <debugfs>/damon
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+ # echo foo > mk_contexts
+ # ls foo
+ # attrs init_regions kdamond_pid schemes target_ids
+
+如果不再需要上下文,你可以通过把上下文的名字放到 ``rm_contexts`` 文件中来删除它和相应的目录::
+
+ # echo foo > rm_contexts
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+
+注意, ``mk_contexts`` 、 ``rm_contexts`` 和 ``monitor_on`` 文件只在根目录下。
+
+
+监测结果的监测点
+================
+
+DAMON通过一个tracepoint ``damon:damon_aggregated`` 提供监测结果. 当监测开启时,你可
+以记录追踪点事件,并使用追踪点支持工具如perf显示结果。比如说::
+
+ # echo on > monitor_on
+ # perf record -e damon:damon_aggregated &
+ # sleep 5
+ # kill 9 $(pidof perf)
+ # echo off > monitor_on
+ # perf script
diff --git a/Documentation/translations/zh_CN/admin-guide/mm/index.rst b/Documentation/translations/zh_CN/admin-guide/mm/index.rst
new file mode 100644
index 000000000000..702271c5b683
--- /dev/null
+++ b/Documentation/translations/zh_CN/admin-guide/mm/index.rst
@@ -0,0 +1,49 @@
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/admin-guide/mm/index.rst
+
+:翻译:
+
+ 徐鑫 xu xin <xu.xin16@zte.com.cn>
+
+
+========
+内存管理
+========
+
+Linux内存管理子系统,顾名思义,是负责系统中的内存管理。它包括了虚拟内存与请求
+分页的实现,内核内部结构和用户空间程序的内存分配、将文件映射到进程地址空间以
+及许多其他很酷的事情。
+
+Linux内存管理是一个具有许多可配置设置的复杂系统, 且这些设置中的大多数都可以通
+过 ``/proc`` 文件系统获得,并且可以使用 ``sysctl`` 进行查询和调整。这些API接
+口被描述在Documentation/admin-guide/sysctl/vm.rst文件和 `man 5 proc`_ 中。
+
+.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
+
+Linux内存管理有它自己的术语,如果你还不熟悉它,请考虑阅读下面参考:
+:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
+
+在此目录下,我们详细描述了如何与Linux内存管理中的各种机制交互。
+
+.. toctree::
+ :maxdepth: 1
+
+ damon/index
+ ksm
+
+Todolist:
+* concepts
+* cma_debugfs
+* hugetlbpage
+* idle_page_tracking
+* memory-hotplug
+* nommu-mmap
+* numa_memory_policy
+* numaperf
+* pagemap
+* soft-dirty
+* swap_numa
+* transhuge
+* userfaultfd
+* zswap
diff --git a/Documentation/translations/zh_CN/admin-guide/mm/ksm.rst b/Documentation/translations/zh_CN/admin-guide/mm/ksm.rst
new file mode 100644
index 000000000000..4829156ef1ae
--- /dev/null
+++ b/Documentation/translations/zh_CN/admin-guide/mm/ksm.rst
@@ -0,0 +1,148 @@
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/admin-guide/mm/ksm.rst
+
+:翻译:
+
+ 徐鑫 xu xin <xu.xin16@zte.com.cn>
+
+
+============
+内核同页合并
+============
+
+
+概述
+====
+
+KSM是一种能节省内存的数据去重功能,由CONFIG_KSM=y启用,并在2.6.32版本时被添
+加到Linux内核。详见 ``mm/ksm.c`` 的实现,以及http://lwn.net/Articles/306704
+和https://lwn.net/Articles/330589
+
+KSM最初目的是为了与KVM(即著名的内核共享内存)一起使用而开发的,通过共享虚拟机
+之间的公共数据,将更多虚拟机放入物理内存。但它对于任何会生成多个相同数据实例的
+应用程序都是很有用的。
+
+KSM的守护进程ksmd会定期扫描那些已注册的用户内存区域,查找内容相同的页面,这些
+页面可以被单个写保护页面替换(如果进程以后想要更新其内容,将自动复制)。使用:
+引用:`sysfs intraface <ksm_sysfs>` 接口来配置KSM守护程序在单个过程中所扫描的页
+数以及两个过程之间的间隔时间。
+
+KSM只合并匿名(私有)页面,从不合并页缓存(文件)页面。KSM的合并页面最初只能被
+锁定在内核内存中,但现在可以就像其他用户页面一样被换出(但当它们被交换回来时共
+享会被破坏: ksmd必须重新发现它们的身份并再次合并)。
+
+以madvise控制KSM
+================
+
+KSM仅在特定的地址空间区域时运行,即应用程序通过使用如下所示的madvise(2)系统调
+用来请求某块地址成为可能的合并候选者的地址空间::
+
+ int madvise(addr, length, MADV_MERGEABLE)
+
+应用程序当然也可以通过调用::
+
+ int madvise(addr, length, MADV_UNMERGEABLE)
+
+来取消该请求,并恢复为非共享页面:此时KSM将去除合并在该范围内的任何合并页。注意:
+这个去除合并的调用可能突然需要的内存量超过实际可用的内存量-那么可能会出现EAGAIN
+失败,但更可能会唤醒OOM killer。
+
+如果KSM未被配置到正在运行的内核中,则madvise MADV_MERGEABLE 和 MADV_UNMERGEABLE
+的调用只会以EINVAL 失败。如果正在运行的内核是用CONFIG_KSM=y方式构建的,那么这些
+调用通常会成功:即使KSM守护程序当前没有运行,MADV_MERGEABLE 仍然会在KSM守护程序
+启动时注册范围,即使该范围不能包含KSM实际可以合并的任何页面,即使MADV_UNMERGEABLE
+应用于从未标记为MADV_MERGEABLE的范围。
+
+如果一块内存区域必须被拆分为至少一个新的MADV_MERGEABLE区域或MADV_UNMERGEABLE区域,
+当该进程将超过 ``vm.max_map_count`` 的设定,则madvise可能返回ENOMEM。(请参阅文档
+Documentation/admin-guide/sysctl/vm.rst)。
+
+与其他madvise调用一样,它们在用户地址空间的映射区域上使用:如果指定的范围包含未
+映射的间隙(尽管在中间的映射区域工作),它们将报告ENOMEM,如果没有足够的内存用于
+内部结构,则可能会因EAGAIN而失败。
+
+KSM守护进程sysfs接口
+====================
+
+KSM守护进程可以由``/sys/kernel/mm/ksm/`` 中的sysfs文件控制,所有人都可以读取,但
+只能由root用户写入。各接口解释如下:
+
+
+pages_to_scan
+ ksmd进程进入睡眠前要扫描的页数。
+ 例如, ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``
+
+ 默认值:100(该值被选择用于演示目的)
+
+sleep_millisecs
+ ksmd在下次扫描前应休眠多少毫秒
+ 例如, ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
+
+ 默认值:20(该值被选择用于演示目的)
+
+merge_across_nodes
+ 指定是否可以合并来自不同NUMA节点的页面。当设置为0时,ksm仅合并在物理上位
+ 于同一NUMA节点的内存区域中的页面。这降低了访问共享页面的延迟。在有明显的
+ NUMA距离上,具有更多节点的系统可能受益于设置该值为0时的更低延迟。而对于
+ 需要对内存使用量最小化的较小系统来说,设置该值为1(默认设置)则可能会受
+ 益于更大共享页面。在决定使用哪种设置之前,您可能希望比较系统在每种设置下
+ 的性能。 ``merge_across_nodes`` 仅当系统中没有ksm共享页面时,才能被更改设
+ 置:首先将接口`run` 设置为2从而对页进行去合并,然后在修改
+ ``merge_across_nodes`` 后再将‘run’又设置为1,以根据新设置来重新合并。
+
+ 默认值:1(如早期的发布版本一样合并跨站点)
+
+run
+ * 设置为0可停止ksmd运行,但保留合并页面,
+ * 设置为1可运行ksmd,例如, ``echo 1 > /sys/kernel/mm/ksm/run`` ,
+ * 设置为2可停止ksmd运行,并且对所有目前已合并的页进行去合并,但保留可合并
+ 区域以供下次运行。
+
+ 默认值:0(必须设置为1才能激活KSM,除非禁用了CONFIG_SYSFS)
+
+use_zero_pages
+ 指定是否应当特殊处理空页(即那些仅含zero的已分配页)。当该值设置为1时,
+ 空页与内核零页合并,而不是像通常情况下那样空页自身彼此合并。这可以根据
+ 工作负载的不同,在具有着色零页的架构上可以提高性能。启用此设置时应小心,
+ 因为它可能会降低某些工作负载的KSM性能,比如,当待合并的候选页面的校验和
+ 与空页面的校验和恰好匹配的时候。此设置可随时更改,仅对那些更改后再合并
+ 的页面有效。
+
+ 默认值:0(如同早期版本的KSM正常表现)
+
+max_page_sharing
+ 单个KSM页面允许的最大共享站点数。这将强制执行重复数据消除限制,以避免涉
+ 及遍历共享KSM页面的虚拟映射的虚拟内存操作的高延迟。最小值为2,因为新创
+ 建的KSM页面将至少有两个共享者。该值越高,KSM合并内存的速度越快,去重
+ 因子也越高,但是对于任何给定的KSM页面,虚拟映射的最坏情况遍历的速度也会
+ 越慢。减慢了这种遍历速度就意味着在交换、压缩、NUMA平衡和页面迁移期间,
+ 某些虚拟内存操作将有更高的延迟,从而降低这些虚拟内存操作调用者的响应能力。
+ 其他任务如果不涉及执行虚拟映射遍历的VM操作,其任务调度延迟不受此参数的影
+ 响,因为这些遍历本身是调度友好的。
+
+stable_node_chains_prune_millisecs
+ 指定KSM检查特定页面的元数据的频率(即那些达到过时信息数据去重限制标准的
+ 页面)单位是毫秒。较小的毫秒值将以更低的延迟来释放KSM元数据,但它们将使
+ ksmd在扫描期间使用更多CPU。如果还没有一个KSM页面达到 ``max_page_sharing``
+ 标准,那就没有什么用。
+
+KSM与MADV_MERGEABLE的工作有效性体现于 ``/sys/kernel/mm/ksm/`` 路径下的接口:
+
+pages_shared
+ 表示多少共享页正在被使用
+pages_sharing
+ 表示还有多少站点正在共享这些共享页,即节省了多少
+pages_unshared
+ 表示有多少页是唯一的,但被反复检查以进行合并
+pages_volatile
+ 表示有多少页因变化太快而无法放在tree中
+full_scans
+ 表示所有可合并区域已扫描多少次
+stable_node_chains
+ 达到 ``max_page_sharing`` 限制的KSM页数
+stable_node_dups
+ 重复的KSM页数
+
+比值 ``pages_sharing/pages_shared`` 的最大值受限制于 ``max_page_sharing``
+的设定。要想增加该比值,则相应地要增加 ``max_page_sharing`` 的值。
diff --git a/Documentation/translations/zh_CN/core-api/index.rst b/Documentation/translations/zh_CN/core-api/index.rst
index d10191c45cf1..26d9913fc8b6 100644
--- a/Documentation/translations/zh_CN/core-api/index.rst
+++ b/Documentation/translations/zh_CN/core-api/index.rst
@@ -42,6 +42,7 @@
kref
assoc_array
xarray
+ rbtree
Todolist:
@@ -49,7 +50,6 @@ Todolist:
idr
circular-buffers
- rbtree
generic-radix-tree
packing
bus-virt-phys-mapping
diff --git a/Documentation/translations/zh_CN/core-api/rbtree.rst b/Documentation/translations/zh_CN/core-api/rbtree.rst
new file mode 100644
index 000000000000..a3e1555cb974
--- /dev/null
+++ b/Documentation/translations/zh_CN/core-api/rbtree.rst
@@ -0,0 +1,391 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/core-api/rbtree.rst
+
+:翻译:
+
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
+
+=========================
+Linux中的红黑树(rbtree)
+=========================
+
+
+:日期: 2007年1月18日
+:作者: Rob Landley <rob@landley.net>
+
+何为红黑树,它们有什么用?
+--------------------------
+
+红黑树是一种自平衡二叉搜索树,被用来存储可排序的键/值数据对。这与基数树(被用来高效
+存储稀疏数组,因此使用长整型下标来插入/访问/删除结点)和哈希表(没有保持排序因而无法
+容易地按序遍历,同时必须调节其大小和哈希函数,然而红黑树可以优雅地伸缩以便存储任意
+数量的键)不同。
+
+红黑树和AVL树类似,但在插入和删除时提供了更快的实时有界的最坏情况性能(分别最多两次
+旋转和三次旋转,来平衡树),查询时间轻微变慢(但时间复杂度仍然是O(log n))。
+
+引用Linux每周新闻(Linux Weekly News):
+
+ 内核中有多处红黑树的使用案例。最后期限调度器和完全公平排队(CFQ)I/O调度器利用
+ 红黑树跟踪请求;数据包CD/DVD驱动程序也是如此。高精度时钟代码使用一颗红黑树组织
+ 未完成的定时器请求。ext3文件系统用红黑树跟踪目录项。虚拟内存区域(VMAs)、epoll
+ 文件描述符、密码学密钥和在“分层令牌桶”调度器中的网络数据包都由红黑树跟踪。
+
+本文档涵盖了对Linux红黑树实现的使用方法。更多关于红黑树的性质和实现的信息,参见:
+
+ Linux每周新闻关于红黑树的文章
+ https://lwn.net/Articles/184495/
+
+ 维基百科红黑树词条
+ https://en.wikipedia.org/wiki/Red-black_tree
+
+红黑树的Linux实现
+-----------------
+
+Linux的红黑树实现在文件“lib/rbtree.c”中。要使用它,需要“#include <linux/rbtree.h>”。
+
+Linux的红黑树实现对速度进行了优化,因此比传统的实现少一个间接层(有更好的缓存局部性)。
+每个rb_node结构体的实例嵌入在它管理的数据结构中,因此不需要靠指针来分离rb_node和它
+管理的数据结构。用户应该编写他们自己的树搜索和插入函数,来调用已提供的红黑树函数,
+而不是使用一个比较回调函数指针。加锁代码也留给红黑树的用户编写。
+
+创建一颗红黑树
+--------------
+
+红黑树中的数据结点是包含rb_node结构体成员的结构体::
+
+ struct mytype {
+ struct rb_node node;
+ char *keystring;
+ };
+
+当处理一个指向内嵌rb_node结构体的指针时,包住rb_node的结构体可用标准的container_of()
+宏访问。此外,个体成员可直接用rb_entry(node, type, member)访问。
+
+每颗红黑树的根是一个rb_root数据结构,它由以下方式初始化为空:
+
+ struct rb_root mytree = RB_ROOT;
+
+在一颗红黑树中搜索值
+--------------------
+
+为你的树写一个搜索函数是相当简单的:从树根开始,比较每个值,然后根据需要继续前往左边或
+右边的分支。
+
+示例::
+
+ struct mytype *my_search(struct rb_root *root, char *string)
+ {
+ struct rb_node *node = root->rb_node;
+
+ while (node) {
+ struct mytype *data = container_of(node, struct mytype, node);
+ int result;
+
+ result = strcmp(string, data->keystring);
+
+ if (result < 0)
+ node = node->rb_left;
+ else if (result > 0)
+ node = node->rb_right;
+ else
+ return data;
+ }
+ return NULL;
+ }
+
+在一颗红黑树中插入数据
+----------------------
+
+在树中插入数据的步骤包括:首先搜索插入新结点的位置,然后插入结点并对树再平衡
+("recoloring")。
+
+插入的搜索和上文的搜索不同,它要找到嫁接新结点的位置。新结点也需要一个指向它的父节点
+的链接,以达到再平衡的目的。
+
+示例::
+
+ int my_insert(struct rb_root *root, struct mytype *data)
+ {
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /* Figure out where to put new node */
+ while (*new) {
+ struct mytype *this = container_of(*new, struct mytype, node);
+ int result = strcmp(data->keystring, this->keystring);
+
+ parent = *new;
+ if (result < 0)
+ new = &((*new)->rb_left);
+ else if (result > 0)
+ new = &((*new)->rb_right);
+ else
+ return FALSE;
+ }
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&data->node, parent, new);
+ rb_insert_color(&data->node, root);
+
+ return TRUE;
+ }
+
+在一颗红黑树中删除或替换已经存在的数据
+--------------------------------------
+
+若要从树中删除一个已经存在的结点,调用::
+
+ void rb_erase(struct rb_node *victim, struct rb_root *tree);
+
+示例::
+
+ struct mytype *data = mysearch(&mytree, "walrus");
+
+ if (data) {
+ rb_erase(&data->node, &mytree);
+ myfree(data);
+ }
+
+若要用一个新结点替换树中一个已经存在的键值相同的结点,调用::
+
+ void rb_replace_node(struct rb_node *old, struct rb_node *new,
+ struct rb_root *tree);
+
+通过这种方式替换结点不会对树做重排序:如果新结点的键值和旧结点不同,红黑树可能被
+破坏。
+
+(按排序的顺序)遍历存储在红黑树中的元素
+----------------------------------------
+
+我们提供了四个函数,用于以排序的方式遍历一颗红黑树的内容。这些函数可以在任意红黑树
+上工作,并且不需要被修改或包装(除非加锁的目的)::
+
+ struct rb_node *rb_first(struct rb_root *tree);
+ struct rb_node *rb_last(struct rb_root *tree);
+ struct rb_node *rb_next(struct rb_node *node);
+ struct rb_node *rb_prev(struct rb_node *node);
+
+要开始迭代,需要使用一个指向树根的指针调用rb_first()或rb_last(),它将返回一个指向
+树中第一个或最后一个元素所包含的节点结构的指针。要继续的话,可以在当前结点上调用
+rb_next()或rb_prev()来获取下一个或上一个结点。当没有剩余的结点时,将返回NULL。
+
+迭代器函数返回一个指向被嵌入的rb_node结构体的指针,由此,包住rb_node的结构体可用
+标准的container_of()宏访问。此外,个体成员可直接用rb_entry(node, type, member)
+访问。
+
+示例::
+
+ struct rb_node *node;
+ for (node = rb_first(&mytree); node; node = rb_next(node))
+ printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring);
+
+带缓存的红黑树
+--------------
+
+计算最左边(最小的)结点是二叉搜索树的一个相当常见的任务,例如用于遍历,或用户根据
+他们自己的逻辑依赖一个特定的顺序。为此,用户可以使用'struct rb_root_cached'来优化
+时间复杂度为O(logN)的rb_first()的调用,以简单地获取指针,避免了潜在的昂贵的树迭代。
+维护操作的额外运行时间开销可忽略,不过内存占用较大。
+
+和rb_root结构体类似,带缓存的红黑树由以下方式初始化为空::
+
+ struct rb_root_cached mytree = RB_ROOT_CACHED;
+
+带缓存的红黑树只是一个常规的rb_root,加上一个额外的指针来缓存最左边的节点。这使得
+rb_root_cached可以存在于rb_root存在的任何地方,并且只需增加几个接口来支持带缓存的
+树::
+
+ struct rb_node *rb_first_cached(struct rb_root_cached *tree);
+ void rb_insert_color_cached(struct rb_node *, struct rb_root_cached *, bool);
+ void rb_erase_cached(struct rb_node *node, struct rb_root_cached *);
+
+操作和删除也有对应的带缓存的树的调用::
+
+ void rb_insert_augmented_cached(struct rb_node *node, struct rb_root_cached *,
+ bool, struct rb_augment_callbacks *);
+ void rb_erase_augmented_cached(struct rb_node *, struct rb_root_cached *,
+ struct rb_augment_callbacks *);
+
+
+对增强型红黑树的支持
+--------------------
+
+增强型红黑树是一种在每个结点里存储了“一些”附加数据的红黑树,其中结点N的附加数据
+必须是以N为根的子树中所有结点的内容的函数。它是建立在红黑树基础设施之上的可选特性。
+想要使用这个特性的红黑树用户,插入和删除结点时必须调用增强型接口并提供增强型回调函数。
+
+实现增强型红黑树操作的C文件必须包含<linux/rbtree_augmented.h>而不是<linux/rbtree.h>。
+注意,linux/rbtree_augmented.h暴露了一些红黑树实现的细节而你不应依赖它们,请坚持
+使用文档记录的API,并且不要在头文件中包含<linux/rbtree_augmented.h>,以最小化你的
+用户意外地依赖这些实现细节的可能。
+
+插入时,用户必须更新通往被插入节点的路径上的增强信息,然后像往常一样调用rb_link_node(),
+然后是rb_augment_inserted()而不是平时的rb_insert_color()调用。如果
+rb_augment_inserted()再平衡了红黑树,它将回调至一个用户提供的函数来更新受影响的
+子树上的增强信息。
+
+删除一个结点时,用户必须调用rb_erase_augmented()而不是rb_erase()。
+rb_erase_augmented()回调至一个用户提供的函数来更新受影响的子树上的增强信息。
+
+在两种情况下,回调都是通过rb_augment_callbacks结构体提供的。必须定义3个回调:
+
+- 一个传播回调,它更新一个给定结点和它的祖先们的增强数据,直到一个给定的停止点
+ (如果是NULL,将更新一路更新到树根)。
+
+- 一个复制回调,它将一颗给定子树的增强数据复制到一个新指定的子树树根。
+
+- 一个树旋转回调,它将一颗给定的子树的增强值复制到新指定的子树树根上,并重新计算
+ 先前的子树树根的增强值。
+
+rb_erase_augmented()编译后的代码可能会内联传播、复制回调,这将导致函数体积更大,
+因此每个增强型红黑树的用户应该只有一个rb_erase_augmented()的调用点,以限制编译后
+的代码大小。
+
+
+使用示例
+^^^^^^^^
+
+区间树是增强型红黑树的一个例子。参考Cormen,Leiserson,Rivest和Stein写的
+《算法导论》。区间树的更多细节:
+
+经典的红黑树只有一个键,它不能直接用来存储像[lo:hi]这样的区间范围,也不能快速查找
+与新的lo:hi重叠的部分,或者查找是否有与新的lo:hi完全匹配的部分。
+
+然而,红黑树可以被增强,以一种结构化的方式来存储这种区间范围,从而使高效的查找和
+精确匹配成为可能。
+
+这个存储在每个节点中的“额外信息”是其所有后代结点中的最大hi(max_hi)值。这个信息
+可以保持在每个结点上,只需查看一下该结点和它的直系子结点们。这将被用于时间复杂度
+为O(log n)的最低匹配查找(所有可能的匹配中最低的起始地址),就像这样::
+
+ struct interval_tree_node *
+ interval_tree_first_match(struct rb_root *root,
+ unsigned long start, unsigned long last)
+ {
+ struct interval_tree_node *node;
+
+ if (!root->rb_node)
+ return NULL;
+ node = rb_entry(root->rb_node, struct interval_tree_node, rb);
+
+ while (true) {
+ if (node->rb.rb_left) {
+ struct interval_tree_node *left =
+ rb_entry(node->rb.rb_left,
+ struct interval_tree_node, rb);
+ if (left->__subtree_last >= start) {
+ /*
+ * Some nodes in left subtree satisfy Cond2.
+ * Iterate to find the leftmost such node N.
+ * If it also satisfies Cond1, that's the match
+ * we are looking for. Otherwise, there is no
+ * matching interval as nodes to the right of N
+ * can't satisfy Cond1 either.
+ */
+ node = left;
+ continue;
+ }
+ }
+ if (node->start <= last) { /* Cond1 */
+ if (node->last >= start) /* Cond2 */
+ return node; /* node is leftmost match */
+ if (node->rb.rb_right) {
+ node = rb_entry(node->rb.rb_right,
+ struct interval_tree_node, rb);
+ if (node->__subtree_last >= start)
+ continue;
+ }
+ }
+ return NULL; /* No match */
+ }
+ }
+
+插入/删除是通过以下增强型回调来定义的::
+
+ static inline unsigned long
+ compute_subtree_last(struct interval_tree_node *node)
+ {
+ unsigned long max = node->last, subtree_last;
+ if (node->rb.rb_left) {
+ subtree_last = rb_entry(node->rb.rb_left,
+ struct interval_tree_node, rb)->__subtree_last;
+ if (max < subtree_last)
+ max = subtree_last;
+ }
+ if (node->rb.rb_right) {
+ subtree_last = rb_entry(node->rb.rb_right,
+ struct interval_tree_node, rb)->__subtree_last;
+ if (max < subtree_last)
+ max = subtree_last;
+ }
+ return max;
+ }
+
+ static void augment_propagate(struct rb_node *rb, struct rb_node *stop)
+ {
+ while (rb != stop) {
+ struct interval_tree_node *node =
+ rb_entry(rb, struct interval_tree_node, rb);
+ unsigned long subtree_last = compute_subtree_last(node);
+ if (node->__subtree_last == subtree_last)
+ break;
+ node->__subtree_last = subtree_last;
+ rb = rb_parent(&node->rb);
+ }
+ }
+
+ static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new)
+ {
+ struct interval_tree_node *old =
+ rb_entry(rb_old, struct interval_tree_node, rb);
+ struct interval_tree_node *new =
+ rb_entry(rb_new, struct interval_tree_node, rb);
+
+ new->__subtree_last = old->__subtree_last;
+ }
+
+ static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
+ {
+ struct interval_tree_node *old =
+ rb_entry(rb_old, struct interval_tree_node, rb);
+ struct interval_tree_node *new =
+ rb_entry(rb_new, struct interval_tree_node, rb);
+
+ new->__subtree_last = old->__subtree_last;
+ old->__subtree_last = compute_subtree_last(old);
+ }
+
+ static const struct rb_augment_callbacks augment_callbacks = {
+ augment_propagate, augment_copy, augment_rotate
+ };
+
+ void interval_tree_insert(struct interval_tree_node *node,
+ struct rb_root *root)
+ {
+ struct rb_node **link = &root->rb_node, *rb_parent = NULL;
+ unsigned long start = node->start, last = node->last;
+ struct interval_tree_node *parent;
+
+ while (*link) {
+ rb_parent = *link;
+ parent = rb_entry(rb_parent, struct interval_tree_node, rb);
+ if (parent->__subtree_last < last)
+ parent->__subtree_last = last;
+ if (start < parent->start)
+ link = &parent->rb.rb_left;
+ else
+ link = &parent->rb.rb_right;
+ }
+
+ node->__subtree_last = last;
+ rb_link_node(&node->rb, rb_parent, link);
+ rb_insert_augmented(&node->rb, root, &augment_callbacks);
+ }
+
+ void interval_tree_remove(struct interval_tree_node *node,
+ struct rb_root *root)
+ {
+ rb_erase_augmented(&node->rb, root, &augment_callbacks);
+ }
diff --git a/Documentation/translations/zh_CN/devicetree/index.rst b/Documentation/translations/zh_CN/devicetree/index.rst
new file mode 100644
index 000000000000..23d0b6fccd58
--- /dev/null
+++ b/Documentation/translations/zh_CN/devicetree/index.rst
@@ -0,0 +1,50 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/Devicetree/index.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+=============================
+Open Firmware 和 Devicetree
+=============================
+
+该文档是整个设备树文档的总目录,标题中多是业内默认的术语,初见就翻译成中文,
+晦涩难懂,因此尽量保留,后面翻译其子文档时,可能会根据语境,灵活地翻译为中文。
+
+内核Devicetree的使用
+=======================
+.. toctree::
+ :maxdepth: 1
+
+ usage-model
+ of_unittest
+
+Todolist:
+
+* kernel-api
+
+Devicetree Overlays
+===================
+.. toctree::
+ :maxdepth: 1
+
+Todolist:
+
+* changesets
+* dynamic-resolution-notes
+* overlay-notes
+
+Devicetree Bindings
+===================
+.. toctree::
+ :maxdepth: 1
+
+Todolist:
+
+* bindings/index
diff --git a/Documentation/translations/zh_CN/devicetree/of_unittest.rst b/Documentation/translations/zh_CN/devicetree/of_unittest.rst
new file mode 100644
index 000000000000..abd94e771ef8
--- /dev/null
+++ b/Documentation/translations/zh_CN/devicetree/of_unittest.rst
@@ -0,0 +1,189 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/Devicetree/of_unittest.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+=================================
+Open Firmware Devicetree 单元测试
+=================================
+
+作者: Gaurav Minocha <gaurav.minocha.os@gmail.com>
+
+1. 概述
+=======
+
+本文档解释了执行 OF 单元测试所需的测试数据是如何动态地附加到实时树上的,与机器的架构无关。
+
+建议在继续读下去之前,先阅读以下文件。
+
+(1) Documentation/devicetree/usage-model.rst
+(2) http://www.devicetree.org/Device_Tree_Usage
+
+OF Selftest被设计用来测试提供给设备驱动开发者的接口(include/linux/of.h),以从未扁平
+化的设备树数据结构中获取设备信息等。这个接口被大多数设备驱动在各种使用情况下使用。
+
+
+2. 测试数据
+===========
+
+设备树源文件(drivers/of/unittest-data/testcases.dts)包含执行drivers/of/unittest.c
+中自动化单元测试所需的测试数据。目前,以下设备树源包含文件(.dtsi)被包含在testcases.dt中::
+
+ drivers/of/unittest-data/tests-interrupts.dtsi
+ drivers/of/unittest-data/tests-platform.dtsi
+ drivers/of/unittest-data/tests-phandle.dtsi
+ drivers/of/unittest-data/tests-match.dtsi
+
+当内核在启用OF_SELFTEST的情况下被构建时,那么下面的make规则::
+
+ $(obj)/%.dtb: $(src)/%.dts FORCE
+ $(call if_changed_dep, dtc)
+
+用于将DT源文件(testcases.dts)编译成二进制blob(testcases.dtb),也被称为扁平化的DT。
+
+之后,使用以下规则将上述二进制blob包装成一个汇编文件(testcases.dtb.S)::
+
+ $(obj)/%.dtb.S: $(obj)/%.dtb
+ $(call cmd, dt_S_dtb)
+
+汇编文件被编译成一个对象文件(testcases.dtb.o),并被链接到内核镜像中。
+
+
+2.1. 添加测试数据
+-----------------
+
+未扁平化的设备树结构体:
+
+未扁平化的设备树由连接的设备节点组成,其树状结构形式如下所述::
+
+ // following struct members are used to construct the tree
+ struct device_node {
+ ...
+ struct device_node *parent;
+ struct device_node *child;
+ struct device_node *sibling;
+ ...
+ };
+
+图1描述了一个机器的未扁平化设备树的通用结构,只考虑了子节点和同级指针。存在另一个指针,
+``*parent`` ,用于反向遍历该树。因此,在一个特定的层次上,子节点和所有的兄弟姐妹节点将
+有一个指向共同节点的父指针(例如,child1、sibling2、sibling3、sibling4的父指针指向
+根节点)::
+
+ root ('/')
+ |
+ child1 -> sibling2 -> sibling3 -> sibling4 -> null
+ | | | |
+ | | | null
+ | | |
+ | | child31 -> sibling32 -> null
+ | | | |
+ | | null null
+ | |
+ | child21 -> sibling22 -> sibling23 -> null
+ | | | |
+ | null null null
+ |
+ child11 -> sibling12 -> sibling13 -> sibling14 -> null
+ | | | |
+ | | | null
+ | | |
+ null null child131 -> null
+ |
+ null
+
+Figure 1: 未扁平化的设备树的通用结构
+
+
+在执行OF单元测试之前,需要将测试数据附加到机器的设备树上(如果存在)。因此,当调用
+selftest_data_add()时,首先会读取通过以下内核符号链接到内核镜像中的扁平化设备树
+数据::
+
+ __dtb_testcases_begin - address marking the start of test data blob
+ __dtb_testcases_end - address marking the end of test data blob
+
+其次,它调用of_fdt_unflatten_tree()来解除扁平化的blob。最后,如果机器的设备树
+(即实时树)是存在的,那么它将未扁平化的测试数据树附加到实时树上,否则它将自己作为
+实时设备树附加。
+
+attach_node_and_children()使用of_attach_node()将节点附加到实时树上,如下所
+述。为了解释这一点,图2中描述的测试数据树被附加到图1中描述的实时树上::
+
+ root ('/')
+ |
+ testcase-data
+ |
+ test-child0 -> test-sibling1 -> test-sibling2 -> test-sibling3 -> null
+ | | | |
+ test-child01 null null null
+
+
+Figure 2: 将测试数据树附在实时树上的例子。
+
+根据上面的方案,实时树已经存在,所以不需要附加根('/')节点。所有其他节点都是通过在
+每个节点上调用of_attach_node()来附加的。
+
+在函数of_attach_node()中,新的节点被附在实时树中给定的父节点的子节点上。但是,如
+果父节点已经有了一个孩子,那么新节点就会取代当前的孩子,并将其变成其兄弟姐妹。因此,
+当测试案例的数据节点被连接到上面的实时树(图1)时,最终的结构如图3所示::
+
+ root ('/')
+ |
+ testcase-data -> child1 -> sibling2 -> sibling3 -> sibling4 -> null
+ | | | | |
+ (...) | | | null
+ | | child31 -> sibling32 -> null
+ | | | |
+ | | null null
+ | |
+ | child21 -> sibling22 -> sibling23 -> null
+ | | | |
+ | null null null
+ |
+ child11 -> sibling12 -> sibling13 -> sibling14 -> null
+ | | | |
+ null null | null
+ |
+ child131 -> null
+ |
+ null
+ -----------------------------------------------------------------------
+
+ root ('/')
+ |
+ testcase-data -> child1 -> sibling2 -> sibling3 -> sibling4 -> null
+ | | | | |
+ | (...) (...) (...) null
+ |
+ test-sibling3 -> test-sibling2 -> test-sibling1 -> test-child0 -> null
+ | | | |
+ null null null test-child01
+
+
+Figure 3: 附加测试案例数据后的实时设备树结构。
+
+
+聪明的读者会注意到,与先前的结构相比,test-child0节点成为最后一个兄弟姐妹(图2)。
+在连接了第一个test-child0节点之后,又连接了test-sibling1节点,该节点推动子节点
+(即test-child0)成为兄弟姐妹,并使自己成为子节点,如上所述。
+
+如果发现一个重复的节点(即如果一个具有相同full_name属性的节点已经存在于实时树中),
+那么该节点不会被附加,而是通过调用函数update_node_properties()将其属性更新到活
+树的节点中。
+
+
+2.2. 删除测试数据
+-----------------
+
+一旦测试用例执行完,selftest_data_remove被调用,以移除最初连接的设备节点(首先是
+叶子节点被分离,然后向上移动父节点被移除,最后是整个树)。selftest_data_remove()
+调用detach_node_and_children(),使用of_detach_node()将节点从实时设备树上分离。
+
+为了分离一个节点,of_detach_node()要么将给定节点的父节点的子节点指针更新为其同级节
+点,要么根据情况将前一个同级节点附在给定节点的同级节点上。就这样吧。 :)
diff --git a/Documentation/translations/zh_CN/devicetree/usage-model.rst b/Documentation/translations/zh_CN/devicetree/usage-model.rst
new file mode 100644
index 000000000000..318a3c6a0114
--- /dev/null
+++ b/Documentation/translations/zh_CN/devicetree/usage-model.rst
@@ -0,0 +1,330 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/Devicetree/usage-model.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+===================
+Linux 和 Devicetree
+===================
+
+Linux对设备树数据的使用模型
+
+:作者: Grant Likely <grant.likely@secretlab.ca>
+
+这篇文章描述了Linux如何使用设备树。关于设备树数据格式的概述可以在
+devicetree.org\ [1]_ 的设备树使用页面上找到。
+
+.. [1] https://www.devicetree.org/specifications/
+
+"Open Firmware Device Tree",或简称为Devicetree(DT),是一种用于描述硬
+件的数据结构和语言。更确切地说,它是一种操作系统可读的硬件描述,这样操作系统就不
+需要对机器的细节进行硬编码。
+
+从结构上看,DT是一棵树,或者说是带有命名节点的无环图,节点可以有任意数量的命名
+属性来封装任意的数据。还存在一种机制,可以在自然的树状结构之外创建从一个节点到
+另一个节点的任意链接。
+
+从概念上讲,一套通用的使用惯例,称为 "bindings"(后文译为绑定),被定义为数据
+应该如何出现在树中,以描述典型的硬件特性,包括数据总线、中断线、GPIO连接和外围
+设备。
+
+尽可能使用现有的绑定来描述硬件,以最大限度地利用现有的支持代码,但由于属性和节
+点名称是简单的文本字符串,通过定义新的节点和属性来扩展现有的绑定或创建新的绑定
+很容易。然而,要警惕的是,在创建一个新的绑定之前,最好先对已经存在的东西做一些
+功课。目前有两种不同的、不兼容的i2c总线的绑定,这是因为在创建新的绑定时没有事先
+调查i2c设备在现有系统中是如何被枚举的。
+
+1. 历史
+-------
+DT最初是由Open Firmware创建的,作为将数据从Open Firmware传递给客户程序
+(如传递给操作系统)的通信方法的一部分。操作系统使用设备树在运行时探测硬件的拓
+扑结构,从而在没有硬编码信息的情况下支持大多数可用的硬件(假设所有设备的驱动程
+序都可用)。
+
+由于Open Firmware通常在PowerPC和SPARC平台上使用,长期以来,对这些架构的
+Linux支持一直使用设备树。
+
+2005年,当PowerPC Linux开始大规模清理并合并32位和64位支持时,决定在所有
+Powerpc平台上要求DT支持,无论它们是否使用Open Firmware。为了做到这一点,
+我们创建了一个叫做扁平化设备树(FDT)的DT表示法,它可以作为一个二进制的blob
+传递给内核,而不需要真正的Open Firmware实现。U-Boot、kexec和其他引导程序
+被修改,以支持传递设备树二进制(dtb)和在引导时修改dtb。DT也被添加到PowerPC
+引导包装器(arch/powerpc/boot/\*)中,这样dtb就可以被包裹在内核镜像中,以
+支持引导现有的非DT察觉的固件。
+
+一段时间后,FDT基础架构被普及到了所有的架构中。在写这篇文章的时候,6个主线架
+构(arm、microblaze、mips、powerpc、sparc和x86)和1个非主线架构(ios)
+有某种程度的DT支持。
+
+1. 数据模型
+-----------
+如果你还没有读过设备树用法\ [1]_页,那么现在就去读吧。没关系,我等着....
+
+2.1 高层次视角
+--------------
+最重要的是要明白,DT只是一个描述硬件的数据结构。它没有什么神奇之处,也不会神
+奇地让所有的硬件配置问题消失。它所做的是提供一种语言,将硬件配置与Linux内核
+(或任何其他操作系统)中的板卡和设备驱动支持解耦。使用它可以使板卡和设备支持
+变成数据驱动;根据传递到内核的数据做出设置决定,而不是根据每台机器的硬编码选
+择。
+
+理想情况下,数据驱动的平台设置应该导致更少的代码重复,并使其更容易用一个内核
+镜像支持各种硬件。
+
+Linux使用DT数据有三个主要目的:
+
+1) 平台识别。
+2) 运行时配置,以及
+3) 设备数量。
+
+2.2 平台识别
+------------
+首先,内核将使用DT中的数据来识别特定的机器。在一个理想的世界里,具体的平台对
+内核来说并不重要,因为所有的平台细节都会被设备树以一致和可靠的方式完美描述。
+但是,硬件并不完美,所以内核必须在早期启动时识别机器,以便有机会运行特定于机
+器的修复程序。
+
+在大多数情况下,机器的身份是不相关的,而内核将根据机器的核心CPU或SoC来选择
+设置代码。例如,在ARM上,arch/arm/kernel/setup.c中的setup_arch()将调
+用arch/arm/kernel/devtree.c中的setup_machine_fdt(),它通过
+machine_desc表搜索并选择与设备树数据最匹配的machine_desc。它通过查看根
+设备树节点中的'compatible'属性,并将其与struct machine_desc中的
+dt_compat列表(如果你好奇,该列表定义在arch/arm/include/asm/mach/arch.h
+中)进行比较,从而确定最佳匹配。
+
+“compatible” 属性包含一个排序的字符串列表,以机器的确切名称开始,后面是
+一个可选的与之兼容的板子列表,从最兼容到最不兼容排序。例如,TI BeagleBoard
+和它的后继者BeagleBoard xM板的根兼容属性可能看起来分别为::
+
+ compatible = "ti,omap3-beagleboard", "ti,omap3450", "ti,omap3";
+ compatible = "ti,omap3-beagleboard-xm", "ti,omap3450", "ti,omap3";
+
+其中 "ti,map3-beagleboard-xm "指定了确切的型号,它还声称它与OMAP 3450 SoC
+以及一般的OMP3系列SoC兼容。你会注意到,该列表从最具体的(确切的板子)到最
+不具体的(SoC系列)进行排序。
+
+聪明的读者可能会指出,Beagle xM也可以声称与原Beagle板兼容。然而,我们应
+该当心在板级上这样做,因为通常情况下,即使在同一产品系列中,每块板都有很高
+的变化,而且当一块板声称与另一块板兼容时,很难确定到底是什么意思。对于高层
+来说,最好是谨慎行事,不要声称一块板子与另一块板子兼容。值得注意的例外是,
+当一块板子是另一块板子的载体时,例如CPU模块连接到一个载体板上。
+
+关于兼容值还有一个注意事项。在兼容属性中使用的任何字符串都必须有文件说明它
+表示什么。在Documentation/devicetree/bindings中添加兼容字符串的文档。
+
+同样在ARM上,对于每个machine_desc,内核会查看是否有任何dt_compat列表条
+目出现在兼容属性中。如果有,那么该机器_desc就是驱动该机器的候选者。在搜索
+了整个machine_descs表之后,setup_machine_fdt()根据每个machine_desc
+在兼容属性中匹配的条目,返回 “最兼容” 的machine_desc。如果没有找到匹配
+的machine_desc,那么它将返回NULL。
+
+这个方案背后的原因是观察到,在大多数情况下,如果它们都使用相同的SoC或相同
+系列的SoC,一个机器_desc可以支持大量的电路板。然而,不可避免地会有一些例
+外情况,即特定的板子需要特殊的设置代码,这在一般情况下是没有用的。特殊情况
+可以通过在通用设置代码中明确检查有问题的板子来处理,但如果超过几个情况下,
+这样做很快就会变得很难看和/或无法维护。
+
+相反,兼容列表允许通用机器_desc通过在dt_compat列表中指定“不太兼容”的值
+来提供对广泛的通用板的支持。在上面的例子中,通用板支持可以声称与“ti,ompa3”
+或“ti,ompa3450”兼容。如果在最初的beagleboard上发现了一个bug,需要在
+早期启动时使用特殊的变通代码,那么可以添加一个新的machine_desc,实现变通,
+并且只在“ti,omap3-beagleboard”上匹配。
+
+PowerPC使用了一个稍微不同的方案,它从每个机器_desc中调用.probe()钩子,
+并使用第一个返回TRUE的钩子。然而,这种方法没有考虑到兼容列表的优先级,对于
+新的架构支持可能应该避免。
+
+2.3 运行时配置
+--------------
+在大多数情况下,DT是将数据从固件传递给内核的唯一方法,所以也被用来传递运行
+时和配置数据,如内核参数字符串和initrd镜像的位置。
+
+这些数据大部分都包含在/chosen节点中,当启动Linux时,它看起来就像这样::
+
+ chosen {
+ bootargs = "console=ttyS0,115200 loglevel=8";
+ initrd-start = <0xc8000000>;
+ initrd-end = <0xc8200000>;
+ };
+
+bootargs属性包含内核参数,initrd-\*属性定义initrd blob的地址和大小。注
+意initrd-end是initrd映像后的第一个地址,所以这与结构体资源的通常语义不一
+致。选择的节点也可以选择包含任意数量的额外属性,用于平台特定的配置数据。
+
+在早期启动过程中,架构设置代码通过不同的辅助回调函数多次调用
+of_scan_flat_dt()来解析设备树数据,然后进行分页设置。of_scan_flat_dt()
+代码扫描设备树,并使用辅助函数来提取早期启动期间所需的信息。通常情况下,
+early_init_dt_scan_chosen()辅助函数用于解析所选节点,包括内核参数,
+early_init_dt_scan_root()用于初始化DT地址空间模型,early_init_dt_scan_memory()
+用于确定可用RAM的大小和位置。
+
+在ARM上,函数setup_machine_fdt()负责在选择支持板子的正确machine_desc
+后,对设备树进行早期扫描。
+
+2.4 设备数量
+------------
+在电路板被识别后,在早期配置数据被解析后,内核初始化可以以正常方式进行。在
+这个过程中的某个时刻,unflatten_device_tree()被调用以将数据转换成更有
+效的运行时表示。这也是调用机器特定设置钩子的时候,比如ARM上的machine_desc
+.init_early()、.init_irq()和.init_machine()钩子。本节的其余部分使用
+了ARM实现的例子,但所有架构在使用DT时都会做几乎相同的事情。
+
+从名称上可以猜到,.init_early()用于在启动过程早期需要执行的任何机器特定设
+置,而.init_irq()则用于设置中断处理。使用DT并不会实质性地改变这两个函数的
+行为。如果提供了DT,那么.init_early()和.init_irq()都能调用任何一个DT查
+询函数(of_* in include/linux/of*.h),以获得关于平台的额外数据。
+
+DT上下文中最有趣的钩子是.init_machine(),它主要负责将平台的数据填充到
+Linux设备模型中。历史上,这在嵌入式平台上是通过在板卡support .c文件中定
+义一组静态时钟结构、platform_devices和其他数据,并在.init_machine()中
+大量注册来实现的。当使用DT时,就不用为每个平台的静态设备进行硬编码,可以通过
+解析DT获得设备列表,并动态分配设备结构体。
+
+最简单的情况是,.init_machine()只负责注册一个platform_devices。
+platform_device是Linux使用的一个概念,用于不能被硬件检测到的内存或I/O映
+射的设备,以及“复合”或 “虚拟”设备(后面会详细介绍)。虽然DT没有“平台设备”的
+术语,但平台设备大致对应于树根的设备节点和简单内存映射总线节点的子节点。
+
+现在是举例说明的好时机。下面是NVIDIA Tegra板的设备树的一部分::
+
+ /{
+ compatible = "nvidia,harmony", "nvidia,tegra20";
+ #address-cells = <1>;
+ #size-cells = <1>;
+ interrupt-parent = <&intc>;
+
+ chosen { };
+ aliases { };
+
+ memory {
+ device_type = "memory";
+ reg = <0x00000000 0x40000000>;
+ };
+
+ soc {
+ compatible = "nvidia,tegra20-soc", "simple-bus";
+ #address-cells = <1>;
+ #size-cells = <1>;
+ ranges;
+
+ intc: interrupt-controller@50041000 {
+ compatible = "nvidia,tegra20-gic";
+ interrupt-controller;
+ #interrupt-cells = <1>;
+ reg = <0x50041000 0x1000>, < 0x50040100 0x0100 >;
+ };
+
+ serial@70006300 {
+ compatible = "nvidia,tegra20-uart";
+ reg = <0x70006300 0x100>;
+ interrupts = <122>;
+ };
+
+ i2s1: i2s@70002800 {
+ compatible = "nvidia,tegra20-i2s";
+ reg = <0x70002800 0x100>;
+ interrupts = <77>;
+ codec = <&wm8903>;
+ };
+
+ i2c@7000c000 {
+ compatible = "nvidia,tegra20-i2c";
+ #address-cells = <1>;
+ #size-cells = <0>;
+ reg = <0x7000c000 0x100>;
+ interrupts = <70>;
+
+ wm8903: codec@1a {
+ compatible = "wlf,wm8903";
+ reg = <0x1a>;
+ interrupts = <347>;
+ };
+ };
+ };
+
+ sound {
+ compatible = "nvidia,harmony-sound";
+ i2s-controller = <&i2s1>;
+ i2s-codec = <&wm8903>;
+ };
+ };
+
+在.init_machine()时,Tegra板支持代码将需要查看这个DT,并决定为哪些节点
+创建platform_devices。然而,看一下这个树,并不能立即看出每个节点代表什么
+类型的设备,甚至不能看出一个节点是否代表一个设备。/chosen、/aliases和
+/memory节点是信息节点,并不描述设备(尽管可以说内存可以被认为是一个设备)。
+/soc节点的子节点是内存映射的设备,但是codec@1a是一个i2c设备,而sound节
+点代表的不是一个设备,而是其他设备是如何连接在一起以创建音频子系统的。我知
+道每个设备是什么,因为我熟悉电路板的设计,但是内核怎么知道每个节点该怎么做?
+
+诀窍在于,内核从树的根部开始,寻找具有“兼容”属性的节点。首先,一般认为任何
+具有“兼容”属性的节点都代表某种设备;其次,可以认为树根的任何节点要么直接连
+接到处理器总线上,要么是无法用其他方式描述的杂项系统设备。对于这些节点中的
+每一个,Linux都会分配和注册一个platform_device,它又可能被绑定到一个
+platform_driver。
+
+为什么为这些节点使用platform_device是一个安全的假设?嗯,就Linux对设备
+的建模方式而言,几乎所有的总线类型都假定其设备是总线控制器的孩子。例如,每
+个i2c_client是i2c_master的一个子节点。每个spi_device都是SPI总线的一
+个子节点。类似的还有USB、PCI、MDIO等。同样的层次结构也出现在DT中,I2C设
+备节点只作为I2C总线节点的子节点出现。同理,SPI、MDIO、USB等等。唯一不需
+要特定类型的父设备的设备是platform_devices(和amba_devices,但后面会
+详细介绍),它们将愉快地运行在Linux/sys/devices树的底部。因此,如果一个
+DT节点位于树的根部,那么它真的可能最好注册为platform_device。
+
+Linux板支持代码调用of_platform_populate(NULL, NULL, NULL, NULL)来
+启动树根的设备发现。参数都是NULL,因为当从树的根部开始时,不需要提供一个起
+始节点(第一个NULL),一个父结构设备(最后一个NULL),而且我们没有使用匹配
+表(尚未)。对于只需要注册设备的板子,除了of_platform_populate()的调用,
+.init_machine()可以完全为空。
+
+在Tegra的例子中,这说明了/soc和/sound节点,但是SoC节点的子节点呢?它们
+不应该也被注册为平台设备吗?对于Linux DT支持,一般的行为是子设备在驱动
+.probe()时被父设备驱动注册。因此,一个i2c总线设备驱动程序将为每个子节点
+注册一个i2c_client,一个SPI总线驱动程序将注册其spi_device子节点,其他
+总线类型也是如此。根据该模型,可以编写一个与SoC节点绑定的驱动程序,并简单
+地为其每个子节点注册platform_device。板卡支持代码将分配和注册一个SoC设
+备,一个(理论上的)SoC设备驱动程序可以绑定到SoC设备,并在其.probe()钩
+中为/soc/interruptcontroller、/soc/serial、/soc/i2s和/soc/i2c注
+册platform_devices。很简单,对吗?
+
+实际上,事实证明,将一些platform_device的子设备注册为更多的platform_device
+是一种常见的模式,设备树支持代码反映了这一点,并使上述例子更简单。
+of_platform_populate()的第二个参数是一个of_device_id表,任何与该表
+中的条目相匹配的节点也将获得其子节点的注册。在Tegra的例子中,代码可以是
+这样的::
+
+ static void __init harmony_init_machine(void)
+ {
+ /* ... */
+ of_platform_populate(NULL, of_default_bus_match_table, NULL, NULL);
+ }
+
+“simple-bus”在Devicetree规范中被定义为一个属性,意味着一个简单的内存映射
+的总线,所以of_platform_populate()代码可以被写成只是假设简单总线兼容的节
+点将总是被遍历。然而,我们把它作为一个参数传入,以便电路板支持代码可以随时覆
+盖默认行为。
+
+[需要添加关于添加i2c/spi/etc子设备的讨论] 。
+
+附录A:AMBA设备
+---------------
+
+ARM Primecell是连接到ARM AMBA总线的某种设备,它包括对硬件检测和电源管理
+的一些支持。在Linux中,amba_device和amba_bus_type结构体被用来表示
+Primecell设备。然而,棘手的一点是,AMBA总线上的所有设备并非都是Primecell,
+而且对于Linux来说,典型的情况是amba_device和platform_device实例都是同
+一总线段的同义词。
+
+当使用DT时,这给of_platform_populate()带来了问题,因为它必须决定是否将
+每个节点注册为platform_device或amba_device。不幸的是,这使设备创建模型
+变得有点复杂,但解决方案原来并不是太具有侵略性。如果一个节点与“arm,amba-primecell”
+兼容,那么of_platform_populate()将把它注册为amba_device而不是
+platform_device。
diff --git a/Documentation/translations/zh_CN/index.rst b/Documentation/translations/zh_CN/index.rst
index 46e14ec9963d..88d8df957a78 100644
--- a/Documentation/translations/zh_CN/index.rst
+++ b/Documentation/translations/zh_CN/index.rst
@@ -5,7 +5,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
- \kerneldocBeginSC
+ \kerneldocBeginSC{
.. _linux_doc_zh:
@@ -56,10 +56,14 @@ TODOList:
下列文档描述了内核需要的平台固件相关信息。
+.. toctree::
+ :maxdepth: 2
+
+ devicetree/index
+
TODOList:
* firmware-guide/index
-* devicetree/index
应用程序开发人员文档
--------------------
@@ -104,14 +108,17 @@ TODOList:
:maxdepth: 2
core-api/index
+ accounting/index
cpu-freq/index
iio/index
+ infiniband/index
+ power/index
+ virt/index
sound/index
filesystems/index
- virt/index
- infiniband/index
- accounting/index
scheduler/index
+ vm/index
+ peci/index
TODOList:
@@ -129,7 +136,6 @@ TODOList:
* netlabel/index
* networking/index
* pcmcia/index
-* power/index
* target/index
* timers/index
* spi/index
@@ -140,7 +146,6 @@ TODOList:
* gpu/index
* security/index
* crypto/index
-* vm/index
* bpf/index
* usb/index
* PCI/index
@@ -198,4 +203,4 @@ TODOList:
.. raw:: latex
- \kerneldocEndSC
+ }\kerneldocEndSC
diff --git a/Documentation/translations/zh_CN/peci/index.rst b/Documentation/translations/zh_CN/peci/index.rst
new file mode 100644
index 000000000000..4f6694c828fa
--- /dev/null
+++ b/Documentation/translations/zh_CN/peci/index.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/peci/index.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+=================
+Linux PECI 子系统
+=================
+
+.. toctree::
+
+
+ peci
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/translations/zh_CN/peci/peci.rst b/Documentation/translations/zh_CN/peci/peci.rst
new file mode 100644
index 000000000000..a3b4f99b994c
--- /dev/null
+++ b/Documentation/translations/zh_CN/peci/peci.rst
@@ -0,0 +1,54 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/peci/peci.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+====
+概述
+====
+
+平台环境控制接口(PECI)是英特尔处理器和管理控制器(如底板管理控制器,BMC)
+之间的一个通信接口。PECI提供的服务允许管理控制器通过访问各种寄存器来配置、监
+控和调试平台。它定义了一个专门的命令协议,管理控制器作为PECI的发起者,处理器
+作为PECI的响应者。PECI可以用于基于单处理器和多处理器的系统中。
+
+注意:英特尔PECI规范没有作为专门的文件发布,而是作为英特尔CPU的外部设计规范
+(EDS)的一部分。外部设计规范通常是不公开的。
+
+PECI 线
+---------
+
+PECI线接口使用单线进行自锁和数据传输。它不需要任何额外的控制线--物理层是一个
+自锁的单线总线信号,每一个比特都从接近零伏的空闲状态开始驱动、上升边缘。驱动高
+电平信号的持续时间可以确定位值是逻辑 “0” 还是逻辑 “1”。PECI线还包括与每个信
+息建立的可变数据速率。
+
+对于PECI线,每个处理器包将在一个定义的范围内利用唯一的、固定的地址,该地址应
+该与处理器插座ID有固定的关系--如果其中一个处理器被移除,它不会影响其余处理器
+的地址。
+
+PECI子系统代码内嵌文档
+------------------------
+
+该API在以下内核代码中:
+
+include/linux/peci.h
+
+drivers/peci/internal.h
+
+drivers/peci/core.c
+
+drivers/peci/request.c
+
+PECI CPU 驱动 API
+-------------------
+
+该API在以下内核代码中:
+
+drivers/peci/cpu.c
diff --git a/Documentation/translations/zh_CN/power/energy-model.rst b/Documentation/translations/zh_CN/power/energy-model.rst
new file mode 100644
index 000000000000..c7da1b6aefee
--- /dev/null
+++ b/Documentation/translations/zh_CN/power/energy-model.rst
@@ -0,0 +1,190 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/power/energy-model.rst
+
+:翻译:
+
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
+
+============
+设备能量模型
+============
+
+1. 概述
+-------
+
+能量模型(EM)框架是一种驱动程序与内核子系统之间的接口。其中驱动程序了解不同
+性能层级的设备所消耗的功率,而内核子系统愿意使用该信息做出能量感知决策。
+
+设备所消耗的功率的信息来源在不同的平台上可能有很大的不同。这些功率成本在某些
+情况下可以使用设备树数据来估算。在其它情况下,固件会更清楚。或者,用户空间可能
+是最清楚的。以此类推。为了避免每一个客户端子系统对每一种可能的信息源自己重新
+实现支持,EM框架作为一个抽象层介入,它在内核中对功率成本表的格式进行标准化,
+因此能够避免多余的工作。
+
+功率值可以用毫瓦或“抽象刻度”表示。多个子系统可能使用EM,由系统集成商来检查
+功率值刻度类型的要求是否满足。可以在能量感知调度器的文档中找到一个例子
+Documentation/scheduler/sched-energy.rst。对于一些子系统,比如热能或
+powercap,用“抽象刻度”描述功率值可能会导致问题。这些子系统对过去使用的功率的
+估算值更感兴趣,因此可能需要真实的毫瓦。这些要求的一个例子可以在智能功率分配
+Documentation/driver-api/thermal/power_allocator.rst文档中找到。
+
+内核子系统可能(基于EM内部标志位)实现了对EM注册设备是否具有不一致刻度的自动
+检查。要记住的重要事情是,当功率值以“抽象刻度”表示时,从中推导以毫焦耳为单位
+的真实能量消耗是不可能的。
+
+下图描述了一个驱动的例子(这里是针对Arm的,但该方法适用于任何体系结构),它
+向EM框架提供了功率成本,感兴趣的客户端可从中读取数据::
+
+ +---------------+ +-----------------+ +---------------+
+ | Thermal (IPA) | | Scheduler (EAS) | | Other |
+ +---------------+ +-----------------+ +---------------+
+ | | em_cpu_energy() |
+ | | em_cpu_get() |
+ +---------+ | +---------+
+ | | |
+ v v v
+ +---------------------+
+ | Energy Model |
+ | Framework |
+ +---------------------+
+ ^ ^ ^
+ | | | em_dev_register_perf_domain()
+ +----------+ | +---------+
+ | | |
+ +---------------+ +---------------+ +--------------+
+ | cpufreq-dt | | arm_scmi | | Other |
+ +---------------+ +---------------+ +--------------+
+ ^ ^ ^
+ | | |
+ +--------------+ +---------------+ +--------------+
+ | Device Tree | | Firmware | | ? |
+ +--------------+ +---------------+ +--------------+
+
+对于CPU设备,EM框架管理着系统中每个“性能域”的功率成本表。一个性能域是一组
+性能一起伸缩的CPU。性能域通常与CPUFreq策略具有1对1映射。一个性能域中的
+所有CPU要求具有相同的微架构。不同性能域中的CPU可以有不同的微架构。
+
+
+2. 核心API
+----------
+
+2.1 配置选项
+^^^^^^^^^^^^
+
+必须使能CONFIG_ENERGY_MODEL才能使用EM框架。
+
+
+2.2 性能域的注册
+^^^^^^^^^^^^^^^^
+
+“高级”EM的注册
+~~~~~~~~~~~~~~~~
+
+“高级”EM因它允许驱动提供更精确的功率模型而得名。它并不受限于框架中的一些已
+实现的数学公式(就像“简单”EM那样)。它可以更好地反映每个性能状态的实际功率
+测量。因此,在EM静态功率(漏电流功率)是重要的情况下,应该首选这种注册方式。
+
+驱动程序应通过以下API将性能域注册到EM框架中::
+
+ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
+ struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts);
+
+驱动程序必须提供一个回调函数,为每个性能状态返回<频率,功率>元组。驱动程序
+提供的回调函数可以自由地从任何相关位置(DT、固件......)以及以任何被认为是
+必要的方式获取数据。只有对于CPU设备,驱动程序必须使用cpumask指定性能域的CPU。
+对于CPU以外的其他设备,最后一个参数必须被设置为NULL。
+
+最后一个参数“milliwatts”(毫瓦)设置成正确的值是很重要的,使用EM的内核
+子系统可能会依赖这个标志来检查所有的EM设备是否使用相同的刻度。如果有不同的
+刻度,这些子系统可能决定:返回警告/错误,停止工作或崩溃(panic)。
+
+关于实现这个回调函数的驱动程序的例子,参见第3节。或者在第2.4节阅读这个API
+的更多文档。
+
+
+“简单”EM的注册
+~~~~~~~~~~~~~~~~
+
+“简单”EM是用框架的辅助函数cpufreq_register_em_with_opp()注册的。它实现了
+一个和以下数学公式紧密相关的功率模型::
+
+ Power = C * V^2 * f
+
+使用这种方法注册的EM可能无法正确反映真实设备的物理特性,例如当静态功率
+(漏电流功率)很重要时。
+
+
+2.3 访问性能域
+^^^^^^^^^^^^^^
+
+有两个API函数提供对能量模型的访问。em_cpu_get()以CPU id为参数,em_pd_get()
+以设备指针为参数。使用哪个接口取决于子系统,但对于CPU设备来说,这两个函数都返
+回相同的性能域。
+
+对CPU的能量模型感兴趣的子系统可以通过em_cpu_get() API检索它。在创建性能域时
+分配一次能量模型表,它保存在内存中不被修改。
+
+一个性能域所消耗的能量可以使用em_cpu_energy() API来估算。该估算假定CPU设备
+使用的CPUfreq监管器是schedutil。当前该计算不能提供给其它类型的设备。
+
+关于上述API的更多细节可以在 ``<linux/energy_model.h>`` 或第2.4节中找到。
+
+
+2.4 API的细节描述
+^^^^^^^^^^^^^^^^^
+参见 include/linux/energy_model.h 和 kernel/power/energy_model.c 的kernel doc。
+
+3. 驱动示例
+-----------
+
+CPUFreq框架支持专用的回调函数,用于为指定的CPU(们)注册EM:
+cpufreq_driver::register_em()。这个回调必须为每个特定的驱动程序正确实现,
+因为框架会在设置过程中适时地调用它。本节提供了一个简单的例子,展示CPUFreq驱动
+在能量模型框架中使用(假的)“foo”协议注册性能域。该驱动实现了一个est_power()
+函数提供给EM框架::
+
+ -> drivers/cpufreq/foo_cpufreq.c
+
+ 01 static int est_power(unsigned long *mW, unsigned long *KHz,
+ 02 struct device *dev)
+ 03 {
+ 04 long freq, power;
+ 05
+ 06 /* 使用“foo”协议设置频率上限 */
+ 07 freq = foo_get_freq_ceil(dev, *KHz);
+ 08 if (freq < 0);
+ 09 return freq;
+ 10
+ 11 /* 估算相关频率下设备的功率成本 */
+ 12 power = foo_estimate_power(dev, freq);
+ 13 if (power < 0);
+ 14 return power;
+ 15
+ 16 /* 将这些值返回给EM框架 */
+ 17 *mW = power;
+ 18 *KHz = freq;
+ 19
+ 20 return 0;
+ 21 }
+ 22
+ 23 static void foo_cpufreq_register_em(struct cpufreq_policy *policy)
+ 24 {
+ 25 struct em_data_callback em_cb = EM_DATA_CB(est_power);
+ 26 struct device *cpu_dev;
+ 27 int nr_opp;
+ 28
+ 29 cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
+ 30
+ 31 /* 查找该策略支持的OPP数量 */
+ 32 nr_opp = foo_get_nr_opp(policy);
+ 33
+ 34 /* 并注册新的性能域 */
+ 35 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,
+ 36 true);
+ 37 }
+ 38
+ 39 static struct cpufreq_driver foo_cpufreq_driver = {
+ 40 .register_em = foo_cpufreq_register_em,
+ 41 };
diff --git a/Documentation/translations/zh_CN/power/index.rst b/Documentation/translations/zh_CN/power/index.rst
new file mode 100644
index 000000000000..bc54983ba515
--- /dev/null
+++ b/Documentation/translations/zh_CN/power/index.rst
@@ -0,0 +1,56 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/power/index.rst
+
+:翻译:
+
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
+
+========
+电源管理
+========
+
+.. toctree::
+ :maxdepth: 1
+
+ energy-model
+ opp
+
+TODOList:
+
+ * apm-acpi
+ * basic-pm-debugging
+ * charger-manager
+ * drivers-testing
+ * freezing-of-tasks
+ * pci
+ * pm_qos_interface
+ * power_supply_class
+ * runtime_pm
+ * s2ram
+ * suspend-and-cpuhotplug
+ * suspend-and-interrupts
+ * swsusp-and-swap-files
+ * swsusp-dmcrypt
+ * swsusp
+ * video
+ * tricks
+
+ * userland-swsusp
+
+ * powercap/powercap
+ * powercap/dtpm
+
+ * regulator/consumer
+ * regulator/design
+ * regulator/machine
+ * regulator/overview
+ * regulator/regulator
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/translations/zh_CN/power/opp.rst b/Documentation/translations/zh_CN/power/opp.rst
new file mode 100644
index 000000000000..8d6e3f6f6202
--- /dev/null
+++ b/Documentation/translations/zh_CN/power/opp.rst
@@ -0,0 +1,341 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/power/opp.rst
+
+:翻译:
+
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
+
+======================
+操作性能值(OPP)库
+======================
+
+(C) 2009-2010 Nishanth Menon <nm@ti.com>, 德州仪器公司
+
+.. 目录
+
+ 1. 简介
+ 2. OPP链表初始注册
+ 3. OPP搜索函数
+ 4. OPP可用性控制函数
+ 5. OPP数据检索函数
+ 6. 数据结构
+
+1. 简介
+=======
+
+1.1 何为操作性能值(OPP)?
+------------------------------
+
+当今复杂的单片系统(SoC)由多个子模块组成,这些子模块会联合工作。在一个执行不同用例
+的操作系统中,并不是SoC中的所有模块都需要一直以最高频率工作。为了促成这一点,SoC中
+的子模块被分组为不同域,允许一些域以较低的电压和频率运行,而其它域则以较高的“电压/
+频率对”运行。
+
+设备按域支持的由频率电压对组成的离散的元组的集合,被称为操作性能值(组),或OPPs。
+
+举例来说:
+
+让我们考虑一个支持下述频率、电压值的内存保护单元(MPU)设备:
+{300MHz,最低电压为1V}, {800MHz,最低电压为1.2V}, {1GHz,最低电压为1.3V}
+
+我们能将它们表示为3个OPP,如下述{Hz, uV}元组(译注:频率的单位是赫兹,电压的单位是
+微伏)。
+
+- {300000000, 1000000}
+- {800000000, 1200000}
+- {1000000000, 1300000}
+
+1.2 操作性能值库
+----------------
+
+OPP库提供了一组辅助函数来组织和查询OPP信息。该库位于drivers/opp/目录下,其头文件
+位于include/linux/pm_opp.h中。OPP库可以通过开启CONFIG_PM_OPP来启用。某些SoC,
+如德州仪器的OMAP框架允许在不需要cpufreq的情况下可选地在某一OPP下启动。
+
+OPP库的典型用法如下::
+
+ (用户) -> 注册一个默认的OPP集合 -> (库)
+ (SoC框架) -> 在必要的情况下,对某些OPP进行修改 -> OPP layer
+ -> 搜索/检索信息的查询 ->
+
+OPP层期望每个域由一个唯一的设备指针来表示。SoC框架在OPP层为每个设备注册了一组初始
+OPP。这个链表的长度被期望是一个最优化的小数字,通常每个设备大约5个。初始链表包含了
+一个OPP集合,这个集合被期望能在系统中安全使能。
+
+关于OPP可用性的说明
+^^^^^^^^^^^^^^^^^^^
+
+随着系统的运行,SoC框架可能会基于各种外部因素选择让某些OPP在每个设备上可用或不可用,
+示例:温度管理或其它异常场景中,SoC框架可能会选择禁用一个较高频率的OPP以安全地继续
+运行,直到该OPP被重新启用(如果可能)。
+
+OPP库在它的实现中达成了这个概念。以下操作函数只能对可用的OPP使用:
+dev_pm_opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage,
+dev_pm_opp_get_freq, dev_pm_opp_get_opp_count。
+
+dev_pm_opp_find_freq_exact是用来查找OPP指针的,该指针可被用在dev_pm_opp_enable/
+disable函数,使一个OPP在被需要时变为可用。
+
+警告:如果对一个设备调用dev_pm_opp_enable/disable函数,OPP库的用户应该使用
+dev_pm_opp_get_opp_count来刷新OPP的可用性计数。触发这些的具体机制,或者对有依赖的
+子系统(比如cpufreq)的通知机制,都是由使用OPP库的SoC特定框架酌情处理的。在这些操作
+中,同样需要注意刷新cpufreq表。
+
+2. OPP链表初始注册
+==================
+SoC的实现会迭代调用dev_pm_opp_add函数来增加每个设备的OPP。预期SoC框架将以最优的
+方式注册OPP条目 - 典型的数字范围小于5。通过注册OPP生成的OPP链表,在整个设备运行过程
+中由OPP库维护。SoC框架随后可以使用dev_pm_opp_enable / disable函数动态地
+控制OPP的可用性。
+
+dev_pm_opp_add
+ 为设备指针所指向的特定域添加一个新的OPP。OPP是用频率和电压定义的。一旦完成
+ 添加,OPP被认为是可用的,可以用dev_pm_opp_enable/disable函数来控制其可用性。
+ OPP库内部用dev_pm_opp结构体存储并管理这些信息。这个函数可以被SoC框架根据SoC
+ 的使用环境的需求来定义一个最优链表。
+
+ 警告:
+ 不要在中断上下文使用这个函数。
+
+ 示例::
+
+ soc_pm_init()
+ {
+ /* 做一些事情 */
+ r = dev_pm_opp_add(mpu_dev, 1000000, 900000);
+ if (!r) {
+ pr_err("%s: unable to register mpu opp(%d)\n", r);
+ goto no_cpufreq;
+ }
+ /* 做一些和cpufreq相关的事情 */
+ no_cpufreq:
+ /* 做剩余的事情 */
+ }
+
+3. OPP搜索函数
+==============
+cpufreq等高层框架对频率进行操作,为了将频率映射到相应的OPP,OPP库提供了便利的函数
+来搜索OPP库内部管理的OPP链表。这些搜索函数如果找到匹配的OPP,将返回指向该OPP的指针,
+否则返回错误。这些错误预计由标准的错误检查,如IS_ERR()来处理,并由调用者采取适当的
+行动。
+
+这些函数的调用者应在使用完OPP后调用dev_pm_opp_put()。否则,OPP的内存将永远不会
+被释放,并导致内存泄露。
+
+dev_pm_opp_find_freq_exact
+ 根据 *精确的* 频率和可用性来搜索OPP。这个函数对默认不可用的OPP特别有用。
+ 例子:在SoC框架检测到更高频率可用的情况下,它可以使用这个函数在调用
+ dev_pm_opp_enable之前找到OPP::
+
+ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
+ dev_pm_opp_put(opp);
+ /* 不要操作指针.. 只是做有效性检查.. */
+ if (IS_ERR(opp)) {
+ pr_err("frequency not disabled!\n");
+ /* 触发合适的操作.. */
+ } else {
+ dev_pm_opp_enable(dev,1000000000);
+ }
+
+ 注意:
+ 这是唯一一个可以搜索不可用OPP的函数。
+
+dev_pm_opp_find_freq_floor
+ 搜索一个 *最多* 提供指定频率的可用OPP。这个函数在搜索较小的匹配或按频率
+ 递减的顺序操作OPP信息时很有用。
+ 例子:要找的一个设备的最高OPP::
+
+ freq = ULONG_MAX;
+ opp = dev_pm_opp_find_freq_floor(dev, &freq);
+ dev_pm_opp_put(opp);
+
+dev_pm_opp_find_freq_ceil
+ 搜索一个 *最少* 提供指定频率的可用OPP。这个函数在搜索较大的匹配或按频率
+ 递增的顺序操作OPP信息时很有用。
+ 例1:找到一个设备最小的OPP::
+
+ freq = 0;
+ opp = dev_pm_opp_find_freq_ceil(dev, &freq);
+ dev_pm_opp_put(opp);
+
+ 例: 一个SoC的cpufreq_driver->target的简易实现::
+
+ soc_cpufreq_target(..)
+ {
+ /* 做策略检查等操作 */
+ /* 找到和请求最接近的频率 */
+ opp = dev_pm_opp_find_freq_ceil(dev, &freq);
+ dev_pm_opp_put(opp);
+ if (!IS_ERR(opp))
+ soc_switch_to_freq_voltage(freq);
+ else
+ /* 当不能满足请求时,要做的事 */
+ /* 做其它事 */
+ }
+
+4. OPP可用性控制函数
+====================
+在OPP库中注册的默认OPP链表也许无法满足所有可能的场景。OPP库提供了一套函数来修改
+OPP链表中的某个OPP的可用性。这使得SoC框架能够精细地动态控制哪一组OPP是可用于操作
+的。设计这些函数的目的是在诸如考虑温度时 *暂时地* 删除某个OPP(例如,在温度下降
+之前不要使用某OPP)。
+
+警告:
+ 不要在中断上下文使用这些函数。
+
+dev_pm_opp_enable
+ 使一个OPP可用于操作。
+ 例子:假设1GHz的OPP只有在SoC温度低于某个阈值时才可用。SoC框架的实现可能
+ 会选择做以下事情::
+
+ if (cur_temp < temp_low_thresh) {
+ /* 若1GHz未使能,则使能 */
+ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
+ dev_pm_opp_put(opp);
+ /* 仅仅是错误检查 */
+ if (!IS_ERR(opp))
+ ret = dev_pm_opp_enable(dev, 1000000000);
+ else
+ goto try_something_else;
+ }
+
+dev_pm_opp_disable
+ 使一个OPP不可用于操作。
+ 例子:假设1GHz的OPP只有在SoC温度高于某个阈值时才可用。SoC框架的实现可能
+ 会选择做以下事情::
+
+ if (cur_temp > temp_high_thresh) {
+ /* 若1GHz已使能,则关闭 */
+ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true);
+ dev_pm_opp_put(opp);
+ /* 仅仅是错误检查 */
+ if (!IS_ERR(opp))
+ ret = dev_pm_opp_disable(dev, 1000000000);
+ else
+ goto try_something_else;
+ }
+
+5. OPP数据检索函数
+==================
+由于OPP库对OPP信息进行了抽象化处理,因此需要一组函数来从dev_pm_opp结构体中提取
+信息。一旦使用搜索函数检索到一个OPP指针,以下函数就可以被SoC框架用来检索OPP层
+内部描述的信息。
+
+dev_pm_opp_get_voltage
+ 检索OPP指针描述的电压。
+ 例子: 当cpufreq切换到到不同频率时,SoC框架需要用稳压器框架将OPP描述
+ 的电压设置到提供电压的电源管理芯片中::
+
+ soc_switch_to_freq_voltage(freq)
+ {
+ /* 做一些事情 */
+ opp = dev_pm_opp_find_freq_ceil(dev, &freq);
+ v = dev_pm_opp_get_voltage(opp);
+ dev_pm_opp_put(opp);
+ if (v)
+ regulator_set_voltage(.., v);
+ /* 做其它事 */
+ }
+
+dev_pm_opp_get_freq
+ 检索OPP指针描述的频率。
+ 例子:比方说,SoC框架使用了几个辅助函数,通过这些函数,我们可以将OPP
+ 指针传入,而不是传入额外的参数,用来处理一系列数据参数::
+
+ soc_cpufreq_target(..)
+ {
+ /* 做一些事情.. */
+ max_freq = ULONG_MAX;
+ max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq);
+ requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq);
+ if (!IS_ERR(max_opp) && !IS_ERR(requested_opp))
+ r = soc_test_validity(max_opp, requested_opp);
+ dev_pm_opp_put(max_opp);
+ dev_pm_opp_put(requested_opp);
+ /* 做其它事 */
+ }
+ soc_test_validity(..)
+ {
+ if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp))
+ return -EINVAL;
+ if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp))
+ return -EINVAL;
+ /* 做一些事情.. */
+ }
+
+dev_pm_opp_get_opp_count
+ 检索某个设备可用的OPP数量。
+ 例子:假设SoC中的一个协处理器需要知道某个表中的可用频率,主处理器可以
+ 按如下方式发出通知::
+
+ soc_notify_coproc_available_frequencies()
+ {
+ /* 做一些事情 */
+ num_available = dev_pm_opp_get_opp_count(dev);
+ speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL);
+ /* 按升序填充表 */
+ freq = 0;
+ while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) {
+ speeds[i] = freq;
+ freq++;
+ i++;
+ dev_pm_opp_put(opp);
+ }
+
+ soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available);
+ /* 做其它事 */
+ }
+
+6. 数据结构
+===========
+通常,一个SoC包含多个可变电压域。每个域由一个设备指针描述。和OPP之间的关系可以
+按以下方式描述::
+
+ SoC
+ |- device 1
+ | |- opp 1 (availability, freq, voltage)
+ | |- opp 2 ..
+ ... ...
+ | `- opp n ..
+ |- device 2
+ ...
+ `- device m
+
+OPP库维护着一个内部链表,SoC框架使用上文描述的各个函数来填充和访问。然而,描述
+真实OPP和域的结构体是OPP库自身的内部组成,以允许合适的抽象在不同系统中得到复用。
+
+struct dev_pm_opp
+ OPP库的内部数据结构,用于表示一个OPP。除了频率、电压、可用性信息外,
+ 它还包含OPP库运行所需的内部统计信息。指向这个结构体的指针被提供给
+ 用户(比如SoC框架)使用,在与OPP层的交互中作为OPP的标识符。
+
+ 警告:
+ 结构体dev_pm_opp的指针不应该由用户解析或修改。一个实例的默认值由
+ dev_pm_opp_add填充,但OPP的可用性由dev_pm_opp_enable/disable函数
+ 修改。
+
+struct device
+ 这用于向OPP层标识一个域。设备的性质和它的实现是由OPP库的用户决定的,
+ 如SoC框架。
+
+总体来说,以一个简化的视角看,对数据结构的操作可以描述为下面各图::
+
+ 初始化 / 修改:
+ +-----+ /- dev_pm_opp_enable
+ dev_pm_opp_add --> | opp | <-------
+ | +-----+ \- dev_pm_opp_disable
+ \-------> domain_info(device)
+
+ 搜索函数:
+ /-- dev_pm_opp_find_freq_ceil ---\ +-----+
+ domain_info<---- dev_pm_opp_find_freq_exact -----> | opp |
+ \-- dev_pm_opp_find_freq_floor ---/ +-----+
+
+ 检索函数:
+ +-----+ /- dev_pm_opp_get_voltage
+ | opp | <---
+ +-----+ \- dev_pm_opp_get_freq
+
+ domain_info <- dev_pm_opp_get_opp_count
diff --git a/Documentation/translations/zh_CN/riscv/index.rst b/Documentation/translations/zh_CN/riscv/index.rst
index bbf5d7b3777a..614cde0c0997 100644
--- a/Documentation/translations/zh_CN/riscv/index.rst
+++ b/Documentation/translations/zh_CN/riscv/index.rst
@@ -18,6 +18,7 @@ RISC-V 体系结构
:maxdepth: 1
boot-image-header
+ vm-layout
pmu
patch-acceptance
diff --git a/Documentation/translations/zh_CN/riscv/vm-layout.rst b/Documentation/translations/zh_CN/riscv/vm-layout.rst
new file mode 100644
index 000000000000..585cb89317a3
--- /dev/null
+++ b/Documentation/translations/zh_CN/riscv/vm-layout.rst
@@ -0,0 +1,67 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/riscv/vm-layout.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+============================
+RISC-V Linux上的虚拟内存布局
+============================
+
+:作者: Alexandre Ghiti <alex@ghiti.fr>
+:日期: 12 February 2021
+
+这份文件描述了RISC-V Linux内核使用的虚拟内存布局。
+
+32位 RISC-V Linux 内核
+======================
+
+RISC-V Linux Kernel SV32
+------------------------
+
+TODO
+
+64位 RISC-V Linux 内核
+======================
+
+RISC-V特权架构文档指出,64位地址 "必须使第63-48位值都等于第47位,否则将发生缺页异常。":这将虚
+拟地址空间分成两半,中间有一个非常大的洞,下半部分是用户空间所在的地方,上半部分是RISC-V Linux
+内核所在的地方。
+
+RISC-V Linux Kernel SV39
+------------------------
+
+::
+
+ ========================================================================================================================
+ 开始地址 | 偏移 | 结束地址 | 大小 | 虚拟内存区域描述
+ ========================================================================================================================
+ | | | |
+ 0000000000000000 | 0 | 0000003fffffffff | 256 GB | 用户空间虚拟内存,每个内存管理器不同
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
+ 0000004000000000 | +256 GB | ffffffbfffffffff | ~16M TB | ... 巨大的、几乎64位宽的直到内核映射的-256GB地方
+ | | | | 开始偏移的非经典虚拟内存地址空洞。
+ | | | |
+ __________________|____________|__________________|_________|___________________________________________________________
+ |
+ | 内核空间的虚拟内存,在所有进程之间共享:
+ ____________________________________________________________|___________________________________________________________
+ | | | |
+ ffffffc6fee00000 | -228 GB | ffffffc6feffffff | 2 MB | fixmap
+ ffffffc6ff000000 | -228 GB | ffffffc6ffffffff | 16 MB | PCI io
+ ffffffc700000000 | -228 GB | ffffffc7ffffffff | 4 GB | vmemmap
+ ffffffc800000000 | -224 GB | ffffffd7ffffffff | 64 GB | vmalloc/ioremap space
+ ffffffd800000000 | -160 GB | fffffff6ffffffff | 124 GB | 直接映射所有物理内存
+ fffffff700000000 | -36 GB | fffffffeffffffff | 32 GB | kasan
+ __________________|____________|__________________|_________|____________________________________________________________
+ |
+ |
+ ____________________________________________________________|____________________________________________________________
+ | | | |
+ ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF
+ ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel
+ __________________|____________|__________________|_________|____________________________________________________________
diff --git a/Documentation/translations/zh_CN/scheduler/index.rst b/Documentation/translations/zh_CN/scheduler/index.rst
index f8f8f35d53c7..12bf3bd02ccf 100644
--- a/Documentation/translations/zh_CN/scheduler/index.rst
+++ b/Documentation/translations/zh_CN/scheduler/index.rst
@@ -5,6 +5,7 @@
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
:校译:
@@ -23,16 +24,14 @@ Linux调度器
sched-design-CFS
sched-domains
sched-capacity
-
+ sched-energy
+ sched-nice-design
+ sched-stats
TODOList:
- sched-bwc
sched-deadline
- sched-energy
- sched-nice-design
sched-rt-group
- sched-stats
text_files
diff --git a/Documentation/translations/zh_CN/scheduler/sched-energy.rst b/Documentation/translations/zh_CN/scheduler/sched-energy.rst
new file mode 100644
index 000000000000..fdbf6cfeea93
--- /dev/null
+++ b/Documentation/translations/zh_CN/scheduler/sched-energy.rst
@@ -0,0 +1,351 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/scheduler/sched-energy.rst
+
+:翻译:
+
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
+
+============
+能量感知调度
+============
+
+1. 简介
+-------
+
+能量感知调度(EAS)使调度器有能力预测其决策对CPU所消耗的能量的影响。EAS依靠
+一个能量模型(EM)来为每个任务选择一个节能的CPU,同时最小化对吞吐率的影响。
+本文档致力于介绍介绍EAS是如何工作的,它背后的主要设计决策是什么,以及使其运行
+所需的条件细节。
+
+在进一步阅读之前,请注意,在撰写本文时::
+
+ /!\ EAS不支持对称CPU拓扑的平台 /!\
+
+EAS只在异构CPU拓扑结构(如Arm大小核,big.LITTLE)上运行。因为在这种情况下,
+通过调度来节约能量的潜力是最大的。
+
+EAS实际使用的EM不是由调度器维护的,而是一个专门的框架。关于这个框架的细节和
+它提供的内容,请参考其文档(见Documentation/power/energy-model.rst)。
+
+
+2. 背景和术语
+-------------
+
+从一开始就说清楚定义:
+ - 能量 = [焦耳] (比如供电设备上的电池提供的资源)
+ - 功率 = 能量/时间 = [焦耳/秒] = [瓦特]
+
+ EAS的目标是最小化能量消耗,同时仍能将工作完成。也就是说,我们要最大化::
+
+ 性能 [指令数/秒]
+ ----------------
+ 功率 [瓦特]
+
+它等效于最小化::
+
+ 能量 [焦耳]
+ -----------
+ 指令数
+
+同时仍然获得“良好”的性能。当前调度器只考虑性能目标,因此该式子本质上是一个
+可选的优化目标,它同时考虑了两个目标:能量效率和性能。
+
+引入EM的想法是为了让调度器评估其决策的影响,而不是盲目地应用可能仅在部分
+平台有正面效果的节能技术。同时,EM必须尽可能的简单,以最小化调度器的时延
+影响。
+
+简而言之,EAS改变了CFS任务分配给CPU的方式。当调度器决定一个任务应该在哪里
+运行时(在唤醒期间),EM被用来在不损害系统吞吐率的情况下,从几个较好的候选
+CPU中挑选一个经预测能量消耗最优的CPU。EAS的预测依赖于对平台拓扑结构特定元素
+的了解,包括CPU的“算力”,以及它们各自的能量成本。
+
+
+3. 拓扑信息
+-----------
+
+EAS(以及调度器的剩余部分)使用“算力”的概念来区分不同计算吞吐率的CPU。一个
+CPU的“算力”代表了它在最高频率下运行时能完成的工作量,且这个值是相对系统中
+算力最大的CPU而言的。算力值被归一化为1024以内,并且可与由实体负载跟踪
+(PELT)机制算出的利用率信号做对比。由于有算力值和利用率值,EAS能够估计一个
+任务/CPU有多大/有多忙,并在评估性能与能量时将其考虑在内。CPU算力由特定体系
+结构实现的arch_scale_cpu_capacity()回调函数提供。
+
+EAS使用的其余平台信息是直接从能量模型(EM)框架中读取的。一个平台的EM是一张
+表,表中每项代表系统中一个“性能域”的功率成本。(若要了解更多关于性能域的细节,
+见Documentation/power/energy-model.rst)
+
+当调度域被建立或重新建立时,调度器管理对拓扑代码中EM对象的引用。对于每个根域
+(rd),调度器维护一个与当前rd->span相交的所有性能域的单向链表。链表中的每个
+节点都包含一个指向EM框架所提供的结构体em_perf_domain的指针。
+
+链表被附加在根域上,以应对独占的cpuset的配置。由于独占的cpuset的边界不一定与
+性能域的边界一致,不同根域的链表可能包含重复的元素。
+
+示例1
+ 让我们考虑一个有12个CPU的平台,分成3个性能域,(pd0,pd4和pd8),按以下
+ 方式组织::
+
+ CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
+ PDs: |--pd0--|--pd4--|---pd8---|
+ RDs: |----rd1----|-----rd2-----|
+
+ 现在,考虑用户空间决定将系统分成两个独占的cpusets,因此创建了两个独立的根域,
+ 每个根域包含6个CPU。这两个根域在上图中被表示为rd1和rd2。由于pd4与rd1和rd2
+ 都有交集,它将同时出现于附加在这两个根域的“->pd”链表中:
+
+ * rd1->pd: pd0 -> pd4
+ * rd2->pd: pd4 -> pd8
+
+ 请注意,调度器将为pd4创建两个重复的链表节点(每个链表中各一个)。然而这
+ 两个节点持有指向同一个EM框架的共享数据结构的指针。
+
+由于对这些链表的访问可能与热插拔及其它事件并发发生,因此它们受RCU锁保护,就像
+被调度器操控的拓扑结构体中剩下字段一样。
+
+EAS同样维护了一个静态键(sched_energy_present),当至少有一个根域满足EAS
+启动的所有条件时,这个键就会被启动。在第6节中总结了这些条件。
+
+
+4. 能量感知任务放置
+-------------------
+
+EAS覆盖了CFS的任务唤醒平衡代码。在唤醒平衡时,它使用平台的EM和PELT信号来选择节能
+的目标CPU。当EAS被启用时,select_task_rq_fair()调用find_energy_efficient_cpu()
+来做任务放置决定。这个函数寻找在每个性能域中寻找具有最高剩余算力(CPU算力 - CPU
+利用率)的CPU,因为它能让我们保持最低的频率。然后,该函数检查将任务放在新CPU相较
+依然放在之前活动的prev_cpu是否可以节省能量。
+
+如果唤醒的任务被迁移,find_energy_efficient_cpu()使用compute_energy()来估算
+系统将消耗多少能量。compute_energy()检查各CPU当前的利用率情况,并尝试调整来
+“模拟”任务迁移。EM框架提供了API em_pd_energy()计算每个性能域在给定的利用率条件
+下的预期能量消耗。
+
+下面详细介绍一个优化能量消耗的任务放置决策的例子。
+
+示例2
+ 让我们考虑一个有两个独立性能域的(伪)平台,每个性能域含有2个CPU。CPU0和CPU1
+ 是小核,CPU2和CPU3是大核。
+
+ 调度器必须决定将任务P放在哪个CPU上,这个任务的util_avg = 200(平均利用率),
+ prev_cpu = 0(上一次运行在CPU0)。
+
+ 目前CPU的利用率情况如下图所示。CPU 0-3的util_avg分别为400、100、600和500。
+ 每个性能域有三个操作性能值(OPP)。与每个OPP相关的CPU算力和功率成本列在能量
+ 模型表中。P的util_avg在图中显示为"PP"::
+
+ CPU util.
+ 1024 - - - - - - - Energy Model
+ +-----------+-------------+
+ | Little | Big |
+ 768 ============= +-----+-----+------+------+
+ | Cap | Pwr | Cap | Pwr |
+ +-----+-----+------+------+
+ 512 =========== - ##- - - - - | 170 | 50 | 512 | 400 |
+ ## ## | 341 | 150 | 768 | 800 |
+ 341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 |
+ PP ## ## +-----+-----+------+------+
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+ Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
+
+
+ find_energy_efficient_cpu()将首先在两个性能域中寻找具有最大剩余算力的CPU。
+ 在这个例子中是CPU1和CPU3。然后,它将估算,当P被放在它们中的任意一个时,系统的
+ 能耗,并检查这样做是否会比把P放在CPU0上节省一些能量。EAS假定OPPs遵循利用率
+ (这与CPUFreq监管器schedutil的行为一致。关于这个问题的更多细节,见第6节)。
+
+ **情况1. P被迁移到CPU1**::
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 200 / 341 * 150 = 88
+ * CPU1: 300 / 341 * 150 = 131
+ * CPU2: 600 / 768 * 800 = 625
+ 512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
+ ## ## => total_energy = 1364
+ 341 =========== ## ##
+ PP ## ##
+ 170 -## - - PP- ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+
+ **情况2. P被迁移到CPU3**::
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 200 / 341 * 150 = 88
+ * CPU1: 100 / 341 * 150 = 43
+ PP * CPU2: 600 / 768 * 800 = 625
+ 512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
+ ## ## => total_energy = 1485
+ 341 =========== ## ##
+ ## ##
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+ **情况3. P依旧留在prev_cpu/CPU0**::
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 400 / 512 * 300 = 234
+ * CPU1: 100 / 512 * 300 = 58
+ * CPU2: 600 / 768 * 800 = 625
+ 512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
+ ## ## => total_energy = 1437
+ 341 -PP - - - - ## ##
+ PP ## ##
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+ 从这些计算结果来看,情况1的总能量最低。所以从节约能量的角度看,CPU1是最佳候选
+ 者。
+
+大核通常比小核更耗电,因此主要在任务不适合在小核运行时使用。然而,小核并不总是比
+大核节能。举例来说,对于某些系统,小核的高OPP可能比大核的低OPP能量消耗更高。因此,
+如果小核在某一特定时间点刚好有足够的利用率,在此刻被唤醒的小任务放在大核执行可能
+会更节能,尽管它在小核上运行也是合适的。
+
+即使在大核所有OPP都不如小核OPP节能的情况下,在某些特定条件下,令小任务运行在大核
+上依然可能节能。事实上,将一个任务放在一个小核上可能导致整个性能域的OPP提高,这将
+增加已经在该性能域运行的任务的能量成本。如果唤醒的任务被放在一个大核上,它的执行
+成本可能比放在小核上更高,但这不会影响小核上的其它任务,这些任务将继续以较低的OPP
+运行。因此,当考虑CPU消耗的总能量时,在大核上运行一个任务的额外成本可能小于为所有
+其它运行在小核的任务提高OPP的成本。
+
+上面的例子几乎不可能以一种通用的方式得到正确的结果;同时,对于所有平台,在不知道
+系统所有CPU每个不同OPP的运行成本时,也无法得到正确的结果。得益于基于EM的设计,
+EAS应该能够正确处理这些问题而不会引发太多麻烦。然而,为了确保对高利用率场景的
+吞吐率造成的影响最小化,EAS同样实现了另外一种叫“过度利用率”的机制。
+
+
+5. 过度利用率
+-------------
+
+从一般的角度来看,EAS能提供最大帮助的是那些涉及低、中CPU利用率的使用场景。每当CPU
+密集型的长任务运行,它们将需要所有的可用CPU算力,调度器将没有什么办法来节省能量同时
+又不损害吞吐率。为了避免EAS损害性能,一旦CPU被使用的算力超过80%,它将被标记为“过度
+利用”。只要根域中没有CPU是过度利用状态,负载均衡被禁用,而EAS将覆盖唤醒平衡代码。
+EAS很可能将负载放置在系统中能量效率最高的CPU而不是其它CPU上,只要不损害吞吐率。
+因此,负载均衡器被禁用以防止它打破EAS发现的节能任务放置。当系统未处于过度利用状态时,
+这样做是安全的,因为低于80%的临界点意味着:
+
+ a. 所有的CPU都有一些空闲时间,所以EAS使用的利用率信号很可能准确地代表各种任务
+ 的“大小”。
+ b. 所有任务,不管它们的nice值是多大,都应该被提供了足够多的CPU算力。
+ c. 既然有多余的算力,那么所有的任务都必须定期阻塞/休眠,在唤醒时进行平衡就足够
+ 了。
+
+只要一个CPU利用率超过80%的临界点,上述三个假设中至少有一个是不正确的。在这种情况下,
+整个根域的“过度利用”标志被设置,EAS被禁用,负载均衡器被重新启用。通过这样做,调度器
+又回到了在CPU密集的条件下基于负载的算法做负载均衡。这更好地尊重了任务的nice值。
+
+由于过度利用率的概念在很大程度上依赖于检测系统中是否有一些空闲时间,所以必须考虑
+(比CFS)更高优先级的调度类(以及中断)“窃取”的CPU算力。像这样,对过度使用率的检测
+不仅要考虑CFS任务使用的算力,还需要考虑其它调度类和中断。
+
+
+6. EAS的依赖和要求
+------------------
+
+能量感知调度依赖系统的CPU具有特定的硬件属性,以及内核中的其它特性被启用。本节列出
+了这些依赖,并对如何满足这些依赖提供了提示。
+
+
+6.1 - 非对称CPU拓扑
+^^^^^^^^^^^^^^^^^^^
+
+
+如简介所提,目前只有非对称CPU拓扑结构的平台支持EAS。通过在运行时查询
+SD_ASYM_CPUCAPACITY_FULL标志位是否在创建调度域时已设置来检查这一要求是否满足。
+
+参阅Documentation/scheduler/sched-capacity.rst以了解在sched_domain层次结构
+中设置此标志位所需满足的要求。
+
+请注意,EAS并非从根本上与SMP不兼容,但在SMP平台上还没有观察到明显的节能。这一
+限制可以在将来进行修改,如果被证明不是这样的话。
+
+
+6.2 - 当前的能量模型
+^^^^^^^^^^^^^^^^^^^^
+
+EAS使用一个平台的EM来估算调度决策对能量的影响。因此,你的平台必须向EM框架提供
+能量成本表,以启动EAS。要做到这一点,请参阅文档
+Documentation/power/energy-model.rst中的独立EM框架部分。
+
+另请注意,调度域需要在EM注册后重建,以便启动EAS。
+
+EAS使用EM对能量使用率进行预测决策,因此它在检查任务放置的可能选项时更加注重
+差异。对于EAS来说,EM的功率值是以毫瓦还是以“抽象刻度”为单位表示并不重要。
+
+
+
+6.3 - 能量模型复杂度
+^^^^^^^^^^^^^^^^^^^^
+
+任务唤醒路径是时延敏感的。当一个平台的EM太复杂(太多CPU,太多性能域,太多状态
+等),在唤醒路径中使用它的成本就会升高到不可接受。能量感知唤醒算法的复杂度为:
+
+ C = Nd * (Nc + Ns)
+
+其中:Nd是性能域的数量;Nc是CPU的数量;Ns是OPP的总数(例如:对于两个性能域,
+每个域有4个OPP,则Ns = 8)。
+
+当调度域建立时,复杂性检查是在根域上进行的。如果一个根域的复杂度C恰好高于完全
+主观设定的EM_MAX_COMPLEXITY阈值(在本文写作时,是2048),则EAS不会在此根域
+启动。
+
+如果你的平台的能量模型的复杂度太高,EAS无法在这个根域上使用,但你真的想用,
+那么你就只剩下两个可能的选择:
+
+ 1. 将你的系统拆分成分离的、较小的、使用独占cpuset的根域,并在每个根域局部
+ 启用EAS。这个方案的好处是开箱即用,但缺点是无法在根域之间实现负载均衡,
+ 这可能会导致总体系统负载不均衡。
+ 2. 提交补丁以降低EAS唤醒算法的复杂度,从而使其能够在合理的时间内处理更大
+ 的EM。
+
+
+6.4 - Schedutil监管器
+^^^^^^^^^^^^^^^^^^^^^
+
+EAS试图预测CPU在不久的将来会在哪个OPP下运行,以估算它们的能量消耗。为了做到
+这一点,它假定CPU的OPP跟随CPU利用率变化而变化。
+
+尽管在实践中很难对这一假设的准确性提供硬性保证(因为,举例来说硬件可能不会做
+它被要求做的事情),相对于其他CPUFreq监管器,schedutil至少_请求_使用利用率
+信号计算的频率。因此,与EAS一起使用的唯一合理的监管器是schedutil,因为它是
+唯一一个在频率请求和能量预测之间提供某种程度的一致性的监管器。
+
+不支持将EAS与schedutil之外的任何其它监管器一起使用。
+
+
+6.5 刻度不变性使用率信号
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+为了对不同的CPU和所有的性能状态做出准确的预测,EAS需要频率不变的和CPU不变的
+PELT信号。这些信号可以通过每个体系结构定义的arch_scale{cpu,freq}_capacity()
+回调函数获取。
+
+不支持在没有实现这两个回调函数的平台上使用EAS。
+
+
+6.6 多线程(SMT)
+^^^^^^^^^^^^^^^^^
+
+当前实现的EAS是不感知SMT的,因此无法利用多线程硬件节约能量。EAS认为线程是独立的
+CPU,这实际上对性能和能量消耗都是不利的。
+
+不支持在SMT上使用EAS。
diff --git a/Documentation/translations/zh_CN/scheduler/sched-nice-design.rst b/Documentation/translations/zh_CN/scheduler/sched-nice-design.rst
new file mode 100644
index 000000000000..9107f0c0b979
--- /dev/null
+++ b/Documentation/translations/zh_CN/scheduler/sched-nice-design.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/scheduler/sched-nice-design.rst
+
+:翻译:
+
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
+
+=====================
+调度器nice值设计
+=====================
+
+本文档解释了新的Linux调度器中修改和精简后的nice级别的实现思路。
+
+Linux的nice级别总是非常脆弱,人们持续不断地缠着我们,让nice +19的任务占用
+更少的CPU时间。
+
+不幸的是,在旧的调度器中,这不是那么容易实现的(否则我们早就做到了),因为对
+nice级别的支持在历史上是与时间片长度耦合的,而时间片单位是由HZ滴答驱动的,
+所以最小的时间片是1/HZ。
+
+在O(1)调度器中(2003年),我们改变了负的nice级别,使它们比2.4内核更强
+(人们对这一变化很满意),而且我们还故意校正了线性时间片准则,使得nice +19
+的级别 _正好_ 是1 jiffy。为了让大家更好地理解它,时间片的图会是这样的(质量
+不佳的ASCII艺术提醒!)::
+
+
+ A
+ \ | [timeslice length]
+ \ |
+ \ |
+ \ |
+ \ |
+ \|___100msecs
+ |^ . _
+ | ^ . _
+ | ^ . _
+ -*----------------------------------*-----> [nice level]
+ -20 | +19
+ |
+ |
+
+因此,如果有人真的想renice任务,相较线性规则,+19会给出更大的效果(改变
+ABI来扩展优先级的解决方案在早期就被放弃了)。
+
+这种方法在一定程度上奏效了一段时间,但后来HZ=1000时,它导致1 jiffy为
+1ms,这意味着0.1%的CPU使用率,我们认为这有点过度。过度 _不是_ 因为它表示
+的CPU使用率过小,而是因为它引发了过于频繁(每毫秒1次)的重新调度(因此会
+破坏缓存,等等。请记住,硬件更弱、cache更小是很久以前的事了,当时人们在
+nice +19级别运行数量颇多的应用程序)。
+
+因此,对于HZ=1000,我们将nice +19改为5毫秒,因为这感觉像是正确的最小
+粒度——这相当于5%的CPU利用率。但nice +19的根本的HZ敏感属性依然保持不变,
+我们没有收到过关于nice +19在CPU利用率方面太 _弱_ 的任何抱怨,我们只收到
+过它(依然)太 _强_ 的抱怨 :-)。
+
+总结一下:我们一直想让nice各级别一致性更强,但在HZ和jiffies的限制下,以及
+nice级别与时间片、调度粒度耦合是令人讨厌的设计,这一目标并不真正可行。
+
+第二个关于Linux nice级别支持的抱怨是(不那么频繁,但仍然定期发生),它
+在原点周围的不对称性(你可以在上面的图片中看到),或者更准确地说:事实上
+nice级别的行为取决于 _绝对的_ nice级别,而nice应用程序接口本身从根本上
+说是“相对”的:
+
+ int nice(int inc);
+
+ asmlinkage long sys_nice(int increment)
+
+(第一个是glibc的应用程序接口,第二个是syscall的应用程序接口)
+注意,“inc”是相对当前nice级别而言的,类似bash的“nice”命令等工具是这个
+相对性应用程序接口的镜像。
+
+在旧的调度器中,举例来说,如果你以nice +1启动一个任务,并以nice +2启动
+另一个任务,这两个任务的CPU分配将取决于父外壳程序的nice级别——如果它是
+nice -10,那么CPU的分配不同于+5或+10。
+
+第三个关于Linux nice级别支持的抱怨是,负数nice级别“不够有力”,以很多人
+不得不诉诸于实时调度优先级来运行音频(和其它多媒体)应用程序,比如
+SCHED_FIFO。但这也造成了其它问题:SCHED_FIFO未被证明是免于饥饿的,一个
+有问题的SCHED_FIFO应用程序也会锁住运行良好的系统。
+
+v2.6.23版内核的新调度器解决了这三种类型的抱怨:
+
+为了解决第一个抱怨(nice级别不够“有力”),调度器与“时间片”、HZ的概念
+解耦(调度粒度被处理成一个和nice级别独立的概念),因此有可能实现更好、
+更一致的nice +19支持:在新的调度器中,nice +19的任务得到一个HZ无关的
+1.5%CPU使用率,而不是旧版调度器中3%-5%-9%的可变范围。
+
+为了解决第二个抱怨(nice各级别不一致),新调度器令调用nice(1)对各任务的
+CPU利用率有相同的影响,无论其绝对nice级别如何。所以在新调度器上,运行一个
+nice +10和一个nice +11的任务会与运行一个nice -5和一个nice -4的任务的
+CPU利用率分割是相同的(一个会得到55%的CPU,另一个会得到45%)。这是为什么
+nice级别被改为“乘法”(或指数)——这样的话,不管你从哪个级别开始,“相对”
+结果将总是一样的。
+
+第三个抱怨(负数nice级别不够“有力”,并迫使音频应用程序在更危险的
+SCHED_FIFO调度策略下运行)几乎被新的调度器自动解决了:更强的负数级别
+具有重新校正nice级别动态范围的自动化副作用。
diff --git a/Documentation/translations/zh_CN/scheduler/sched-stats.rst b/Documentation/translations/zh_CN/scheduler/sched-stats.rst
new file mode 100644
index 000000000000..1c68c3d1c283
--- /dev/null
+++ b/Documentation/translations/zh_CN/scheduler/sched-stats.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/scheduler/sched-stats.rst
+
+:翻译:
+
+ 唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
+
+==============
+调度器统计数据
+==============
+
+第15版schedstats去掉了sched_yield的一些计数器:yld_exp_empty,yld_act_empty
+和yld_both_empty。在其它方面和第14版完全相同。
+
+第14版schedstats包括对sched_domains(译注:调度域)的支持,该特性进入内核
+主线2.6.20,不过这一版schedstats与2.6.13-2.6.19内核的版本12的统计数据是完全
+相同的(内核未发布第13版)。有些计数器按每个运行队列统计是更有意义的,其它则
+按每个调度域统计是更有意义的。注意,调度域(以及它们的附属信息)仅在开启
+CONFIG_SMP的机器上是相关的和可用的。
+
+在第14版schedstat中,每个被列出的CPU至少会有一级域统计数据,且很可能有一个
+以上的域。在这个实现中,域没有特别的名字,但是编号最高的域通常在机器上所有的
+CPU上仲裁平衡,而domain0是最紧密聚焦的域,有时仅在一对CPU之间进行平衡。此时,
+没有任何体系结构需要3层以上的域。域统计数据中的第一个字段是一个位图,表明哪些
+CPU受该域的影响。
+
+这些字段是计数器,而且只能递增。使用这些字段的程序将需要从基线观测开始,然后在
+后续每一个观测中计算出计数器的变化。一个能以这种方式处理其中很多字段的perl脚本
+可见
+
+ http://eaglet.pdxhosts.com/rick/linux/schedstat/
+
+请注意,任何这样的脚本都必须是特定于版本的,改变版本的主要原因是输出格式的变化。
+对于那些希望编写自己的脚本的人,可以参考这里描述的各个字段。
+
+CPU统计数据
+-----------
+cpu<N> 1 2 3 4 5 6 7 8 9
+
+第一个字段是sched_yield()的统计数据:
+
+ 1) sched_yield()被调用了#次
+
+接下来的三个是schedule()的统计数据:
+
+ 2) 这个字段是一个过时的数组过期计数,在O(1)调度器中使用。为了ABI兼容性,
+ 我们保留了它,但它总是被设置为0。
+ 3) schedule()被调用了#次
+ 4) 调用schedule()导致处理器变为空闲了#次
+
+接下来的两个是try_to_wake_up()的统计数据:
+
+ 5) try_to_wake_up()被调用了#次
+ 6) 调用try_to_wake_up()导致本地CPU被唤醒了#次
+
+接下来的三个统计数据描述了调度延迟:
+
+ 7) 本处理器运行任务的总时间,单位是jiffies
+ 8) 本处理器任务等待运行的时间,单位是jiffies
+ 9) 本CPU运行了#个时间片
+
+域统计数据
+----------
+
+对于每个被描述的CPU,和它相关的每一个调度域均会产生下面一行数据(注意,如果
+CONFIG_SMP没有被定义,那么*没有*调度域被使用,这些行不会出现在输出中)。
+
+domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
+
+第一个字段是一个位掩码,表明该域在操作哪些CPU。
+
+接下来的24个字段是load_balance()函数的各个统计数据,按空闲类型分组(空闲,
+繁忙,新空闲):
+
+
+ 1) 当CPU空闲时,load_balance()在这个调度域中被调用了#次
+ 2) 当CPU空闲时,load_balance()在这个调度域中被调用,但是发现负载无需
+ 均衡#次
+ 3) 当CPU空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多
+ 任务且失败了#次
+ 4) 当CPU空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
+ #次
+ 5) 当CPU空闲时,pull_task()在这个调度域中被调用#次
+ 6) 当CPU空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
+ 7) 当CPU空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的
+ 队列#次
+ 8) 当CPU空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
+ #次
+ 9) 当CPU繁忙时,load_balance()在这个调度域中被调用了#次
+ 10) 当CPU繁忙时,load_balance()在这个调度域中被调用,但是发现负载无需
+ 均衡#次
+ 11) 当CPU繁忙时,load_balance()在这个调度域中被调用,试图迁移1个或更多
+ 任务且失败了#次
+ 12) 当CPU繁忙时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
+ #次
+ 13) 当CPU繁忙时,pull_task()在这个调度域中被调用#次
+ 14) 当CPU繁忙时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
+ 15) 当CPU繁忙时,load_balance()在这个调度域中被调用,未能找到更繁忙的
+ 队列#次
+ 16) 当CPU繁忙时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
+ #次
+ 17) 当CPU新空闲时,load_balance()在这个调度域中被调用了#次
+ 18) 当CPU新空闲时,load_balance()在这个调度域中被调用,但是发现负载无需
+ 均衡#次
+ 19) 当CPU新空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多
+ 任务且失败了#次
+ 20) 当CPU新空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
+ #次
+ 21) 当CPU新空闲时,pull_task()在这个调度域中被调用#次
+ 22) 当CPU新空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
+ 23) 当CPU新空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的
+ 队列#次
+ 24) 当CPU新空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
+ #次
+
+接下来的3个字段是active_load_balance()函数的各个统计数据:
+
+ 25) active_load_balance()被调用了#次
+ 26) active_load_balance()被调用,试图迁移1个或更多任务且失败了#次
+ 27) active_load_balance()被调用,成功迁移了#次任务
+
+接下来的3个字段是sched_balance_exec()函数的各个统计数据:
+
+ 28) sbe_cnt不再被使用
+ 29) sbe_balanced不再被使用
+ 30) sbe_pushed不再被使用
+
+接下来的3个字段是sched_balance_fork()函数的各个统计数据:
+
+ 31) sbf_cnt不再被使用
+ 32) sbf_balanced不再被使用
+ 33) sbf_pushed不再被使用
+
+接下来的3个字段是try_to_wake_up()函数的各个统计数据:
+
+ 34) 在这个调度域中调用try_to_wake_up()唤醒任务时,任务在调度域中一个
+ 和上次运行不同的新CPU上运行了#次
+ 35) 在这个调度域中调用try_to_wake_up()唤醒任务时,任务被迁移到发生唤醒
+ 的CPU次数为#,因为该任务在原CPU是冷缓存状态
+ 36) 在这个调度域中调用try_to_wake_up()唤醒任务时,引发被动负载均衡#次
+
+/proc/<pid>/schedstat
+---------------------
+schedstats还添加了一个新的/proc/<pid>/schedstat文件,来提供一些进程级的
+相同信息。这个文件中,有三个字段与该进程相关:
+
+ 1) 在CPU上运行花费的时间
+ 2) 在运行队列上等待的时间
+ 3) 在CPU上运行了#个时间片
+
+可以很容易地编写一个程序,利用这些额外的字段来报告一个特定的进程或一组进程在
+调度器策略下的表现如何。这样的程序的一个简单版本可在下面的链接找到
+
+ http://eaglet.pdxhosts.com/rick/linux/schedstat/v12/latency.c
diff --git a/Documentation/translations/zh_CN/vm/active_mm.rst b/Documentation/translations/zh_CN/vm/active_mm.rst
new file mode 100644
index 000000000000..366609ea4f37
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/active_mm.rst
@@ -0,0 +1,85 @@
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/vm/active_mm.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+=========
+Active MM
+=========
+
+这是一封linux之父回复开发者的一封邮件,所以翻译时我尽量保持邮件格式的完整。
+
+::
+
+ List: linux-kernel
+ Subject: Re: active_mm
+ From: Linus Torvalds <torvalds () transmeta ! com>
+ Date: 1999-07-30 21:36:24
+
+ 因为我并不经常写解释,所以已经抄送到linux-kernel邮件列表,而当我做这些,
+ 且更多的人在阅读它们时,我觉得棒极了。
+
+ 1999年7月30日 星期五, David Mosberger 写道:
+ >
+ > 是否有一个简短的描述,说明task_struct中的
+ > "mm" 和 "active_mm"应该如何使用? (如果
+ > 这个问题在邮件列表中讨论过,我表示歉意--我刚
+ > 刚度假回来,有一段时间没能关注linux-kernel了)。
+
+ 基本上,新的设定是:
+
+ - 我们有“真实地址空间”和“匿名地址空间”。区别在于,匿名地址空间根本不关心用
+ 户级页表,所以当我们做上下文切换到匿名地址空间时,我们只是让以前的地址空间
+ 处于活动状态。
+
+ 一个“匿名地址空间”的明显用途是任何不需要任何用户映射的线程--所有的内核线
+ 程基本上都属于这一类,但即使是“真正的”线程也可以暂时说在一定时间内它们不
+ 会对用户空间感兴趣,调度器不妨试着避免在切换VM状态上浪费时间。目前只有老
+ 式的bdflush sync能做到这一点。
+
+ - “tsk->mm” 指向 “真实地址空间”。对于一个匿名进程来说,tsk->mm将是NULL,
+ 其逻辑原因是匿名进程实际上根本就 “没有” 真正的地址空间。
+
+ - 然而,我们显然需要跟踪我们为这样的匿名用户“偷用”了哪个地址空间。为此,我们
+ 有 “tsk->active_mm”,它显示了当前活动的地址空间是什么。
+
+ 规则是,对于一个有真实地址空间的进程(即tsk->mm是 non-NULL),active_mm
+ 显然必须与真实的mm相同。
+
+ 对于一个匿名进程,tsk->mm == NULL,而tsk->active_mm是匿名进程运行时
+ “借用”的mm。当匿名进程被调度走时,借用的地址空间被返回并清除。
+
+ 为了支持所有这些,“struct mm_struct”现在有两个计数器:一个是 “mm_users”
+ 计数器,即有多少 “真正的地址空间用户”,另一个是 “mm_count”计数器,即 “lazy”
+ 用户(即匿名用户)的数量,如果有任何真正的用户,则加1。
+
+ 通常情况下,至少有一个真正的用户,但也可能是真正的用户在另一个CPU上退出,而
+ 一个lazy的用户仍在活动,所以你实际上得到的情况是,你有一个地址空间 **只**
+ 被lazy的用户使用。这通常是一个短暂的生命周期状态,因为一旦这个线程被安排给一
+ 个真正的线程,这个 “僵尸” mm就会被释放,因为 “mm_count”变成了零。
+
+ 另外,一个新的规则是,**没有人** 再把 “init_mm” 作为一个真正的MM了。
+ “init_mm”应该被认为只是一个 “没有其他上下文时的lazy上下文”,事实上,它主
+ 要是在启动时使用,当时还没有真正的VM被创建。因此,用来检查的代码
+
+ if (current->mm == &init_mm)
+
+ 一般来说,应该用
+
+ if (!current->mm)
+
+ 取代上面的写法(这更有意义--测试基本上是 “我们是否有一个用户环境”,并且通常
+ 由缺页异常处理程序和类似的东西来完成)。
+
+ 总之,我刚才在ftp.kernel.org上放了一个pre-patch-2.3.13-1,因为它稍微改
+ 变了接口以适配alpha(谁会想到呢,但alpha体系结构上下文切换代码实际上最终是
+ 最丑陋的之一--不像其他架构的MM和寄存器状态是分开的,alpha的PALcode将两者
+ 连接起来,你需要同时切换两者)。
+
+ (文档来源 http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
diff --git a/Documentation/translations/zh_CN/vm/balance.rst b/Documentation/translations/zh_CN/vm/balance.rst
new file mode 100644
index 000000000000..e98a47ef24a8
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/balance.rst
@@ -0,0 +1,81 @@
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/vm/balance.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+========
+内存平衡
+========
+
+2000年1月开始,作者:Kanoj Sarcar <kanoj@sgi.com>
+
+对于 !__GFP_HIGH 和 !__GFP_KSWAPD_RECLAIM 以及非 __GFP_IO 的分配,需要进行
+内存平衡。
+
+调用者避免回收的第一个原因是调用者由于持有自旋锁或处于中断环境中而无法睡眠。第二个
+原因可能是,调用者愿意在不产生页面回收开销的情况下分配失败。这可能发生在有0阶回退
+选项的机会主义高阶分配请求中。在这种情况下,调用者可能也希望避免唤醒kswapd。
+
+__GFP_IO分配请求是为了防止文件系统死锁。
+
+在没有非睡眠分配请求的情况下,做平衡似乎是有害的。页面回收可以被懒散地启动,也就是
+说,只有在需要的时候(也就是区域的空闲内存为0),而不是让它成为一个主动的过程。
+
+也就是说,内核应该尝试从直接映射池中满足对直接映射页的请求,而不是回退到dma池中,
+这样就可以保持dma池为dma请求(不管是不是原子的)所填充。类似的争论也适用于高内存
+和直接映射的页面。相反,如果有很多空闲的dma页,最好是通过从dma池中分配一个来满足
+常规的内存请求,而不是产生常规区域平衡的开销。
+
+在2.2中,只有当空闲页总数低于总内存的1/64时,才会启动内存平衡/页面回收。如果dma
+和常规内存的比例合适,即使dma区完全空了,也很可能不会进行平衡。2.2已经在不同内存
+大小的生产机器上运行,即使有这个问题存在,似乎也做得不错。在2.3中,由于HIGHMEM的
+存在,这个问题变得更加严重。
+
+在2.3中,区域平衡可以用两种方式之一来完成:根据区域的大小(可能是低级区域的大小),
+我们可以在初始化阶段决定在平衡任何区域时应该争取多少空闲页。好的方面是,在平衡的时
+候,我们不需要看低级区的大小,坏的方面是,我们可能会因为忽略低级区可能较低的使用率
+而做过于频繁的平衡。另外,只要对分配程序稍作修改,就有可能将memclass()宏简化为一
+个简单的等式。
+
+另一个可能的解决方案是,我们只在一个区 **和** 其所有低级区的空闲内存低于该区及其
+低级区总内存的1/64时进行平衡。这就解决了2.2的平衡问题,并尽可能地保持了与2.2行为
+的接近。另外,平衡算法在各种架构上的工作方式也是一样的,这些架构有不同数量和类型的
+内存区。如果我们想变得更花哨一点,我们可以在未来为不同区域的自由页面分配不同的权重。
+
+请注意,如果普通区的大小与dma区相比是巨大的,那么在决定是否平衡普通区的时候,考虑
+空闲的dma页就变得不那么重要了。那么第一个解决方案就变得更有吸引力。
+
+所附的补丁实现了第二个解决方案。它还 “修复”了两个问题:首先,在低内存条件下,kswapd
+被唤醒,就像2.2中的非睡眠分配。第二,HIGHMEM区也被平衡了,以便给replace_with_highmem()
+一个争取获得HIGHMEM页的机会,同时确保HIGHMEM分配不会落回普通区。这也确保了HIGHMEM
+页不会被泄露(例如,在一个HIGHMEM页在交换缓存中但没有被任何人使用的情况下)。
+
+kswapd还需要知道它应该平衡哪些区。kswapd主要是在无法进行平衡的情况下需要的,可能
+是因为所有的分配请求都来自中断上下文,而所有的进程上下文都在睡眠。对于2.3,
+kswapd并不真正需要平衡高内存区,因为中断上下文并不请求高内存页。kswapd看zone
+结构体中的zone_wake_kswapd字段来决定一个区是否需要平衡。
+
+如果从进程内存和shm中偷取页面可以减轻该页面节点中任何区的内存压力,而该区的内存压力
+已经低于其水位,则会进行偷取。
+
+watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd:
+这些是每个区的字段,用于确定一个区何时需要平衡。当页面数低于水位[WMARK_MIN]时,
+hysteric 的字段low_on_memory被设置。这个字段会一直被设置,直到空闲页数变成水位
+[WMARK_HIGH]。当low_on_memory被设置时,页面分配请求将尝试释放该区域的一些页面(如果
+请求中设置了GFP_WAIT)。与此相反的是,决定唤醒kswapd以释放一些区的页。这个决定不是基于
+hysteresis 的,而是当空闲页的数量低于watermark[WMARK_LOW]时就会进行;在这种情况下,
+zone_wake_kswapd也被设置。
+
+
+我所听到的(超棒的)想法:
+
+1. 动态经历应该影响平衡:可以跟踪一个区的失败请求的数量,并反馈到平衡方案中(jalvo@mbay.net)。
+
+2. 实现一个类似于replace_with_highmem()的replace_with_regular(),以保留dma页面。
+ (lkd@tantalophile.demon.co.uk)
diff --git a/Documentation/translations/zh_CN/vm/damon/api.rst b/Documentation/translations/zh_CN/vm/damon/api.rst
new file mode 100644
index 000000000000..21143eea4ebe
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/damon/api.rst
@@ -0,0 +1,32 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Original: Documentation/vm/damon/api.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+=======
+API参考
+=======
+
+内核空间的程序可以使用下面的API来使用DAMON的每个功能。你所需要做的就是引用 ``damon.h`` ,
+它位于源代码树的include/linux/。
+
+结构体
+======
+
+该API在以下内核代码中:
+
+include/linux/damon.h
+
+
+函数
+====
+
+该API在以下内核代码中:
+
+mm/damon/core.c
diff --git a/Documentation/translations/zh_CN/vm/damon/design.rst b/Documentation/translations/zh_CN/vm/damon/design.rst
new file mode 100644
index 000000000000..05f66c02740a
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/damon/design.rst
@@ -0,0 +1,139 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Original: Documentation/vm/damon/design.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+====
+设计
+====
+
+可配置的层
+==========
+
+DAMON提供了数据访问监控功能,同时使其准确性和开销可控。基本的访问监控需要依赖于目标地址空间
+并为之优化的基元。另一方面,作为DAMON的核心,准确性和开销的权衡机制是在纯逻辑空间中。DAMON
+将这两部分分离在不同的层中,并定义了它的接口,以允许各种低层次的基元实现与核心逻辑的配置。
+
+由于这种分离的设计和可配置的接口,用户可以通过配置核心逻辑和适当的低级基元实现来扩展DAMON的
+任何地址空间。如果没有提供合适的,用户可以自己实现基元。
+
+例如,物理内存、虚拟内存、交换空间、那些特定的进程、NUMA节点、文件和支持的内存设备将被支持。
+另外,如果某些架构或设备支持特殊的优化访问检查基元,这些基元将很容易被配置。
+
+
+特定地址空间基元的参考实现
+==========================
+
+基本访问监测的低级基元被定义为两部分。:
+
+1. 确定地址空间的监测目标地址范围
+2. 目标空间中特定地址范围的访问检查。
+
+DAMON目前为物理和虚拟地址空间提供了基元的实现。下面两个小节描述了这些工作的方式。
+
+
+基于VMA的目标地址范围构造
+-------------------------
+
+这仅仅是针对虚拟地址空间基元的实现。对于物理地址空间,只是要求用户手动设置监控目标地址范围。
+
+在进程的超级巨大的虚拟地址空间中,只有小部分被映射到物理内存并被访问。因此,跟踪未映射的地
+址区域只是一种浪费。然而,由于DAMON可以使用自适应区域调整机制来处理一定程度的噪声,所以严
+格来说,跟踪每一个映射并不是必须的,但在某些情况下甚至会产生很高的开销。也就是说,监测目标
+内部过于巨大的未映射区域应该被移除,以不占用自适应机制的时间。
+
+出于这个原因,这个实现将复杂的映射转换为三个不同的区域,覆盖地址空间的每个映射区域。这三个
+区域之间的两个空隙是给定地址空间中两个最大的未映射区域。这两个最大的未映射区域是堆和最上面
+的mmap()区域之间的间隙,以及在大多数情况下最下面的mmap()区域和堆之间的间隙。因为这些间隙
+在通常的地址空间中是异常巨大的,排除这些间隙就足以做出合理的权衡。下面详细说明了这一点::
+
+ <heap>
+ <BIG UNMAPPED REGION 1>
+ <uppermost mmap()-ed region>
+ (small mmap()-ed regions and munmap()-ed regions)
+ <lowermost mmap()-ed region>
+ <BIG UNMAPPED REGION 2>
+ <stack>
+
+
+基于PTE访问位的访问检查
+-----------------------
+
+物理和虚拟地址空间的实现都使用PTE Accessed-bit进行基本访问检查。唯一的区别在于从地址中
+找到相关的PTE访问位的方式。虚拟地址的实现是为该地址的目标任务查找页表,而物理地址的实现则
+是查找与该地址有映射关系的每一个页表。通过这种方式,实现者找到并清除下一个采样目标地址的位,
+并检查该位是否在一个采样周期后再次设置。这可能会干扰其他使用访问位的内核子系统,即空闲页跟
+踪和回收逻辑。为了避免这种干扰,DAMON使其与空闲页面跟踪相互排斥,并使用 ``PG_idle`` 和
+``PG_young`` 页面标志来解决与回收逻辑的冲突,就像空闲页面跟踪那样。
+
+
+独立于地址空间的核心机制
+========================
+
+下面四个部分分别描述了DAMON的核心机制和五个监测属性,即 ``采样间隔`` 、 ``聚集间隔`` 、
+``区域更新间隔`` 、 ``最小区域数`` 和 ``最大区域数`` 。
+
+
+访问频率监测
+------------
+
+DAMON的输出显示了在给定的时间内哪些页面的访问频率是多少。访问频率的分辨率是通过设置
+``采样间隔`` 和 ``聚集间隔`` 来控制的。详细地说,DAMON检查每个 ``采样间隔`` 对每
+个页面的访问,并将结果汇总。换句话说,计算每个页面的访问次数。在每个 ``聚合间隔`` 过
+去后,DAMON调用先前由用户注册的回调函数,以便用户可以阅读聚合的结果,然后再清除这些结
+果。这可以用以下简单的伪代码来描述::
+
+ while monitoring_on:
+ for page in monitoring_target:
+ if accessed(page):
+ nr_accesses[page] += 1
+ if time() % aggregation_interval == 0:
+ for callback in user_registered_callbacks:
+ callback(monitoring_target, nr_accesses)
+ for page in monitoring_target:
+ nr_accesses[page] = 0
+ sleep(sampling interval)
+
+这种机制的监测开销将随着目标工作负载规模的增长而任意增加。
+
+
+基于区域的抽样调查
+------------------
+
+为了避免开销的无限制增加,DAMON将假定具有相同访问频率的相邻页面归入一个区域。只要保持
+这个假设(一个区域内的页面具有相同的访问频率),该区域内就只需要检查一个页面。因此,对
+于每个 ``采样间隔`` ,DAMON在每个区域中随机挑选一个页面,等待一个 ``采样间隔`` ,检
+查该页面是否同时被访问,如果被访问则增加该区域的访问频率。因此,监测开销是可以通过设置
+区域的数量来控制的。DAMON允许用户设置最小和最大的区域数量来进行权衡。
+
+然而,如果假设没有得到保证,这个方案就不能保持输出的质量。
+
+
+适应性区域调整
+--------------
+
+即使最初的监测目标区域被很好地构建以满足假设(同一区域内的页面具有相似的访问频率),数
+据访问模式也会被动态地改变。这将导致监测质量下降。为了尽可能地保持假设,DAMON根据每个
+区域的访问频率自适应地进行合并和拆分。
+
+对于每个 ``聚集区间`` ,它比较相邻区域的访问频率,如果频率差异较小,就合并这些区域。
+然后,在它报告并清除每个区域的聚合接入频率后,如果区域总数不超过用户指定的最大区域数,
+它将每个区域拆分为两个或三个区域。
+
+通过这种方式,DAMON提供了其最佳的质量和最小的开销,同时保持了用户为其权衡设定的界限。
+
+
+动态目标空间更新处理
+--------------------
+
+监测目标地址范围可以动态改变。例如,虚拟内存可以动态地被映射和解映射。物理内存可以被
+热插拔。
+
+由于在某些情况下变化可能相当频繁,DAMON检查动态内存映射的变化,并仅在用户指定的时间
+间隔( ``区域更新间隔`` )内将其应用于抽象的目标区域。
diff --git a/Documentation/translations/zh_CN/vm/damon/faq.rst b/Documentation/translations/zh_CN/vm/damon/faq.rst
new file mode 100644
index 000000000000..07b4ac19407d
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/damon/faq.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Original: Documentation/vm/damon/faq.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+========
+常见问题
+========
+
+为什么是一个新的子系统,而不是扩展perf或其他用户空间工具?
+==========================================================
+
+首先,因为它需要尽可能的轻量级,以便可以在线使用,所以应该避免任何不必要的开销,如内核-用户
+空间的上下文切换成本。第二,DAMON的目标是被包括内核在内的其他程序所使用。因此,对特定工具
+(如perf)的依赖性是不可取的。这就是DAMON在内核空间实现的两个最大的原因。
+
+
+“闲置页面跟踪” 或 “perf mem” 可以替代DAMON吗?
+==============================================
+
+闲置页跟踪是物理地址空间访问检查的一个低层次的原始方法。“perf mem”也是类似的,尽管它可以
+使用采样来减少开销。另一方面,DAMON是一个更高层次的框架,用于监控各种地址空间。它专注于内
+存管理优化,并提供复杂的精度/开销处理机制。因此,“空闲页面跟踪” 和 “perf mem” 可以提供
+DAMON输出的一个子集,但不能替代DAMON。
+
+
+DAMON是否只支持虚拟内存?
+=========================
+
+不,DAMON的核心是独立于地址空间的。用户可以在DAMON核心上实现和配置特定地址空间的低级原始
+部分,包括监测目标区域的构造和实际的访问检查。通过这种方式,DAMON用户可以用任何访问检查技
+术来监测任何地址空间。
+
+尽管如此,DAMON默认为虚拟内存和物理内存提供了基于vma/rmap跟踪和PTE访问位检查的地址空间
+相关功能的实现,以供参考和方便使用。
+
+
+我可以简单地监测页面的粒度吗?
+==============================
+
+是的,你可以通过设置 ``min_nr_regions`` 属性高于工作集大小除以页面大小的值来实现。
+因为监视目标区域的大小被强制为 ``>=page size`` ,所以区域分割不会产生任何影响。
diff --git a/Documentation/translations/zh_CN/vm/damon/index.rst b/Documentation/translations/zh_CN/vm/damon/index.rst
new file mode 100644
index 000000000000..84d36d90c9b0
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/damon/index.rst
@@ -0,0 +1,33 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Original: Documentation/vm/damon/index.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+
+==========================
+DAMON:数据访问监视器
+==========================
+
+DAMON是Linux内核的一个数据访问监控框架子系统。DAMON的核心机制使其成为
+(该核心机制详见(Documentation/translations/zh_CN/vm/damon/design.rst))
+
+ - *准确度* (监测输出对DRAM级别的内存管理足够有用;但可能不适合CPU Cache级别),
+ - *轻量级* (监控开销低到可以在线应用),以及
+ - *可扩展* (无论目标工作负载的大小,开销的上限值都在恒定范围内)。
+
+因此,利用这个框架,内核的内存管理机制可以做出高级决策。会导致高数据访问监控开销的实
+验性内存管理优化工作可以再次进行。同时,在用户空间,有一些特殊工作负载的用户可以编写
+个性化的应用程序,以便更好地了解和优化他们的工作负载和系统。
+
+.. toctree::
+ :maxdepth: 2
+
+ faq
+ design
+ api
+
diff --git a/Documentation/translations/zh_CN/vm/free_page_reporting.rst b/Documentation/translations/zh_CN/vm/free_page_reporting.rst
new file mode 100644
index 000000000000..31d6c34b956b
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/free_page_reporting.rst
@@ -0,0 +1,38 @@
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/vm/_free_page_reporting.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+==========
+空闲页报告
+==========
+
+空闲页报告是一个API,设备可以通过它来注册接收系统当前未使用的页面列表。这在虚拟
+化的情况下是很有用的,客户机能够使用这些数据来通知管理器它不再使用内存中的某些页
+面。
+
+对于驱动,通常是气球驱动要使用这个功能,它将分配和初始化一个page_reporting_dev_info
+结构体。它要填充的结构体中的字段是用于处理散点列表的 "report" 函数指针。它还必
+须保证每次调用该函数时能处理至少相当于PAGE_REPORTING_CAPACITY的散点列表条目。
+假设没有其他页面报告设备已经注册, 对page_reporting_register的调用将向报告框
+架注册页面报告接口。
+
+一旦注册,页面报告API将开始向驱动报告成批的页面。API将在接口被注册后2秒开始报告
+页面,并在任何足够高的页面被释放之后2秒继续报告。
+
+报告的页面将被存储在传递给报告函数的散列表中,最后一个条目的结束位被设置在条目
+nent-1中。 当页面被报告函数处理时,分配器将无法访问它们。一旦报告函数完成,这些
+页将被返回到它们所获得的自由区域。
+
+在移除使用空闲页报告的驱动之前,有必要调用page_reporting_unregister,以移除
+目前被空闲页报告使用的page_reporting_dev_info结构体。这样做将阻止进一步的报
+告通过该接口发出。如果另一个驱动或同一驱动被注册,它就有可能恢复前一个驱动在报告
+空闲页方面的工作。
+
+
+Alexander Duyck, 2019年12月04日
diff --git a/Documentation/translations/zh_CN/vm/highmem.rst b/Documentation/translations/zh_CN/vm/highmem.rst
new file mode 100644
index 000000000000..018838e58c3e
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/highmem.rst
@@ -0,0 +1,128 @@
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/vm/highmem.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+==========
+高内存处理
+==========
+
+作者: Peter Zijlstra <a.p.zijlstra@chello.nl>
+
+.. contents:: :local:
+
+高内存是什么?
+==============
+
+当物理内存的大小接近或超过虚拟内存的最大大小时,就会使用高内存(highmem)。在这一点上,内
+核不可能在任何时候都保持所有可用的物理内存的映射。这意味着内核需要开始使用它想访问的物理内
+存的临时映射。
+
+没有被永久映射覆盖的那部分(物理)内存就是我们所说的 "高内存"。对于这个边界的确切位置,有
+各种架构上的限制。
+
+例如,在i386架构中,我们选择将内核映射到每个进程的虚拟空间,这样我们就不必为内核的进入/退
+出付出全部的TLB作废代价。这意味着可用的虚拟内存空间(i386上为4GiB)必须在用户和内核空间之
+间进行划分。
+
+使用这种方法的架构的传统分配方式是3:1,3GiB用于用户空间,顶部的1GiB用于内核空间。::
+
+ +--------+ 0xffffffff
+ | Kernel |
+ +--------+ 0xc0000000
+ | |
+ | User |
+ | |
+ +--------+ 0x00000000
+
+这意味着内核在任何时候最多可以映射1GiB的物理内存,但是由于我们需要虚拟地址空间来做其他事
+情--包括访问其余物理内存的临时映射--实际的直接映射通常会更少(通常在~896MiB左右)。
+
+其他有mm上下文标签的TLB的架构可以有独立的内核和用户映射。然而,一些硬件(如一些ARM)在使
+用mm上下文标签时,其虚拟空间有限。
+
+
+临时虚拟映射
+============
+
+内核包含几种创建临时映射的方法。:
+
+* vmap(). 这可以用来将多个物理页长期映射到一个连续的虚拟空间。它需要synchronization
+ 来解除映射。
+
+* kmap(). 这允许对单个页面进行短期映射。它需要synchronization,但在一定程度上被摊销。
+ 当以嵌套方式使用时,它也很容易出现死锁,因此不建议在新代码中使用它。
+
+* kmap_atomic(). 这允许对单个页面进行非常短的时间映射。由于映射被限制在发布它的CPU上,
+ 它表现得很好,但发布任务因此被要求留在该CPU上直到它完成,以免其他任务取代它的映射。
+
+ kmap_atomic() 也可以由中断上下文使用,因为它不睡眠,而且调用者可能在调用kunmap_atomic()
+ 之后才睡眠。
+
+ 可以假设k[un]map_atomic()不会失败。
+
+
+使用kmap_atomic
+===============
+
+何时何地使用 kmap_atomic() 是很直接的。当代码想要访问一个可能从高内存(见__GFP_HIGHMEM)
+分配的页面的内容时,例如在页缓存中的页面,就会使用它。该API有两个函数,它们的使用方式与
+下面类似::
+
+ /* 找到感兴趣的页面。 */
+ struct page *page = find_get_page(mapping, offset);
+
+ /* 获得对该页内容的访问权。 */
+ void *vaddr = kmap_atomic(page);
+
+ /* 对该页的内容做一些处理。 */
+ memset(vaddr, 0, PAGE_SIZE);
+
+ /* 解除该页面的映射。 */
+ kunmap_atomic(vaddr);
+
+注意,kunmap_atomic()调用的是kmap_atomic()调用的结果而不是参数。
+
+如果你需要映射两个页面,因为你想从一个页面复制到另一个页面,你需要保持kmap_atomic调用严
+格嵌套,如::
+
+ vaddr1 = kmap_atomic(page1);
+ vaddr2 = kmap_atomic(page2);
+
+ memcpy(vaddr1, vaddr2, PAGE_SIZE);
+
+ kunmap_atomic(vaddr2);
+ kunmap_atomic(vaddr1);
+
+
+临时映射的成本
+==============
+
+创建临时映射的代价可能相当高。体系架构必须操作内核的页表、数据TLB和/或MMU的寄存器。
+
+如果CONFIG_HIGHMEM没有被设置,那么内核会尝试用一点计算来创建映射,将页面结构地址转换成
+指向页面内容的指针,而不是去捣鼓映射。在这种情况下,解映射操作可能是一个空操作。
+
+如果CONFIG_MMU没有被设置,那么就不可能有临时映射和高内存。在这种情况下,也将使用计算方法。
+
+
+i386 PAE
+========
+
+在某些情况下,i386 架构将允许你在 32 位机器上安装多达 64GiB 的内存。但这有一些后果:
+
+* Linux需要为系统中的每个页面建立一个页帧结构,而且页帧需要驻在永久映射中,这意味着:
+
+* 你最多可以有896M/sizeof(struct page)页帧;由于页结构体是32字节的,所以最终会有
+ 112G的页;然而,内核需要在内存中存储更多的页帧......
+
+* PAE使你的页表变大--这使系统变慢,因为更多的数据需要在TLB填充等方面被访问。一个好处
+ 是,PAE有更多的PTE位,可以提供像NX和PAT这样的高级功能。
+
+一般的建议是,你不要在32位机器上使用超过8GiB的空间--尽管更多的空间可能对你和你的工作
+量有用,但你几乎是靠你自己--不要指望内核开发者真的会很关心事情的进展情况。
diff --git a/Documentation/translations/zh_CN/vm/index.rst b/Documentation/translations/zh_CN/vm/index.rst
new file mode 100644
index 000000000000..a1d2f0356cc1
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/index.rst
@@ -0,0 +1,53 @@
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/vm/index.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+=================
+Linux内存管理文档
+=================
+
+这是一个关于Linux内存管理(mm)子系统内部的文档集,其中有不同层次的细节,包括注释
+和邮件列表的回复,用于阐述数据结构和算法的基本情况。如果你正在寻找关于简单分配内存的建
+议,请参阅(Documentation/translations/zh_CN/core-api/memory-allocation.rst)。
+对于控制和调整指南,请参阅(Documentation/admin-guide/mm/index)。
+TODO:待引用文档集被翻译完毕后请及时修改此处)
+
+.. toctree::
+ :maxdepth: 1
+
+ active_mm
+ balance
+ damon/index
+ free_page_reporting
+ highmem
+ ksm
+
+TODOLIST:
+* arch_pgtable_helpers
+* free_page_reporting
+* frontswap
+* hmm
+* hwpoison
+* hugetlbfs_reserv
+* memory-model
+* mmu_notifier
+* numa
+* overcommit-accounting
+* page_migration
+* page_frags
+* page_owner
+* page_table_check
+* remap_file_pages
+* slub
+* split_page_table_lock
+* transhuge
+* unevictable-lru
+* vmalloced-kernel-stacks
+* z3fold
+* zsmalloc
diff --git a/Documentation/translations/zh_CN/vm/ksm.rst b/Documentation/translations/zh_CN/vm/ksm.rst
new file mode 100644
index 000000000000..83b0c73984da
--- /dev/null
+++ b/Documentation/translations/zh_CN/vm/ksm.rst
@@ -0,0 +1,70 @@
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/vm/ksm.rst
+
+:翻译:
+
+ 徐鑫 xu xin <xu.xin16@zte.com.cn>
+
+============
+内核同页合并
+============
+
+KSM 是一种节省内存的数据去重功能,由CONFIG_KSM=y启用,并在2.6.32版本时被添加
+到Linux内核。详见 ``mm/ksm.c`` 的实现,以及http://lwn.net/Articles/306704和
+https://lwn.net/Articles/330589
+
+KSM的用户空间的接口在Documentation/translations/zh_CN/admin-guide/mm/ksm.rst
+文档中有描述。
+
+设计
+====
+
+概述
+----
+
+概述内容请见mm/ksm.c文档中的“DOC: Overview”
+
+逆映射
+------
+KSM维护着稳定树中的KSM页的逆映射信息。
+
+当KSM页面的共享数小于 ``max_page_sharing`` 的虚拟内存区域(VMAs)时,则代表了
+KSM页的稳定树其中的节点指向了一个rmap_item结构体类型的列表。同时,这个KSM页
+的 ``page->mapping`` 指向了该稳定树节点。
+
+如果共享数超过了阈值,KSM将给稳定树添加第二个维度。稳定树就变成链接一个或多
+个稳定树"副本"的"链"。每个副本都保留KSM页的逆映射信息,其中 ``page->mapping``
+指向该"副本"。
+
+每个链以及链接到该链中的所有"副本"强制不变的是,它们代表了相同的写保护内存
+内容,尽管任中一个"副本"是由同一片内存区的不同的KSM复制页所指向的。
+
+这样一来,相比与无限的逆映射链表,稳定树的查找计算复杂性不受影响。但在稳定树
+本身中不能有重复的KSM页面内容仍然是强制要求。
+
+由 ``max_page_sharing`` 强制决定的数据去重限制是必要的,以此来避免虚拟内存
+rmap链表变得过大。rmap的遍历具有O(N)的复杂度,其中N是共享页面的rmap_项(即
+虚拟映射)的数量,而这个共享页面的节点数量又被 ``max_page_sharing`` 所限制。
+因此,这有效地将线性O(N)计算复杂度从rmap遍历中分散到不同的KSM页面上。ksmd进
+程在稳定节点"链"上的遍历也是O(N),但这个N是稳定树"副本"的数量,而不是rmap项
+的数量,因此它对ksmd性能没有显著影响。实际上,最佳稳定树"副本"的候选节点将
+保留在"副本"列表的开头。
+
+``max_page_sharing`` 的值设置得高了会促使更快的内存合并(因为将有更少的稳定
+树副本排队进入稳定节点chain->hlist)和更高的数据去重系数,但代价是在交换、压
+缩、NUMA平衡和页面迁移过程中可能导致KSM页的最大rmap遍历速度较慢。
+
+``stable_node_dups/stable_node_chains`` 的比值还受 ``max_page_sharing`` 调控
+的影响,高比值可能意味着稳定节点dup中存在碎片,这可以通过在ksmd中引入碎片算
+法来解决,该算法将rmap项从一个稳定节点dup重定位到另一个稳定节点dup,以便释放
+那些仅包含极少rmap项的稳定节点"dup",但这可能会增加ksmd进程的CPU使用率,并可
+能会减慢应用程序在KSM页面上的只读计算。
+
+KSM会定期扫描稳定节点"链"中链接的所有稳定树"副本",以便删减过时了的稳定节点。
+这种扫描的频率由 ``stable_node_chains_prune_millisecs`` 这个sysfs 接口定义。
+
+参考
+====
+内核代码请见mm/ksm.c。
+涉及的函数(mm_slot ksm_scan stable_node rmap_item)。
diff --git a/Documentation/translations/zh_TW/index.rst b/Documentation/translations/zh_TW/index.rst
index f56f78ba7860..e1ce9d8c06f8 100644
--- a/Documentation/translations/zh_TW/index.rst
+++ b/Documentation/translations/zh_TW/index.rst
@@ -5,7 +5,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
- \kerneldocBeginTC
+ \kerneldocBeginTC{
.. _linux_doc_zh_tw:
@@ -174,4 +174,4 @@ TODOList:
.. raw:: latex
- \kerneldocEndTC
+ }\kerneldocEndTC
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index e6fce2cbd99e..dfbc27d17ff7 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -256,7 +256,7 @@ Code Seq# Include File Comments
'l' 00-3F linux/tcfs_fs.h transparent cryptographic file system
<http://web.archive.org/web/%2A/http://mikonos.dia.unisa.it/tcfs>
'l' 40-7F linux/udf_fs_i.h in development:
- <http://sourceforge.net/projects/linux-udf/>
+ <https://github.com/pali/udftools>
'm' 00-09 linux/mmtimer.h conflict!
'm' all linux/mtio.h conflict!
'm' all linux/soundcard.h conflict!
diff --git a/Documentation/virt/uml/user_mode_linux_howto_v2.rst b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
index 2cafd3c3c6cb..d5ad96c795f6 100644
--- a/Documentation/virt/uml/user_mode_linux_howto_v2.rst
+++ b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
@@ -664,7 +664,11 @@ one is input, the second one output.
* The fd channel - use file descriptor numbers for input/output. Example:
``con1=fd:0,fd:1.``
-* The port channel - listen on TCP port number. Example: ``con1=port:4321``
+* The port channel - start a telnet server on TCP port number. Example:
+ ``con1=port:4321``. The host must have /usr/sbin/in.telnetd (usually part of
+ a telnetd package) and the port-helper from the UML utilities (see the
+ information for the xterm channel below). UML will not boot until a client
+ connects.
* The pty and pts channels - use system pty/pts.
diff --git a/Documentation/vm/page_owner.rst b/Documentation/vm/page_owner.rst
index 9837fc8147dd..905555e3e483 100644
--- a/Documentation/vm/page_owner.rst
+++ b/Documentation/vm/page_owner.rst
@@ -26,9 +26,9 @@ fragmentation statistics can be obtained through gfp flag information of
each page. It is already implemented and activated if page owner is
enabled. Other usages are more than welcome.
-page owner is disabled in default. So, if you'd like to use it, you need
-to add "page_owner=on" into your boot cmdline. If the kernel is built
-with page owner and page owner is disabled in runtime due to no enabling
+page owner is disabled by default. So, if you'd like to use it, you need
+to add "page_owner=on" to your boot cmdline. If the kernel is built
+with page owner and page owner is disabled in runtime due to not enabling
boot option, runtime overhead is marginal. If disabled in runtime, it
doesn't require memory to store owner information, so there is no runtime
memory overhead. And, page owner inserts just two unlikely branches into
@@ -85,7 +85,7 @@ Usage
cat /sys/kernel/debug/page_owner > page_owner_full.txt
./page_owner_sort page_owner_full.txt sorted_page_owner.txt
- The general output of ``page_owner_full.txt`` is as follows:
+ The general output of ``page_owner_full.txt`` is as follows::
Page allocated via order XXX, ...
PFN XXX ...
@@ -100,7 +100,7 @@ Usage
and pages of buf, and finally sorts them according to the times.
See the result about who allocated each page
- in the ``sorted_page_owner.txt``. General output:
+ in the ``sorted_page_owner.txt``. General output::
XXX times, XXX pages:
Page allocated via order XXX, ...
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index f498f1d36cd3..982c8af853b9 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -21,6 +21,7 @@ x86-specific Documentation
tlb
mtrr
pat
+ intel-hfi
intel-iommu
intel_txt
amd-memory-encryption
diff --git a/Documentation/x86/intel-hfi.rst b/Documentation/x86/intel-hfi.rst
new file mode 100644
index 000000000000..49dea58ea4fb
--- /dev/null
+++ b/Documentation/x86/intel-hfi.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================
+Hardware-Feedback Interface for scheduling on Intel Hardware
+============================================================
+
+Overview
+--------
+
+Intel has described the Hardware Feedback Interface (HFI) in the Intel 64 and
+IA-32 Architectures Software Developer's Manual (Intel SDM) Volume 3 Section
+14.6 [1]_.
+
+The HFI gives the operating system a performance and energy efficiency
+capability data for each CPU in the system. Linux can use the information from
+the HFI to influence task placement decisions.
+
+The Hardware Feedback Interface
+-------------------------------
+
+The Hardware Feedback Interface provides to the operating system information
+about the performance and energy efficiency of each CPU in the system. Each
+capability is given as a unit-less quantity in the range [0-255]. Higher values
+indicate higher capability. Energy efficiency and performance are reported in
+separate capabilities. Even though on some systems these two metrics may be
+related, they are specified as independent capabilities in the Intel SDM.
+
+These capabilities may change at runtime as a result of changes in the
+operating conditions of the system or the action of external factors. The rate
+at which these capabilities are updated is specific to each processor model. On
+some models, capabilities are set at boot time and never change. On others,
+capabilities may change every tens of milliseconds. For instance, a remote
+mechanism may be used to lower Thermal Design Power. Such change can be
+reflected in the HFI. Likewise, if the system needs to be throttled due to
+excessive heat, the HFI may reflect reduced performance on specific CPUs.
+
+The kernel or a userspace policy daemon can use these capabilities to modify
+task placement decisions. For instance, if either the performance or energy
+capabilities of a given logical processor becomes zero, it is an indication that
+the hardware recommends to the operating system to not schedule any tasks on
+that processor for performance or energy efficiency reasons, respectively.
+
+Implementation details for Linux
+--------------------------------
+
+The infrastructure to handle thermal event interrupts has two parts. In the
+Local Vector Table of a CPU's local APIC, there exists a register for the
+Thermal Monitor Register. This register controls how interrupts are delivered
+to a CPU when the thermal monitor generates and interrupt. Further details
+can be found in the Intel SDM Vol. 3 Section 10.5 [1]_.
+
+The thermal monitor may generate interrupts per CPU or per package. The HFI
+generates package-level interrupts. This monitor is configured and initialized
+via a set of machine-specific registers. Specifically, the HFI interrupt and
+status are controlled via designated bits in the IA32_PACKAGE_THERM_INTERRUPT
+and IA32_PACKAGE_THERM_STATUS registers, respectively. There exists one HFI
+table per package. Further details can be found in the Intel SDM Vol. 3
+Section 14.9 [1]_.
+
+The hardware issues an HFI interrupt after updating the HFI table and is ready
+for the operating system to consume it. CPUs receive such interrupt via the
+thermal entry in the Local APIC's Local Vector Table.
+
+When servicing such interrupt, the HFI driver parses the updated table and
+relays the update to userspace using the thermal notification framework. Given
+that there may be many HFI updates every second, the updates relayed to
+userspace are throttled at a rate of CONFIG_HZ jiffies.
+
+References
+----------
+
+.. [1] https://www.intel.com/sdm
diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst
index 076efd51ef1f..2e9b8b0f9a0f 100644
--- a/Documentation/x86/sva.rst
+++ b/Documentation/x86/sva.rst
@@ -104,18 +104,47 @@ The MSR must be configured on each logical CPU before any application
thread can interact with a device. Threads that belong to the same
process share the same page tables, thus the same MSR value.
-PASID is cleared when a process is created. The PASID allocation and MSR
-programming may occur long after a process and its threads have been created.
-One thread must call iommu_sva_bind_device() to allocate the PASID for the
-process. If a thread uses ENQCMD without the MSR first being populated, a #GP
-will be raised. The kernel will update the PASID MSR with the PASID for all
-threads in the process. A single process PASID can be used simultaneously
-with multiple devices since they all share the same address space.
-
-One thread can call iommu_sva_unbind_device() to free the allocated PASID.
-The kernel will clear the PASID MSR for all threads belonging to the process.
-
-New threads inherit the MSR value from the parent.
+PASID Life Cycle Management
+===========================
+
+PASID is initialized as INVALID_IOASID (-1) when a process is created.
+
+Only processes that access SVA-capable devices need to have a PASID
+allocated. This allocation happens when a process opens/binds an SVA-capable
+device but finds no PASID for this process. Subsequent binds of the same, or
+other devices will share the same PASID.
+
+Although the PASID is allocated to the process by opening a device,
+it is not active in any of the threads of that process. It's loaded to the
+IA32_PASID MSR lazily when a thread tries to submit a work descriptor
+to a device using the ENQCMD.
+
+That first access will trigger a #GP fault because the IA32_PASID MSR
+has not been initialized with the PASID value assigned to the process
+when the device was opened. The Linux #GP handler notes that a PASID has
+been allocated for the process, and so initializes the IA32_PASID MSR
+and returns so that the ENQCMD instruction is re-executed.
+
+On fork(2) or exec(2) the PASID is removed from the process as it no
+longer has the same address space that it had when the device was opened.
+
+On clone(2) the new task shares the same address space, so will be
+able to use the PASID allocated to the process. The IA32_PASID is not
+preemptively initialized as the PASID value might not be allocated yet or
+the kernel does not know whether this thread is going to access the device
+and the cleared IA32_PASID MSR reduces context switch overhead by xstate
+init optimization. Since #GP faults have to be handled on any threads that
+were created before the PASID was assigned to the mm of the process, newly
+created threads might as well be treated in a consistent way.
+
+Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
+all threads in unbind, free the PASID lazily only on mm exit.
+
+If a process does a close(2) of the device file descriptor and munmap(2)
+of the device MMIO portal, then the driver will unbind the device. The
+PASID is still marked VALID in the PASID_MSR for any threads in the
+process that accessed the device. But this is harmless as without the
+MMIO portal they cannot submit new work to the device.
Relationships
=============