diff options
Diffstat (limited to 'Documentation')
60 files changed, 1615 insertions, 560 deletions
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block index 3879963f0f01..4ba771b56b3b 100644 --- a/Documentation/ABI/stable/sysfs-block +++ b/Documentation/ABI/stable/sysfs-block @@ -77,7 +77,7 @@ Description: What: /sys/block/<disk>/diskseq Date: February 2021 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: The /sys/block/<disk>/diskseq files reports the disk sequence number, which is a monotonically increasing @@ -547,6 +547,21 @@ Description: [RO] Maximum size in bytes of a single element in a DMA scatter/gather list. +What: /sys/block/<disk>/queue/max_write_streams +Date: November 2024 +Contact: linux-block@vger.kernel.org +Description: + [RO] Maximum number of write streams supported, 0 if not + supported. If supported, valid values are 1 through + max_write_streams, inclusive. + +What: /sys/block/<disk>/queue/write_stream_granularity +Date: November 2024 +Contact: linux-block@vger.kernel.org +Description: + [RO] Granularity of a write stream in bytes. The granularity + of a write stream is the size that should be discarded or + overwritten together to avoid write amplification in the device. What: /sys/block/<disk>/queue/max_segments Date: March 2010 diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 206079d3bd5b..6a1acabb29d8 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -511,6 +511,7 @@ Description: information about CPUs heterogeneity. What: /sys/devices/system/cpu/vulnerabilities /sys/devices/system/cpu/vulnerabilities/gather_data_sampling + /sys/devices/system/cpu/vulnerabilities/indirect_target_selection /sys/devices/system/cpu/vulnerabilities/itlb_multihit /sys/devices/system/cpu/vulnerabilities/l1tf /sys/devices/system/cpu/vulnerabilities/mds diff --git a/Documentation/ABI/testing/sysfs-driver-hid-appletb-kbd b/Documentation/ABI/testing/sysfs-driver-hid-appletb-kbd index 2a19584d091e..8c9718d83e9d 100644 --- a/Documentation/ABI/testing/sysfs-driver-hid-appletb-kbd +++ b/Documentation/ABI/testing/sysfs-driver-hid-appletb-kbd @@ -1,6 +1,6 @@ What: /sys/bus/hid/drivers/hid-appletb-kbd/<dev>/mode -Date: September, 2023 -KernelVersion: 6.5 +Date: March, 2025 +KernelVersion: 6.15 Contact: linux-input@vger.kernel.org Description: The set of keys displayed on the Touch Bar. diff --git a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon index 9bce281314df..cb207c79680d 100644 --- a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon +++ b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon @@ -111,7 +111,7 @@ Description: RO. Package current voltage in millivolt. What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp2_input Date: March 2025 -KernelVersion: 6.14 +KernelVersion: 6.15 Contact: intel-xe@lists.freedesktop.org Description: RO. Package temperature in millidegree Celsius. @@ -119,7 +119,7 @@ Description: RO. Package temperature in millidegree Celsius. What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp3_input Date: March 2025 -KernelVersion: 6.14 +KernelVersion: 6.15 Contact: intel-xe@lists.freedesktop.org Description: RO. VRAM temperature in millidegree Celsius. diff --git a/Documentation/ABI/testing/sysfs-driver-ufs b/Documentation/ABI/testing/sysfs-driver-ufs index ae0191295d29..e36d2de16cbd 100644 --- a/Documentation/ABI/testing/sysfs-driver-ufs +++ b/Documentation/ABI/testing/sysfs-driver-ufs @@ -1604,3 +1604,35 @@ Description: prevent the UFS from frequently performing clock gating/ungating. The attribute is read/write. + +What: /sys/bus/platform/drivers/ufshcd/*/device_lvl_exception_count +What: /sys/bus/platform/devices/*.ufs/device_lvl_exception_count +Date: March 2025 +Contact: Bao D. Nguyen <quic_nguyenb@quicinc.com> +Description: + This attribute is applicable to ufs devices compliant to the + JEDEC specifications version 4.1 or later. The + device_lvl_exception_count is a counter indicating the number of + times the device level exceptions have occurred since the last + time this variable is reset. Writing a 0 value to this + attribute will reset the device_lvl_exception_count. If the + device_lvl_exception_count reads a positive value, the user + application should read the device_lvl_exception_id attribute to + know more information about the exception. + + The attribute is read/write. + +What: /sys/bus/platform/drivers/ufshcd/*/device_lvl_exception_id +What: /sys/bus/platform/devices/*.ufs/device_lvl_exception_id +Date: March 2025 +Contact: Bao D. Nguyen <quic_nguyenb@quicinc.com> +Description: + Reading the device_lvl_exception_id returns the + qDeviceLevelExceptionID attribute of the ufs device JEDEC + specification version 4.1. The definition of the + qDeviceLevelExceptionID is the ufs device vendor specific + implementation. Refer to the device manufacturer datasheet for + more information on the meaning of the qDeviceLevelExceptionID + attribute value. + + The attribute is read only. diff --git a/Documentation/ABI/testing/sysfs-fs-erofs b/Documentation/ABI/testing/sysfs-fs-erofs index b134146d735b..bf3b6299c15e 100644 --- a/Documentation/ABI/testing/sysfs-fs-erofs +++ b/Documentation/ABI/testing/sysfs-fs-erofs @@ -27,3 +27,11 @@ Description: Writing to this will drop compression-related caches, - 1 : invalidate cached compressed folios - 2 : drop in-memory pclusters - 3 : drop in-memory pclusters and cached compressed folios + +What: /sys/fs/erofs/accel +Date: May 2025 +Contact: "Bo Liu" <liubo03@inspur.com> +Description: Used to set or show hardware accelerators in effect + and multiple accelerators are separated by '\n'. + Supported accelerator(s): qat_deflate. + Disable all accelerators with an empty string (echo > accel). diff --git a/Documentation/ABI/testing/sysfs-kernel-reboot b/Documentation/ABI/testing/sysfs-kernel-reboot index e117aba46be0..52571fd5ddba 100644 --- a/Documentation/ABI/testing/sysfs-kernel-reboot +++ b/Documentation/ABI/testing/sysfs-kernel-reboot @@ -1,7 +1,7 @@ What: /sys/kernel/reboot Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Interface to set the kernel reboot behavior, similarly to what can be done via the reboot= cmdline option. (see Documentation/admin-guide/kernel-parameters.txt) @@ -9,25 +9,25 @@ Description: Interface to set the kernel reboot behavior, similarly to What: /sys/kernel/reboot/mode Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Reboot mode. Valid values are: cold warm hard soft gpio What: /sys/kernel/reboot/type Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Reboot type. Valid values are: bios acpi kbd triple efi pci What: /sys/kernel/reboot/cpu Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: CPU number to use to reboot. What: /sys/kernel/reboot/force Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Don't wait for any other CPUs on reboot and avoid anything that could hang. diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst index 957ccf617797..3262397ebe8f 100644 --- a/Documentation/admin-guide/blockdev/index.rst +++ b/Documentation/admin-guide/blockdev/index.rst @@ -11,6 +11,7 @@ Block Devices nbd paride ramdisk + zoned_loop zram drbd/index diff --git a/Documentation/admin-guide/blockdev/zoned_loop.rst b/Documentation/admin-guide/blockdev/zoned_loop.rst new file mode 100644 index 000000000000..9c7aa3b482f3 --- /dev/null +++ b/Documentation/admin-guide/blockdev/zoned_loop.rst @@ -0,0 +1,169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Zoned Loop Block Device +======================= + +.. Contents: + + 1) Overview + 2) Creating a Zoned Device + 3) Deleting a Zoned Device + 4) Example + + +1) Overview +----------- + +The zoned loop block device driver (zloop) allows a user to create a zoned block +device using one regular file per zone as backing storage. This driver does not +directly control any hardware and uses read, write and truncate operations to +regular files of a file system to emulate a zoned block device. + +Using zloop, zoned block devices with a configurable capacity, zone size and +number of conventional zones can be created. The storage for each zone of the +device is implemented using a regular file with a maximum size equal to the zone +size. The size of a file backing a conventional zone is always equal to the zone +size. The size of a file backing a sequential zone indicates the amount of data +sequentially written to the file, that is, the size of the file directly +indicates the position of the write pointer of the zone. + +When resetting a sequential zone, its backing file size is truncated to zero. +Conversely, for a zone finish operation, the backing file is truncated to the +zone size. With this, the maximum capacity of a zloop zoned block device created +can be larger configured to be larger than the storage space available on the +backing file system. Of course, for such configuration, writing more data than +the storage space available on the backing file system will result in write +errors. + +The zoned loop block device driver implements a complete zone transition state +machine. That is, zones can be empty, implicitly opened, explicitly opened, +closed or full. The current implementation does not support any limits on the +maximum number of open and active zones. + +No user tools are necessary to create and delete zloop devices. + +2) Creating a Zoned Device +-------------------------- + +Once the zloop module is loaded (or if zloop is compiled in the kernel), the +character device file /dev/zloop-control can be used to add a zloop device. +This is done by writing an "add" command directly to the /dev/zloop-control +device:: + + $ modprobe zloop + $ ls -l /dev/zloop* + crw-------. 1 root root 10, 123 Jan 6 19:18 /dev/zloop-control + + $ mkdir -p <base directory/<device ID> + $ echo "add [options]" > /dev/zloop-control + +The options available for the add command can be listed by reading the +/dev/zloop-control device:: + + $ cat /dev/zloop-control + add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io + remove id=%d + +In more details, the options that can be used with the "add" command are as +follows. + +================ =========================================================== +id Device number (the X in /dev/zloopX). + Default: automatically assigned. +capacity_mb Device total capacity in MiB. This is always rounded up to + the nearest higher multiple of the zone size. + Default: 16384 MiB (16 GiB). +zone_size_mb Device zone size in MiB. Default: 256 MiB. +zone_capacity_mb Device zone capacity (must always be equal to or lower than + the zone size. Default: zone size. +conv_zones Total number of conventioanl zones starting from sector 0. + Default: 8. +base_dir Path to the base directoy where to create the directory + containing the zone files of the device. + Default=/var/local/zloop. + The device directory containing the zone files is always + named with the device ID. E.g. the default zone file + directory for /dev/zloop0 is /var/local/zloop/0. +nr_queues Number of I/O queues of the zoned block device. This value is + always capped by the number of online CPUs + Default: 1 +queue_depth Maximum I/O queue depth per I/O queue. + Default: 64 +buffered_io Do buffered IOs instead of direct IOs (default: false) +================ =========================================================== + +3) Deleting a Zoned Device +-------------------------- + +Deleting an unused zoned loop block device is done by issuing the "remove" +command to /dev/zloop-control, specifying the ID of the device to remove:: + + $ echo "remove id=X" > /dev/zloop-control + +The remove command does not have any option. + +A zoned device that was removed can be re-added again without any change to the +state of the device zones: the device zones are restored to their last state +before the device was removed. Adding again a zoned device after it was removed +must always be done using the same configuration as when the device was first +added. If a zone configuration change is detected, an error will be returned and +the zoned device will not be created. + +To fully delete a zoned device, after executing the remove operation, the device +base directory containing the backing files of the device zones must be deleted. + +4) Example +---------- + +The following sequence of commands creates a 2GB zoned device with zones of 64 +MB and a zone capacity of 63 MB:: + + $ modprobe zloop + $ mkdir -p /var/local/zloop/0 + $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity=63MB" > /dev/zloop-control + +For the device created (/dev/zloop0), the zone backing files are all created +under the default base directory (/var/local/zloop):: + + $ ls -l /var/local/zloop/0 + total 0 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000000 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000001 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000002 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000003 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000004 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000005 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000006 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000007 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000008 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000009 + ... + +The zoned device created (/dev/zloop0) can then be used normally:: + + $ lsblk -z + NAME ZONED ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN + zloop0 host-managed 64M 32 0 0 1M 4K + $ blkzone report /dev/zloop0 + start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + ... + +Deleting this device is done using the command:: + + $ echo "remove id=0" > /dev/zloop-control + +The removed device can be re-added again using the same "add" command as when +the device was first created. To fully delete a zoned device, its backing files +should also be deleted after executing the remove command:: + + $ rm -r /var/local/zloop/0 diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 1a16ce68a4d7..9e7de8e70048 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -3019,7 +3019,7 @@ Filesystem Support for Writeback -------------------------------- A filesystem can support cgroup writeback by updating -address_space_operations->writepage[s]() to annotate bio's using the +address_space_operations->writepages() to annotate bio's using the following two functions. wbc_init_bio(@wbc, @bio) diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 451874b8135d..ce296b8430fc 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -23,3 +23,4 @@ are configurable at compile, boot or run time. gather_data_sampling reg-file-data-sampling rsb + indirect-target-selection diff --git a/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst new file mode 100644 index 000000000000..d9ca64108d23 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst @@ -0,0 +1,168 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Indirect Target Selection (ITS) +=============================== + +ITS is a vulnerability in some Intel CPUs that support Enhanced IBRS and were +released before Alder Lake. ITS may allow an attacker to control the prediction +of indirect branches and RETs located in the lower half of a cacheline. + +ITS is assigned CVE-2024-28956 with a CVSS score of 4.7 (Medium). + +Scope of Impact +--------------- +- **eIBRS Guest/Host Isolation**: Indirect branches in KVM/kernel may still be + predicted with unintended target corresponding to a branch in the guest. + +- **Intra-Mode BTI**: In-kernel training such as through cBPF or other native + gadgets. + +- **Indirect Branch Prediction Barrier (IBPB)**: After an IBPB, indirect + branches may still be predicted with targets corresponding to direct branches + executed prior to the IBPB. This is fixed by the IPU 2025.1 microcode, which + should be available via distro updates. Alternatively microcode can be + obtained from Intel's github repository [#f1]_. + +Affected CPUs +------------- +Below is the list of ITS affected CPUs [#f2]_ [#f3]_: + + ======================== ============ ==================== =============== + Common name Family_Model eIBRS Intra-mode BTI + Guest/Host Isolation + ======================== ============ ==================== =============== + SKYLAKE_X (step >= 6) 06_55H Affected Affected + ICELAKE_X 06_6AH Not affected Affected + ICELAKE_D 06_6CH Not affected Affected + ICELAKE_L 06_7EH Not affected Affected + TIGERLAKE_L 06_8CH Not affected Affected + TIGERLAKE 06_8DH Not affected Affected + KABYLAKE_L (step >= 12) 06_8EH Affected Affected + KABYLAKE (step >= 13) 06_9EH Affected Affected + COMETLAKE 06_A5H Affected Affected + COMETLAKE_L 06_A6H Affected Affected + ROCKETLAKE 06_A7H Not affected Affected + ======================== ============ ==================== =============== + +- All affected CPUs enumerate Enhanced IBRS feature. +- IBPB isolation is affected on all ITS affected CPUs, and need a microcode + update for mitigation. +- None of the affected CPUs enumerate BHI_CTRL which was introduced in Golden + Cove (Alder Lake and Sapphire Rapids). This can help guests to determine the + host's affected status. +- Intel Atom CPUs are not affected by ITS. + +Mitigation +---------- +As only the indirect branches and RETs that have their last byte of instruction +in the lower half of the cacheline are vulnerable to ITS, the basic idea behind +the mitigation is to not allow indirect branches in the lower half. + +This is achieved by relying on existing retpoline support in the kernel, and in +compilers. ITS-vulnerable retpoline sites are runtime patched to point to newly +added ITS-safe thunks. These safe thunks consists of indirect branch in the +second half of the cacheline. Not all retpoline sites are patched to thunks, if +a retpoline site is evaluated to be ITS-safe, it is replaced with an inline +indirect branch. + +Dynamic thunks +~~~~~~~~~~~~~~ +From a dynamically allocated pool of safe-thunks, each vulnerable site is +replaced with a new thunk, such that they get a unique address. This could +improve the branch prediction accuracy. Also, it is a defense-in-depth measure +against aliasing. + +Note, for simplicity, indirect branches in eBPF programs are always replaced +with a jump to a static thunk in __x86_indirect_its_thunk_array. If required, +in future this can be changed to use dynamic thunks. + +All vulnerable RETs are replaced with a static thunk, they do not use dynamic +thunks. This is because RETs get their prediction from RSB mostly that does not +depend on source address. RETs that underflow RSB may benefit from dynamic +thunks. But, RETs significantly outnumber indirect branches, and any benefit +from a unique source address could be outweighed by the increased icache +footprint and iTLB pressure. + +Retpoline +~~~~~~~~~ +Retpoline sequence also mitigates ITS-unsafe indirect branches. For this +reason, when retpoline is enabled, ITS mitigation only relocates the RETs to +safe thunks. Unless user requested the RSB-stuffing mitigation. + +RSB Stuffing +~~~~~~~~~~~~ +RSB-stuffing via Call Depth Tracking is a mitigation for Retbleed RSB-underflow +attacks. And it also mitigates RETs that are vulnerable to ITS. + +Mitigation in guests +^^^^^^^^^^^^^^^^^^^^ +All guests deploy ITS mitigation by default, irrespective of eIBRS enumeration +and Family/Model of the guest. This is because eIBRS feature could be hidden +from a guest. One exception to this is when a guest enumerates BHI_DIS_S, which +indicates that the guest is running on an unaffected host. + +To prevent guests from unnecessarily deploying the mitigation on unaffected +platforms, Intel has defined ITS_NO bit(62) in MSR IA32_ARCH_CAPABILITIES. When +a guest sees this bit set, it should not enumerate the ITS bug. Note, this bit +is not set by any hardware, but is **intended for VMMs to synthesize** it for +guests as per the host's affected status. + +Mitigation options +^^^^^^^^^^^^^^^^^^ +The ITS mitigation can be controlled using the "indirect_target_selection" +kernel parameter. The available options are: + + ======== =================================================================== + on (default) Deploy the "Aligned branch/return thunks" mitigation. + If spectre_v2 mitigation enables retpoline, aligned-thunks are only + deployed for the affected RET instructions. Retpoline mitigates + indirect branches. + + off Disable ITS mitigation. + + vmexit Equivalent to "=on" if the CPU is affected by guest/host isolation + part of ITS. Otherwise, mitigation is not deployed. This option is + useful when host userspace is not in the threat model, and only + attacks from guest to host are considered. + + stuff Deploy RSB-fill mitigation when retpoline is also deployed. + Otherwise, deploy the default mitigation. When retpoline mitigation + is enabled, RSB-stuffing via Call-Depth-Tracking also mitigates + ITS. + + force Force the ITS bug and deploy the default mitigation. + ======== =================================================================== + +Sysfs reporting +--------------- + +The sysfs file showing ITS mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/indirect_target_selection + +Note, microcode mitigation status is not reported in this file. + +The possible values in this file are: + +.. list-table:: + + * - Not affected + - The processor is not vulnerable. + * - Vulnerable + - System is vulnerable and no mitigation has been applied. + * - Vulnerable, KVM: Not affected + - System is vulnerable to intra-mode BTI, but not affected by eIBRS + guest/host isolation. + * - Mitigation: Aligned branch/return thunks + - The mitigation is enabled, affected indirect branches and RETs are + relocated to safe thunks. + * - Mitigation: Retpolines, Stuffing RSB + - The mitigation is enabled using retpoline and RSB stuffing. + +References +---------- +.. [#f1] Microcode repository - https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files + +.. [#f2] Affected Processors list - https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html + +.. [#f3] Affected Processors list (machine readable) - https://github.com/intel/Intel-affected-processor-list diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index d9fd26b95b34..30532eb7bb19 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2202,6 +2202,23 @@ different crypto accelerators. This option can be used to achieve best performance for particular HW. + indirect_target_selection= [X86,Intel] Mitigation control for Indirect + Target Selection(ITS) bug in Intel CPUs. Updated + microcode is also required for a fix in IBPB. + + on: Enable mitigation (default). + off: Disable mitigation. + force: Force the ITS bug and deploy default + mitigation. + vmexit: Only deploy mitigation if CPU is affected by + guest/host isolation part of ITS. + stuff: Deploy RSB-fill mitigation when retpoline is + also deployed. Otherwise, deploy the default + mitigation. + + For details see: + Documentation/admin-guide/hw-vuln/indirect-target-selection.rst + init= [KNL] Format: <full_path> Run specified binary instead of /sbin/init as init @@ -3693,6 +3710,7 @@ expose users to several CPU vulnerabilities. Equivalent to: if nokaslr then kpti=0 [ARM64] gather_data_sampling=off [X86] + indirect_target_selection=off [X86] kvm.nx_huge_pages=off [X86] l1tf=off [X86] mds=off [X86] @@ -6250,7 +6268,7 @@ port and the regular usb controller gets disabled. root= [KNL] Root filesystem - Usually this a a block device specifier of some kind, + Usually this is a block device specifier of some kind, see the early_lookup_bdev comment in block/early-lookup.c for details. Alternatively this can be "ram" for the legacy initial diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 8290177b4f75..d385985b305f 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -75,6 +75,7 @@ Currently, these files are in /proc/sys/vm: - unprivileged_userfaultfd - user_reserve_kbytes - vfs_cache_pressure +- vfs_cache_pressure_denom - watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -1017,19 +1018,28 @@ vfs_cache_pressure This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel +will attempt to reclaim dentries and inodes at a "fair" rate with respect to +pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, +the kernel will never reclaim dentries and inodes due to memory pressure and +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries +and inodes. -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may +have negative performance impact. Reclaim code needs to take various locks to +find freeable directory and inode objects. When vfs_cache_pressure equals +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable +objects than there are. +Note: This setting should always be used together with vfs_cache_pressure_denom. + +vfs_cache_pressure_denom +======================== + +Defaults to 100 (minimum allowed value). Requires corresponding +vfs_cache_pressure setting to take effect. watermark_boost_factor ====================== diff --git a/Documentation/arch/openrisc/openrisc_port.rst b/Documentation/arch/openrisc/openrisc_port.rst index 1565b9546e38..a8f307a3b499 100644 --- a/Documentation/arch/openrisc/openrisc_port.rst +++ b/Documentation/arch/openrisc/openrisc_port.rst @@ -7,10 +7,10 @@ target architecture, specifically, is the 32-bit OpenRISC 1000 family (or1k). For information about OpenRISC processors and ongoing development: - ======= ============================= + ======= ============================== website https://openrisc.io - email openrisc@lists.librecores.org - ======= ============================= + email linux-openrisc@vger.kernel.org + ======= ============================== --------------------------------------------------------------------- @@ -27,11 +27,11 @@ Toolchain binaries can be obtained from openrisc.io or our github releases page. Instructions for building the different toolchains can be found on openrisc.io or Stafford's toolchain build and release scripts. - ========== ================================================= - binaries https://github.com/openrisc/or1k-gcc/releases + ========== ========================================================== + binaries https://github.com/stffrdhrn/or1k-toolchain-build/releases toolchains https://openrisc.io/software building https://github.com/stffrdhrn/or1k-toolchain-build - ========== ================================================= + ========== ========================================================== 2) Building diff --git a/Documentation/arch/riscv/hwprobe.rst b/Documentation/arch/riscv/hwprobe.rst index 53607d962653..f60bf5991755 100644 --- a/Documentation/arch/riscv/hwprobe.rst +++ b/Documentation/arch/riscv/hwprobe.rst @@ -51,7 +51,7 @@ The following keys are defined: * :c:macro:`RISCV_HWPROBE_KEY_MARCHID`: Contains the value of ``marchid``, as defined by the RISC-V privileged architecture specification. -* :c:macro:`RISCV_HWPROBE_KEY_MIMPLID`: Contains the value of ``mimplid``, as +* :c:macro:`RISCV_HWPROBE_KEY_MIMPID`: Contains the value of ``mimpid``, as defined by the RISC-V privileged architecture specification. * :c:macro:`RISCV_HWPROBE_KEY_BASE_BEHAVIOR`: A bitmask containing the base diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index de27e1620821..0acb4c9b8d90 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -382,6 +382,14 @@ In case of new BPF instructions, once the changes have been accepted into the Linux kernel, please implement support into LLVM's BPF back end. See LLVM_ section below for further information. +Q: What "BPF_INTERNAL" symbol namespace is for? +----------------------------------------------- +A: Symbols exported as BPF_INTERNAL can only be used by BPF infrastructure +like preload kernel modules with light skeleton. Most symbols outside +of BPF_INTERNAL are not expected to be used by code outside of BPF either. +Symbols may lack the designation because they predate the namespaces, +or due to an oversight. + Stable submission ================= diff --git a/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml b/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml index 6ad78429dc74..c92341888a28 100644 --- a/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml +++ b/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Ceva AHCI SATA Controller maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> description: | diff --git a/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml b/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml index 350fb8f400f0..5952e6448ed4 100644 --- a/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml +++ b/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml @@ -111,11 +111,27 @@ properties: unevaluatedProperties: false port@1: - $ref: /schemas/graph.yaml#/properties/port + $ref: /schemas/graph.yaml#/$defs/port-base + unevaluatedProperties: false description: DSI output port node to the panel or the next bridge in the chain + properties: + endpoint: + $ref: /schemas/media/video-interfaces.yaml# + unevaluatedProperties: false + + properties: + data-lanes: + description: array of physical DSI data lane indexes. + minItems: 1 + items: + - const: 1 + - const: 2 + - const: 3 + - const: 4 + required: - port@0 - port@1 diff --git a/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml b/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml index bb93baa88879..e13e9d6dd148 100644 --- a/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml +++ b/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml @@ -12,7 +12,6 @@ description: PS_MODE). Every pin can be configured as input/output. maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml b/Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml index 517a4ac1bea3..e365413732e7 100644 --- a/Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml +++ b/Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Mediatek's Keypad Controller maintainers: - - Mattijs Korpershoek <mkorpershoek@baylibre.com> + - Mattijs Korpershoek <mkorpershoek@kernel.org> allOf: - $ref: /schemas/input/matrix-keymap.yaml# diff --git a/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml b/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml index 6076ddf56bb5..c49688be1058 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml @@ -19,6 +19,7 @@ properties: - fsl,imx8mp-irqsteer - fsl,imx8qm-irqsteer - fsl,imx8qxp-irqsteer + - fsl,imx94-irqsteer - const: fsl,imx-irqsteer reg: diff --git a/Documentation/devicetree/bindings/net/can/microchip,mcp2510.yaml b/Documentation/devicetree/bindings/net/can/microchip,mcp2510.yaml index e0ec53bc10c6..1525a50ded47 100644 --- a/Documentation/devicetree/bindings/net/can/microchip,mcp2510.yaml +++ b/Documentation/devicetree/bindings/net/can/microchip,mcp2510.yaml @@ -1,7 +1,7 @@ # SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) %YAML 1.2 --- -$id: http://devicetree.org/schemas/can/microchip,mcp2510.yaml# +$id: http://devicetree.org/schemas/net/can/microchip,mcp2510.yaml# $schema: http://devicetree.org/meta-schemas/core.yaml# title: Microchip MCP251X stand-alone CAN controller diff --git a/Documentation/devicetree/bindings/net/ethernet-controller.yaml b/Documentation/devicetree/bindings/net/ethernet-controller.yaml index 45819b235800..a2d4c626f659 100644 --- a/Documentation/devicetree/bindings/net/ethernet-controller.yaml +++ b/Documentation/devicetree/bindings/net/ethernet-controller.yaml @@ -74,19 +74,17 @@ properties: - rev-rmii - moca - # RX and TX delays are added by the MAC when required + # RX and TX delays are provided by the PCB. See below - rgmii - # RGMII with internal RX and TX delays provided by the PHY, - # the MAC should not add the RX or TX delays in this case + # RX and TX delays are not provided by the PCB. This is the most + # frequent case. See below - rgmii-id - # RGMII with internal RX delay provided by the PHY, the MAC - # should not add an RX delay in this case + # TX delay is provided by the PCB. See below - rgmii-rxid - # RGMII with internal TX delay provided by the PHY, the MAC - # should not add an TX delay in this case + # RX delay is provided by the PCB. See below - rgmii-txid - rtbi - smii @@ -286,4 +284,89 @@ allOf: additionalProperties: true +# Informative +# =========== +# +# 'phy-modes' & 'phy-connection-type' properties 'rgmii', 'rgmii-id', +# 'rgmii-rxid', and 'rgmii-txid' are frequently used wrongly by +# developers. This informative section clarifies their usage. +# +# The RGMII specification requires a 2ns delay between the data and +# clock signals on the RGMII bus. How this delay is implemented is not +# specified. +# +# One option is to make the clock traces on the PCB longer than the +# data traces. A sufficiently difference in length can provide the 2ns +# delay. If both the RX and TX delays are implemented in this manner, +# 'rgmii' should be used, so indicating the PCB adds the delays. +# +# If the PCB does not add these delays via extra long traces, +# 'rgmii-id' should be used. Here, 'id' refers to 'internal delay', +# where either the MAC or PHY adds the delay. +# +# If only one of the two delays are implemented via extra long clock +# lines, either 'rgmii-rxid' or 'rgmii-txid' should be used, +# indicating the MAC or PHY should implement one of the delays +# internally, while the PCB implements the other delay. +# +# Device Tree describes hardware, and in this case, it describes the +# PCB between the MAC and the PHY, if the PCB implements delays or +# not. +# +# In practice, very few PCBs make use of extra long clock lines. Hence +# any RGMII phy mode other than 'rgmii-id' is probably wrong, and is +# unlikely to be accepted during review without details provided in +# the commit description and comments in the .dts file. +# +# When the PCB does not implement the delays, the MAC or PHY must. As +# such, this is software configuration, and so not described in Device +# Tree. +# +# The following describes how Linux implements the configuration of +# the MAC and PHY to add these delays when the PCB does not. As stated +# above, developers often get this wrong, and the aim of this section +# is reduce the frequency of these errors by Linux developers. Other +# users of the Device Tree may implement it differently, and still be +# consistent with both the normative and informative description +# above. +# +# By default in Linux, when using phylib/phylink, the MAC is expected +# to read the 'phy-mode' from Device Tree, not implement any delays, +# and pass the value to the PHY. The PHY will then implement delays as +# specified by the 'phy-mode'. The PHY should always be reconfigured +# to implement the needed delays, replacing any setting performed by +# strapping or the bootloader, etc. +# +# Experience to date is that all PHYs which implement RGMII also +# implement the ability to add or not add the needed delays. Hence +# this default is expected to work in all cases. Ignoring this default +# is likely to be questioned by Reviews, and require a strong argument +# to be accepted. +# +# There are a small number of cases where the MAC has hard coded +# delays which cannot be disabled. The 'phy-mode' only describes the +# PCB. The inability to disable the delays in the MAC does not change +# the meaning of 'phy-mode'. It does however mean that a 'phy-mode' of +# 'rgmii' is now invalid, it cannot be supported, since both the PCB +# and the MAC and PHY adding delays cannot result in a functional +# link. Thus the MAC should report a fatal error for any modes which +# cannot be supported. When the MAC implements the delay, it must +# ensure that the PHY does not also implement the same delay. So it +# must modify the phy-mode it passes to the PHY, removing the delay it +# has added. Failure to remove the delay will result in a +# non-functioning link. +# +# Sometimes there is a need to fine tune the delays. Often the MAC or +# PHY can perform this fine tuning. In the MAC node, the Device Tree +# properties 'rx-internal-delay-ps' and 'tx-internal-delay-ps' should +# be used to indicate fine tuning performed by the MAC. The values +# expected here are small. A value of 2000ps, i.e 2ns, and a phy-mode +# of 'rgmii' will not be accepted by Reviewers. +# +# If the PHY is to perform fine tuning, the properties +# 'rx-internal-delay-ps' and 'tx-internal-delay-ps' in the PHY node +# should be used. When the PHY is implementing delays, e.g. 'rgmii-id' +# these properties should have a value near to 2000ps. If the PCB is +# implementing delays, e.g. 'rgmii', a small value can be used to fine +# tune the delay added by the PCB. ... diff --git a/Documentation/devicetree/bindings/nvmem/layouts/fixed-cell.yaml b/Documentation/devicetree/bindings/nvmem/layouts/fixed-cell.yaml index 8b3826243ddd..38e3ad50ff4f 100644 --- a/Documentation/devicetree/bindings/nvmem/layouts/fixed-cell.yaml +++ b/Documentation/devicetree/bindings/nvmem/layouts/fixed-cell.yaml @@ -27,7 +27,7 @@ properties: $ref: /schemas/types.yaml#/definitions/uint32-array items: - minimum: 0 - maximum: 7 + maximum: 31 description: Offset in bit within the address range specified by reg. - minimum: 1 diff --git a/Documentation/devicetree/bindings/nvmem/qcom,qfprom.yaml b/Documentation/devicetree/bindings/nvmem/qcom,qfprom.yaml index 39c209249c9c..3f6dc6a3a9f1 100644 --- a/Documentation/devicetree/bindings/nvmem/qcom,qfprom.yaml +++ b/Documentation/devicetree/bindings/nvmem/qcom,qfprom.yaml @@ -19,6 +19,7 @@ properties: - enum: - qcom,apq8064-qfprom - qcom,apq8084-qfprom + - qcom,ipq5018-qfprom - qcom,ipq5332-qfprom - qcom,ipq5424-qfprom - qcom,ipq6018-qfprom @@ -28,6 +29,8 @@ properties: - qcom,msm8226-qfprom - qcom,msm8916-qfprom - qcom,msm8917-qfprom + - qcom,msm8937-qfprom + - qcom,msm8960-qfprom - qcom,msm8974-qfprom - qcom,msm8976-qfprom - qcom,msm8996-qfprom @@ -51,6 +54,7 @@ properties: - qcom,sm8450-qfprom - qcom,sm8550-qfprom - qcom,sm8650-qfprom + - qcom,x1e80100-qfprom - const: qcom,qfprom reg: diff --git a/Documentation/devicetree/bindings/nvmem/rockchip,otp.yaml b/Documentation/devicetree/bindings/nvmem/rockchip,otp.yaml index a44d44b32809..dc89020b0950 100644 --- a/Documentation/devicetree/bindings/nvmem/rockchip,otp.yaml +++ b/Documentation/devicetree/bindings/nvmem/rockchip,otp.yaml @@ -14,6 +14,7 @@ properties: enum: - rockchip,px30-otp - rockchip,rk3308-otp + - rockchip,rk3576-otp - rockchip,rk3588-otp reg: @@ -62,6 +63,8 @@ allOf: properties: clocks: maxItems: 3 + clock-names: + maxItems: 3 resets: maxItems: 1 reset-names: @@ -73,11 +76,33 @@ allOf: compatible: contains: enum: + - rockchip,rk3576-otp + then: + properties: + clocks: + maxItems: 3 + clock-names: + maxItems: 3 + resets: + minItems: 2 + maxItems: 2 + reset-names: + items: + - const: otp + - const: apb + + - if: + properties: + compatible: + contains: + enum: - rockchip,rk3588-otp then: properties: clocks: minItems: 4 + clock-names: + minItems: 4 resets: minItems: 3 reset-names: diff --git a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml index a4dfa09344dd..f85ee5d20ccb 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml +++ b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml @@ -9,15 +9,6 @@ title: Renesas R-Car Timer Pulse Unit PWM Controller maintainers: - Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> -select: - properties: - compatible: - contains: - const: renesas,tpu - required: - - compatible - - '#pwm-cells' - properties: compatible: items: diff --git a/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml b/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml index 1f1b42dde94d..1db85fc9966f 100644 --- a/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml +++ b/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Zynq UltraScale+ MPSoC and Versal reset maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> description: | diff --git a/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml b/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml index 31295be91013..234089b5954d 100644 --- a/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml +++ b/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Freescale Layerscape Reset Registers Module maintainers: - - Frank Li + - Frank Li <Frank.Li@nxp.com> description: Reset Module includes chip reset, service processor control and Reset Control diff --git a/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml b/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml index bccd00a1ddd0..53d00ca643b3 100644 --- a/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml +++ b/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml @@ -56,19 +56,18 @@ properties: enum: - snps,dw-apb-ssi - snps,dwc-ssi-1.01a - - description: Microsemi Ocelot/Jaguar2 SoC SPI Controller - items: - - enum: - - mscc,ocelot-spi - - mscc,jaguar2-spi - - const: snps,dw-apb-ssi - description: Microchip Sparx5 SoC SPI Controller const: microchip,sparx5-spi - description: Amazon Alpine SPI Controller const: amazon,alpine-dw-apb-ssi - - description: Renesas RZ/N1 SPI Controller + - description: Vendor controllers which use snps,dw-apb-ssi as fallback items: - - const: renesas,rzn1-spi + - enum: + - mscc,ocelot-spi + - mscc,jaguar2-spi + - renesas,rzn1-spi + - sophgo,sg2042-spi + - thead,th1520-spi - const: snps,dw-apb-ssi - description: Intel Keem Bay SPI Controller const: intel,keembay-ssi @@ -88,10 +87,6 @@ properties: - renesas,r9a06g032-spi # RZ/N1D - renesas,r9a06g033-spi # RZ/N1S - const: renesas,rzn1-spi # RZ/N1 - - description: T-HEAD TH1520 SoC SPI Controller - items: - - const: thead,th1520-spi - - const: snps,dw-apb-ssi reg: minItems: 1 diff --git a/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml b/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml index 891cca009528..6b80b060672e 100644 --- a/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml +++ b/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml @@ -18,9 +18,14 @@ description: | properties: compatible: - enum: - - nxp,imx95-sysctr-timer - - nxp,sysctr-timer + oneOf: + - enum: + - nxp,imx95-sysctr-timer + - nxp,sysctr-timer + - items: + - enum: + - nxp,imx94-sysctr-timer + - const: nxp,imx95-sysctr-timer reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/timer/renesas,tpu.yaml b/Documentation/devicetree/bindings/timer/renesas,tpu.yaml deleted file mode 100644 index 7a473b302775..000000000000 --- a/Documentation/devicetree/bindings/timer/renesas,tpu.yaml +++ /dev/null @@ -1,56 +0,0 @@ -# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -%YAML 1.2 ---- -$id: http://devicetree.org/schemas/timer/renesas,tpu.yaml# -$schema: http://devicetree.org/meta-schemas/core.yaml# - -title: Renesas H8/300 Timer Pulse Unit - -maintainers: - - Yoshinori Sato <ysato@users.sourceforge.jp> - -description: - The TPU is a 16bit timer/counter with configurable clock inputs and - programmable compare match. - This implementation supports only cascade mode. - -select: - properties: - compatible: - contains: - const: renesas,tpu - '#pwm-cells': false - required: - - compatible - -properties: - compatible: - const: renesas,tpu - - reg: - items: - - description: First channel - - description: Second channel - - clocks: - maxItems: 1 - - clock-names: - const: fck - -required: - - compatible - - reg - - clocks - - clock-names - -additionalProperties: false - -examples: - - | - tpu: tpu@ffffe0 { - compatible = "renesas,tpu"; - reg = <0xffffe0 16>, <0xfffff0 12>; - clocks = <&pclk>; - clock-names = "fck"; - }; diff --git a/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml b/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml index b5843f4d17d8..379dacacb526 100644 --- a/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml +++ b/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Xilinx SuperSpeed DWC3 USB SoC controller maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml b/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml index e2a72deae776..c68c04da3399 100644 --- a/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml +++ b/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml @@ -17,7 +17,6 @@ description: maintainers: - Michal Simek <michal.simek@amd.com> - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml b/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml index a7f75fe36665..f295aa9d9ee7 100644 --- a/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml +++ b/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Xilinx udc controller maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/driver-api/early-userspace/buffer-format.rst b/Documentation/driver-api/early-userspace/buffer-format.rst index 7f74e301fdf3..726bfa2fe70d 100644 --- a/Documentation/driver-api/early-userspace/buffer-format.rst +++ b/Documentation/driver-api/early-userspace/buffer-format.rst @@ -4,20 +4,18 @@ initramfs buffer format Al Viro, H. Peter Anvin -Last revision: 2002-01-13 - -Starting with kernel 2.5.x, the old "initial ramdisk" protocol is -getting {replaced/complemented} with the new "initial ramfs" -(initramfs) protocol. The initramfs contents is passed using the same -memory buffer protocol used by the initrd protocol, but the contents +With kernel 2.5.x, the old "initial ramdisk" protocol was complemented +with an "initial ramfs" protocol. The initramfs content is passed +using the same memory buffer protocol used by initrd, but the content is different. The initramfs buffer contains an archive which is -expanded into a ramfs filesystem; this document details the format of -the initramfs buffer format. +expanded into a ramfs filesystem; this document details the initramfs +buffer format. The initramfs buffer format is based around the "newc" or "crc" CPIO formats, and can be created with the cpio(1) utility. The cpio -archive can be compressed using gzip(1). One valid version of an -initramfs buffer is thus a single .cpio.gz file. +archive can be compressed using gzip(1), or any other algorithm provided +via CONFIG_DECOMPRESS_*. One valid version of an initramfs buffer is +thus a single .cpio.gz file. The full format of the initramfs buffer is defined by the following grammar, where:: @@ -25,12 +23,20 @@ grammar, where:: * is used to indicate "0 or more occurrences of" (|) indicates alternatives + indicates concatenation - GZIP() indicates the gzip(1) of the operand + GZIP() indicates gzip compression of the operand + BZIP2() indicates bzip2 compression of the operand + LZMA() indicates lzma compression of the operand + XZ() indicates xz compression of the operand + LZO() indicates lzo compression of the operand + LZ4() indicates lz4 compression of the operand + ZSTD() indicates zstd compression of the operand ALGN(n) means padding with null bytes to an n-byte boundary - initramfs := ("\0" | cpio_archive | cpio_gzip_archive)* + initramfs := ("\0" | cpio_archive | cpio_compressed_archive)* - cpio_gzip_archive := GZIP(cpio_archive) + cpio_compressed_archive := (GZIP(cpio_archive) | BZIP2(cpio_archive) + | LZMA(cpio_archive) | XZ(cpio_archive) | LZO(cpio_archive) + | LZ4(cpio_archive) | ZSTD(cpio_archive)) cpio_archive := cpio_file* + (<nothing> | cpio_trailer) @@ -75,6 +81,8 @@ c_chksum 8 bytes Checksum of data field if c_magic is 070702; The c_mode field matches the contents of st_mode returned by stat(2) on Linux, and encodes the file type and file permissions. +c_mtime is ignored unless CONFIG_INITRAMFS_PRESERVE_MTIME=y is set. + The c_filesize should be zero for any file which is not a regular file or symlink. diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst index ba5de97d155f..871a38f557e8 100644 --- a/Documentation/filesystems/bcachefs/casefolding.rst +++ b/Documentation/filesystems/bcachefs/casefolding.rst @@ -88,3 +88,21 @@ This would fail if negative dentry's were cached. This is slightly suboptimal, but could be fixed in future with some vfs work. + +References +---------- + +(from Peter Anvin, on the list) + +It is worth noting that Microsoft has basically declared their +"recommended" case folding (upcase) table to be permanently frozen (for +new filesystem instances in the case where they use an on-disk +translation table created at format time.) As far as I know they have +never supported anything other than 1:1 conversion of BMP code points, +nor normalization. + +The exFAT specification enumerates the full recommended upcase table, +although in a somewhat annoying format (basically a hex dump of +compressed data): + +https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst new file mode 100644 index 000000000000..59a332509dcd --- /dev/null +++ b/Documentation/filesystems/bcachefs/future/idle_work.rst @@ -0,0 +1,78 @@ +Idle/background work classes design doc: + +Right now, our behaviour at idle isn't ideal, it was designed for servers that +would be under sustained load, to keep pending work at a "medium" level, to +let work build up so we can process it in more efficient batches, while also +giving headroom for bursts in load. + +But for desktops or mobile - scenarios where work is less sustained and power +usage is more important - we want to operate differently, with a "rush to +idle" so the system can go to sleep. We don't want to be dribbling out +background work while the system should be idle. + +The complicating factor is that there are a number of background tasks, which +form a heirarchy (or a digraph, depending on how you divide it up) - one +background task may generate work for another. + +Thus proper idle detection needs to model this heirarchy. + +- Foreground writes +- Page cache writeback +- Copygc, rebalance +- Journal reclaim + +When we implement idle detection and rush to idle, we need to be careful not +to disturb too much the existing behaviour that works reasonably well when the +system is under sustained load (or perhaps improve it in the case of +rebalance, which currently does not actively attempt to let work batch up). + +SUSTAINED LOAD REGIME +--------------------- + +When the system is under continuous load, we want these jobs to run +continuously - this is perhaps best modelled with a P/D controller, where +they'll be trying to keep a target value (i.e. fragmented disk space, +available journal space) roughly in the middle of some range. + +The goal under sustained load is to balance our ability to handle load spikes +without running out of x resource (free disk space, free space in the +journal), while also letting some work accumululate to be batched (or become +unnecessary). + +For example, we don't want to run copygc too aggressively, because then it +will be evacuating buckets that would have become empty (been overwritten or +deleted) anyways, and we don't want to wait until we're almost out of free +space because then the system will behave unpredicably - suddenly we're doing +a lot more work to service each write and the system becomes much slower. + +IDLE REGIME +----------- + +When the system becomes idle, we should start flushing our pending work +quicker so the system can go to sleep. + +Note that the definition of "idle" depends on where in the heirarchy a task +is - a task should start flushing work more quickly when the task above it has +stopped generating new work. + +e.g. rebalance should start flushing more quickly when page cache writeback is +idle, and journal reclaim should only start flushing more quickly when both +copygc and rebalance are idle. + +It's important to let work accumulate when more work is still incoming and we +still have room, because flushing is always more efficient if we let it batch +up. New writes may overwrite data before rebalance moves it, and tasks may be +generating more updates for the btree nodes that journal reclaim needs to flush. + +On idle, how much work we do at each interval should be proportional to the +length of time we have been idle for. If we're idle only for a short duration, +we shouldn't flush everything right away; the system might wake up and start +generating new work soon, and flushing immediately might end up doing a lot of +work that would have been unnecessary if we'd allowed things to batch more. + +To summarize, we will need: + + - A list of classes for background tasks that generate work, which will + include one "foreground" class. + - Tracking for each class - "Am I doing work, or have I gone to sleep?" + - And each class should check the class above it when deciding how much work to issue. diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst index 3864d0ae89c1..e5c4c2120b93 100644 --- a/Documentation/filesystems/bcachefs/index.rst +++ b/Documentation/filesystems/bcachefs/index.rst @@ -29,3 +29,10 @@ At this moment, only a few of these are described here. casefolding errorcodes + +Future design +------------- +.. toctree:: + :maxdepth: 1 + + future/idle_work diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index c293f8e37468..7ddb235aee9d 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -128,6 +128,7 @@ device=%s Specify a path to an extra device to be used together. fsid=%s Specify a filesystem image ID for Fscache back-end. domain_id=%s Specify a domain ID in fscache mode so that different images with the same blobs under a given domain ID can share storage. +fsoffset=%llu Specify block-aligned filesystem offset for the primary device. =================== ========================================================= Sysfs Entries diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index e80329908549..3d22e2db732d 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -1409,7 +1409,7 @@ read the ciphertext into the page cache and decrypt it in-place. The folio lock must be held until decryption has finished, to prevent the folio from becoming visible to userspace prematurely. -For the write path (->writepage()) of regular files, filesystems +For the write path (->writepages()) of regular files, filesystems cannot encrypt data in-place in the page cache, since the cached plaintext must be preserved. Instead, filesystems must encrypt into a temporary buffer or "bounce page", then write out the temporary diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst index e29651a42eec..f2df9b6df988 100644 --- a/Documentation/filesystems/iomap/design.rst +++ b/Documentation/filesystems/iomap/design.rst @@ -243,13 +243,25 @@ The fields are as follows: regular file data. This is only useful for FIEMAP. - * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can - be set by the filesystem for its own purposes. + * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be + merged with any other I/O or completion. Filesystems must use this when + submitting I/O to devices that cannot handle I/O crossing certain LBAs + (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all + other functions ignore it. + + * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use. * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target block assigned to it yet and the file system will do that in the bio submission handler, splitting the I/O as needed. + * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the + ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to + inform iomap that the write I/O operation requires torn-write protection + based on HW-offload mechanism. They must also ensure that mapping updates + upon the completion of the I/O must be performed in a single metadata + update. + These flags can be set by iomap itself during file operations. The filesystem should supply an ``->iomap_end`` function if it needs to observe these flags: diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 0ec0bb6eb0fb..2e567e341c3b 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -249,7 +249,6 @@ address_space_operations ======================== prototypes:: - int (*writepage)(struct page *page, struct writeback_control *wbc); int (*read_folio)(struct file *, struct folio *); int (*writepages)(struct address_space *, struct writeback_control *); bool (*dirty_folio)(struct address_space *, struct folio *folio); @@ -280,7 +279,6 @@ locking rules: ====================== ======================== ========= =============== ops folio locked i_rwsem invalidate_lock ====================== ======================== ========= =============== -writepage: yes, unlocks (see below) read_folio: yes, unlocks shared writepages: dirty_folio: maybe @@ -309,54 +307,6 @@ completion. ->readahead() unlocks the folios that I/O is attempted on like ->read_folio(). -->writepage() is used for two purposes: for "memory cleansing" and for -"sync". These are quite different operations and the behaviour may differ -depending upon the mode. - -If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then -it *must* start I/O against the page, even if that would involve -blocking on in-progress I/O. - -If writepage is called for memory cleansing (sync_mode == -WBC_SYNC_NONE) then its role is to get as much writeout underway as -possible. So writepage should try to avoid blocking against -currently-in-progress I/O. - -If the filesystem is not called for "sync" and it determines that it -would need to block against in-progress I/O to be able to start new I/O -against the page the filesystem should redirty the page with -redirty_page_for_writepage(), then unlock the page and return zero. -This may also be done to avoid internal deadlocks, but rarely. - -If the filesystem is called for sync then it must wait on any -in-progress I/O and then start new I/O. - -The filesystem should unlock the page synchronously, before returning to the -caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE -value. WRITEPAGE_ACTIVATE means that page cannot really be written out -currently, and VM should stop calling ->writepage() on this page for some -time. VM does this by moving page to the head of the active list, hence the -name. - -Unless the filesystem is going to redirty_page_for_writepage(), unlock the page -and return zero, writepage *must* run set_page_writeback() against the page, -followed by unlocking it. Once set_page_writeback() has been run against the -page, write I/O can be submitted and the write I/O completion handler must run -end_page_writeback() once the I/O is complete. If no I/O is submitted, the -filesystem must run end_page_writeback() against the page before returning from -writepage. - -That is: after 2.5.12, pages which are under writeout are *not* locked. Note, -if the filesystem needs the page to be locked during writeout, that is ok, too, -the page is allowed to be unlocked at any point in time between the calls to -set_page_writeback() and end_page_writeback(). - -Note, failure to run either redirty_page_for_writepage() or the combination of -set_page_writeback()/end_page_writeback() on a page submitted to writepage -will leave the page itself marked clean but it will be tagged as dirty in the -radix tree. This incoherency can lead to all sorts of hard-to-debug problems -in the filesystem like having dirty inodes at umount and losing written data. - ->writepages() is used for periodic writeback and for syscall-initiated sync operations. The address_space should start I/O against at least ``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page @@ -364,8 +314,8 @@ which is written. The address_space implementation may write more (or less) pages than ``*nr_to_write`` asks for, but it should try to be reasonably close. If nr_to_write is NULL, all dirty pages must be written. -writepages should _only_ write pages which are present on -mapping->io_pages. +writepages should _only_ write pages which are present in +mapping->i_pages. ->dirty_folio() is called from various places in the kernel when the target folio is marked as needing writeback. The folio cannot be diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst index d92c276f1575..e149b89118c8 100644 --- a/Documentation/filesystems/mount_api.rst +++ b/Documentation/filesystems/mount_api.rst @@ -671,7 +671,6 @@ The members are as follows: fsparam_bool() fs_param_is_bool fsparam_u32() fs_param_is_u32 fsparam_u32oct() fs_param_is_u32_octal - fsparam_u32hex() fs_param_is_u32_hex fsparam_s32() fs_param_is_s32 fsparam_u64() fs_param_is_u64 fsparam_enum() fs_param_is_enum @@ -755,21 +754,6 @@ process the parameters it is given. * :: - bool validate_constant_table(const struct constant_table *tbl, - size_t tbl_size, - int low, int high, int special); - - Validate a constant table. Checks that all the elements are appropriately - ordered, that there are no duplicates and that the values are between low - and high inclusive, though provision is made for one allowable special - value outside of that range. If no special value is required, special - should just be set to lie inside the low-to-high range. - - If all is good, true is returned. If the table is invalid, errors are - logged to the kernel log buffer and false is returned. - - * :: - bool fs_validate_description(const char *name, const struct fs_parameter_description *desc); diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 3886c14f89f4..939b4b624fad 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -1,33 +1,187 @@ .. SPDX-License-Identifier: GPL-2.0 -================================= -Network Filesystem Helper Library -================================= +=================================== +Network Filesystem Services Library +=================================== .. Contents: - Overview. + - Requests and streams. + - Subrequests. + - Result collection and retry. + - Local caching. + - Content encryption (fscrypt). - Per-inode context. - Inode context helper functions. - - Buffered read helpers. - - Read helper functions. - - Read helper structures. - - Read helper operations. - - Read helper procedure. - - Read helper cache API. + - Inode locking. + - Inode writeback. + - High-level VFS API. + - Unlocked read/write iter. + - Pre-locked read/write iter. + - Monolithic files API. + - Memory-mapped I/O API. + - High-level VM API. + - Deprecated PG_private2 API. + - I/O request API. + - Request structure. + - Stream structure. + - Subrequest structure. + - Filesystem methods. + - Terminating a subrequest. + - Local cache API. + - API function reference. Overview ======== -The network filesystem helper library is a set of functions designed to aid a -network filesystem in implementing VM/VFS operations. For the moment, that -just includes turning various VM buffered read operations into requests to read -from the server. The helper library, however, can also interpose other -services, such as local caching or local data encryption. +The network filesystem services library, netfslib, is a set of functions +designed to aid a network filesystem in implementing VM/VFS API operations. It +takes over the normal buffered read, readahead, write and writeback and also +handles unbuffered and direct I/O. -Note that the library module doesn't link against local caching directly, so -access must be provided by the netfs. +The library provides support for (re-)negotiation of I/O sizes and retrying +failed I/O as well as local caching and will, in the future, provide content +encryption. + +It insulates the filesystem from VM interface changes as much as possible and +handles VM features such as large multipage folios. The filesystem basically +just has to provide a way to perform read and write RPC calls. + +The way I/O is organised inside netfslib consists of a number of objects: + + * A *request*. A request is used to track the progress of the I/O overall and + to hold on to resources. The collection of results is done at the request + level. The I/O within a request is divided into a number of parallel + streams of subrequests. + + * A *stream*. A non-overlapping series of subrequests. The subrequests + within a stream do not have to be contiguous. + + * A *subrequest*. This is the basic unit of I/O. It represents a single RPC + call or a single cache I/O operation. The library passes these to the + filesystem and the cache to perform. + +Requests and Streams +-------------------- + +When actually performing I/O (as opposed to just copying into the pagecache), +netfslib will create one or more requests to track the progress of the I/O and +to hold resources. + +A read operation will have a single stream and the subrequests within that +stream may be of mixed origins, for instance mixing RPC subrequests and cache +subrequests. + +On the other hand, a write operation may have multiple streams, where each +stream targets a different destination. For instance, there may be one stream +writing to the local cache and one to the server. Currently, only two streams +are allowed, but this could be increased if parallel writes to multiple servers +is desired. + +The subrequests within a write stream do not need to match alignment or size +with the subrequests in another write stream and netfslib performs the tiling +of subrequests in each stream over the source buffer independently. Further, +each stream may contain holes that don't correspond to holes in the other +stream. + +In addition, the subrequests do not need to correspond to the boundaries of the +folios or vectors in the source/destination buffer. The library handles the +collection of results and the wrangling of folio flags and references. + +Subrequests +----------- + +Subrequests are at the heart of the interaction between netfslib and the +filesystem using it. Each subrequest is expected to correspond to a single +read or write RPC or cache operation. The library will stitch together the +results from a set of subrequests to provide a higher level operation. + +Netfslib has two interactions with the filesystem or the cache when setting up +a subrequest. First, there's an optional preparatory step that allows the +filesystem to negotiate the limits on the subrequest, both in terms of maximum +number of bytes and maximum number of vectors (e.g. for RDMA). This may +involve negotiating with the server (e.g. cifs needing to acquire credits). + +And, secondly, there's the issuing step in which the subrequest is handed off +to the filesystem to perform. + +Note that these two steps are done slightly differently between read and write: + + * For reads, the VM/VFS tells us how much is being requested up front, so the + library can preset maximum values that the cache and then the filesystem can + then reduce. The cache also gets consulted first on whether it wants to do + a read before the filesystem is consulted. + + * For writeback, it is unknown how much there will be to write until the + pagecache is walked, so no limit is set by the library. + +Once a subrequest is completed, the filesystem or cache informs the library of +the completion and then collection is invoked. Depending on whether the +request is synchronous or asynchronous, the collection of results will be done +in either the application thread or in a work queue. + +Result Collection and Retry +--------------------------- + +As subrequests complete, the results are collected and collated by the library +and folio unlocking is performed progressively (if appropriate). Once the +request is complete, async completion will be invoked (again, if appropriate). +It is possible for the filesystem to provide interim progress reports to the +library to cause folio unlocking to happen earlier if possible. + +If any subrequests fail, netfslib can retry them. It will wait until all +subrequests are completed, offer the filesystem the opportunity to fiddle with +the resources/state held by the request and poke at the subrequests before +re-preparing and re-issuing the subrequests. + +This allows the tiling of contiguous sets of failed subrequest within a stream +to be changed, adding more subrequests or ditching excess as necessary (for +instance, if the network sizes change or the server decides it wants smaller +chunks). + +Further, if one or more contiguous cache-read subrequests fail, the library +will pass them to the filesystem to perform instead, renegotiating and retiling +them as necessary to fit with the filesystem's parameters rather than those of +the cache. + +Local Caching +------------- + +One of the services netfslib provides, via ``fscache``, is the option to cache +on local disk a copy of the data obtained from/written to a network filesystem. +The library will manage the storing, retrieval and some invalidation of data +automatically on behalf of the filesystem if a cookie is attached to the +``netfs_inode``. + +Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to +keep track of a page that was being written to the cache, but this is now +deprecated as PG_private_2 will be removed. + +Instead, folios that are read from the server for which there was no data in +the cache will be marked as dirty and will have ``folio->private`` set to a +special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write. +If the folio is modified before that happened, the special value will be +cleared and the write will become normally dirty. + +When writeback occurs, folios that are so marked will only be written to the +cache and not to the server. Writeback handles mixed cache-only writes and +server-and-cache writes by using two streams, sending one to the cache and one +to the server. The server stream will have gaps in it corresponding to those +folios. + +Content Encryption (fscrypt) +---------------------------- + +Though it does not do so yet, at some point netfslib will acquire the ability +to do client-side content encryption on behalf of the network filesystem (Ceph, +for example). fscrypt can be used for this if appropriate (it may not be - +cifs, for example). + +The data will be stored encrypted in the local cache using the same manner of +encryption as the data written to the server and the library will impose bounce +buffering and RMW cycles as necessary. Per-Inode Context @@ -40,10 +194,13 @@ structure is defined:: struct netfs_inode { struct inode inode; const struct netfs_request_ops *ops; - struct fscache_cookie *cache; + struct fscache_cookie * cache; + loff_t remote_i_size; + unsigned long flags; + ... }; -A network filesystem that wants to use netfs lib must place one of these in its +A network filesystem that wants to use netfslib must place one of these in its inode wrapper struct instead of the VFS ``struct inode``. This can be done in a way similar to the following:: @@ -56,7 +213,8 @@ This allows netfslib to find its state by using ``container_of()`` from the inode pointer, thereby allowing the netfslib helper functions to be pointed to directly by the VFS/VM operation tables. -The structure contains the following fields: +The structure contains the following fields that are of interest to the +filesystem: * ``inode`` @@ -71,6 +229,37 @@ The structure contains the following fields: Local caching cookie, or NULL if no caching is enabled. This field does not exist if fscache is disabled. + * ``remote_i_size`` + + The size of the file on the server. This differs from inode->i_size if + local modifications have been made but not yet written back. + + * ``flags`` + + A set of flags, some of which the filesystem might be interested in: + + * ``NETFS_ICTX_MODIFIED_ATTR`` + + Set if netfslib modifies mtime/ctime. The filesystem is free to ignore + this or clear it. + + * ``NETFS_ICTX_UNBUFFERED`` + + Do unbuffered I/O upon the file. Like direct I/O but without the + alignment limitations. RMW will be performed if necessary. The pagecache + will not be used unless mmap() is also used. + + * ``NETFS_ICTX_WRITETHROUGH`` + + Do writethrough caching upon the file. I/O will be set up and dispatched + as buffered writes are made to the page cache. mmap() does the normal + writeback thing. + + * ``NETFS_ICTX_SINGLE_NO_UPLOAD`` + + Set if the file has a monolithic content that must be read entirely in a + single go and must not be written back to the server, though it can be + cached (e.g. AFS directories). Inode Context Helper Functions ------------------------------ @@ -84,117 +273,250 @@ set the operations table pointer:: then a function to cast from the VFS inode structure to the netfs context:: - struct netfs_inode *netfs_node(struct inode *inode); + struct netfs_inode *netfs_inode(struct inode *inode); and finally, a function to get the cache cookie pointer from the context attached to an inode (or NULL if fscache is disabled):: struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx); +Inode Locking +------------- + +A number of functions are provided to manage the locking of i_rwsem for I/O and +to effectively extend it to provide more separate classes of exclusion:: + + int netfs_start_io_read(struct inode *inode); + void netfs_end_io_read(struct inode *inode); + int netfs_start_io_write(struct inode *inode); + void netfs_end_io_write(struct inode *inode); + int netfs_start_io_direct(struct inode *inode); + void netfs_end_io_direct(struct inode *inode); + +The exclusion breaks down into four separate classes: + + 1) Buffered reads and writes. + + Buffered reads can run concurrently each other and with buffered writes, + but buffered writes cannot run concurrently with each other. + + 2) Direct reads and writes. + + Direct (and unbuffered) reads and writes can run concurrently since they do + not share local buffering (i.e. the pagecache) and, in a network + filesystem, are expected to have exclusion managed on the server (though + this may not be the case for, say, Ceph). + + 3) Other major inode modifying operations (e.g. truncate, fallocate). + + These should just access i_rwsem directly. + + 4) mmap(). + + mmap'd accesses might operate concurrently with any of the other classes. + They might form the buffer for an intra-file loopback DIO read/write. They + might be permitted on unbuffered files. + +Inode Writeback +--------------- + +Netfslib will pin resources on an inode for future writeback (such as pinning +use of an fscache cookie) when an inode is dirtied. However, this pinning +needs careful management. To manage the pinning, the following sequence +occurs: + + 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the + pinning begins (when a folio is dirtied, for example) if the cache is + active to stop the cache structures from being discarded and the cache + space from being culled. This also prevents re-getting of cache resources + if the flag is already set. + + 2) This flag then cleared inside the inode lock during inode writeback in the + VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb`` + in ``struct writeback_control``. + + 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced. + + 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup. + + 5) The filesystem invokes netfs to do its cleanup. + +To do the cleanup, netfslib provides a function to do the resource unpinning:: + + int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc); + +If the filesystem doesn't need to do anything else, this may be set as a its +``.write_inode`` method. + +Further, if an inode is deleted, the filesystem's write_inode method may not +get called, so:: + + void netfs_clear_inode_writeback(struct inode *inode, const void *aux); -Buffered Read Helpers -===================== +must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called. -The library provides a set of read helpers that handle the ->read_folio(), -->readahead() and much of the ->write_begin() VM operations and translate them -into a common call framework. -The following services are provided: +High-Level VFS API +================== - * Handle folios that span multiple pages. +Netfslib provides a number of sets of API calls for the filesystem to delegate +VFS operations to. Netfslib, in turn, will call out to the filesystem and the +cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene +at various times. - * Insulate the netfs from VM interface changes. +Unlocked Read/Write Iter +------------------------ - * Allow the netfs to arbitrarily split reads up into pieces, even ones that - don't match folio sizes or folio alignments and that may cross folios. +The first API set is for the delegation of operations to netfslib when the +filesystem is called through the standard VFS read/write_iter methods:: - * Allow the netfs to expand a readahead request in both directions to meet its - needs. + ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter); + ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from); + ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); + ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); + ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from); - * Allow the netfs to partially fulfil a read, which will then be resubmitted. +They can be assigned directly to ``.read_iter`` and ``.write_iter``. They +perform the inode locking themselves and the first two will switch between +buffered I/O and DIO as appropriate. - * Handle local caching, allowing cached data and server-read data to be - interleaved for a single request. +Pre-Locked Read/Write Iter +-------------------------- - * Handle clearing of bufferage that isn't on the server. +The second API set is for the delegation of operations to netfslib when the +filesystem is called through the standard VFS methods, but needs to do some +other stuff before or after calling netfslib whilst still inside locked section +(e.g. Ceph negotiating caps). The unbuffered read function is:: - * Handle retrying of reads that failed, switching reads from the cache to the - server as necessary. + ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter); - * In the future, this is a place that other services can be performed, such as - local encryption of data to be stored remotely or in the cache. +This must not be assigned directly to ``.read_iter`` and the filesystem is +responsible for performing the inode locking before calling it. In the case of +buffered read, the filesystem should use ``filemap_read()``. -From the network filesystem, the helpers require a table of operations. This -includes a mandatory method to issue a read operation along with a number of -optional methods. +There are three functions for writes:: + ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from, + struct netfs_group *netfs_group); + ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, + struct netfs_group *netfs_group); + ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter, + struct netfs_group *netfs_group); -Read Helper Functions +These must not be assigned directly to ``.write_iter`` and the filesystem is +responsible for performing the inode locking before calling them. + +The first two functions are for buffered writes; the first just adds some +standard write checks and jumps to the second, but if the filesystem wants to +do the checks itself, it can use the second directly. The third function is +for unbuffered or DIO writes. + +On all three write functions, there is a writeback group pointer (which should +be NULL if the filesystem doesn't use this). Writeback groups are set on +folios when they're modified. If a folio to-be-modified is already marked with +a different group, it is flushed first. The writeback API allows writing back +of a specific group. + +Memory-Mapped I/O API --------------------- -Three read helpers are provided:: +An API for support of mmap()'d I/O is provided:: + + vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group); + +This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The +filesystem should not take the inode lock before calling it, but, as with the +locked write functions above, this does take a writeback group pointer. If the +page to be made writable is in a different group, it will be flushed first. + +Monolithic Files API +-------------------- + +There is also a special API set for files for which the content must be read in +a single RPC (and not written back) and is maintained as a monolithic blob +(e.g. an AFS directory), though it can be stored and updated in the local cache:: + + ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter); + void netfs_single_mark_inode_dirty(struct inode *inode); + int netfs_writeback_single(struct address_space *mapping, + struct writeback_control *wbc, + struct iov_iter *iter); + +The first function reads from a file into the given buffer, reading from the +cache in preference if the data is cached there; the second function allows the +inode to be marked dirty, causing a later writeback; and the third function can +be called from the writeback code to write the data to the cache, if there is +one. - void netfs_readahead(struct readahead_control *ractl); - int netfs_read_folio(struct file *file, - struct folio *folio); - int netfs_write_begin(struct netfs_inode *ctx, - struct file *file, - struct address_space *mapping, - loff_t pos, - unsigned int len, - struct folio **_folio, - void **_fsdata); +The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be +used. The writeback function requires the buffer to be of ITER_FOLIOQ type. -Each corresponds to a VM address space operation. These operations use the -state in the per-inode context. +High-Level VM API +================== -For ->readahead() and ->read_folio(), the network filesystem just point directly -at the corresponding read helper; whereas for ->write_begin(), it may be a -little more complicated as the network filesystem might want to flush -conflicting writes or track dirty data and needs to put the acquired folio if -an error occurs after calling the helper. +Netfslib also provides a number of sets of API calls for the filesystem to +delegate VM operations to. Again, netfslib, in turn, will call out to the +filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places +for it to intervene at various times:: -The helpers manage the read request, calling back into the network filesystem -through the supplied table of operations. Waits will be performed as -necessary before returning for helpers that are meant to be synchronous. + void netfs_readahead(struct readahead_control *); + int netfs_read_folio(struct file *, struct folio *); + int netfs_writepages(struct address_space *mapping, + struct writeback_control *wbc); + bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio); + void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length); + bool netfs_release_folio(struct folio *folio, gfp_t gfp); -If an error occurs, the ->free_request() will be called to clean up the -netfs_io_request struct allocated. If some parts of the request are in -progress when an error occurs, the request will get partially completed if -sufficient data is read. +These are ``address_space_operations`` methods and can be set directly in the +operations table. -Additionally, there is:: +Deprecated PG_private_2 API +--------------------------- - * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq, - ssize_t transferred_or_error, - bool was_async); +There is also a deprecated function for filesystems that still use the +``->write_begin`` method:: -which should be called to complete a read subrequest. This is given the number -of bytes transferred or a negative error code, plus a flag indicating whether -the operation was asynchronous (ie. whether the follow-on processing can be -done in the current context, given this may involve sleeping). + int netfs_write_begin(struct netfs_inode *inode, struct file *file, + struct address_space *mapping, loff_t pos, unsigned int len, + struct folio **_folio, void **_fsdata); +It uses the deprecated PG_private_2 flag and so should not be used. -Read Helper Structures ----------------------- -The read helpers make use of a couple of structures to maintain the state of -the read. The first is a structure that manages a read request as a whole:: +I/O Request API +=============== + +The I/O request API comprises a number of structures and a number of functions +that the filesystem may need to use. + +Request Structure +----------------- + +The request structure manages the request as a whole, holding some resources +and state on behalf of the filesystem and tracking the collection of results:: struct netfs_io_request { + enum netfs_io_origin origin; struct inode *inode; struct address_space *mapping; - struct netfs_cache_resources cache_resources; + struct netfs_group *group; + struct netfs_io_stream io_streams[]; void *netfs_priv; - loff_t start; - size_t len; - loff_t i_size; - const struct netfs_request_ops *netfs_ops; + void *netfs_priv2; + unsigned long long start; + unsigned long long len; + unsigned long long i_size; unsigned int debug_id; + unsigned long flags; ... }; -The above fields are the ones the netfs can use. They are: +Many of the fields are for internal use, but the fields shown here are of +interest to the filesystem: + + * ``origin`` + + The origin of the request (readahead, read_folio, DIO read, writeback, ...). * ``inode`` * ``mapping`` @@ -202,11 +524,19 @@ The above fields are the ones the netfs can use. They are: The inode and the address space of the file being read from. The mapping may or may not point to inode->i_data. - * ``cache_resources`` + * ``group`` + + The writeback group this request is dealing with or NULL. This holds a ref + on the group. + + * ``io_streams`` - Resources for the local cache to use, if present. + The parallel streams of subrequests available to the request. Currently two + are available, but this may be made extensible in future. ``NR_IO_STREAMS`` + indicates the size of the array. * ``netfs_priv`` + * ``netfs_priv2`` The network filesystem's private data. The value for this can be passed in to the helper functions or set during the request. @@ -221,37 +551,121 @@ The above fields are the ones the netfs can use. They are: The size of the file at the start of the request. - * ``netfs_ops`` - - A pointer to the operation table. The value for this is passed into the - helper functions. - * ``debug_id`` A number allocated to this operation that can be displayed in trace lines for reference. + * ``flags`` + + Flags for managing and controlling the operation of the request. Some of + these may be of interest to the filesystem: + + * ``NETFS_RREQ_RETRYING`` + + Netfslib sets this when generating retries. + + * ``NETFS_RREQ_PAUSE`` + + The filesystem can set this to request to pause the library's subrequest + issuing loop - but care needs to be taken as netfslib may also set it. + + * ``NETFS_RREQ_NONBLOCK`` + * ``NETFS_RREQ_BLOCKED`` + + Netfslib sets the first to indicate that non-blocking mode was set by the + caller and the filesystem can set the second to indicate that it would + have had to block. + + * ``NETFS_RREQ_USE_PGPRIV2`` + + The filesystem can set this if it wants to use PG_private_2 to track + whether a folio is being written to the cache. This is deprecated as + PG_private_2 is going to go away. + +If the filesystem wants more private data than is afforded by this structure, +then it should wrap it and provide its own allocator. + +Stream Structure +---------------- + +A request is comprised of one or more parallel streams and each stream may be +aimed at a different target. + +For read requests, only stream 0 is used. This can contain a mixture of +subrequests aimed at different sources. For write requests, stream 0 is used +for the server and stream 1 is used for the cache. For buffered writeback, +stream 0 is not enabled unless a normal dirty folio is encountered, at which +point ->begin_writeback() will be invoked and the filesystem can mark the +stream available. + +The stream struct looks like:: + + struct netfs_io_stream { + unsigned char stream_nr; + bool avail; + size_t sreq_max_len; + unsigned int sreq_max_segs; + unsigned int submit_extendable_to; + ... + }; + +A number of members are available for access/use by the filesystem: + + * ``stream_nr`` + + The number of the stream within the request. + + * ``avail`` + + True if the stream is available for use. The filesystem should set this on + stream zero if in ->begin_writeback(). + + * ``sreq_max_len`` + * ``sreq_max_segs`` + + These are set by the filesystem or the cache in ->prepare_read() or + ->prepare_write() for each subrequest to indicate the maximum number of + bytes and, optionally, the maximum number of segments (if not 0) that that + subrequest can support. + + * ``submit_extendable_to`` -The second structure is used to manage individual slices of the overall read -request:: + The size that a subrequest can be rounded up to beyond the EOF, given the + available buffer. This allows the cache to work out if it can do a DIO read + or write that straddles the EOF marker. + +Subrequest Structure +-------------------- + +Individual units of I/O are managed by the subrequest structure. These +represent slices of the overall request and run independently:: struct netfs_io_subrequest { struct netfs_io_request *rreq; - loff_t start; + struct iov_iter io_iter; + unsigned long long start; size_t len; size_t transferred; unsigned long flags; + short error; unsigned short debug_index; + unsigned char stream_nr; ... }; -Each subrequest is expected to access a single source, though the helpers will +Each subrequest is expected to access a single source, though the library will handle falling back from one source type to another. The members are: * ``rreq`` A pointer to the read request. + * ``io_iter`` + + An I/O iterator representing a slice of the buffer to be read into or + written from. + * ``start`` * ``len`` @@ -260,241 +674,300 @@ handle falling back from one source type to another. The members are: * ``transferred`` - The amount of data transferred so far of the length of this slice. The - network filesystem or cache should start the operation this far into the - slice. If a short read occurs, the helpers will call again, having updated - this to reflect the amount read so far. + The amount of data transferred so far for this subrequest. This should be + added to with the length of the transfer made by this issuance of the + subrequest. If this is less than ``len`` then the subrequest may be + reissued to continue. * ``flags`` - Flags pertaining to the read. There are two of interest to the filesystem - or cache: + Flags for managing the subrequest. There are a number of interest to the + filesystem or cache: + + * ``NETFS_SREQ_MADE_PROGRESS`` + + Set by the filesystem to indicates that at least one byte of data was read + or written. + + * ``NETFS_SREQ_HIT_EOF`` + + The filesystem should set this if a read hit the EOF on the file (in which + case ``transferred`` should stop at the EOF). Netfslib may expand the + subrequest out to the size of the folio containing the EOF on the off + chance that a third party change happened or a DIO read may have asked for + more than is available. The library will clear any excess pagecache. * ``NETFS_SREQ_CLEAR_TAIL`` - This can be set to indicate that the remainder of the slice, from - transferred to len, should be cleared. + The filesystem can set this to indicate that the remainder of the slice, + from transferred to len, should be cleared. Do not set if HIT_EOF is set. + + * ``NETFS_SREQ_NEED_RETRY`` + + The filesystem can set this to tell netfslib to retry the subrequest. + + * ``NETFS_SREQ_BOUNDARY`` + + This can be set by the filesystem on a subrequest to indicate that it ends + at a boundary with the filesystem structure (e.g. at the end of a Ceph + object). It tells netfslib not to retile subrequests across it. * ``NETFS_SREQ_SEEK_DATA_READ`` - This is a hint to the cache that it might want to try skipping ahead to - the next data (ie. using SEEK_DATA). + This is a hint from netfslib to the cache that it might want to try + skipping ahead to the next data (ie. using SEEK_DATA). + + * ``error`` + + This is for the filesystem to store result of the subrequest. It should be + set to 0 if successful and a negative error code otherwise. * ``debug_index`` + * ``stream_nr`` A number allocated to this slice that can be displayed in trace lines for - reference. + reference and the number of the request stream that it belongs to. +If necessary, the filesystem can get and put extra refs on the subrequest it is +given:: -Read Helper Operations ----------------------- + void netfs_get_subrequest(struct netfs_io_subrequest *subreq, + enum netfs_sreq_ref_trace what); + void netfs_put_subrequest(struct netfs_io_subrequest *subreq, + enum netfs_sreq_ref_trace what); -The network filesystem must provide the read helpers with a table of operations -through which it can issue requests and negotiate:: +using netfs trace codes to indicate the reason. Care must be taken, however, +as once control of the subrequest is returned to netfslib, the same subrequest +can be reissued/retried. + +Filesystem Methods +------------------ + +The filesystem sets a table of operations in ``netfs_inode`` for netfslib to +use:: struct netfs_request_ops { - void (*init_request)(struct netfs_io_request *rreq, struct file *file); + mempool_t *request_pool; + mempool_t *subrequest_pool; + int (*init_request)(struct netfs_io_request *rreq, struct file *file); void (*free_request)(struct netfs_io_request *rreq); + void (*free_subrequest)(struct netfs_io_subrequest *rreq); void (*expand_readahead)(struct netfs_io_request *rreq); - bool (*clamp_length)(struct netfs_io_subrequest *subreq); + int (*prepare_read)(struct netfs_io_subrequest *subreq); void (*issue_read)(struct netfs_io_subrequest *subreq); - bool (*is_still_valid)(struct netfs_io_request *rreq); - int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, - struct folio **foliop, void **_fsdata); void (*done)(struct netfs_io_request *rreq); + void (*update_i_size)(struct inode *inode, loff_t i_size); + void (*post_modify)(struct inode *inode); + void (*begin_writeback)(struct netfs_io_request *wreq); + void (*prepare_write)(struct netfs_io_subrequest *subreq); + void (*issue_write)(struct netfs_io_subrequest *subreq); + void (*retry_request)(struct netfs_io_request *wreq, + struct netfs_io_stream *stream); + void (*invalidate_cache)(struct netfs_io_request *wreq); }; -The operations are as follows: - - * ``init_request()`` +The table starts with a pair of optional pointers to memory pools from which +requests and subrequests can be allocated. If these are not given, netfslib +has default pools that it will use instead. If the filesystem wraps the netfs +structs in its own larger structs, then it will need to use its own pools. +Netfslib will allocate directly from the pools. - [Optional] This is called to initialise the request structure. It is given - the file for reference. +The methods defined in the table are: + * ``init_request()`` * ``free_request()`` + * ``free_subrequest()`` - [Optional] This is called as the request is being deallocated so that the - filesystem can clean up any state it has attached there. + [Optional] A filesystem may implement these to initialise or clean up any + resources that it attaches to the request or subrequest. * ``expand_readahead()`` [Optional] This is called to allow the filesystem to expand the size of a - readahead read request. The filesystem gets to expand the request in both - directions, though it's not permitted to reduce it as the numbers may - represent an allocation already made. If local caching is enabled, it gets - to expand the request first. + readahead request. The filesystem gets to expand the request in both + directions, though it must retain the initial region as that may represent + an allocation already made. If local caching is enabled, it gets to expand + the request first. Expansion is communicated by changing ->start and ->len in the request structure. Note that if any change is made, ->len must be increased by at least as much as ->start is reduced. - * ``clamp_length()`` - - [Optional] This is called to allow the filesystem to reduce the size of a - subrequest. The filesystem can use this, for example, to chop up a request - that has to be split across multiple servers or to put multiple reads in - flight. - - This should return 0 on success and an error code on error. - - * ``issue_read()`` + * ``prepare_read()`` - [Required] The helpers use this to dispatch a subrequest to the server for - reading. In the subrequest, ->start, ->len and ->transferred indicate what - data should be read from the server. + [Optional] This is called to allow the filesystem to limit the size of a + subrequest. It may also limit the number of individual regions in iterator, + such as required by RDMA. This information should be set on stream zero in:: - There is no return value; the netfs_subreq_terminated() function should be - called to indicate whether or not the operation succeeded and how much data - it transferred. The filesystem also should not deal with setting folios - uptodate, unlocking them or dropping their refs - the helpers need to deal - with this as they have to coordinate with copying to the local cache. + rreq->io_streams[0].sreq_max_len + rreq->io_streams[0].sreq_max_segs - Note that the helpers have the folios locked, but not pinned. It is - possible to use the ITER_XARRAY iov iterator to refer to the range of the - inode that is being operated upon without the need to allocate large bvec - tables. + The filesystem can use this, for example, to chop up a request that has to + be split across multiple servers or to put multiple reads in flight. - * ``is_still_valid()`` + Zero should be returned on success and an error code otherwise. - [Optional] This is called to find out if the data just read from the local - cache is still valid. It should return true if it is still valid and false - if not. If it's not still valid, it will be reread from the server. + * ``issue_read()`` - * ``check_write_begin()`` + [Required] Netfslib calls this to dispatch a subrequest to the server for + reading. In the subrequest, ->start, ->len and ->transferred indicate what + data should be read from the server and ->io_iter indicates the buffer to be + used. - [Optional] This is called from the netfs_write_begin() helper once it has - allocated/grabbed the folio to be modified to allow the filesystem to flush - conflicting state before allowing it to be modified. + There is no return value; the ``netfs_read_subreq_terminated()`` function + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing. The + termination can be done asynchronously. - It may unlock and discard the folio it was given and set the caller's folio - pointer to NULL. It should return 0 if everything is now fine (``*foliop`` - left set) or the op should be retried (``*foliop`` cleared) and any other - error code to abort the operation. + Note: the filesystem must not deal with setting folios uptodate, unlocking + them or dropping their refs - the library deals with this as it may have to + stitch together the results of multiple subrequests that variously overlap + the set of folios. - * ``done`` + * ``done()`` - [Optional] This is called after the folios in the request have all been + [Optional] This is called after the folios in a read request have all been unlocked (and marked uptodate if applicable). + * ``update_i_size()`` + + [Optional] This is invoked by netfslib at various points during the write + paths to ask the filesystem to update its idea of the file size. If not + given, netfslib will set i_size and i_blocks and update the local cache + cookie. + + * ``post_modify()`` + + [Optional] This is called after netfslib writes to the pagecache or when it + allows an mmap'd page to be marked as writable. + + * ``begin_writeback()`` + + [Optional] Netfslib calls this when processing a writeback request if it + finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE, + indicating it must be written to the server. This allows the filesystem to + only set up writeback resources when it knows it's going to have to perform + a write. + + * ``prepare_write()`` + [Optional] This is called to allow the filesystem to limit the size of a + subrequest. It may also limit the number of individual regions in iterator, + such as required by RDMA. This information should be set on stream to which + the subrequest belongs:: -Read Helper Procedure ---------------------- - -The read helpers work by the following general procedure: - - * Set up the request. - - * For readahead, allow the local cache and then the network filesystem to - propose expansions to the read request. This is then proposed to the VM. - If the VM cannot fully perform the expansion, a partially expanded read will - be performed, though this may not get written to the cache in its entirety. - - * Loop around slicing chunks off of the request to form subrequests: - - * If a local cache is present, it gets to do the slicing, otherwise the - helpers just try to generate maximal slices. - - * The network filesystem gets to clamp the size of each slice if it is to be - the source. This allows rsize and chunking to be implemented. + rreq->io_streams[subreq->stream_nr].sreq_max_len + rreq->io_streams[subreq->stream_nr].sreq_max_segs - * The helpers issue a read from the cache or a read from the server or just - clears the slice as appropriate. + The filesystem can use this, for example, to chop up a request that has to + be split across multiple servers or to put multiple writes in flight. - * The next slice begins at the end of the last one. + This is not permitted to return an error. Instead, in the event of failure, + ``netfs_prepare_write_failed()`` must be called. - * As slices finish being read, they terminate. + * ``issue_write()`` - * When all the subrequests have terminated, the subrequests are assessed and - any that are short or have failed are reissued: + [Required] This is used to dispatch a subrequest to the server for writing. + In the subrequest, ->start, ->len and ->transferred indicate what data + should be written to the server and ->io_iter indicates the buffer to be + used. - * Failed cache requests are issued against the server instead. + There is no return value; the ``netfs_write_subreq_terminated()`` function + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing. The + termination can be done asynchronously. - * Failed server requests just fail. + Note: the filesystem must not deal with removing the dirty or writeback + marks on folios involved in the operation and should not take refs or pins + on them, but should leave retention to netfslib. - * Short reads against either source will be reissued against that source - provided they have transferred some more data: + * ``retry_request()`` - * The cache may need to skip holes that it can't do DIO from. + [Optional] Netfslib calls this at the beginning of a retry cycle. This + allows the filesystem to examine the state of the request, the subrequests + in the indicated stream and of its own data and make adjustments or + renegotiate resources. + + * ``invalidate_cache()`` - * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the - end of the slice instead of reissuing. + [Optional] This is called by netfslib to invalidate data stored in the local + cache in the event that writing to the local cache fails, providing updated + coherency data that netfs can't provide. - * Once the data is read, the folios that have been fully read/cleared: +Terminating a subrequest +------------------------ - * Will be marked uptodate. +When a subrequest completes, there are a number of functions that the cache or +subrequest can call to inform netfslib of the status change. One function is +provided to terminate a write subrequest at the preparation stage and acts +synchronously: - * If a cache is present, will be marked with PG_fscache. + * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);`` - * Unlocked + Indicate that the ->prepare_write() call failed. The ``error`` field should + have been updated. - * Any folios that need writing to the cache will then have DIO writes issued. +Note that ->prepare_read() can return an error as a read can simply be aborted. +Dealing with writeback failure is trickier. - * Synchronous operations will wait for reading to be complete. +The other functions are used for subrequests that got as far as being issued: - * Writes to the cache will proceed asynchronously and the folios will have the - PG_fscache mark removed when that completes. + * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);`` - * The request structures will be cleaned up when everything has completed. + Tell netfslib that a read subrequest has terminated. The ``error``, + ``flags`` and ``transferred`` fields should have been updated. + * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);`` -Read Helper Cache API ---------------------- + Tell netfslib that a write subrequest has terminated. Either the amount of + data processed or the negative error code can be passed in. This is + can be used as a kiocb completion function. -When implementing a local cache to be used by the read helpers, two things are -required: some way for the network filesystem to initialise the caching for a -read request and a table of operations for the helpers to call. + * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);`` -To begin a cache operation on an fscache object, the following function is -called:: + This is provided to optionally update netfslib on the incremental progress + of a read, allowing some folios to be unlocked early and does not actually + terminate the subrequest. The ``transferred`` field should have been + updated. - int fscache_begin_read_operation(struct netfs_io_request *rreq, - struct fscache_cookie *cookie); +Local Cache API +--------------- -passing in the request pointer and the cookie corresponding to the file. This -fills in the cache resources mentioned below. +Netfslib provides a separate API for a local cache to implement, though it +provides some somewhat similar routines to the filesystem request API. -The netfs_io_request object contains a place for the cache to hang its +Firstly, the netfs_io_request object contains a place for the cache to hang its state:: struct netfs_cache_resources { const struct netfs_cache_ops *ops; void *cache_priv; void *cache_priv2; + unsigned int debug_id; + unsigned int inval_counter; }; -This contains an operations table pointer and two private pointers. The -operation table looks like the following:: +This contains an operations table pointer and two private pointers plus the +debug ID of the fscache cookie for tracing purposes and an invalidation counter +that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests +to be invalidated after completion. + +The cache operation table looks like the following:: struct netfs_cache_ops { void (*end_operation)(struct netfs_cache_resources *cres); - void (*expand_readahead)(struct netfs_cache_resources *cres, loff_t *_start, size_t *_len, loff_t i_size); - enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq, - loff_t i_size); - + loff_t i_size); int (*read)(struct netfs_cache_resources *cres, loff_t start_pos, struct iov_iter *iter, bool seek_data, netfs_io_terminated_t term_func, void *term_func_priv); - - int (*prepare_write)(struct netfs_cache_resources *cres, - loff_t *_start, size_t *_len, loff_t i_size, - bool no_space_allocated_yet); - - int (*write)(struct netfs_cache_resources *cres, - loff_t start_pos, - struct iov_iter *iter, - netfs_io_terminated_t term_func, - void *term_func_priv); - - int (*query_occupancy)(struct netfs_cache_resources *cres, - loff_t start, size_t len, size_t granularity, - loff_t *_data_start, size_t *_data_len); + void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq); + void (*issue_write)(struct netfs_io_subrequest *subreq); }; With a termination handler function pointer:: @@ -511,10 +984,16 @@ The methods defined in the table are: * ``expand_readahead()`` - [Optional] Called at the beginning of a netfs_readahead() operation to allow - the cache to expand a request in either direction. This allows the cache to + [Optional] Called at the beginning of a readahead operation to allow the + cache to expand a request in either direction. This allows the cache to size the request appropriately for the cache granularity. + * ``prepare_read()`` + + [Required] Called to configure the next slice of a request. ->start and + ->len in the subrequest indicate where and how big the next slice can be; + the cache gets to reduce the length to match its granularity requirements. + The function is passed pointers to the start and length in its parameters, plus the size of the file for reference, and adjusts the start and length appropriately. It should return one of: @@ -528,12 +1007,6 @@ The methods defined in the table are: downloaded from the server or read from the cache - or whether slicing should be given up at the current point. - * ``prepare_read()`` - - [Required] Called to configure the next slice of a request. ->start and - ->len in the subrequest indicate where and how big the next slice can be; - the cache gets to reduce the length to match its granularity requirements. - * ``read()`` [Required] Called to read from the cache. The start file offset is given @@ -547,44 +1020,33 @@ The methods defined in the table are: indicating whether the termination is definitely happening in the caller's context. - * ``prepare_write()`` + * ``prepare_write_subreq()`` - [Required] Called to prepare a write to the cache to take place. This - involves checking to see whether the cache has sufficient space to honour - the write. ``*_start`` and ``*_len`` indicate the region to be written; the - region can be shrunk or it can be expanded to a page boundary either way as - necessary to align for direct I/O. i_size holds the size of the object and - is provided for reference. no_space_allocated_yet is set to true if the - caller is certain that no data has been written to that region - for example - if it tried to do a read from there already. + [Required] This is called to allow the cache to limit the size of a + subrequest. It may also limit the number of individual regions in iterator, + such as required by DIO/DMA. This information should be set on stream to + which the subrequest belongs:: - * ``write()`` + rreq->io_streams[subreq->stream_nr].sreq_max_len + rreq->io_streams[subreq->stream_nr].sreq_max_segs - [Required] Called to write to the cache. The start file offset is given - along with an iterator to write from, which gives the length also. - - Also provided is a pointer to a termination handler function and private - data to pass to that function. The termination function should be called - with the number of bytes transferred or an error code, plus a flag - indicating whether the termination is definitely happening in the caller's - context. + The filesystem can use this, for example, to chop up a request that has to + be split across multiple servers or to put multiple writes in flight. - * ``query_occupancy()`` + This is not permitted to return an error. In the event of failure, + ``netfs_prepare_write_failed()`` must be called. - [Required] Called to find out where the next piece of data is within a - particular region of the cache. The start and length of the region to be - queried are passed in, along with the granularity to which the answer needs - to be aligned. The function passes back the start and length of the data, - if any, available within that region. Note that there may be a hole at the - front. + * ``issue_write()`` - It returns 0 if some data was found, -ENODATA if there was no usable data - within the region or -ENOBUFS if there is no caching on this file. + [Required] This is used to dispatch a subrequest to the cache for writing. + In the subrequest, ->start, ->len and ->transferred indicate what data + should be written to the cache and ->io_iter indicates the buffer to be + used. -Note that these methods are passed a pointer to the cache resource structure, -not the read request structure as they could be used in other situations where -there isn't a read request structure as well, such as writing dirty data to the -cache. + There is no return value; the ``netfs_write_subreq_terminated()`` function + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing. The + termination can be done asynchronously. API Function Reference diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 767b2927c762..3111ef5592f3 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1203,3 +1203,43 @@ should use d_drop();d_splice_alias() and return the result of the latter. If a positive dentry cannot be returned for some reason, in-kernel clients such as cachefiles, nfsd, smb/server may not perform ideally but will fail-safe. + +--- + +** mandatory** + +lookup_one(), lookup_one_unlocked(), lookup_one_positive_unlocked() now +take a qstr instead of a name and len. These, not the "one_len" +versions, should be used whenever accessing a filesystem from outside +that filesysmtem, through a mount point - which will have a mnt_idmap. + +--- + +** mandatory** + +Functions try_lookup_one_len(), lookup_one_len(), +lookup_one_len_unlocked() and lookup_positive_unlocked() have been +renamed to try_lookup_noperm(), lookup_noperm(), +lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now +take a qstr instead of separate name and length. QSTR() can be used +when strlen() is needed for the length. + +For try_lookup_noperm() a reference to the qstr is passed in case the +hash might subsequently be needed. + +These function no longer do any permission checking - they previously +checked that the caller has 'X' permission on the parent. They must +ONLY be used internally by a filesystem on itself when it knows that +permissions are irrelevant or in a context where permission checks have +already been performed such as after vfs_path_parent_lookup() + +--- + +** mandatory** + +d_hash_and_lookup() is no longer exported or available outside the VFS. +Use try_lookup_noperm() instead. This adds name validation and takes +arguments in the opposite order but is otherwise identical. + +Using try_lookup_noperm() will require linux/namei.h to be included. + diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index ae79c30b6c0c..bf051c7da6b8 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -716,9 +716,8 @@ page lookup by address, and keeping track of pages tagged as Dirty or Writeback. The first can be used independently to the others. The VM can try to -either write dirty pages in order to clean them, or release clean pages -in order to reuse them. To do this it can call the ->writepage method -on dirty pages, and ->release_folio on clean folios with the private +release clean pages in order to reuse them. To do this it can call +->release_folio on clean folios with the private flag set. Clean pages without PagePrivate and with no external references will be released without notice being given to the address_space. @@ -731,8 +730,8 @@ maintains information about the PG_Dirty and PG_Writeback status of each page, so that pages with either of these flags can be found quickly. The Dirty tag is primarily used by mpage_writepages - the default -->writepages method. It uses the tag to find dirty pages to call -->writepage on. If mpage_writepages is not used (i.e. the address +->writepages method. It uses the tag to find dirty pages to +write back. If mpage_writepages is not used (i.e. the address provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost unused. write_inode_now and sync_inode do use it (through __sync_single_inode) to check if ->writepages has been successful in @@ -756,23 +755,23 @@ pages, however the address_space has finer control of write sizes. The read process essentially only requires 'read_folio'. The write process is more complicated and uses write_begin/write_end or -dirty_folio to write data into the address_space, and writepage and +dirty_folio to write data into the address_space, and writepages to writeback data to storage. Adding and removing pages to/from an address_space is protected by the inode's i_mutex. When data is written to a page, the PG_Dirty flag should be set. It -typically remains set until writepage asks for it to be written. This +typically remains set until writepages asks for it to be written. This should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. Writeback makes use of a writeback_control structure to direct the -operations. This gives the writepage and writepages operations some +operations. This gives the writepages operation some information about the nature of and reason for the writeback request, and the constraints under which it is being done. It is also used to -return information back to the caller about the result of a writepage or +return information back to the caller about the result of a writepages request. @@ -819,7 +818,6 @@ cache in your filesystem. The following members are defined: .. code-block:: c struct address_space_operations { - int (*writepage)(struct page *page, struct writeback_control *wbc); int (*read_folio)(struct file *, struct folio *); int (*writepages)(struct address_space *, struct writeback_control *); bool (*dirty_folio)(struct address_space *, struct folio *); @@ -848,25 +846,6 @@ cache in your filesystem. The following members are defined: int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); }; -``writepage`` - called by the VM to write a dirty page to backing store. This - may happen for data integrity reasons (i.e. 'sync'), or to free - up memory (flush). The difference can be seen in - wbc->sync_mode. The PG_Dirty flag has been cleared and - PageLocked is true. writepage should start writeout, should set - PG_Writeback, and should make sure the page is unlocked, either - synchronously or asynchronously when the write operation - completes. - - If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to - try too hard if there are problems, and may choose to write out - other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not - keep calling ->writepage on that page. - - See the file "Locking" for more details. - ``read_folio`` Called by the page cache to read a folio from the backing store. The 'file' argument supplies authentication information to network @@ -909,7 +888,7 @@ cache in your filesystem. The following members are defined: given and that many pages should be written if possible. If no ->writepages is given, then mpage_writepages is used instead. This will choose pages from the address space that are tagged as - DIRTY and will pass them to ->writepage. + DIRTY and will write them back. ``dirty_folio`` called by the VM to mark a folio as dirty. This is particularly diff --git a/Documentation/kbuild/reproducible-builds.rst b/Documentation/kbuild/reproducible-builds.rst index a7762486c93f..f2dcc39044e6 100644 --- a/Documentation/kbuild/reproducible-builds.rst +++ b/Documentation/kbuild/reproducible-builds.rst @@ -46,6 +46,21 @@ The kernel embeds the building user and host names in `KBUILD_BUILD_USER and KBUILD_BUILD_HOST`_ variables. If you are building from a git commit, you could use its committer address. +Absolute filenames +------------------ + +When the kernel is built out-of-tree, debug information may include +absolute filenames for the source files. This must be overridden by +including the ``-fdebug-prefix-map`` option in the `KCFLAGS`_ variable. + +Depending on the compiler used, the ``__FILE__`` macro may also expand +to an absolute filename in an out-of-tree build. Kbuild automatically +uses the ``-fmacro-prefix-map`` option to prevent this, if it is +supported. + +The Reproducible Builds web site has more information about these +`prefix-map options`_. + Generated files in source packages ---------------------------------- @@ -116,5 +131,7 @@ See ``scripts/setlocalversion`` for details. .. _KBUILD_BUILD_TIMESTAMP: kbuild.html#kbuild-build-timestamp .. _KBUILD_BUILD_USER and KBUILD_BUILD_HOST: kbuild.html#kbuild-build-user-kbuild-build-host +.. _KCFLAGS: kbuild.html#kcflags +.. _prefix-map options: https://reproducible-builds.org/docs/build-path/ .. _Reproducible Builds project: https://reproducible-builds.org/ .. _SOURCE_DATE_EPOCH: https://reproducible-builds.org/docs/source-date-epoch/ diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index 655d8d10fe24..c650cd3dcb80 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -89,8 +89,10 @@ definitions: doc: Group of short_detected states - name: phy-upstream-type - enum-name: + enum-name: phy-upstream + header: linux/ethtool.h type: enum + name-prefix: phy-upstream entries: [ mac, phy ] - name: tcp-data-split diff --git a/Documentation/netlink/specs/ovs_vport.yaml b/Documentation/netlink/specs/ovs_vport.yaml index 86ba9ac2a521..b538bb99ee9b 100644 --- a/Documentation/netlink/specs/ovs_vport.yaml +++ b/Documentation/netlink/specs/ovs_vport.yaml @@ -123,12 +123,12 @@ attribute-sets: operations: name-prefix: ovs-vport-cmd- + fixed-header: ovs-header list: - name: new doc: Create a new OVS vport attribute-set: vport - fixed-header: ovs-header do: request: attributes: @@ -141,7 +141,6 @@ operations: name: del doc: Delete existing OVS vport from a data path attribute-set: vport - fixed-header: ovs-header do: request: attributes: @@ -152,7 +151,6 @@ operations: name: get doc: Get / dump OVS vport configuration and state attribute-set: vport - fixed-header: ovs-header do: &vport-get-op request: attributes: diff --git a/Documentation/netlink/specs/rt_link.yaml b/Documentation/netlink/specs/rt_link.yaml index 31238455f8e9..6b9d5ee87d93 100644 --- a/Documentation/netlink/specs/rt_link.yaml +++ b/Documentation/netlink/specs/rt_link.yaml @@ -1113,11 +1113,10 @@ attribute-sets: - name: prop-list type: nest - nested-attributes: link-attrs + nested-attributes: prop-list-link-attrs - name: alt-ifname type: string - multi-attr: true - name: perm-address type: binary @@ -1164,6 +1163,13 @@ attribute-sets: name: netns-immutable type: u8 - + name: prop-list-link-attrs + subset-of: link-attrs + attributes: + - + name: alt-ifname + multi-attr: true + - name: af-spec-attrs attributes: - @@ -1585,7 +1591,7 @@ attribute-sets: name: nf-call-iptables type: u8 - - name: nf-call-ip6-tables + name: nf-call-ip6tables type: u8 - name: nf-call-arptables @@ -2077,7 +2083,7 @@ attribute-sets: name: id type: u16 - - name: flag + name: flags type: binary struct: ifla-vlan-flags - @@ -2165,7 +2171,7 @@ attribute-sets: type: binary struct: ifla-cacheinfo - - name: icmp6-stats + name: icmp6stats type: binary struct: ifla-icmp6-stats - @@ -2179,9 +2185,10 @@ attribute-sets: type: u32 - name: mctp-attrs + name-prefix: ifla-mctp- attributes: - - name: mctp-net + name: net type: u32 - name: phys-binding @@ -2453,7 +2460,6 @@ operations: - min-mtu - max-mtu - prop-list - - alt-ifname - perm-address - proto-down-reason - parent-dev-name diff --git a/Documentation/netlink/specs/rt_neigh.yaml b/Documentation/netlink/specs/rt_neigh.yaml index e670b6dc07be..a843caa72259 100644 --- a/Documentation/netlink/specs/rt_neigh.yaml +++ b/Documentation/netlink/specs/rt_neigh.yaml @@ -13,25 +13,25 @@ definitions: type: struct members: - - name: family + name: ndm-family type: u8 - - name: pad + name: ndm-pad type: pad len: 3 - - name: ifindex + name: ndm-ifindex type: s32 - - name: state + name: ndm-state type: u16 enum: nud-state - - name: flags + name: ndm-flags type: u8 enum: ntf-flags - - name: type + name: ndm-type type: u8 enum: rtm-type - @@ -189,7 +189,7 @@ attribute-sets: type: binary display-hint: ipv4 - - name: lladr + name: lladdr type: binary display-hint: mac - diff --git a/Documentation/netlink/specs/tc.yaml b/Documentation/netlink/specs/tc.yaml index aacccea5dfe4..953aa837958b 100644 --- a/Documentation/netlink/specs/tc.yaml +++ b/Documentation/netlink/specs/tc.yaml @@ -2017,7 +2017,8 @@ attribute-sets: attributes: - name: act - type: nest + type: indexed-array + sub-type: nest nested-attributes: tc-act-attrs - name: police @@ -2250,7 +2251,8 @@ attribute-sets: attributes: - name: act - type: nest + type: indexed-array + sub-type: nest nested-attributes: tc-act-attrs - name: police @@ -2745,7 +2747,7 @@ attribute-sets: type: u16 byte-order: big-endian - - name: key-l2-tpv3-sid + name: key-l2tpv3-sid type: u32 byte-order: big-endian - @@ -3504,7 +3506,7 @@ attribute-sets: name: rate64 type: u64 - - name: prate4 + name: prate64 type: u64 - name: burst diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst index b8fef8101176..7aabead90648 100644 --- a/Documentation/networking/timestamping.rst +++ b/Documentation/networking/timestamping.rst @@ -811,11 +811,9 @@ Documentation/devicetree/bindings/ptp/timestamper.txt for more details. 3.2.4 Other caveats for MAC drivers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Stacked PHCs, especially DSA (but not only) - since that doesn't require any -modification to MAC drivers, so it is more difficult to ensure correctness of -all possible code paths - is that they uncover bugs which were impossible to -trigger before the existence of stacked PTP clocks. One example has to do with -this line of code, already presented earlier:: +The use of stacked PHCs may uncover MAC driver bugs which were impossible to +trigger without them. One example has to do with this line of code, already +presented earlier:: skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst index 12f429359a82..63344bea8393 100644 --- a/Documentation/power/runtime_pm.rst +++ b/Documentation/power/runtime_pm.rst @@ -154,7 +154,7 @@ suspending the device are satisfied) and to queue up a suspend request for the device in that case. If there is no idle callback, or if the callback returns 0, then the PM core will attempt to carry out a runtime suspend of the device, also respecting devices configured for autosuspend. In essence this means a -call to __pm_runtime_autosuspend() (do note that drivers needs to update the +call to pm_runtime_autosuspend() (do note that drivers needs to update the device last busy mark, pm_runtime_mark_last_busy(), to control the delay under this circumstance). To prevent this (for example, if the callback routine has started a delayed suspend), the routine must return a non-zero value. Negative diff --git a/Documentation/translations/zh_CN/arch/openrisc/openrisc_port.rst b/Documentation/translations/zh_CN/arch/openrisc/openrisc_port.rst index cadc580fa23b..d728e4db0b85 100644 --- a/Documentation/translations/zh_CN/arch/openrisc/openrisc_port.rst +++ b/Documentation/translations/zh_CN/arch/openrisc/openrisc_port.rst @@ -17,10 +17,10 @@ OpenRISC 1000系列(或1k)。 关于OpenRISC处理器和正在进行中的开发的信息: - ======= ============================= + ======= ============================== 网站 https://openrisc.io - 邮箱 openrisc@lists.librecores.org - ======= ============================= + 邮箱 linux-openrisc@vger.kernel.org + ======= ============================== --------------------------------------------------------------------- @@ -36,11 +36,11 @@ OpenRISC工具链和Linux的构建指南 工具链的构建指南可以在openrisc.io或Stafford的工具链构建和发布脚本 中找到。 - ====== ================================================= - 二进制 https://github.com/openrisc/or1k-gcc/releases + ====== ========================================================== + 二进制 https://github.com/stffrdhrn/or1k-toolchain-build/releases 工具链 https://openrisc.io/software 构建 https://github.com/stffrdhrn/or1k-toolchain-build - ====== ================================================= + ====== ========================================================== 2) 构建 diff --git a/Documentation/translations/zh_TW/arch/openrisc/openrisc_port.rst b/Documentation/translations/zh_TW/arch/openrisc/openrisc_port.rst index 422fe9f7a3f2..a1e4517dc601 100644 --- a/Documentation/translations/zh_TW/arch/openrisc/openrisc_port.rst +++ b/Documentation/translations/zh_TW/arch/openrisc/openrisc_port.rst @@ -17,10 +17,10 @@ OpenRISC 1000系列(或1k)。 關於OpenRISC處理器和正在進行中的開發的信息: - ======= ============================= + ======= ============================== 網站 https://openrisc.io - 郵箱 openrisc@lists.librecores.org - ======= ============================= + 郵箱 linux-openrisc@vger.kernel.org + ======= ============================== --------------------------------------------------------------------- @@ -36,11 +36,11 @@ OpenRISC工具鏈和Linux的構建指南 工具鏈的構建指南可以在openrisc.io或Stafford的工具鏈構建和發佈腳本 中找到。 - ====== ================================================= - 二進制 https://github.com/openrisc/or1k-gcc/releases + ====== ========================================================== + 二進制 https://github.com/stffrdhrn/or1k-toolchain-build/releases 工具鏈 https://openrisc.io/software 構建 https://github.com/stffrdhrn/or1k-toolchain-build - ====== ================================================= + ====== ========================================================== 2) 構建 diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst index 1dabfc29be0d..7195a7f91107 100644 --- a/Documentation/userspace-api/mseal.rst +++ b/Documentation/userspace-api/mseal.rst @@ -27,7 +27,7 @@ SYSCALL ======= mseal syscall signature ----------------------- - ``int mseal(void \* addr, size_t len, unsigned long flags)`` + ``int mseal(void *addr, size_t len, unsigned long flags)`` **addr**/**len**: virtual memory address range. The address range set by **addr**/**len** must meet: diff --git a/Documentation/wmi/devices/msi-wmi-platform.rst b/Documentation/wmi/devices/msi-wmi-platform.rst index 31a136942892..73197b31926a 100644 --- a/Documentation/wmi/devices/msi-wmi-platform.rst +++ b/Documentation/wmi/devices/msi-wmi-platform.rst @@ -138,6 +138,10 @@ input data, the meaning of which depends on the subfeature being accessed. The output buffer contains a single byte which signals success or failure (``0x00`` on failure) and 31 bytes of output data, the meaning if which depends on the subfeature being accessed. +.. note:: + The ACPI control method responsible for handling the WMI method calls is not thread-safe. + This is a firmware bug that needs to be handled inside the driver itself. + WMI method Get_EC() ------------------- |