diff options
Diffstat (limited to 'Documentation/x86')
33 files changed, 2096 insertions, 284 deletions
diff --git a/Documentation/x86/amd-memory-encryption.rst b/Documentation/x86/amd-memory-encryption.rst index c48d452d0718..a1940ebe7be5 100644 --- a/Documentation/x86/amd-memory-encryption.rst +++ b/Documentation/x86/amd-memory-encryption.rst @@ -53,7 +53,7 @@ CPUID function 0x8000001f reports information related to SME:: system physical addresses, not guest physical addresses) -If support for SME is present, MSR 0xc00100010 (MSR_K8_SYSCFG) can be used to +If support for SME is present, MSR 0xc00100010 (MSR_AMD64_SYSCFG) can be used to determine if SME is enabled and/or to enable memory encryption:: 0xc0010010: @@ -79,7 +79,7 @@ The state of SME in the Linux kernel can be documented as follows: The CPU supports SME (determined through CPUID instruction). - Enabled: - Supported and bit 23 of MSR_K8_SYSCFG is set. + Supported and bit 23 of MSR_AMD64_SYSCFG is set. - Active: Supported, Enabled and the Linux kernel is actively applying @@ -89,7 +89,7 @@ The state of SME in the Linux kernel can be documented as follows: SME can also be enabled and activated in the BIOS. If SME is enabled and activated in the BIOS, then all memory accesses will be encrypted and it will not be necessary to activate the Linux memory encryption support. If the BIOS -merely enables SME (sets bit 23 of the MSR_K8_SYSCFG), then Linux can activate +merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or by supplying mem_encrypt=on on the kernel command line. However, if BIOS does not enable SME, then Linux will not be able to activate memory encryption, even diff --git a/Documentation/x86/amd_hsmp.rst b/Documentation/x86/amd_hsmp.rst new file mode 100644 index 000000000000..440e4b645a1c --- /dev/null +++ b/Documentation/x86/amd_hsmp.rst @@ -0,0 +1,86 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================ +AMD HSMP interface +============================================ + +Newer Fam19h EPYC server line of processors from AMD support system +management functionality via HSMP (Host System Management Port). + +The Host System Management Port (HSMP) is an interface to provide +OS-level software with access to system management functions via a +set of mailbox registers. + +More details on the interface can be found in chapter +"7 Host System Management Port (HSMP)" of the family/model PPR +Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip + +HSMP interface is supported on EPYC server CPU models only. + + +HSMP device +============================================ + +amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice +/dev/hsmp to let user space programs run hsmp mailbox commands. + +$ ls -al /dev/hsmp +crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp + +Characteristics of the dev node: + * Write mode is used for running set/configure commands + * Read mode is used for running get/status monitor commands + +Access restrictions: + * Only root user is allowed to open the file in write mode. + * The file can be opened in read mode by all the users. + +In-kernel integration: + * Other subsystems in the kernel can use the exported transport + function hsmp_send_message(). + * Locking across callers is taken care by the driver. + + +An example +========== + +To access hsmp device from a C program. +First, you need to include the headers:: + + #include <linux/amd_hsmp.h> + +Which defines the supported messages/message IDs. + +Next thing, open the device file, as follows:: + + int file; + + file = open("/dev/hsmp", O_RDWR); + if (file < 0) { + /* ERROR HANDLING; you can check errno to see what went wrong */ + exit(1); + } + +The following IOCTL is defined: + +``ioctl(file, HSMP_IOCTL_CMD, struct hsmp_message *msg)`` + The argument is a pointer to a:: + + struct hsmp_message { + __u32 msg_id; /* Message ID */ + __u16 num_args; /* Number of input argument words in message */ + __u16 response_sz; /* Number of expected output/response words */ + __u32 args[HSMP_MAX_MSG_LEN]; /* argument/response buffer */ + __u16 sock_ind; /* socket number */ + }; + +The ioctl would return a non-zero on failure; you can read errno to see +what happened. The transaction returns 0 on success. + +More details on the interface and message definitions can be found in chapter +"7 Host System Management Port (HSMP)" of the respective family/model PPR +eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip + +User space C-APIs are made available by linking against the esmi library, +which is provided by the E-SMS project https://developer.amd.com/e-sms/. +See: https://github.com/amd/esmi_ib_library diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst index c9c201596c3e..894a19897005 100644 --- a/Documentation/x86/boot.rst +++ b/Documentation/x86/boot.rst @@ -490,15 +490,11 @@ Protocol: 2.00+ kernel) to not write early messages that require accessing the display hardware directly. - Bit 6 (write): KEEP_SEGMENTS + Bit 6 (obsolete): KEEP_SEGMENTS Protocol: 2.07+ - - If 0, reload the segment registers in the 32bit entry point. - - If 1, do not reload the segment registers in the 32bit entry point. - - Assume that %cs %ds %ss %es are all set to flat segments with - a base of 0 (or the equivalent for their environment). + - This flag is obsolete. Bit 7 (write): CAN_USE_HEAP @@ -786,9 +782,9 @@ Protocol: 2.08+ uncompressed data should be determined using the standard magic numbers. The currently supported compression formats are gzip (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA - (magic number 5D 00), XZ (magic number FD 37), and LZ4 (magic number - 02 21). The uncompressed payload is currently always ELF (magic - number 7F 45 4C 46). + (magic number 5D 00), XZ (magic number FD 37), LZ4 (magic number + 02 21) and ZSTD (magic number 28 B5). The uncompressed payload is + currently always ELF (magic number 7F 45 4C 46). ============ ============== Field name: payload_length @@ -855,7 +851,7 @@ Protocol: 2.09+ struct setup_data { __u64 next = 0 or <addr_of_next_setup_data_struct>; __u32 type = SETUP_INDIRECT; - __u32 len = sizeof(setup_data); + __u32 len = sizeof(setup_indirect); __u8 data[sizeof(setup_indirect)] = struct setup_indirect { __u32 type = SETUP_INDIRECT | SETUP_E820_EXT; __u32 reserved = 0; @@ -1346,8 +1342,8 @@ follow:: In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should -also fill the additional fields of the struct boot_params as that -described in zero-page.txt. +also fill the additional fields of the struct boot_params as +described in chapter Documentation/x86/zero-page.rst. After setting up the struct boot_params, the boot loader can load the 32/64-bit kernel in the same way as that of 16-bit boot protocol. @@ -1383,7 +1379,7 @@ can be calculated as follows:: In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should also fill the additional fields of the struct boot_params as described -in zero-page.txt. +in chapter Documentation/x86/zero-page.rst. After setting up the struct boot_params, the boot loader can load 64-bit kernel in the same way as that of 16-bit boot protocol, but @@ -1403,8 +1399,8 @@ must have read/write permission; CS must be __BOOT_CS and DS, ES, SS must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base address of the struct boot_params. -EFI Handover Protocol -===================== +EFI Handover Protocol (deprecated) +================================== This protocol allows boot loaders to defer initialisation to the EFI boot stub. The boot loader is required to load the kernel/initrd(s) @@ -1412,6 +1408,12 @@ from the boot media and jump to the EFI handover protocol entry point which is hdr->handover_offset bytes from the beginning of startup_{32,64}. +The boot loader MUST respect the kernel's PE/COFF metadata when it comes +to section alignment, the memory footprint of the executable image beyond +the size of the file itself, and any other aspect of the PE/COFF header +that may affect correct operation of the image as a PE/COFF binary in the +execution context provided by the EFI firmware. + The function prototype for the handover entry point looks like this:: efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp) @@ -1423,9 +1425,18 @@ UEFI specification. 'bp' is the boot loader-allocated boot params. The boot loader *must* fill out the following fields in bp:: - - hdr.code32_start - hdr.cmd_line_ptr - hdr.ramdisk_image (if applicable) - hdr.ramdisk_size (if applicable) All other fields should be zero. + +NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF + entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd + loading protocol (refer to [0] for an example of the bootloader side of + this), which removes the need for any knowledge on the part of the EFI + bootloader regarding the internal representation of boot_params or any + requirements/limitations regarding the placement of the command line + and ramdisk in memory, or the placement of the kernel image itself. + +[0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0 diff --git a/Documentation/x86/booting-dt.rst b/Documentation/x86/booting-dt.rst new file mode 100644 index 000000000000..965a374071ab --- /dev/null +++ b/Documentation/x86/booting-dt.rst @@ -0,0 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +DeviceTree Booting +------------------ + + There is one single 32bit entry point to the kernel at code32_start, + the decompressor (the real mode entry point goes to the same 32bit + entry point once it switched into protected mode). That entry point + supports one calling convention which is documented in + Documentation/x86/boot.rst + The physical pointer to the device-tree block is passed via setup_data + which requires at least boot protocol 2.09. + The type filed is defined as + + #define SETUP_DTB 2 + + This device-tree is used as an extension to the "boot page". As such it + does not parse / consider data which is already covered by the boot + page. This includes memory size, reserved ranges, command line arguments + or initrd address. It simply holds information which can not be retrieved + otherwise like interrupt routing or a list of devices behind an I2C bus. diff --git a/Documentation/x86/buslock.rst b/Documentation/x86/buslock.rst new file mode 100644 index 000000000000..7c051e714943 --- /dev/null +++ b/Documentation/x86/buslock.rst @@ -0,0 +1,126 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: <isonum.txt> + +=============================== +Bus lock detection and handling +=============================== + +:Copyright: |copy| 2021 Intel Corporation +:Authors: - Fenghua Yu <fenghua.yu@intel.com> + - Tony Luck <tony.luck@intel.com> + +Problem +======= + +A split lock is any atomic operation whose operand crosses two cache lines. +Since the operand spans two cache lines and the operation must be atomic, +the system locks the bus while the CPU accesses the two cache lines. + +A bus lock is acquired through either split locked access to writeback (WB) +memory or any locked access to non-WB memory. This is typically thousands of +cycles slower than an atomic operation within a cache line. It also disrupts +performance on other cores and brings the whole system to its knees. + +Detection +========= + +Intel processors may support either or both of the following hardware +mechanisms to detect split locks and bus locks. + +#AC exception for split lock detection +-------------------------------------- + +Beginning with the Tremont Atom CPU split lock operations may raise an +Alignment Check (#AC) exception when a split lock operation is attemped. + +#DB exception for bus lock detection +------------------------------------ + +Some CPUs have the ability to notify the kernel by an #DB trap after a user +instruction acquires a bus lock and is executed. This allows the kernel to +terminate the application or to enforce throttling. + +Software handling +================= + +The kernel #AC and #DB handlers handle bus lock based on the kernel +parameter "split_lock_detect". Here is a summary of different options: + ++------------------+----------------------------+-----------------------+ +|split_lock_detect=|#AC for split lock |#DB for bus lock | ++------------------+----------------------------+-----------------------+ +|off |Do nothing |Do nothing | ++------------------+----------------------------+-----------------------+ +|warn |Kernel OOPs |Warn once per task and | +|(default) |Warn once per task and |and continues to run. | +| |disable future checking | | +| |When both features are | | +| |supported, warn in #AC | | ++------------------+----------------------------+-----------------------+ +|fatal |Kernel OOPs |Send SIGBUS to user. | +| |Send SIGBUS to user | | +| |When both features are | | +| |supported, fatal in #AC | | ++------------------+----------------------------+-----------------------+ +|ratelimit:N |Do nothing |Limit bus lock rate to | +|(0 < N <= 1000) | |N bus locks per second | +| | |system wide and warn on| +| | |bus locks. | ++------------------+----------------------------+-----------------------+ + +Usages +====== + +Detecting and handling bus lock may find usages in various areas: + +It is critical for real time system designers who build consolidated real +time systems. These systems run hard real time code on some cores and run +"untrusted" user processes on other cores. The hard real time cannot afford +to have any bus lock from the untrusted processes to hurt real time +performance. To date the designers have been unable to deploy these +solutions as they have no way to prevent the "untrusted" user code from +generating split lock and bus lock to block the hard real time code to +access memory during bus locking. + +It's also useful for general computing to prevent guests or user +applications from slowing down the overall system by executing instructions +with bus lock. + + +Guidance +======== +off +--- + +Disable checking for split lock and bus lock. This option can be useful if +there are legacy applications that trigger these events at a low rate so +that mitigation is not needed. + +warn +---- + +A warning is emitted when a bus lock is detected which allows to identify +the offending application. This is the default behavior. + +fatal +----- + +In this case, the bus lock is not tolerated and the process is killed. + +ratelimit +--------- + +A system wide bus lock rate limit N is specified where 0 < N <= 1000. This +allows a bus lock rate up to N bus locks per second. When the bus lock rate +is exceeded then any task which is caught via the buslock #DB exception is +throttled by enforced sleeps until the rate goes under the limit again. + +This is an effective mitigation in cases where a minimal impact can be +tolerated, but an eventual Denial of Service attack has to be prevented. It +allows to identify the offending processes and analyze whether they are +malicious or just badly written. + +Selecting a rate limit of 1000 allows the bus to be locked for up to about +seven million cycles each second (assuming 7000 cycles for each bus +lock). On a 2 GHz processor that would be about 0.35% system slowdown. diff --git a/Documentation/x86/cpuinfo.rst b/Documentation/x86/cpuinfo.rst new file mode 100644 index 000000000000..08246e8ac835 --- /dev/null +++ b/Documentation/x86/cpuinfo.rst @@ -0,0 +1,154 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +x86 Feature Flags +================= + +Introduction +============ + +On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition +in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature +or KVM want to expose the feature to a KVM guest, it can and should have +an X86_FEATURE_* defined. These flags represent hardware features as +well as software features. + +If users want to know if a feature is available on a given system, they +try to find the flag in /proc/cpuinfo. If a given flag is present, it +means that the kernel supports it and is currently making it available. +If such flag represents a hardware feature, it also means that the +hardware supports it. + +If the expected flag does not appear in /proc/cpuinfo, things are murkier. +Users need to find out the reason why the flag is missing and find the way +how to enable it, which is not always easy. There are several factors that +can explain missing flags: the expected feature failed to enable, the feature +is missing in hardware, platform firmware did not enable it, the feature is +disabled at build or run time, an old kernel is in use, or the kernel does +not support the feature and thus has not enabled it. In general, /proc/cpuinfo +shows features which the kernel supports. For a full list of CPUID flags +which the CPU supports, use tools/arch/x86/kcpuid. + +How are feature flags created? +============================== + +a: Feature flags can be derived from the contents of CPUID leaves. +------------------------------------------------------------------ +These feature definitions are organized mirroring the layout of CPUID +leaves and grouped in words with offsets as mapped in enum cpuid_leafs +in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details). +If a feature is defined with a X86_FEATURE_<name> definition in +cpufeatures.h, and if it is detected at run time, the flags will be +displayed accordingly in /proc/cpuinfo. For example, the flag "avx2" +comes from X86_FEATURE_AVX2 in cpufeatures.h. + +b: Flags can be from scattered CPUID-based features. +---------------------------------------------------- +Hardware features enumerated in sparsely populated CPUID leaves get +software-defined values. Still, CPUID needs to be queried to determine +if a given feature is present. This is done in init_scattered_cpuid_features(). +For instance, X86_FEATURE_CQM_LLC is defined as 11*32 + 0 and its presence is +checked at runtime in the respective CPUID leaf [EAX=f, ECX=0] bit EDX[1]. + +The intent of scattering CPUID leaves is to not bloat struct +cpuinfo_x86.x86_capability[] unnecessarily. For instance, the CPUID leaf +[EAX=7, ECX=0] has 30 features and is dense, but the CPUID leaf [EAX=7, EAX=1] +has only one feature and would waste 31 bits of space in the x86_capability[] +array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted +memory is not trivial. + +c: Flags can be created synthetically under certain conditions for hardware features. +------------------------------------------------------------------------------------- +Examples of conditions include whether certain features are present in +MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed +conditions are met, the features are enabled by the set_cpu_cap or +setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS, +the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and +"split_lock_detect" will be displayed. The flag "ring3mwait" will be +displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors. + +d: Flags can represent purely software features. +------------------------------------------------ +These flags do not represent hardware features. Instead, they represent a +software feature implemented in the kernel. For example, Kernel Page Table +Isolation is purely software feature and its feature flag X86_FEATURE_PTI is +also defined in cpufeatures.h. + +Naming of Flags +=============== + +The script arch/x86/kernel/cpu/mkcapflags.sh processes the +#define X86_FEATURE_<name> from cpufeatures.h and generates the +x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the +resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming +of flags in the x86_cap/bug_flags[] are as follows: + +a: The name of the flag is from the string in X86_FEATURE_<name> by default. +---------------------------------------------------------------------------- +By default, the flag <name> in /proc/cpuinfo is extracted from the respective +X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from +X86_FEATURE_AVX2. + +b: The naming can be overridden. +-------------------------------- +If the comment on the line for the #define X86_FEATURE_* starts with a +double-quote character (""), the string inside the double-quote characters +will be the name of the flags. For example, the flag "sse4_1" comes from +the comment "sse4_1" following the X86_FEATURE_XMM4_1 definition. + +There are situations in which overriding the displayed name of the flag is +needed. For instance, /proc/cpuinfo is a userspace interface and must remain +constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one +shall override the new naming with the name already used in /proc/cpuinfo. + +c: The naming override can be "", which means it will not appear in /proc/cpuinfo. +---------------------------------------------------------------------------------- +The feature shall be omitted from /proc/cpuinfo if it does not make sense for +the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is +defined in cpufeatures.h but that flag is an internal kernel feature used +in the alternative runtime patching functionality. So, its name is overridden +with "". Its flag will not appear in /proc/cpuinfo. + +Flags are missing when one or more of these happen +================================================== + +a: The hardware does not enumerate support for it. +-------------------------------------------------- +For example, when a new kernel is running on old hardware or the feature is +not enabled by boot firmware. Even if the hardware is new, there might be a +problem enabling the feature at run time, the flag will not be displayed. + +b: The kernel does not know about the flag. +------------------------------------------- +For example, when an old kernel is running on new hardware. + +c: The kernel disabled support for it at compile-time. +------------------------------------------------------ +For example, if 5-level-paging is not enabled when building (i.e., +CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_. +Even though the feature will still be detected via CPUID, the kernel disables +it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57). + +d: The feature is disabled at boot-time. +---------------------------------------- +A feature can be disabled either using a command-line parameter or because +it failed to be enabled. The command-line parameter clearcpuid= can be used +to disable features using the feature number as defined in +/arch/x86/include/asm/cpufeatures.h. For instance, User Mode Instruction +Protection can be disabled using clearcpuid=514. The number 514 is calculated +from #define X86_FEATURE_UMIP (16*32 + 2). + +In addition, there exists a variety of custom command-line parameters that +disable specific features. The list of parameters includes, but is not limited +to, nofsgsbase, nosgx, noxsave, etc. 5-level paging can also be disabled using +"no5lvl". + +e: The feature was known to be non-functional. +---------------------------------------------- +The feature was known to be non-functional because a dependency was +missing at runtime. For example, AVX flags will not show up if XSAVE feature +is disabled since they depend on XSAVE feature. Another example would be broken +CPUs and them missing microcode patches. Due to that, the kernel decides not to +enable a feature. + +.. [#f1] 5-level paging uses linear address of 57 bits. diff --git a/Documentation/x86/earlyprintk.rst b/Documentation/x86/earlyprintk.rst index 11307378acf0..51ef11e8f725 100644 --- a/Documentation/x86/earlyprintk.rst +++ b/Documentation/x86/earlyprintk.rst @@ -8,7 +8,7 @@ Mini-HOWTO for using the earlyprintk=dbgp boot option with a USB2 Debug port key and a debug cable, on x86 systems. You need two computers, the 'USB debug key' special gadget and -and two USB cables, connected like this:: +two USB cables, connected like this:: [host/target] <-------> [USB debug key] <-------> [client/console] diff --git a/Documentation/x86/elf_auxvec.rst b/Documentation/x86/elf_auxvec.rst new file mode 100644 index 000000000000..18e4744717f9 --- /dev/null +++ b/Documentation/x86/elf_auxvec.rst @@ -0,0 +1,53 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +x86-specific ELF Auxiliary Vectors +================================== + +This document describes the semantics of the x86 auxiliary vectors. + +Introduction +============ + +ELF Auxiliary vectors enable the kernel to efficiently provide +configuration-specific parameters to userspace. In this example, a program +allocates an alternate stack based on the kernel-provided size:: + + #include <sys/auxv.h> + #include <elf.h> + #include <signal.h> + #include <stdlib.h> + #include <assert.h> + #include <err.h> + + #ifndef AT_MINSIGSTKSZ + #define AT_MINSIGSTKSZ 51 + #endif + + .... + stack_t ss; + + ss.ss_sp = malloc(ss.ss_size); + assert(ss.ss_sp); + + ss.ss_size = getauxval(AT_MINSIGSTKSZ) + SIGSTKSZ; + ss.ss_flags = 0; + + if (sigaltstack(&ss, NULL)) + err(1, "sigaltstack"); + + +The exposed auxiliary vectors +============================= + +AT_SYSINFO is used for locating the vsyscall entry point. It is not +exported on 64-bit mode. + +AT_SYSINFO_EHDR is the start address of the page containing the vDSO. + +AT_MINSIGSTKSZ denotes the minimum stack size required by the kernel to +deliver a signal to user-space. AT_MINSIGSTKSZ comprehends the space +consumed by the kernel to accommodate the user context for the current +hardware configuration. It does not comprehend subsequent user-space stack +consumption, which must be added by the user. (e.g. Above, user-space adds +SIGSTKSZ to AT_MINSIGSTKSZ.) diff --git a/Documentation/x86/entry_64.rst b/Documentation/x86/entry_64.rst index a48b3f6ebbe8..0afdce3c06f4 100644 --- a/Documentation/x86/entry_64.rst +++ b/Documentation/x86/entry_64.rst @@ -8,7 +8,7 @@ This file documents some of the kernel entries in arch/x86/entry/entry_64.S. A lot of this explanation is adapted from an email from Ingo Molnar: -http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu> +https://lore.kernel.org/r/20110529191055.GC9835%40elte.hu The x86 architecture has quite a few different ways to jump into kernel code. Most of these entry points are registered in @@ -33,8 +33,8 @@ Some of these entries are: - interrupt: An array of entries. Every IDT vector that doesn't explicitly point somewhere else gets set to the corresponding value in interrupts. These point to a whole array of - magically-generated functions that make their way to do_IRQ with - the interrupt number as a parameter. + magically-generated functions that make their way to common_interrupt() + with the interrupt number as a parameter. - APIC interrupts: Various special-purpose interrupts for things like TLB shootdown. diff --git a/Documentation/x86/exception-tables.rst b/Documentation/x86/exception-tables.rst index ed6d4b0cf62c..efde1fef4fbd 100644 --- a/Documentation/x86/exception-tables.rst +++ b/Documentation/x86/exception-tables.rst @@ -32,14 +32,14 @@ Whenever the kernel tries to access an address that is currently not accessible, the CPU generates a page fault exception and calls the page fault handler:: - void do_page_fault(struct pt_regs *regs, unsigned long error_code) + void exc_page_fault(struct pt_regs *regs, unsigned long error_code) in arch/x86/mm/fault.c. The parameters on the stack are set up by the low level assembly glue in arch/x86/entry/entry_32.S. The parameter regs is a pointer to the saved registers on the stack, error_code contains a reason code for the exception. -do_page_fault first obtains the unaccessible address from the CPU +exc_page_fault() first obtains the inaccessible address from the CPU control register CR2. If the address is within the virtual address space of the process, the fault probably occurred, because the page was not swapped in, write protected or something similar. However, @@ -57,10 +57,10 @@ Where does fixup point to? Since we jump to the contents of fixup, fixup obviously points to executable code. This code is hidden inside the user access macros. -I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h +I have picked the get_user() macro defined in arch/x86/include/asm/uaccess.h as an example. The definition is somewhat hard to follow, so let's peek at the code generated by the preprocessor and the compiler. I selected -the get_user call in drivers/char/sysrq.c for a detailed examination. +the get_user() call in drivers/char/sysrq.c for a detailed examination. The original code in sysrq.c line 587:: @@ -257,6 +257,9 @@ the fault, in our case the actual value is c0199ff5: the original assembly code: > 3: movl $-14,%eax and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax +If the fixup was able to handle the exception, control flow may be returned +to the instruction after the one that triggered the fault, ie. local label 2b. + The assembly code:: > .section __ex_table,"a" @@ -278,12 +281,15 @@ vma occurs? > c017e7a5 <do_con_write+e1> movb (%ebx),%dl #. MMU generates exception -#. CPU calls do_page_fault -#. do page fault calls search_exception_table (regs->eip == c017e7a5); -#. search_exception_table looks up the address c017e7a5 in the +#. CPU calls exc_page_fault() +#. exc_page_fault() calls do_user_addr_fault() +#. do_user_addr_fault() calls kernelmode_fixup_or_oops() +#. kernelmode_fixup_or_oops() calls fixup_exception() (regs->eip == c017e7a5); +#. fixup_exception() calls search_exception_tables() +#. search_exception_tables() looks up the address c017e7a5 in the exception table (i.e. the contents of the ELF section __ex_table) and returns the address of the associated fault handle code c0199ff5. -#. do_page_fault modifies its own return address to point to the fault +#. fixup_exception() modifies its own return address to point to the fault handle code and returns. #. execution continues in the fault handling code. #. a) EAX becomes -EFAULT (== -14) @@ -295,9 +301,9 @@ The steps 8a to 8c in a certain way emulate the faulting instruction. That's it, mostly. If you look at our example, you might ask why we set EAX to -EFAULT in the exception handler code. Well, the -get_user macro actually returns a value: 0, if the user access was +get_user() macro actually returns a value: 0, if the user access was successful, -EFAULT on failure. Our original code did not test this -return value, however the inline assembly code in get_user tries to +return value, however the inline assembly code in get_user() tries to return -EFAULT. GCC selected EAX to return this value. NOTE: @@ -337,10 +343,15 @@ pointer which points to one of: entry->insn. It is used to distinguish page faults from machine check. -3) ``int ex_handler_ext(const struct exception_table_entry *fixup)`` - This case is used for uaccess_err ... we need to set a flag - in the task structure. Before the handler functions existed this - case was handled by adding a large offset to the fixup to tag - it as special. - More functions can easily be added. + +CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post +link of the kernel image, via a host utility scripts/sorttable. It will set the +symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section +at boot time. With the exception table sorted, at runtime when an exception +occurs we can quickly lookup the __ex_table entry via binary search. + +This is not just a boot time optimization, some architectures require this +table to be sorted in order to handle exceptions relatively early in the boot +process. For example, i386 makes use of this form of exception handling before +paging support is even enabled! diff --git a/Documentation/x86/features.rst b/Documentation/x86/features.rst new file mode 100644 index 000000000000..b663f15053ce --- /dev/null +++ b/Documentation/x86/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features x86 diff --git a/Documentation/x86/ifs.rst b/Documentation/x86/ifs.rst new file mode 100644 index 000000000000..97abb696a680 --- /dev/null +++ b/Documentation/x86/ifs.rst @@ -0,0 +1,2 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. kernel-doc:: drivers/platform/x86/intel/ifs/ifs.h diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 265d9e9a093b..c73d133fd37c 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -9,6 +9,8 @@ x86-specific Documentation :numbered: boot + booting-dt + cpuinfo topology exception-tables kernel-stacks @@ -19,14 +21,24 @@ x86-specific Documentation tlb mtrr pat - intel-iommu + intel-hfi + iommu intel_txt amd-memory-encryption + amd_hsmp + tdx pti mds microcode - resctrl_ui + resctrl tsx_async_abort + buslock usb-legacy-support i386/index x86_64/index + ifs + sva + sgx + features + elf_auxvec + xstate diff --git a/Documentation/x86/intel-hfi.rst b/Documentation/x86/intel-hfi.rst new file mode 100644 index 000000000000..49dea58ea4fb --- /dev/null +++ b/Documentation/x86/intel-hfi.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Hardware-Feedback Interface for scheduling on Intel Hardware +============================================================ + +Overview +-------- + +Intel has described the Hardware Feedback Interface (HFI) in the Intel 64 and +IA-32 Architectures Software Developer's Manual (Intel SDM) Volume 3 Section +14.6 [1]_. + +The HFI gives the operating system a performance and energy efficiency +capability data for each CPU in the system. Linux can use the information from +the HFI to influence task placement decisions. + +The Hardware Feedback Interface +------------------------------- + +The Hardware Feedback Interface provides to the operating system information +about the performance and energy efficiency of each CPU in the system. Each +capability is given as a unit-less quantity in the range [0-255]. Higher values +indicate higher capability. Energy efficiency and performance are reported in +separate capabilities. Even though on some systems these two metrics may be +related, they are specified as independent capabilities in the Intel SDM. + +These capabilities may change at runtime as a result of changes in the +operating conditions of the system or the action of external factors. The rate +at which these capabilities are updated is specific to each processor model. On +some models, capabilities are set at boot time and never change. On others, +capabilities may change every tens of milliseconds. For instance, a remote +mechanism may be used to lower Thermal Design Power. Such change can be +reflected in the HFI. Likewise, if the system needs to be throttled due to +excessive heat, the HFI may reflect reduced performance on specific CPUs. + +The kernel or a userspace policy daemon can use these capabilities to modify +task placement decisions. For instance, if either the performance or energy +capabilities of a given logical processor becomes zero, it is an indication that +the hardware recommends to the operating system to not schedule any tasks on +that processor for performance or energy efficiency reasons, respectively. + +Implementation details for Linux +-------------------------------- + +The infrastructure to handle thermal event interrupts has two parts. In the +Local Vector Table of a CPU's local APIC, there exists a register for the +Thermal Monitor Register. This register controls how interrupts are delivered +to a CPU when the thermal monitor generates and interrupt. Further details +can be found in the Intel SDM Vol. 3 Section 10.5 [1]_. + +The thermal monitor may generate interrupts per CPU or per package. The HFI +generates package-level interrupts. This monitor is configured and initialized +via a set of machine-specific registers. Specifically, the HFI interrupt and +status are controlled via designated bits in the IA32_PACKAGE_THERM_INTERRUPT +and IA32_PACKAGE_THERM_STATUS registers, respectively. There exists one HFI +table per package. Further details can be found in the Intel SDM Vol. 3 +Section 14.9 [1]_. + +The hardware issues an HFI interrupt after updating the HFI table and is ready +for the operating system to consume it. CPUs receive such interrupt via the +thermal entry in the Local APIC's Local Vector Table. + +When servicing such interrupt, the HFI driver parses the updated table and +relays the update to userspace using the thermal notification framework. Given +that there may be many HFI updates every second, the updates relayed to +userspace are throttled at a rate of CONFIG_HZ jiffies. + +References +---------- + +.. [1] https://www.intel.com/sdm diff --git a/Documentation/x86/intel-iommu.rst b/Documentation/x86/intel-iommu.rst deleted file mode 100644 index 9dae6b47e398..000000000000 --- a/Documentation/x86/intel-iommu.rst +++ /dev/null @@ -1,114 +0,0 @@ -=================== -Linux IOMMU Support -=================== - -The architecture spec can be obtained from the below location. - -http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf - -This guide gives a quick cheat sheet for some basic understanding. - -Some Keywords - -- DMAR - DMA remapping -- DRHD - DMA Remapping Hardware Unit Definition -- RMRR - Reserved memory Region Reporting Structure -- ZLR - Zero length reads from PCI devices -- IOVA - IO Virtual address. - -Basic stuff ------------ - -ACPI enumerates and lists the different DMA engines in the platform, and -device scope relationships between PCI devices and which DMA engine controls -them. - -What is RMRR? -------------- - -There are some devices the BIOS controls, for e.g USB devices to perform -PS2 emulation. The regions of memory used for these devices are marked -reserved in the e820 map. When we turn on DMA translation, DMA to those -regions will fail. Hence BIOS uses RMRR to specify these regions along with -devices that need to access these regions. OS is expected to setup -unity mappings for these regions for these devices to access these regions. - -How is IOVA generated? ----------------------- - -Well behaved drivers call pci_map_*() calls before sending command to device -that needs to perform DMA. Once DMA is completed and mapping is no longer -required, device performs a pci_unmap_*() calls to unmap the region. - -The Intel IOMMU driver allocates a virtual address per domain. Each PCIE -device has its own domain (hence protection). Devices under p2p bridges -share the virtual address with all devices under the p2p bridge due to -transaction id aliasing for p2p bridges. - -IOVA generation is pretty generic. We used the same technique as vmalloc() -but these are not global address spaces, but separate for each domain. -Different DMA engines may support different number of domains. - -We also allocate guard pages with each mapping, so we can attempt to catch -any overflow that might happen. - - -Graphics Problems? ------------------- -If you encounter issues with graphics devices, you can try adding -option intel_iommu=igfx_off to turn off the integrated graphics engine. -If this fixes anything, please ensure you file a bug reporting the problem. - -Some exceptions to IOVA ------------------------ -Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff). -The same is true for peer to peer transactions. Hence we reserve the -address from PCI MMIO ranges so they are not allocated for IOVA addresses. - - -Fault reporting ---------------- -When errors are reported, the DMA engine signals via an interrupt. The fault -reason and device that caused it with fault reason is printed on console. - -See below for sample. - - -Boot Message Sample -------------------- - -Something like this gets printed indicating presence of DMAR tables -in ACPI. - -ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 - -When DMAR is being processed and initialized by ACPI, prints DMAR locations -and any RMRR's processed:: - - ACPI DMAR:Host address width 36 - ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 - ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 - ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 - ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff - ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff - -When DMAR is enabled for use, you will notice.. - -PCI-DMA: Using DMAR IOMMU - -Fault reporting ---------------- - -:: - - DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 - DMAR:[fault reason 05] PTE Write access is not set - DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 - DMAR:[fault reason 05] PTE Write access is not set - -TBD ----- - -- For compatibility testing, could use unity map domain for all devices, just - provide a 1-1 for all useful memory under a single domain for all devices. -- API for paravirt ops for abstracting functionality for VMM folks. diff --git a/Documentation/x86/iommu.rst b/Documentation/x86/iommu.rst new file mode 100644 index 000000000000..42c7a6faa39a --- /dev/null +++ b/Documentation/x86/iommu.rst @@ -0,0 +1,151 @@ +================= +x86 IOMMU Support +================= + +The architecture specs can be obtained from the below locations. + +- Intel: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf +- AMD: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf + +This guide gives a quick cheat sheet for some basic understanding. + +Basic stuff +----------- + +ACPI enumerates and lists the different IOMMUs on the platform, and +device scope relationships between devices and which IOMMU controls +them. + +Some ACPI Keywords: + +- DMAR - Intel DMA Remapping table +- DRHD - Intel DMA Remapping Hardware Unit Definition +- RMRR - Intel Reserved Memory Region Reporting Structure +- IVRS - AMD I/O Virtualization Reporting Structure +- IVDB - AMD I/O Virtualization Definition Block +- IVHD - AMD I/O Virtualization Hardware Definition + +What is Intel RMRR? +^^^^^^^^^^^^^^^^^^^ + +There are some devices the BIOS controls, for e.g USB devices to perform +PS2 emulation. The regions of memory used for these devices are marked +reserved in the e820 map. When we turn on DMA translation, DMA to those +regions will fail. Hence BIOS uses RMRR to specify these regions along with +devices that need to access these regions. OS is expected to setup +unity mappings for these regions for these devices to access these regions. + +What is AMD IVRS? +^^^^^^^^^^^^^^^^^ + +The architecture defines an ACPI-compatible data structure called an I/O +Virtualization Reporting Structure (IVRS) that is used to convey information +related to I/O virtualization to system software. The IVRS describes the +configuration and capabilities of the IOMMUs contained in the platform as +well as information about the devices that each IOMMU virtualizes. + +The IVRS provides information about the following: + +- IOMMUs present in the platform including their capabilities and proper configuration +- System I/O topology relevant to each IOMMU +- Peripheral devices that cannot be otherwise enumerated +- Memory regions used by SMI/SMM, platform firmware, and platform hardware. These are generally exclusion ranges to be configured by system software. + +How is an I/O Virtual Address (IOVA) generated? +----------------------------------------------- + +Well behaved drivers call dma_map_*() calls before sending command to device +that needs to perform DMA. Once DMA is completed and mapping is no longer +required, driver performs dma_unmap_*() calls to unmap the region. + +Intel Specific Notes +-------------------- + +Graphics Problems? +^^^^^^^^^^^^^^^^^^ + +If you encounter issues with graphics devices, you can try adding +option intel_iommu=igfx_off to turn off the integrated graphics engine. +If this fixes anything, please ensure you file a bug reporting the problem. + +Some exceptions to IOVA +^^^^^^^^^^^^^^^^^^^^^^^ + +Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff). +The same is true for peer to peer transactions. Hence we reserve the +address from PCI MMIO ranges so they are not allocated for IOVA addresses. + +AMD Specific Notes +------------------ + +Graphics Problems? +^^^^^^^^^^^^^^^^^^ + +If you encounter issues with integrated graphics devices, you can try adding +option iommu=pt to the kernel command line use a 1:1 mapping for the IOMMU. If +this fixes anything, please ensure you file a bug reporting the problem. + +Fault reporting +--------------- +When errors are reported, the IOMMU signals via an interrupt. The fault +reason and device that caused it is printed on the console. + + +Kernel Log Samples +------------------ + +Intel Boot Messages +^^^^^^^^^^^^^^^^^^^ + +Something like this gets printed indicating presence of DMAR tables +in ACPI: + +:: + + ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 + +When DMAR is being processed and initialized by ACPI, prints DMAR locations +and any RMRR's processed: + +:: + + ACPI DMAR:Host address width 36 + ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 + ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 + ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 + ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff + ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff + +When DMAR is enabled for use, you will notice: + +:: + + PCI-DMA: Using DMAR IOMMU + +Intel Fault reporting +^^^^^^^^^^^^^^^^^^^^^ + +:: + + DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 + DMAR:[fault reason 05] PTE Write access is not set + DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 + DMAR:[fault reason 05] PTE Write access is not set + +AMD Boot Messages +^^^^^^^^^^^^^^^^^ + +Something like this gets printed indicating presence of the IOMMU: + +:: + + iommu: Default domain type: Translated + iommu: DMA domain TLB invalidation policy: lazy mode + +AMD Fault reporting +^^^^^^^^^^^^^^^^^^^ + +:: + + AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0007 address=0xffffc02000 flags=0x0000] + AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x0007 address=0xffffc02000 flags=0x0000] diff --git a/Documentation/x86/microcode.rst b/Documentation/x86/microcode.rst index a320d37982ed..b627c6f36bcf 100644 --- a/Documentation/x86/microcode.rst +++ b/Documentation/x86/microcode.rst @@ -6,6 +6,7 @@ The Linux Microcode Loader :Authors: - Fenghua Yu <fenghua.yu@intel.com> - Borislav Petkov <bp@suse.de> + - Ashok Raj <ashok.raj@intel.com> The kernel has a x86 microcode loading facility which is supposed to provide microcode loading methods in the OS. Potential use cases are @@ -92,15 +93,8 @@ vendor's site. Late loading ============ -There are two legacy user space interfaces to load microcode, either through -/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file -in sysfs. - -The /dev/cpu/microcode method is deprecated because it needs a special -userspace tool for that. - -The easier method is simply installing the microcode packages your distro -supplies and running:: +You simply install the microcode packages your distro supplies and +run:: # echo 1 > /sys/devices/system/cpu/microcode/reload @@ -110,6 +104,110 @@ The loading mechanism looks for microcode blobs in /lib/firmware/{intel-ucode,amd-ucode}. The default distro installation packages already put them there. +Since kernel 5.19, late loading is not enabled by default. + +The /dev/cpu/microcode method has been removed in 5.19. + +Why is late loading dangerous? +============================== + +Synchronizing all CPUs +---------------------- + +The microcode engine which receives the microcode update is shared +between the two logical threads in a SMT system. Therefore, when +the update is executed on one SMT thread of the core, the sibling +"automatically" gets the update. + +Since the microcode can "simulate" MSRs too, while the microcode update +is in progress, those simulated MSRs transiently cease to exist. This +can result in unpredictable results if the SMT sibling thread happens to +be in the middle of an access to such an MSR. The usual observation is +that such MSR accesses cause #GPs to be raised to signal that former are +not present. + +The disappearing MSRs are just one common issue which is being observed. +Any other instruction that's being patched and gets concurrently +executed by the other SMT sibling, can also result in similar, +unpredictable behavior. + +To eliminate this case, a stop_machine()-based CPU synchronization was +introduced as a way to guarantee that all logical CPUs will not execute +any code but just wait in a spin loop, polling an atomic variable. + +While this took care of device or external interrupts, IPIs including +LVT ones, such as CMCI etc, it cannot address other special interrupts +that can't be shut off. Those are Machine Check (#MC), System Management +(#SMI) and Non-Maskable interrupts (#NMI). + +Machine Checks +-------------- + +Machine Checks (#MC) are non-maskable. There are two kinds of MCEs. +Fatal un-recoverable MCEs and recoverable MCEs. While un-recoverable +errors are fatal, recoverable errors can also happen in kernel context +are also treated as fatal by the kernel. + +On certain Intel machines, MCEs are also broadcast to all threads in a +system. If one thread is in the middle of executing WRMSR, a MCE will be +taken at the end of the flow. Either way, they will wait for the thread +performing the wrmsr(0x79) to rendezvous in the MCE handler and shutdown +eventually if any of the threads in the system fail to check in to the +MCE rendezvous. + +To be paranoid and get predictable behavior, the OS can choose to set +MCG_STATUS.MCIP. Since MCEs can be at most one in a system, if an +MCE was signaled, the above condition will promote to a system reset +automatically. OS can turn off MCIP at the end of the update for that +core. + +System Management Interrupt +--------------------------- + +SMIs are also broadcast to all CPUs in the platform. Microcode update +requests exclusive access to the core before writing to MSR 0x79. So if +it does happen such that, one thread is in WRMSR flow, and the 2nd got +an SMI, that thread will be stopped in the first instruction in the SMI +handler. + +Since the secondary thread is stopped in the first instruction in SMI, +there is very little chance that it would be in the middle of executing +an instruction being patched. Plus OS has no way to stop SMIs from +happening. + +Non-Maskable Interrupts +----------------------- + +When thread0 of a core is doing the microcode update, if thread1 is +pulled into NMI, that can cause unpredictable behavior due to the +reasons above. + +OS can choose a variety of methods to avoid running into this situation. + + +Is the microcode suitable for late loading? +------------------------------------------- + +Late loading is done when the system is fully operational and running +real workloads. Late loading behavior depends on what the base patch on +the CPU is before upgrading to the new patch. + +This is true for Intel CPUs. + +Consider, for example, a CPU has patch level 1 and the update is to +patch level 3. + +Between patch1 and patch3, patch2 might have deprecated a software-visible +feature. + +This is unacceptable if software is even potentially using that feature. +For instance, say MSR_X is no longer available after an update, +accessing that MSR will cause a #GP fault. + +Basically there is no way to declare a new microcode update suitable +for late-loading. This is another one of the problems that caused late +loading to be not enabled by default. + Builtin microcode ================= diff --git a/Documentation/x86/mtrr.rst b/Documentation/x86/mtrr.rst index c5b695d75349..9f0b1851771a 100644 --- a/Documentation/x86/mtrr.rst +++ b/Documentation/x86/mtrr.rst @@ -28,7 +28,7 @@ are aligned with platform MTRR setup. If MTRRs are only set up by the platform firmware code though and the OS does not make any specific MTRR mapping requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID. -For details refer to :doc:`pat`. +For details refer to Documentation/x86/pat.rst. .. tip:: On Intel P6 family processors (Pentium Pro, Pentium II and later) diff --git a/Documentation/x86/orc-unwinder.rst b/Documentation/x86/orc-unwinder.rst index d811576c1f3e..cdb257015bd9 100644 --- a/Documentation/x86/orc-unwinder.rst +++ b/Documentation/x86/orc-unwinder.rst @@ -140,7 +140,7 @@ Unwinder implementation details Objtool generates the ORC data by integrating with the compile-time stack metadata validation feature, which is described in detail in -tools/objtool/Documentation/stack-validation.txt. After analyzing all +tools/objtool/Documentation/objtool.txt. After analyzing all the code paths of a .o file, it creates an array of orc_entry structs, and a parallel array of instruction addresses associated with those structs, and writes them to the .orc_unwind and .orc_unwind_ip sections @@ -177,6 +177,6 @@ brutal, unyielding efficiency. ORC stands for Oops Rewind Capability. -.. [1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de -.. [2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz +.. [1] https://lore.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de +.. [2] https://lore.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz .. [3] http://dustin.wikidot.com/half-orcs-and-orcs diff --git a/Documentation/x86/resctrl_ui.rst b/Documentation/x86/resctrl.rst index 5368cedfb530..71a531061e4e 100644 --- a/Documentation/x86/resctrl_ui.rst +++ b/Documentation/x86/resctrl.rst @@ -138,6 +138,18 @@ with respect to allocation: non-linear. This field is purely informational only. +"thread_throttle_mode": + Indicator on Intel systems of how tasks running on threads + of a physical core are throttled in cases where they + request different memory bandwidth percentages: + + "max": + the smallest percentage is applied + to all threads + "per-thread": + bandwidth percentages are directly applied to + the threads running on the core + If RDT monitoring is available there will be an "L3_MON" directory with the following files: @@ -364,8 +376,10 @@ to the next control step available on the hardware. The bandwidth throttling is a core specific mechanism on some of Intel SKUs. Using a high bandwidth and a low bandwidth setting on two threads -sharing a core will result in both threads being throttled to use the -low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core +sharing a core may result in both threads being throttled to use the +low bandwidth (see "thread_throttle_mode"). + +The fact that Memory bandwidth allocation(MBA) may be a core specific mechanism where as memory bandwidth monitoring(MBM) is done at the package level may lead to confusion when users try to apply control via the MBA and then monitor the bandwidth to see if the controls are @@ -1195,3 +1209,96 @@ View the llc occupancy snapshot:: # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 11234000 + +Intel RDT Errata +================ + +Intel MBM Counters May Report System Memory Bandwidth Incorrectly +----------------------------------------------------------------- + +Errata SKX99 for Skylake server and BDF102 for Broadwell server. + +Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics +according to the assigned Resource Monitor ID (RMID) for that logical +core. The IA32_QM_CTR register (MSR 0xC8E), used to report these +metrics, may report incorrect system bandwidth for certain RMID values. + +Implication: Due to the errata, system memory bandwidth may not match +what is reported. + +Workaround: MBM total and local readings are corrected according to the +following correction factor table: + ++---------------+---------------+---------------+-----------------+ +|core count |rmid count |rmid threshold |correction factor| ++---------------+---------------+---------------+-----------------+ +|1 |8 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|2 |16 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|3 |24 |15 |0.969650 | ++---------------+---------------+---------------+-----------------+ +|4 |32 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|6 |48 |31 |0.969650 | ++---------------+---------------+---------------+-----------------+ +|7 |56 |47 |1.142857 | ++---------------+---------------+---------------+-----------------+ +|8 |64 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|9 |72 |63 |1.185115 | ++---------------+---------------+---------------+-----------------+ +|10 |80 |63 |1.066553 | ++---------------+---------------+---------------+-----------------+ +|11 |88 |79 |1.454545 | ++---------------+---------------+---------------+-----------------+ +|12 |96 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|13 |104 |95 |1.230769 | ++---------------+---------------+---------------+-----------------+ +|14 |112 |95 |1.142857 | ++---------------+---------------+---------------+-----------------+ +|15 |120 |95 |1.066667 | ++---------------+---------------+---------------+-----------------+ +|16 |128 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|17 |136 |127 |1.254863 | ++---------------+---------------+---------------+-----------------+ +|18 |144 |127 |1.185255 | ++---------------+---------------+---------------+-----------------+ +|19 |152 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|20 |160 |127 |1.066667 | ++---------------+---------------+---------------+-----------------+ +|21 |168 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|22 |176 |159 |1.454334 | ++---------------+---------------+---------------+-----------------+ +|23 |184 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|24 |192 |127 |0.969744 | ++---------------+---------------+---------------+-----------------+ +|25 |200 |191 |1.280246 | ++---------------+---------------+---------------+-----------------+ +|26 |208 |191 |1.230921 | ++---------------+---------------+---------------+-----------------+ +|27 |216 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|28 |224 |191 |1.143118 | ++---------------+---------------+---------------+-----------------+ + +If rmid > rmid threshold, MBM total and local values should be multiplied +by the correction factor. + +See: + +1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update: +http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html + +2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update: +http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf + +3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual: +https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html + +for further information. diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst new file mode 100644 index 000000000000..2bcbffacbed5 --- /dev/null +++ b/Documentation/x86/sgx.rst @@ -0,0 +1,302 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Software Guard eXtensions (SGX) +=============================== + +Overview +======== + +Software Guard eXtensions (SGX) hardware enables for user space applications +to set aside private memory regions of code and data: + +* Privileged (ring-0) ENCLS functions orchestrate the construction of the + regions. +* Unprivileged (ring-3) ENCLU functions allow an application to enter and + execute inside the regions. + +These memory regions are called enclaves. An enclave can be only entered at a +fixed set of entry points. Each entry point can hold a single hardware thread +at a time. While the enclave is loaded from a regular binary file by using +ENCLS functions, only the threads inside the enclave can access its memory. The +region is denied from outside access by the CPU, and encrypted before it leaves +from LLC. + +The support can be determined by + + ``grep sgx /proc/cpuinfo`` + +SGX must both be supported in the processor and enabled by the BIOS. If SGX +appears to be unsupported on a system which has hardware support, ensure +support is enabled in the BIOS. If a BIOS presents a choice between "Enabled" +and "Software Enabled" modes for SGX, choose "Enabled". + +Enclave Page Cache +================== + +SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated +with an enclave. It is contained in a BIOS-reserved region of physical memory. +Unlike pages used for regular memory, pages can only be accessed from outside of +the enclave during enclave construction with special, limited SGX instructions. + +Only a CPU executing inside an enclave can directly access enclave memory. +However, a CPU executing inside an enclave may access normal memory outside the +enclave. + +The kernel manages enclave memory similar to how it treats device memory. + +Enclave Page Types +------------------ + +**SGX Enclave Control Structure (SECS)** + Enclave's address range, attributes and other global data are defined + by this structure. + +**Regular (REG)** + Regular EPC pages contain the code and data of an enclave. + +**Thread Control Structure (TCS)** + Thread Control Structure pages define the entry points to an enclave and + track the execution state of an enclave thread. + +**Version Array (VA)** + Version Array pages contain 512 slots, each of which can contain a version + number for a page evicted from the EPC. + +Enclave Page Cache Map +---------------------- + +The processor tracks EPC pages in a hardware metadata structure called the +*Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page +which describes the owning enclave, access rights and page type among the other +things. + +EPCM permissions are separate from the normal page tables. This prevents the +kernel from, for instance, allowing writes to data which an enclave wishes to +remain read-only. EPCM permissions may only impose additional restrictions on +top of normal x86 page permissions. + +For all intents and purposes, the SGX architecture allows the processor to +invalidate all EPCM entries at will. This requires that software be prepared to +handle an EPCM fault at any time. In practice, this can happen on events like +power transitions when the ephemeral key that encrypts enclave memory is lost. + +Application interface +===================== + +Enclave build functions +----------------------- + +In addition to the traditional compiler and linker build process, SGX has a +separate enclave “build” process. Enclaves must be built before they can be +executed (entered). The first step in building an enclave is opening the +**/dev/sgx_enclave** device. Since enclave memory is protected from direct +access, special privileged instructions are then used to copy data into enclave +pages and establish enclave page permissions. + +.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c + :functions: sgx_ioc_enclave_create + sgx_ioc_enclave_add_pages + sgx_ioc_enclave_init + sgx_ioc_enclave_provision + +Enclave runtime management +-------------------------- + +Systems supporting SGX2 additionally support changes to initialized +enclaves: modifying enclave page permissions and type, and dynamically +adding and removing of enclave pages. When an enclave accesses an address +within its address range that does not have a backing page then a new +regular page will be dynamically added to the enclave. The enclave is +still required to run EACCEPT on the new page before it can be used. + +.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c + :functions: sgx_ioc_enclave_restrict_permissions + sgx_ioc_enclave_modify_types + sgx_ioc_enclave_remove_pages + +Enclave vDSO +------------ + +Entering an enclave can only be done through SGX-specific EENTER and ERESUME +functions, and is a non-trivial process. Because of the complexity of +transitioning to and from an enclave, enclaves typically utilize a library to +handle the actual transitions. This is roughly analogous to how glibc +implementations are used by most applications to wrap system calls. + +Another crucial characteristic of enclaves is that they can generate exceptions +as part of their normal operation that need to be handled in the enclave or are +unique to SGX. + +Instead of the traditional signal mechanism to handle these exceptions, SGX +can leverage special exception fixup provided by the vDSO. The kernel-provided +vDSO function wraps low-level transitions to/from the enclave like EENTER and +ERESUME. The vDSO function intercepts exceptions that would otherwise generate +a signal and return the fault information directly to its caller. This avoids +the need to juggle signal handlers. + +.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h + :functions: vdso_sgx_enter_enclave_t + +ksgxd +===== + +SGX support includes a kernel thread called *ksgxd*. + +EPC sanitization +---------------- + +ksgxd is started when SGX initializes. Enclave memory is typically ready +for use when the processor powers on or resets. However, if SGX has been in +use since the reset, enclave pages may be in an inconsistent state. This might +occur after a crash and kexec() cycle, for instance. At boot, ksgxd +reinitializes all enclave pages so that they can be allocated and re-used. + +The sanitization is done by going through EPC address space and applying the +EREMOVE function to each physical page. Some enclave pages like SECS pages have +hardware dependencies on other pages which prevents EREMOVE from functioning. +Executing two EREMOVE passes removes the dependencies. + +Page reclaimer +-------------- + +Similar to the core kswapd, ksgxd, is responsible for managing the +overcommitment of enclave memory. If the system runs out of enclave memory, +*ksgxd* “swaps” enclave memory to normal memory. + +Launch Control +============== + +SGX provides a launch control mechanism. After all enclave pages have been +copied, kernel executes EINIT function, which initializes the enclave. Only after +this the CPU can execute inside the enclave. + +EINIT function takes an RSA-3072 signature of the enclave measurement. The function +checks that the measurement is correct and signature is signed with the key +hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the +SHA256 of a public key. + +Those MSRs can be configured by the BIOS to be either readable or writable. +Linux supports only writable configuration in order to give full control to the +kernel on launch control policy. Before calling EINIT function, the driver sets +the MSRs to match the enclave's signing key. + +Encryption engines +================== + +In order to conceal the enclave data while it is out of the CPU package, the +memory controller has an encryption engine to transparently encrypt and decrypt +enclave memory. + +In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to +encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in +SRAM to maintain integrity of the encrypted data. This provides integrity and +anti-replay protection but does not scale to large memory sizes because the time +required to update the Merkle tree grows logarithmically in relation to the +memory size. + +CPUs starting from Icelake use Total Memory Encryption (TME) in the place of +MEE. TME-based SGX implementations do not have an integrity Merkle tree, which +means integrity and replay-attacks are not mitigated. B, it includes +additional changes to prevent cipher text from being returned and SW memory +aliases from being created. + +DMA to enclave memory is blocked by range registers on both MEE and TME systems +(SDM section 41.10). + +Usage Models +============ + +Shared Library +-------------- + +Sensitive data and the code that acts on it is partitioned from the application +into a separate library. The library is then linked as a DSO which can be loaded +into an enclave. The application can then make individual function calls into +the enclave through special SGX instructions. A run-time within the enclave is +configured to marshal function parameters into and out of the enclave and to +call the correct library function. + +Application Container +--------------------- + +An application may be loaded into a container enclave which is specially +configured with a library OS and run-time which permits the application to run. +The enclave run-time and library OS work together to execute the application +when a thread enters the enclave. + +Impact of Potential Kernel SGX Bugs +=================================== + +EPC leaks +--------- + +When EPC page leaks happen, a WARNING like this is shown in dmesg: + +"EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..." + +This is effectively a kernel use-after-free of an EPC page, and due +to the way SGX works, the bug is detected at freeing. Rather than +adding the page back to the pool of available EPC pages, the kernel +intentionally leaks the page to avoid additional errors in the future. + +When this happens, the kernel will likely soon leak more EPC pages, and +SGX will likely become unusable because the memory available to SGX is +limited. However, while this may be fatal to SGX, the rest of the kernel +is unlikely to be impacted and should continue to work. + +As a result, when this happpens, user should stop running any new +SGX workloads, (or just any new workloads), and migrate all valuable +workloads. Although a machine reboot can recover all EPC memory, the bug +should be reported to Linux developers. + + +Virtual EPC +=========== + +The implementation has also a virtual EPC driver to support SGX enclaves +in guests. Unlike the SGX driver, an EPC page allocated by the virtual +EPC driver doesn't have a specific enclave associated with it. This is +because KVM doesn't track how a guest uses EPC pages. + +As a result, the SGX core page reclaimer doesn't support reclaiming EPC +pages allocated to KVM guests through the virtual EPC driver. If the +user wants to deploy SGX applications both on the host and in guests +on the same machine, the user should reserve enough EPC (by taking out +total virtual EPC size of all SGX VMs from the physical EPC size) for +host SGX applications so they can run with acceptable performance. + +Architectural behavior is to restore all EPC pages to an uninitialized +state also after a guest reboot. Because this state can be reached only +through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc`` +provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction +on all pages in the virtual EPC. + +``EREMOVE`` can fail for three reasons. Userspace must pay attention +to expected failures and handle them as follows: + +1. Page removal will always fail when any thread is running in the + enclave to which the page belongs. In this case the ioctl will + return ``EBUSY`` independent of whether it has successfully removed + some pages; userspace can avoid these failures by preventing execution + of any vcpu which maps the virtual EPC. + +2. Page removal will cause a general protection fault if two calls to + ``EREMOVE`` happen concurrently for pages that refer to the same + "SECS" metadata pages. This can happen if there are concurrent + invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc`` + file descriptor in the guest is closed at the same time as + ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``. + This can be avoided in userspace by serializing calls to the ioctl() + and to close(), but in general it should not be a problem. + +3. Finally, page removal will fail for SECS metadata pages which still + have child pages. Child pages can be removed by executing + ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors + mapped into the guest. This means that the ioctl() must be called + twice: an initial set of calls to remove child pages and a subsequent + set of calls to remove SECS pages. The second set of calls is only + required for those mappings that returned a nonzero value from the + first call. It indicates a bug in the kernel or the userspace client + if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has + a return code other than 0. diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst new file mode 100644 index 000000000000..2e9b8b0f9a0f --- /dev/null +++ b/Documentation/x86/sva.rst @@ -0,0 +1,286 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +Shared Virtual Addressing (SVA) with ENQCMD +=========================================== + +Background +========== + +Shared Virtual Addressing (SVA) allows the processor and device to use the +same virtual addresses avoiding the need for software to translate virtual +addresses to physical addresses. SVA is what PCIe calls Shared Virtual +Memory (SVM). + +In addition to the convenience of using application virtual addresses +by the device, it also doesn't require pinning pages for DMA. +PCIe Address Translation Services (ATS) along with Page Request Interface +(PRI) allow devices to function much the same way as the CPU handling +application page-faults. For more information please refer to the PCIe +specification Chapter 10: ATS Specification. + +Use of SVA requires IOMMU support in the platform. IOMMU is also +required to support the PCIe features ATS and PRI. ATS allows devices +to cache translations for virtual addresses. The IOMMU driver uses the +mmu_notifier() support to keep the device TLB cache and the CPU cache in +sync. When an ATS lookup fails for a virtual address, the device should +use the PRI in order to request the virtual address to be paged into the +CPU page tables. The device must use ATS again in order the fetch the +translation before use. + +Shared Hardware Workqueues +========================== + +Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits +the use of Shared Work Queues (SWQ) by both applications and Virtual +Machines (VM's). This allows better hardware utilization vs. hard +partitioning resources that could result in under utilization. In order to +allow the hardware to distinguish the context for which work is being +executed in the hardware by SWQ interface, SIOV uses Process Address Space +ID (PASID), which is a 20-bit number defined by the PCIe SIG. + +PASID value is encoded in all transactions from the device. This allows the +IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe +Resource Identifier (RID) which is the Bus/Device/Function. + + +ENQCMD +====== + +ENQCMD is a new instruction on Intel platforms that atomically submits a +work descriptor to a device. The descriptor includes the operation to be +performed, virtual addresses of all parameters, virtual address of a completion +record, and the PASID (process address space ID) of the current process. + +ENQCMD works with non-posted semantics and carries a status back if the +command was accepted by hardware. This allows the submitter to know if the +submission needs to be retried or other device specific mechanisms to +implement fairness or ensure forward progress should be provided. + +ENQCMD is the glue that ensures applications can directly submit commands +to the hardware and also permits hardware to be aware of application context +to perform I/O operations via use of PASID. + +Process Address Space Tagging +============================= + +A new thread-scoped MSR (IA32_PASID) provides the connection between +user processes and the rest of the hardware. When an application first +accesses an SVA-capable device, this MSR is initialized with a newly +allocated PASID. The driver for the device calls an IOMMU-specific API +that sets up the routing for DMA and page-requests. + +For example, the Intel Data Streaming Accelerator (DSA) uses +iommu_sva_bind_device(), which will do the following: + +- Allocate the PASID, and program the process page-table (%cr3 register) in the + PASID context entries. +- Register for mmu_notifier() to track any page-table invalidations to keep + the device TLB in sync. For example, when a page-table entry is invalidated, + the IOMMU propagates the invalidation to the device TLB. This will force any + future access by the device to this virtual address to participate in + ATS. If the IOMMU responds with proper response that a page is not + present, the device would request the page to be paged in via the PCIe PRI + protocol before performing I/O. + +This MSR is managed with the XSAVE feature set as "supervisor state" to +ensure the MSR is updated during context switch. + +PASID Management +================ + +The kernel must allocate a PASID on behalf of each process which will use +ENQCMD and program it into the new MSR to communicate the process identity to +platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests +from this process. When a user submits a work descriptor to a device using the +ENQCMD instruction, the PASID field in the descriptor is auto-filled with the +value from MSR_IA32_PASID. Requests for DMA from the device are also tagged +with the same PASID. The platform IOMMU uses the PASID in the transaction to +perform address translation. The IOMMU APIs setup the corresponding PASID +entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in +x86). + +The MSR must be configured on each logical CPU before any application +thread can interact with a device. Threads that belong to the same +process share the same page tables, thus the same MSR value. + +PASID Life Cycle Management +=========================== + +PASID is initialized as INVALID_IOASID (-1) when a process is created. + +Only processes that access SVA-capable devices need to have a PASID +allocated. This allocation happens when a process opens/binds an SVA-capable +device but finds no PASID for this process. Subsequent binds of the same, or +other devices will share the same PASID. + +Although the PASID is allocated to the process by opening a device, +it is not active in any of the threads of that process. It's loaded to the +IA32_PASID MSR lazily when a thread tries to submit a work descriptor +to a device using the ENQCMD. + +That first access will trigger a #GP fault because the IA32_PASID MSR +has not been initialized with the PASID value assigned to the process +when the device was opened. The Linux #GP handler notes that a PASID has +been allocated for the process, and so initializes the IA32_PASID MSR +and returns so that the ENQCMD instruction is re-executed. + +On fork(2) or exec(2) the PASID is removed from the process as it no +longer has the same address space that it had when the device was opened. + +On clone(2) the new task shares the same address space, so will be +able to use the PASID allocated to the process. The IA32_PASID is not +preemptively initialized as the PASID value might not be allocated yet or +the kernel does not know whether this thread is going to access the device +and the cleared IA32_PASID MSR reduces context switch overhead by xstate +init optimization. Since #GP faults have to be handled on any threads that +were created before the PASID was assigned to the mm of the process, newly +created threads might as well be treated in a consistent way. + +Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in +all threads in unbind, free the PASID lazily only on mm exit. + +If a process does a close(2) of the device file descriptor and munmap(2) +of the device MMIO portal, then the driver will unbind the device. The +PASID is still marked VALID in the PASID_MSR for any threads in the +process that accessed the device. But this is harmless as without the +MMIO portal they cannot submit new work to the device. + +Relationships +============= + + * Each process has many threads, but only one PASID. + * Devices have a limited number (~10's to 1000's) of hardware workqueues. + The device driver manages allocating hardware workqueues. + * A single mmap() maps a single hardware workqueue as a "portal" and + each portal maps down to a single workqueue. + * For each device with which a process interacts, there must be + one or more mmap()'d portals. + * Many threads within a process can share a single portal to access + a single device. + * Multiple processes can separately mmap() the same portal, in + which case they still share one device hardware workqueue. + * The single process-wide PASID is used by all threads to interact + with all devices. There is not, for instance, a PASID for each + thread or each thread<->device pair. + +FAQ +=== + +* What is SVA/SVM? + +Shared Virtual Addressing (SVA) permits I/O hardware and the processor to +work in the same address space, i.e., to share it. Some call it Shared +Virtual Memory (SVM), but Linux community wanted to avoid confusing it with +POSIX Shared Memory and Secure Virtual Machines which were terms already in +circulation. + +* What is a PASID? + +A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet +(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS. +PASID is included in all transactions between the platform and the device. + +* How are shared workqueues different? + +Traditionally, in order for userspace applications to interact with hardware, +there is a separate hardware instance required per process. For example, +consider doorbells as a mechanism of informing hardware about work to process. +Each doorbell is required to be spaced 4k (or page-size) apart for process +isolation. This requires hardware to provision that space and reserve it in +MMIO. This doesn't scale as the number of threads becomes quite large. The +hardware also manages the queue depth for Shared Work Queues (SWQ), and +consumers don't need to track queue depth. If there is no space to accept +a command, the device will return an error indicating retry. + +A user should check Deferrable Memory Write (DMWr) capability on the device +and only submits ENQCMD when the device supports it. In the new DMWr PCIe +terminology, devices need to support DMWr completer capability. In addition, +it requires all switch ports to support DMWr routing and must be enabled by +the PCIe subsystem, much like how PCIe atomic operations are managed for +instance. + +SWQ allows hardware to provision just a single address in the device. When +used with ENQCMD to submit work, the device can distinguish the process +submitting the work since it will include the PASID assigned to that +process. This helps the device scale to a large number of processes. + +* Is this the same as a user space device driver? + +Communicating with the device via the shared workqueue is much simpler +than a full blown user space driver. The kernel driver does all the +initialization of the hardware. User space only needs to worry about +submitting work and processing completions. + +* Is this the same as SR-IOV? + +Single Root I/O Virtualization (SR-IOV) focuses on providing independent +hardware interfaces for virtualizing hardware. Hence, it's required to be +almost fully functional interface to software supporting the traditional +BARs, space for interrupts via MSI-X, its own register layout. +Virtual Functions (VFs) are assisted by the Physical Function (PF) +driver. + +Scalable I/O Virtualization builds on the PASID concept to create device +instances for virtualization. SIOV requires host software to assist in +creating virtual devices; each virtual device is represented by a PASID +along with the bus/device/function of the device. This allows device +hardware to optimize device resource creation and can grow dynamically on +demand. SR-IOV creation and management is very static in nature. Consult +references below for more details. + +* Why not just create a virtual function for each app? + +Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require +duplicated hardware for PCI config space and interrupts such as MSI-X. +Resources such as interrupts have to be hard partitioned between VFs at +creation time, and cannot scale dynamically on demand. The VFs are not +completely independent from the Physical Function (PF). Most VFs require +some communication and assistance from the PF driver. SIOV, in contrast, +creates a software-defined device where all the configuration and control +aspects are mediated via the slow path. The work submission and completion +happen without any mediation. + +* Does this support virtualization? + +ENQCMD can be used from within a guest VM. In these cases, the VMM helps +with setting up a translation table to translate from Guest PASID to Host +PASID. Please consult the ENQCMD instruction set reference for more +details. + +* Does memory need to be pinned? + +When devices support SVA along with platform hardware such as IOMMU +supporting such devices, there is no need to pin memory for DMA purposes. +Devices that support SVA also support other PCIe features that remove the +pinning requirement for memory. + +Device TLB support - Device requests the IOMMU to lookup an address before +use via Address Translation Service (ATS) requests. If the mapping exists +but there is no page allocated by the OS, IOMMU hardware returns that no +mapping exists. + +Device requests the virtual address to be mapped via Page Request +Interface (PRI). Once the OS has successfully completed the mapping, it +returns the response back to the device. The device requests again for +a translation and continues. + +IOMMU works with the OS in managing consistency of page-tables with the +device. When removing pages, it interacts with the device to remove any +device TLB entry that might have been cached before removing the mappings from +the OS. + +References +========== + +VT-D: +https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d + +SIOV: +https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux + +ENQCMD in ISE: +https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf + +DSA spec: +https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst new file mode 100644 index 000000000000..b8fa4329e1a5 --- /dev/null +++ b/Documentation/x86/tdx.rst @@ -0,0 +1,218 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +Intel Trust Domain Extensions (TDX) +===================================== + +Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from +the host and physical attacks by isolating the guest register state and by +encrypting the guest memory. In TDX, a special module running in a special +mode sits between the host and the guest and manages the guest/host +separation. + +Since the host cannot directly access guest registers or memory, much +normal functionality of a hypervisor must be moved into the guest. This is +implemented using a Virtualization Exception (#VE) that is handled by the +guest kernel. A #VE is handled entirely inside the guest kernel, but some +require the hypervisor to be consulted. + +TDX includes new hypercall-like mechanisms for communicating from the +guest to the hypervisor or the TDX module. + +New TDX Exceptions +================== + +TDX guests behave differently from bare-metal and traditional VMX guests. +In TDX guests, otherwise normal instructions or memory accesses can cause +#VE or #GP exceptions. + +Instructions marked with an '*' conditionally cause exceptions. The +details for these instructions are discussed below. + +Instruction-based #VE +--------------------- + +- Port I/O (INS, OUTS, IN, OUT) +- HLT +- MONITOR, MWAIT +- WBINVD, INVD +- VMCALL +- RDMSR*,WRMSR* +- CPUID* + +Instruction-based #GP +--------------------- + +- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, + VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON +- ENCLS, ENCLU +- GETSEC +- RSM +- ENQCMD +- RDMSR*,WRMSR* + +RDMSR/WRMSR Behavior +-------------------- + +MSR access behavior falls into three categories: + +- #GP generated +- #VE generated +- "Just works" + +In general, the #GP MSRs should not be used in guests. Their use likely +indicates a bug in the guest. The guest may try to handle the #GP with a +hypercall but it is unlikely to succeed. + +The #VE MSRs are typically able to be handled by the hypervisor. Guests +can make a hypercall to the hypervisor to handle the #VE. + +The "just works" MSRs do not need any special guest handling. They might +be implemented by directly passing through the MSR to the hardware or by +trapping and handling in the TDX module. Other than possibly being slow, +these MSRs appear to function just as they would on bare metal. + +CPUID Behavior +-------------- + +For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID +return values (in guest EAX/EBX/ECX/EDX) are configurable by the +hypervisor. For such cases, the Intel TDX module architecture defines two +virtualization types: + +- Bit fields for which the hypervisor controls the value seen by the guest + TD. + +- Bit fields for which the hypervisor configures the value such that the + guest TD either sees their native value or a value of 0. For these bit + fields, the hypervisor can mask off the native values, but it can not + turn *on* values. + +A #VE is generated for CPUID leaves and sub-leaves that the TDX module does +not know how to handle. The guest kernel may ask the hypervisor for the +value with a hypercall. + +#VE on Memory Accesses +====================== + +There are essentially two classes of TDX memory: private and shared. +Private memory receives full TDX protections. Its content is protected +against access from the hypervisor. Shared memory is expected to be +shared between guest and hypervisor and does not receive full TDX +protections. + +A TD guest is in control of whether its memory accesses are treated as +private or shared. It selects the behavior with a bit in its page table +entries. This helps ensure that a guest does not place sensitive +information in shared memory, exposing it to the untrusted hypervisor. + +#VE on Shared Memory +-------------------- + +Access to shared mappings can cause a #VE. The hypervisor ultimately +controls whether a shared memory access causes a #VE, so the guest must be +careful to only reference shared pages it can safely handle a #VE. For +instance, the guest should be careful not to access shared memory in the +#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET). + +Shared mapping content is entirely controlled by the hypervisor. The guest +should only use shared mappings for communicating with the hypervisor. +Shared mappings must never be used for sensitive memory content like kernel +stacks. A good rule of thumb is that hypervisor-shared memory should be +treated the same as memory mapped to userspace. Both the hypervisor and +userspace are completely untrusted. + +MMIO for virtual devices is implemented as shared memory. The guest must +be careful not to access device MMIO regions unless it is also prepared to +handle a #VE. + +#VE on Private Pages +-------------------- + +An access to private mappings can also cause a #VE. Since all kernel +memory is also private memory, the kernel might theoretically need to +handle a #VE on arbitrary kernel memory accesses. This is not feasible, so +TDX guests ensure that all guest memory has been "accepted" before memory +is used by the kernel. + +A modest amount of memory (typically 512M) is pre-accepted by the firmware +before the kernel runs to ensure that the kernel can start up without +being subjected to a #VE. + +The hypervisor is permitted to unilaterally move accepted pages to a +"blocked" state. However, if it does this, page access will not generate a +#VE. It will, instead, cause a "TD Exit" where the hypervisor is required +to handle the exception. + +Linux #VE handler +================= + +Just like page faults or #GP's, #VE exceptions can be either handled or be +fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. +An unhandled kernel #VE results in an oops. + +Handling nested exceptions on x86 is typically nasty business. A #VE +could be interrupted by an NMI which triggers another #VE and hilarity +ensues. The TDX #VE architecture anticipated this scenario and includes a +feature to make it slightly less nasty. + +During #VE handling, the TDX module ensures that all interrupts (including +NMIs) are blocked. The block remains in place until the guest makes a +TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts +or a new #VE can be delivered. + +However, the guest kernel must still be careful to avoid potential +#VE-triggering actions (discussed above) while this block is in place. +While the block is in place, any #VE is elevated to a double fault (#DF) +which is not recoverable. + +MMIO handling +============= + +In non-TDX VMs, MMIO is usually implemented by giving a guest access to a +mapping which will cause a VMEXIT on access, and then the hypervisor +emulates the access. That is not possible in TDX guests because VMEXIT +will expose the register state to the host. TDX guests don't trust the host +and can't have their state exposed to the host. + +In TDX, MMIO regions typically trigger a #VE exception in the guest. The +guest #VE handler then emulates the MMIO instruction inside the guest and +converts it into a controlled TDCALL to the host, rather than exposing +guest state to the host. + +MMIO addresses on x86 are just special physical addresses. They can +theoretically be accessed with any instruction that accesses memory. +However, the kernel instruction decoding method is limited. It is only +designed to decode instructions like those generated by io.h macros. + +MMIO access via other means (like structure overlays) may result in an +oops. + +Shared Memory Conversions +========================= + +All TDX guest memory starts out as private at boot. This memory can not +be accessed by the hypervisor. However, some kernel users like device +drivers might have a need to share data with the hypervisor. To do this, +memory must be converted between shared and private. This can be +accomplished using some existing memory encryption helpers: + + * set_memory_decrypted() converts a range of pages to shared. + * set_memory_encrypted() converts memory back to private. + +Device drivers are the primary user of shared memory, but there's no need +to touch every driver. DMA buffers and ioremap() do the conversions +automatically. + +TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is +converted to shared on boot. + +For coherent DMA allocation, the DMA buffer gets converted on the +allocation. Check force_dma_unencrypted() for details. + +References +========== + +TDX reference material is collected here: + +https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html diff --git a/Documentation/x86/topology.rst b/Documentation/x86/topology.rst index e29739904e37..7f58010ea86a 100644 --- a/Documentation/x86/topology.rst +++ b/Documentation/x86/topology.rst @@ -41,6 +41,8 @@ Package Packages contain a number of cores plus shared resources, e.g. DRAM controller, shared caches etc. +Modern systems may also use the term 'Die' for package. + AMD nomenclature for package is 'Node'. Package-related topology information in the kernel: @@ -53,11 +55,18 @@ Package-related topology information in the kernel: The number of dies in a package. This information is retrieved via CPUID. + - cpuinfo_x86.cpu_die_id: + + The physical ID of the die. This information is retrieved via CPUID. + - cpuinfo_x86.phys_proc_id: The physical ID of the package. This information is retrieved via CPUID and deduced from the APIC IDs of the cores in the package. + Modern systems use this value for the socket. There may be multiple + packages within a socket. This value may differ from cpu_die_id. + - cpuinfo_x86.logical_proc_id: The logical ID of the package. As we do not trust BIOSes to enumerate the diff --git a/Documentation/x86/x86_64/5level-paging.rst b/Documentation/x86/x86_64/5level-paging.rst index 44856417e6a5..b792bbdc0b01 100644 --- a/Documentation/x86/x86_64/5level-paging.rst +++ b/Documentation/x86/x86_64/5level-paging.rst @@ -6,9 +6,9 @@ Overview ======== -Original x86-64 was limited by 4-level paing to 256 TiB of virtual address +Original x86-64 was limited by 4-level paging to 256 TiB of virtual address space and 64 TiB of physical address space. We are already bumping into -this limit: some vendors offers servers with 64 TiB of memory today. +this limit: some vendors offer servers with 64 TiB of memory today. To overcome the limitation upcoming hardware will introduce support for 5-level paging. It is a straight-forward extension of the current page diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst index 2b98efb5ba7f..cbd14124a667 100644 --- a/Documentation/x86/x86_64/boot-options.rst +++ b/Documentation/x86/x86_64/boot-options.rst @@ -47,14 +47,7 @@ Please see Documentation/x86/x86_64/machinecheck.rst for sysfs runtime tunables. in a reboot. On Intel systems it is enabled by default. mce=nobootlog Disable boot machine check logging. - mce=tolerancelevel[,monarchtimeout] (number,number) - tolerance levels: - 0: always panic on uncorrected errors, log corrected errors - 1: panic or SIGBUS on uncorrected errors, log corrected errors - 2: SIGBUS or log uncorrected errors, log corrected errors - 3: never panic or SIGBUS, log all errors (for testing only) - Default is 1 - Can be also set using sysfs which is preferable. + mce=monarchtimeout (number) monarchtimeout: Sets the time in us to wait for other CPUs on machine checks. 0 to disable. @@ -126,7 +119,7 @@ Idle loop Rebooting ========= - reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old] + reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old] bios Use the CPU reboot vector for warm reset warm @@ -145,6 +138,8 @@ Rebooting Use efi reset_system runtime service. If EFI is not configured or the EFI reset does not work, the reboot path attempts the reset using the keyboard controller. + pci + Use a write to the PCI config space register 0xcf9 to trigger reboot. Using warm reset will be much faster especially on big memory systems because the BIOS will not go through the memory check. @@ -155,14 +150,12 @@ Rebooting Don't stop other CPUs on reboot. This can make reboot more reliable in some cases. -Non Executable Mappings -======================= - - noexec=on|off - on - Enable(default) - off - Disable + reboot=default + There are some built-in platform specific "quirks" - you may see: + "reboot: <name> series board detected. Selecting <type> for reboots." + In the case where you think the quirk is in error (e.g. you have + newer BIOS, or newer board) using this option will ignore the built-in + quirk table, and use the generic default reboot actions. NUMA ==== @@ -173,6 +166,10 @@ NUMA numa=noacpi Don't parse the SRAT table for NUMA setup + numa=nohmat + Don't parse the HMAT table for NUMA setup, or soft-reserved memory + partitioning. + numa=fake=<size>[MG] If given as a memory unit, fills all system RAM with nodes of size interleaved over physical nodes. @@ -243,16 +240,11 @@ Multiple x86-64 PCI-DMA mapping implementations exist, for example: Kernel boot message: "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)" - 4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM - pSeries and xSeries servers. This hardware IOMMU supports DMA address - mapping with memory protection, etc. - Kernel boot message: "PCI-DMA: Using Calgary IOMMU" - :: iommu=[<size>][,noagp][,off][,force][,noforce] [,memaper[=<order>]][,merge][,fullflush][,nomerge] - [,noaperture][,calgary] + [,noaperture] General iommu options: @@ -291,39 +283,17 @@ iommu options only relevant to the AMD GART hardware IOMMU: Don't initialize the AGP driver and use full aperture. panic Always panic when IOMMU overflows. - calgary - Use the Calgary IOMMU if it is available iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU implementation: - swiotlb=<pages>[,force] - <pages> - Prereserve that many 128K pages for the software IO bounce buffering. + swiotlb=<slots>[,force,noforce] + <slots> + Prereserve that many 2K slots for the software IO bounce buffering. force Force all IO through the software TLB. - -Settings for the IBM Calgary hardware IOMMU currently found in IBM -pSeries and xSeries machines - - calgary=[64k,128k,256k,512k,1M,2M,4M,8M] - Set the size of each PCI slot's translation table when using the - Calgary IOMMU. This is the size of the translation table itself - in main memory. The smallest table, 64k, covers an IO space of - 32MB; the largest, 8MB table, can cover an IO space of 4GB. - Normally the kernel will make the right choice by itself. - calgary=[translate_empty_slots] - Enable translation even on slots that have no devices attached to - them, in case a device will be hotplugged in the future. - calgary=[disable=<PCI bus number>] - Disable translation on a given PHB. For - example, the built-in graphics adapter resides on the first bridge - (PCI bus number 0); if translation (isolation) is enabled on this - bridge, X servers that access the hardware directly from user - space might stop working. Use this option if you have devices that - are accessed from userspace directly on some PCI host bridge. - panic - Always panic when IOMMU overflows + noforce + Do not initialize the software TLB. Miscellaneous @@ -333,3 +303,17 @@ Miscellaneous Do not use GB pages for kernel direct mappings. gbpages Use GB pages for kernel direct mappings. + + +AMD SEV (Secure Encrypted Virtualization) +========================================= +Options relating to AMD SEV, specified via the following format: + +:: + + sev=option1[,option2] + +The available options are: + + debug + Enable debug messages. diff --git a/Documentation/x86/x86_64/fsgs.rst b/Documentation/x86/x86_64/fsgs.rst new file mode 100644 index 000000000000..50960e09e1f6 --- /dev/null +++ b/Documentation/x86/x86_64/fsgs.rst @@ -0,0 +1,199 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Using FS and GS segments in user space applications +=================================================== + +The x86 architecture supports segmentation. Instructions which access +memory can use segment register based addressing mode. The following +notation is used to address a byte within a segment: + + Segment-register:Byte-address + +The segment base address is added to the Byte-address to compute the +resulting virtual address which is accessed. This allows to access multiple +instances of data with the identical Byte-address, i.e. the same code. The +selection of a particular instance is purely based on the base-address in +the segment register. + +In 32-bit mode the CPU provides 6 segments, which also support segment +limits. The limits can be used to enforce address space protections. + +In 64-bit mode the CS/SS/DS/ES segments are ignored and the base address is +always 0 to provide a full 64bit address space. The FS and GS segments are +still functional in 64-bit mode. + +Common FS and GS usage +------------------------------ + +The FS segment is commonly used to address Thread Local Storage (TLS). FS +is usually managed by runtime code or a threading library. Variables +declared with the '__thread' storage class specifier are instantiated per +thread and the compiler emits the FS: address prefix for accesses to these +variables. Each thread has its own FS base address so common code can be +used without complex address offset calculations to access the per thread +instances. Applications should not use FS for other purposes when they use +runtimes or threading libraries which manage the per thread FS. + +The GS segment has no common use and can be used freely by +applications. GCC and Clang support GS based addressing via address space +identifiers. + +Reading and writing the FS/GS base address +------------------------------------------ + +There exist two mechanisms to read and write the FS/GS base address: + + - the arch_prctl() system call + + - the FSGSBASE instruction family + +Accessing FS/GS base with arch_prctl() +-------------------------------------- + + The arch_prctl(2) based mechanism is available on all 64-bit CPUs and all + kernel versions. + + Reading the base: + + arch_prctl(ARCH_GET_FS, &fsbase); + arch_prctl(ARCH_GET_GS, &gsbase); + + Writing the base: + + arch_prctl(ARCH_SET_FS, fsbase); + arch_prctl(ARCH_SET_GS, gsbase); + + The ARCH_SET_GS prctl may be disabled depending on kernel configuration + and security settings. + +Accessing FS/GS base with the FSGSBASE instructions +--------------------------------------------------- + + With the Ivy Bridge CPU generation Intel introduced a new set of + instructions to access the FS and GS base registers directly from user + space. These instructions are also supported on AMD Family 17H CPUs. The + following instructions are available: + + =============== =========================== + RDFSBASE %reg Read the FS base register + RDGSBASE %reg Read the GS base register + WRFSBASE %reg Write the FS base register + WRGSBASE %reg Write the GS base register + =============== =========================== + + The instructions avoid the overhead of the arch_prctl() syscall and allow + more flexible usage of the FS/GS addressing modes in user space + applications. This does not prevent conflicts between threading libraries + and runtimes which utilize FS and applications which want to use it for + their own purpose. + +FSGSBASE instructions enablement +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + The instructions are enumerated in CPUID leaf 7, bit 0 of EBX. If + available /proc/cpuinfo shows 'fsgsbase' in the flag entry of the CPUs. + + The availability of the instructions does not enable them + automatically. The kernel has to enable them explicitly in CR4. The + reason for this is that older kernels make assumptions about the values in + the GS register and enforce them when GS base is set via + arch_prctl(). Allowing user space to write arbitrary values to GS base + would violate these assumptions and cause malfunction. + + On kernels which do not enable FSGSBASE the execution of the FSGSBASE + instructions will fault with a #UD exception. + + The kernel provides reliable information about the enabled state in the + ELF AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the + kernel has FSGSBASE instructions enabled and applications can use them. + The following code example shows how this detection works:: + + #include <sys/auxv.h> + #include <elf.h> + + /* Will be eventually in asm/hwcap.h */ + #ifndef HWCAP2_FSGSBASE + #define HWCAP2_FSGSBASE (1 << 1) + #endif + + .... + + unsigned val = getauxval(AT_HWCAP2); + + if (val & HWCAP2_FSGSBASE) + printf("FSGSBASE enabled\n"); + +FSGSBASE instructions compiler support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE +instructions. Clang 5 supports them as well. + + =================== =========================== + _readfsbase_u64() Read the FS base register + _readfsbase_u64() Read the GS base register + _writefsbase_u64() Write the FS base register + _writegsbase_u64() Write the GS base register + =================== =========================== + +To utilize these instrinsics <immintrin.h> must be included in the source +code and the compiler option -mfsgsbase has to be added. + +Compiler support for FS/GS based addressing +------------------------------------------- + +GCC version 6 and newer provide support for FS/GS based addressing via +Named Address Spaces. GCC implements the following address space +identifiers for x86: + + ========= ==================================== + __seg_fs Variable is addressed relative to FS + __seg_gs Variable is addressed relative to GS + ========= ==================================== + +The preprocessor symbols __SEG_FS and __SEG_GS are defined when these +address spaces are supported. Code which implements fallback modes should +check whether these symbols are defined. Usage example:: + + #ifdef __SEG_GS + + long data0 = 0; + long data1 = 1; + + long __seg_gs *ptr; + + /* Check whether FSGSBASE is enabled by the kernel (HWCAP2_FSGSBASE) */ + .... + + /* Set GS base to point to data0 */ + _writegsbase_u64(&data0); + + /* Access offset 0 of GS */ + ptr = 0; + printf("data0 = %ld\n", *ptr); + + /* Set GS base to point to data1 */ + _writegsbase_u64(&data1); + /* ptr still addresses offset 0! */ + printf("data1 = %ld\n", *ptr); + + +Clang does not provide the GCC address space identifiers, but it provides +address spaces via an attribute based mechanism in Clang 2.6 and newer +versions: + + ==================================== ===================================== + __attribute__((address_space(256)) Variable is addressed relative to GS + __attribute__((address_space(257)) Variable is addressed relative to FS + ==================================== ===================================== + +FS/GS based addressing with inline assembly +------------------------------------------- + +In case the compiler does not support address spaces, inline assembly can +be used for FS/GS based addressing mode:: + + mov %fs:offset, %reg + mov %gs:offset, %reg + + mov %reg, %fs:offset + mov %reg, %gs:offset diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst index d6eaaa5a35fc..a56070fc8e77 100644 --- a/Documentation/x86/x86_64/index.rst +++ b/Documentation/x86/x86_64/index.rst @@ -14,3 +14,4 @@ x86_64 Support fake-numa-for-cpusets cpu-hotplug-spec machinecheck + fsgs diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst index e189168406fa..cea12ee97200 100644 --- a/Documentation/x86/x86_64/machinecheck.rst +++ b/Documentation/x86/x86_64/machinecheck.rst @@ -21,65 +21,13 @@ from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN (N = CPU number). -The directory contains some configurable entries: - -bankNctl - (N bank number) - - 64bit Hex bitmask enabling/disabling specific subevents for bank N - When a bit in the bitmask is zero then the respective - subevent will not be reported. - By default all events are enabled. - Note that BIOS maintain another mask to disable specific events - per bank. This is not visible here - -The following entries appear for each CPU, but they are truly shared -between all CPUs. - -check_interval - How often to poll for corrected machine check errors, in seconds - (Note output is hexadecimal). Default 5 minutes. When the poller - finds MCEs it triggers an exponential speedup (poll more often) on - the polling interval. When the poller stops finding MCEs, it - triggers an exponential backoff (poll less often) on the polling - interval. The check_interval variable is both the initial and - maximum polling interval. 0 means no polling for corrected machine - check errors (but some corrected errors might be still reported - in other ways) - -tolerant - Tolerance level. When a machine check exception occurs for a non - corrected machine check the kernel can take different actions. - Since machine check exceptions can happen any time it is sometimes - risky for the kernel to kill a process because it defies - normal kernel locking rules. The tolerance level configures - how hard the kernel tries to recover even at some risk of - deadlock. Higher tolerant values trade potentially better uptime - with the risk of a crash or even corruption (for tolerant >= 3). - - 0: always panic on uncorrected errors, log corrected errors - 1: panic or SIGBUS on uncorrected errors, log corrected errors - 2: SIGBUS or log uncorrected errors, log corrected errors - 3: never panic or SIGBUS, log all errors (for testing only) - - Default: 1 - - Note this only makes a difference if the CPU allows recovery - from a machine check exception. Current x86 CPUs generally do not. - -trigger - Program to run when a machine check event is detected. - This is an alternative to running mcelog regularly from cron - and allows to detect events faster. -monarch_timeout - How long to wait for the other CPUs to machine check too on a - exception. 0 to disable waiting for other CPUs. - Unit: us +The directory contains some configurable entries. See +Documentation/ABI/testing/sysfs-mce for more details. TBD document entries for AMD threshold interrupt configuration For more details about the x86 machine check architecture see the Intel and AMD architecture manuals from their developer websites. -For more details about the architecture see +For more details about the architecture see http://one.firstfloor.org/~andi/mce.pdf diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst index e5053404a1ae..9798676bb0bf 100644 --- a/Documentation/x86/x86_64/mm.rst +++ b/Documentation/x86/x86_64/mm.rst @@ -19,7 +19,7 @@ Complete virtual memory map with 4-level page tables Note that as we get closer to the top of the address space, the notation changes from TB to GB and then MB/KB. - - "16M TB" might look weird at first sight, but it's an easier to visualize size + - "16M TB" might look weird at first sight, but it's an easier way to visualize size notation than "16 EB", which few will recognize at first sight as 16 exabytes. It also shows it nicely how incredibly large 64-bit address space is. @@ -140,10 +140,6 @@ The direct mapping covers all memory in the system up to the highest memory address (this means in some cases it can also include PCI memory holes). -vmalloc space is lazily synchronized into the different PML4/PML5 pages of -the processes using the page fault handler, with init_top_pgt as -reference. - We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual memory window (this size is arbitrary, it can be raised later if needed). The mappings are not part of any other kernel PGD and are only available diff --git a/Documentation/x86/x86_64/uefi.rst b/Documentation/x86/x86_64/uefi.rst index 88c3ba32546f..fbc30c9a071d 100644 --- a/Documentation/x86/x86_64/uefi.rst +++ b/Documentation/x86/x86_64/uefi.rst @@ -29,14 +29,14 @@ Mechanics be selected:: CONFIG_EFI=y - CONFIG_EFI_VARS=y or m # optional + CONFIG_EFIVAR_FS=y or m # optional - Create a VFAT partition on the disk - Copy the following to the VFAT partition: elilo bootloader with x86_64 support, elilo configuration file, kernel image built in first step and corresponding - initrd. Instructions on building elilo and its dependencies + initrd. Instructions on building elilo and its dependencies can be found in the elilo sourceforge project. - Boot to EFI shell and invoke elilo choosing the kernel image built diff --git a/Documentation/x86/xstate.rst b/Documentation/x86/xstate.rst new file mode 100644 index 000000000000..5cec7fb558d6 --- /dev/null +++ b/Documentation/x86/xstate.rst @@ -0,0 +1,74 @@ +Using XSTATE features in user space applications +================================================ + +The x86 architecture supports floating-point extensions which are +enumerated via CPUID. Applications consult CPUID and use XGETBV to +evaluate which features have been enabled by the kernel XCR0. + +Up to AVX-512 and PKRU states, these features are automatically enabled by +the kernel if available. Features like AMX TILE_DATA (XSTATE component 18) +are enabled by XCR0 as well, but the first use of related instruction is +trapped by the kernel because by default the required large XSTATE buffers +are not allocated automatically. + +Using dynamically enabled XSTATE features in user space applications +-------------------------------------------------------------------- + +The kernel provides an arch_prctl(2) based mechanism for applications to +request the usage of such features. The arch_prctl(2) options related to +this are: + +-ARCH_GET_XCOMP_SUPP + + arch_prctl(ARCH_GET_XCOMP_SUPP, &features); + + ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of + type uint64_t. The second argument is a pointer to that storage. + +-ARCH_GET_XCOMP_PERM + + arch_prctl(ARCH_GET_XCOMP_PERM, &features); + + ARCH_GET_XCOMP_PERM stores the features for which the userspace process + has permission in userspace storage of type uint64_t. The second argument + is a pointer to that storage. + +-ARCH_REQ_XCOMP_PERM + + arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr); + + ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled + feature or a feature set. A feature set can be mapped to a facility, e.g. + AMX, and can require one or more XSTATE components to be enabled. + + The feature argument is the number of the highest XSTATE component which + is required for a facility to work. + +When requesting permission for a feature, the kernel checks the +availability. The kernel ensures that sigaltstacks in the process's tasks +are large enough to accommodate the resulting large signal frame. It +enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent +sigaltstack(2) calls. If an installed sigaltstack is smaller than the +resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also, +sigaltstack(2) results in -ENOMEM if the requested altstack is too small +for the permitted features. + +Permission, when granted, is valid per process. Permissions are inherited +on fork(2) and cleared on exec(3). + +The first use of an instruction related to a dynamically enabled feature is +trapped by the kernel. The trap handler checks whether the process has +permission to use the feature. If the process has no permission then the +kernel sends SIGILL to the application. If the process has permission then +the handler allocates a larger xstate buffer for the task so the large +state can be context switched. In the unlikely cases that the allocation +fails, the kernel sends SIGSEGV. + +Dynamic features in signal frames +--------------------------------- + +Dynamcally enabled features are not written to the signal frame upon signal +entry if the feature is in its initial configuration. This differs from +non-dynamic features which are always written regardless of their +configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV +field to determine if a features was written. diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst index f088f5881666..45aa9cceb4f1 100644 --- a/Documentation/x86/zero-page.rst +++ b/Documentation/x86/zero-page.rst @@ -19,6 +19,7 @@ Offset/Size Proto Name Meaning 058/008 ALL tboot_addr Physical address of tboot shared page 060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information (struct ist_info) +070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table 080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!! 090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!! 0A0/010 ALL sys_desc_table System description table (struct sys_desc_table), @@ -27,6 +28,7 @@ Offset/Size Proto Name Meaning 0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits 0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits 0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits +13C/004 ALL cc_blob_address Physical address of Confidential Computing blob 140/080 ALL edid_info Video mode setup (struct edid_info) 1C0/020 ALL efi_info EFI 32 information (struct efi_info) 1E0/004 ALL alt_mem_k Alternative mem check, in KB |