Age | Commit message (Collapse) | Author | Files | Lines |
|
Add documentation for TDX host kernel support. There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals. Also reuse it for TDX host kernel support.
Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-19-dave.hansen%40intel.com
|
|
The first few generations of TDX hardware have an erratum. Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.
Make an effort to detect these TDX-induced machine checks and spit out
a new blurb to dmesg so folks do not think their hardware is failing.
== Background ==
Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.
This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.
== Problem ==
A partial write to a TDX private memory cacheline will silently "poison"
the line. Subsequent reads will consume the poison and generate a
machine check. According to the TDX hardware spec, neither of these
things should have happened.
To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.
== Solution ==
In the end, this issue is hard to trigger. Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.
Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:
[...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
[...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
...
[...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] Kernel panic - not syncing: Fatal local machine check
Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".
Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error". But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error. Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.
So keep the "Hardware Error". The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.
Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:
...
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
Adding this additional message requires determination of whether the
memory page is TDX private memory. There is no existing infrastructure
to do that. Add an interface to query the TDX module to fill this gap.
== Impact ==
This issue requires some kind of kernel bug to trigger.
TDX private memory should never be mapped UC/WC. A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory. It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.
MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map. It should also be
detectable with normal debug techniques.
The one place where this might get nasty would be if the CPU read data
then wrote back the same data. That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.
With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.
[ dhansen: changelog tweaks ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
|
|
TDX memory has integrity and confidentiality protections. Violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself. In other words,
the host kernel should never, itself, see machine checks induced by the
TDX integrity hardware.
Alas, the first few generations of TDX hardware have an erratum. A
partial write to a TDX private memory cacheline will silently "poison"
the line. Subsequent reads will consume the poison and generate a
machine check. According to the TDX hardware spec, neither of these
things should have happened.
Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.
This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.
With this erratum, there are additional things need to be done. To
prepare for those changes, add a CPU bug bit to indicate this erratum.
Note this bug reflects the hardware thus it is detected regardless of
whether the kernel is built with TDX support or not.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-17-dave.hansen%40intel.com
|
|
TDX is incompatible with hibernation and some ACPI sleep states.
Users must disable hibernation to use TDX. Users must also disable
TDX if they want to use ACPI S3 sleep.
This feels a bit wonky and asymmetric, but it avoids adding any new
command-line parameters for now. It can be improved if users hate it
too much.
Long version:
TDX cannot survive from S3 and deeper states. The hardware resets and
disables TDX completely when platform goes to S3 and deeper. Both TDX
guests and the TDX module get destroyed permanently.
The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to
support hibernation. The kernel also maintains TDX states to track
whether it has been initialized and its metadata resource, etc. After
resuming from S3 or hibernation, these TDX states won't be correct
anymore.
Theoretically, the kernel can do more complicated things like resetting
TDX internal states and TDX module metadata before going to S3 or
deeper, and re-initialize TDX module after resuming, etc, but there is
no way to save/restore TDX guests for now.
Until TDX supports full save and restore of TDX guests, there is no big
value to handle TDX module in suspend and hibernation alone. To make
things simple, just choose to make TDX mutually exclusive with S3 and
hibernation.
Note the TDX module is initialized at runtime. To avoid having to deal
with the fuss of determining TDX state at runtime, just choose TDX vs S3
and hibernation at kernel early boot. It's a bad user experience if the
choice of TDX and S3/hibernation is done at runtime anyway, i.e., the
user can experience being able to do S3/hibernation but later becoming
unable to due to TDX being enabled.
Disable TDX in kernel early boot when hibernation support is available.
Currently there's no mechanism exposed by the hibernation code to allow
other kernel code to disable hibernation once for all. Users that want
TDX must disable hibernation, like using hibername=no on the command
line.
Disable ACPI S3 when TDX is enabled by the BIOS. For now the user needs
to disable TDX in the BIOS to use ACPI S3. A new kernel command line
can be added in the future if there's a need to let user disable TDX
host via kernel command line.
Alternatively, the kernel could disable TDX when ACPI S3 is supported
and request the user to disable S3 to use TDX. But there's no existing
kernel command line to do that, and BIOS doesn't always have an option
to disable S3.
[ dhansen: subject / changelog tweaks ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-16-dave.hansen%40intel.com
|
|
After the global KeyID has been configured on all packages, initialize
all TDMRs to make all TDX-usable memory regions that are passed to the
TDX module become usable.
This is the last step of initializing the TDX module.
Initializing TDMRs can be time consuming on large memory systems as it
involves initializing all metadata entries for all pages that can be
used by TDX guests. Initializing different TDMRs can be parallelized.
For now to keep it simple, just initialize all TDMRs one by one. It can
be enhanced in the future.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-15-dave.hansen%40intel.com
|
|
After the list of TDMRs and the global KeyID are configured to the TDX
module, the kernel needs to configure the key of the global KeyID on all
packages using TDH.SYS.KEY.CONFIG.
This SEAMCALL cannot run parallel on different cpus. Loop all online
cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
each package.
To keep things simple, this implementation takes no affirmative steps to
online cpus to make sure there's at least one cpu for each package. The
callers (aka. KVM) can ensure success by ensuring sufficient CPUs are
online for this to succeed.
Intel hardware doesn't guarantee cache coherency across different
KeyIDs. The PAMTs are transitioning from being used by the kernel
mapping (KeyId 0) to the TDX module's "global KeyID" mapping.
This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
before the TDX module uses the global KeyID to access the PAMTs.
Otherwise, if those dirty cachelines were written back, they would
corrupt the TDX module's metadata. Aside: This corruption would be
detected by the memory integrity hardware on the next read of the memory
with the global KeyID. The result would likely be fatal to the system
but would not impact TDX security.
Following the TDX module specification, flush cache before configuring
the global KeyID on all packages. Given the PAMT size can be large
(~1/256th of system RAM), just use WBINVD on all CPUs to flush.
If TDH.SYS.KEY.CONFIG fails, the TDX module may already have "converted"
some memory for TDX module use. Convert the memory back so that it can
be safely used by the kernel again. Note that this is slower than it
should be because of the "partial write machine check" erratum which
affects TDX-capable hardware.
Also refactor and introduce a new helper: tdmr_do_pamt_func(). This
takes a TDMR and runs a function on its PAMT. It looks a _bit_ odd to
pass a function pointer around like this, but its use is pretty narrow
and it does eliminate what would otherwise be some copying and pasting.
[ dhansen: * munge changelog as usual
* remove weird (*pamd_func)() syntax ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-14-dave.hansen%40intel.com
|
|
The TDX module uses a private KeyID as the "global KeyID" for mapping
things like the PAMT and other TDX metadata. This KeyID has already
been reserved when detecting TDX during the kernel early boot.
Now that the "TD Memory Regions" (TDMRs) are fully built, pass them to
the TDX module together with the global KeyID.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-13-dave.hansen%40intel.com
|
|
As the last step of constructing TDMRs, populate reserved areas for all
TDMRs. Cover all memory holes and PAMTs with a TMDR reserved area.
[ dhansen: trim down chagnelog ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-12-dave.hansen%40intel.com
|
|
The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory. This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module. PAMTs are not reserved by hardware
up front. They must be allocated by the kernel and then given to the
TDX module during module initialization.
TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.
Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime. One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.
It is imperfect but will be improved on later.
TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR. If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.
Adopt the following policies when allocating PAMTs for a given TDMR:
- Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
the total number of reserved areas consumed for PAMTs.
- Try to first allocate PAMT from the local node of the TDMR for better
NUMA locality.
Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully. This helps answer the eternal "where did
all my memory go?" questions.
[ dhansen: merge in error handling cleanup ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-11-dave.hansen%40intel.com
|
|
Start to transit out the "multi-steps" to construct a list of "TD Memory
Regions" (TDMRs) to cover all TDX-usable memory regions.
The kernel configures TDX-usable memory regions by passing a list of
TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains
the information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.
Do the first step to fill out a number of TDMRs to cover all TDX memory
regions. To keep it simple, always try to use one TDMR for each memory
region. As the first step only set up the base/size for each TDMR.
Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions. If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just use a new TDMR for the remaining
part.
TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs
are consumed but there is more memory region to cover.
There are fancier things that could be done like trying to merge
adjacent TDMRs. This would allow more pathological memory layouts to be
supported. But, current systems are not even close to exhausting the
existing TDMR resources in practice. For now, keep it simple.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-10-dave.hansen%40intel.com
|
|
After the kernel selects all TDX-usable memory regions, the kernel needs
to pass those regions to the TDX module via data structure "TD Memory
Region" (TDMR).
Add a placeholder to construct a list of TDMRs (in multiple steps) to
cover all TDX-usable memory regions.
=== Long Version ===
TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums. Not all memory
satisfies these requirements.
As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR). During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees. The list of these
ranges is available to the kernel by querying the TDX module.
The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory. This metadata essentially
serves as the 'struct page' for the TDX module. The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.
Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory. If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.
For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR). Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.
Unlike a CMR, each TDMR requires 1G granularity and alignment. To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes. If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.
Let's summarize the concepts:
CMR - Firmware-enumerated physical ranges that support TDX. CMRs are
4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
TDX. 1G granularity and alignment required. Each TDMR has
reserved areas where TDX memory holes and overlapping PAMTs can
be represented.
PAMT - Physically contiguous TDX metadata. One table for each page size
per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G
PAMT.
As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing a list of TDMRs to the TDX module.
Constructing the list of TDMRs consists below steps:
1) Fill out TDMRs to cover all memory regions that the TDX module will
use for TD memory.
2) Allocate and set up PAMT for each TDMR.
3) Designate reserved areas for each TDMR.
Add a placeholder to construct TDMRs to do the above steps. To keep
things simple, just allocate enough space to hold maximum number of
TDMRs up front.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-9-dave.hansen%40intel.com
|
|
The TDX module global metadata provides system-wide information about
the module.
TL;DR:
Use the TDH.SYS.RD SEAMCALL to tell if the module is good or not.
Long Version:
1) Only initialize TDX module with version 1.5 and later
TDX module 1.0 has some compatibility issues with the later versions of
module, as documented in the "Intel TDX module ABI incompatibilities
between TDX1.0 and TDX1.5" spec. Don't bother with module versions that
do not have a stable ABI.
2) Get the essential global metadata for module initialization
TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible. The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module. The kernel does this by constructing a list of "TD
Memory Regions" (TDMRs) to cover all these memory regions and passing
them to the TDX module.
Each TDMR is a TDX architectural data structure containing the memory
region that the TDMR covers, plus the information to track (within this
TDMR):
a) the "Physical Address Metadata Table" (PAMT) to track each TDX
memory page's status (such as which TDX guest "owns" a given page,
and
b) the "reserved areas" to tell memory holes that cannot be used as
TDX memory.
The kernel needs to get below metadata from the TDX module to build the
list of TDMRs:
a) the maximum number of supported TDMRs
b) the maximum number of supported reserved areas per TDMR and,
c) the PAMT entry size for each TDX-supported page size.
== Implementation ==
The TDX module has two modes of fetching the metadata: a one field at
a time, or all in one blob. Use the field at a time for now. It is
slower, but there just are not enough fields now to justify the
complexity of extra unpacking.
The err_free_tdxmem=>out_put_tdxmem goto looks wonky by itself. But
it is the first of a bunch of error handling that will get stuck at
its site.
[ dhansen: clean up changelog and add a struct to map between
the TDX module fields and 'struct tdx_tdmr_sysinfo' ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-8-dave.hansen%40intel.com
|
|
Start to transit out the "multi-steps" to initialize the TDX module.
TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums. Not all memory
satisfies these requirements.
As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR). During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees. The list of these
ranges is available to the kernel by querying the TDX module.
CMRs tell the kernel which memory is TDX compatible. The kernel needs
to build a list of memory regions (out of CMRs) as "TDX-usable" memory
and pass them to the TDX module. Once this is done, those "TDX-usable"
memory regions are fixed during module's lifetime.
To keep things simple, assume that all TDX-protected memory will come
from the page allocator. Make sure all pages in the page allocator
*are* TDX-usable memory.
As TDX-usable memory is a fixed configuration, take a snapshot of the
memory configuration from memblocks at the time of module initialization
(memblocks are modified on memory hotplug). This snapshot is used to
enable TDX support for *this* memory configuration only. Use a memory
hotplug notifier to ensure that no other RAM can be added outside of
this configuration.
This approach requires all memblock memory regions at the time of module
initialization to be TDX convertible memory to work, otherwise module
initialization will fail in a later SEAMCALL when passing those regions
to the module. This approach works when all boot-time "system RAM" is
TDX convertible memory and no non-TDX-convertible memory is hot-added
to the core-mm before module initialization.
For instance, on the first generation of TDX machines, both CXL memory
and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add
any CXL memory or NVDIMM to the core-mm before module initialization
will result in failure to initialize the module. The SEAMCALL error
code will be available in the dmesg to help user to understand the
failure.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com
|
|
There are essentially two steps to get the TDX module ready:
1) Get each CPU ready to run TDX
2) Set up the shared TDX module data structures
Introduce and export (to KVM) the infrastructure to do both of these
pieces at runtime.
== Per-CPU TDX Initialization ==
Track the initialization status of each CPU with a per-cpu variable.
This avoids failures in the case of KVM module reloads and handles cases
where CPUs come online later.
Generally, the per-cpu SEAMCALLs happen first. But there's actually one
global call that has to happen before _any_ others (TDH_SYS_INIT). It's
analogous to the boot CPU having to do a bit of extra work just because
it happens to be the first one. Track if _any_ CPU has done this call
and then only actually do it during the first per-cpu init.
== Shared TDX Initialization ==
Create the global state function (tdx_enable()) as a simple placeholder.
The TODO list will be pared down as functionality is added.
Use a state machine protected by mutex to make sure the work in
tdx_enable() will only be done once. This avoids failures if the KVM
module is reloaded.
A CPU must be made ready to run TDX before it can participate in
initializing the shared parts of the module. Any caller of tdx_enable()
need to ensure that it can never run on a CPU which is not ready to
run TDX. It needs to be wary of CPU hotplug, preemption and the
VMX enabling state of any CPU on which it might run.
== Why runtime instead of boot time? ==
The TDX module can be initialized only once in its lifetime. Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM). This approach has below pros:
1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.
2) The TDX module design allows it to be updated while the system is
running. The update procedure shares quite a few steps with this "on
demand" initialization mechanism. The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism. A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.
3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM
code mucks with VMX enabling. If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that. Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.
[ dhansen: completely reorder/rewrite changelog ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-6-dave.hansen%40intel.com
|
|
The SEAMCALLs involved during the TDX module initialization are not
expected to fail. In fact, they are not expected to return any non-zero
code (except the "running out of entropy error", which can be handled
internally already).
Add yet another set of SEAMCALL wrappers, which treats all non-zero
return code as error, to support printing SEAMCALL error upon failure
for module initialization. Note the TDX module initialization doesn't
use the _saved_ret() variant thus no wrapper is added for it.
SEAMCALL assembly can also return kernel-defined error codes for three
special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't
loaded; 3) CPU isn't in VMX operation. Whether they can legally happen
depends on the caller, so leave to the caller to print error message
when desired.
Also convert the SEAMCALL error codes to the kernel error codes in the
new wrappers so that each SEAMCALL caller doesn't have to repeat the
conversion.
[ dhansen: Align the register dump with show_regs(). Zero-pad the
contents, split on two lines and use consistent spacing. ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-5-dave.hansen%40intel.com
|
|
Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons
as RDRAND. Use the kernel RDRAND retry logic for them.
There are three __seamcall*() variants. Do the SEAMCALL retry in common
code and add a wrapper for each of them.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-4-dave.hansen%40intel.com
|
|
TDX capable platforms are locked to X2APIC mode and cannot fall back to
the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support
requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
Link: https://lore.kernel.org/all/20231208170740.53979-3-dave.hansen%40intel.com
|
|
TDX supports 4K, 2M and 1G page sizes. The corresponding values are
defined by the TDX module spec and used as TDX module ABI. Currently,
they are used in try_accept_one() when the TDX guest tries to accept a
page. However currently try_accept_one() uses hard-coded magic values.
Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one(). TDX host support will need to use them too.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/all/20231208170740.53979-2-dave.hansen%40intel.com
|
|
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.
Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME. The memory encryption hardware underpinning MKTME is also
used for Intel TDX. TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs. The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.
During machine boot, TDX microcode verifies that the BIOS programmed TDX
private KeyIDs consistently and correctly programmed across all CPU
packages. The MSRs are locked in this state after verification. This
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
it indicates not just that the hardware supports TDX, but that all the
boot-time security checks passed.
The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests. The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.
Detect platform TDX support by detecting TDX private KeyIDs.
The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata. Each TDX guest also needs a TDX KeyID for its
own protection. Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests. If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.
[ dhansen: add X86_FEATURE, replace helper function ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com
|
|
|
|
The block layer doesn't support logical block sizes smaller than 512
bytes. The nvme spec doesn't support that small either, but the driver
isn't checking to make sure the device responded with usable data.
Failing to catch this will result in a kernel bug, either from a
division by zero when stacking, or a zero length bio.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Request queue quiesce may interrupt flush sequence, and the original request
may have been marked as COMPLETE, but can't get finished because of
queue quiesce.
This way is fine from driver viewpoint, because flush sequence is block
layer concept, and it isn't related with driver.
However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count &
drain inflight requests, then the wait & drain never gets done because
the completed & not-finished flush request is counted as inflight.
Fix this issue by not counting completed flush data request as inflight in
case of quiesce.
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: David Jeffery <djeffery@redhat.com>
Cc: John Pittman <jpittman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The variable phys is defined as (struct resource *) which aligns with
the printk format specifier %pr. Taking the address of it results in a
value of type (struct resource **) which is incompatible with the format
specifier %pr. Therefore, remove the address of operator (&).
Fixes: a5bf3cfce8cb ("iommu: Implement of_iommu_get_resv_regions()")
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20231108062226.928985-1-danielmentz@google.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Since the rethook::handler is an RCU-maganged pointer so that it will
notice readers the rethook is stopped (unregistered) or not, it should
be an __rcu pointer and use appropriate functions to be accessed. This
will use appropriate memory barrier when accessing it. OTOH,
rethook::data is never changed, so we don't need to check it in
get_kretprobe().
NOTE: To avoid sparse warning, rethook::handler is defined by a raw
function pointer type with __rcu instead of rethook_handler_t.
Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/
Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/
Tested-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
The thought behind this patch is to make use of the RCU API where possible
when accessing this pointer so that the needed barriers are always in place
and to self-document the code.
The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
done to the "rp" pointer are changed to make use of the RCU macro for
assignment. For the single read, the implementation of get_kretprobe()
is simplified by making use of an RCU macro which accomplishes the same,
but note that the log warning text will be more generic.
I did find that there is a difference in assembly generated between the
usage of the RCU macros vs without. For example, on arm64, when using
rcu_assign_pointer(), the corresponding store instruction is a
store-release (STLR) which has an implicit barrier. When normal assignment
is done, a regular store (STR) is found. In the macro case, this seems to
be a result of rcu_assign_pointer() using smp_store_release() when the
value to write is not NULL.
Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/
Fixes: d741bf41d7c7 ("kprobes: Remove kretprobe hash")
Cc: stable@vger.kernel.org
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
following kernel warnings:
WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100
This message is from objpool.c:168:
WARN_ON_ONCE(tail - head > pool->nr_objs);
The overrun test case is to validate the case that pre-allocated objects
are insufficient: 8 objects are pre-allocated for each node and consumer
thread per node tries to grab 16 objects in a row. The testing system is
OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
disabling either all 4 big or 4 little cores, the overrun tests run well,
and once with big and little cores mixed together, the overrun test would
always cause an overrun loop. It's likely the memory timing differences
of big and little cores cause this trouble. Here are the debugging data
of objpool_try_get_slot after try_cmpxchg_release:
objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278
The local copies of 'head' and 'last' were 278 and 276, and reloading of
'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
'slot->head' became 'head + 1', which is correct. But what's wrong here
is the stale value of 'last', and that stale value of 'last' finally led
the overrun of 'head'.
Memory updating of 'last' and 'head' are performed in push() and pop()
independently, which could be the culprit leading this out of order
visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
not enough only checking the condition of 'head != slot', the implicit
condition 'last - head <= nr_objs' must also be explicitly asserted to
guarantee 'last' is always behind 'head' before the object retrieving.
This patch will check and try reloading of 'head' and 'last' to ensure
'last' is behind 'head' at the time of object retrieving. Performance
testings show the average impact is about 0.1% for X86_64 and 1.12% for
ARM64. Here are the results:
OS: Debian 10 X86_64, Linux 6.6rc
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
1T 2T 4T 8T 16T
native: 49543304 99277826 199017659 399070324 795185848
objpool: 29909085 59865637 119692073 239750369 478005250
objpool+: 29879313 59230743 119609856 239067773 478509029
32T 48T 64T 96T 128T
native: 1596927073 2390099988 2929397330 3183875848 3257546602
objpool: 957553042 1435814086 1680872925 2043126796 2165424198
objpool+: 956476281 1434491297 1666055740 2041556569 2157415622
OS: Debian 11 AARCH64, Linux 6.6rc
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
1T 2T 4T 8T 16T
native: 30890508 60399915 123111980 242257008 494002946
objpool: 14742531 28883047 57739948 115886644 232455421
objpool+: 14107220 29032998 57286084 113730493 232232850
24T 32T 48T 64T 96T
native: 746406039 1000174750 1493236240 1998318364 2942911180
objpool: 349164852 467284332 702296756 934459713 1387898285
objpool+: 348388180 462750976 696606096 927865887 1368402195
Link: https://lore.kernel.org/all/20231114115148.298821-1-wuqiang.matt@bytedance.com/
Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")
Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.
These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.
CC: Christian König <christian.koenig@amd.com>
CC: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Commit 42c5a3b04bf6 refactored the KPTI init code in a way that results
in the use of non-global kernel mappings even on systems that have no
need for it, and even when KPTI has been disabled explicitly via the
command line.
Ensure that this only happens when we have decided (based on the
detected system-wide CPU features) that KPTI should be enabled.
Fixes: 42c5a3b04bf6 ("arm64: Split kpti_install_ng_mappings()")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20231127120049.2258650-6-ardb@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Drop the vfio_file_iommu_group() stub and instead unconditionally declare
the function to fudge around a KVM wart where KVM tries to do symbol_get()
on vfio_file_iommu_group() (and other VFIO symbols) even if CONFIG_VFIO=n.
Ensuring the symbol is always declared fixes a PPC build error when
modules are also disabled, in which case symbol_get() simply points at the
address of the symbol (with some attributes shenanigans). Because KVM
does symbol_get() instead of directly depending on VFIO, the lack of a
fully defined symbol is not problematic (ugly, but "fine").
arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7:
error: attribute declaration must precede definition [-Werror,-Wignored-attributes]
fn = symbol_get(vfio_file_iommu_group);
^
include/linux/module.h:805:60: note: expanded from macro 'symbol_get'
#define symbol_get(x) ({ extern typeof(x) x __attribute__((weak,visibility("hidden"))); &(x); })
^
include/linux/vfio.h:294:35: note: previous definition is here
static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
^
arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7:
error: attribute declaration must precede definition [-Werror,-Wignored-attributes]
fn = symbol_get(vfio_file_iommu_group);
^
include/linux/module.h:805:65: note: expanded from macro 'symbol_get'
#define symbol_get(x) ({ extern typeof(x) x __attribute__((weak,visibility("hidden"))); &(x); })
^
include/linux/vfio.h:294:35: note: previous definition is here
static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
^
2 errors generated.
Although KVM is firmly in the wrong (there is zero reason for KVM to build
virt/kvm/vfio.c when VFIO is disabled), fudge around the error in VFIO as
the stub is unnecessary and doesn't serve its intended purpose (KVM is the
only external user of vfio_file_iommu_group()), and there is an in-flight
series to clean up the entire KVM<->VFIO interaction, i.e. fixing this in
KVM would result in more churn in the long run, and the stub needs to go
away regardless.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308251949.5IiaV0sz-lkp@intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202309030741.82aLACDG-lkp@intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202309110914.QLH0LU6L-lkp@intel.com
Link: https://lore.kernel.org/all/0-v1-08396538817d+13c5-vfio_kvm_kconfig_jgg@nvidia.com
Link: https://lore.kernel.org/all/20230916003118.2540661-1-seanjc@google.com
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Fixes: c1cce6d079b8 ("vfio: Compile vfio_group infrastructure optionally")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231130001000.543240-1-seanjc@google.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
When compiling with gcc version 14.0.0 20231126 (experimental)
and CONFIG_FORTIFY_SOURCE=y, I've noticed the following:
In file included from ./include/linux/string.h:295,
from ./include/linux/bitmap.h:12,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/paravirt.h:17,
from ./arch/x86/include/asm/cpuid.h:62,
from ./arch/x86/include/asm/processor.h:19,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:60,
from ./arch/x86/include/asm/preempt.h:9,
from ./include/linux/preempt.h:79,
from ./include/linux/spinlock.h:56,
from ./include/linux/wait.h:9,
from ./include/linux/wait_bit.h:8,
from ./include/linux/fs.h:6,
from fs/smb/client/smb2pdu.c:18:
In function 'fortify_memcpy_chk',
inlined from '__SMB2_close' at fs/smb/client/smb2pdu.c:3480:4:
./include/linux/fortify-string.h:588:25: warning: call to '__read_overflow2_field'
declared with attribute warning: detected read beyond size of field (2nd parameter);
maybe use struct_group()? [-Wattribute-warning]
588 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
and:
In file included from ./include/linux/string.h:295,
from ./include/linux/bitmap.h:12,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/paravirt.h:17,
from ./arch/x86/include/asm/cpuid.h:62,
from ./arch/x86/include/asm/processor.h:19,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:60,
from ./arch/x86/include/asm/preempt.h:9,
from ./include/linux/preempt.h:79,
from ./include/linux/spinlock.h:56,
from ./include/linux/wait.h:9,
from ./include/linux/wait_bit.h:8,
from ./include/linux/fs.h:6,
from fs/smb/client/cifssmb.c:17:
In function 'fortify_memcpy_chk',
inlined from 'CIFS_open' at fs/smb/client/cifssmb.c:1248:3:
./include/linux/fortify-string.h:588:25: warning: call to '__read_overflow2_field'
declared with attribute warning: detected read beyond size of field (2nd parameter);
maybe use struct_group()? [-Wattribute-warning]
588 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In both cases, the fortification logic inteprets calls to 'memcpy()' as an
attempts to copy an amount of data which exceeds the size of the specified
field (i.e. more than 8 bytes from __le64 value) and thus issues an overread
warning. Both of these warnings may be silenced by using the convenient
'struct_group()' quirk.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
It's been reported that the runtime PM on KONTRON SinglePC (PCI SSID
1734:1232) caused a stall of playback after a bunch of invocations.
(FWIW, this looks like an timing issue, and the stall happens rather
on the controller side.)
As a workaround, disable the default power-save on this platform.
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20231130151321.9813-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
On RZ/G3S SMARC Carrier II board having RGMII connections b/w Ethernet
MACs and PHYs it has been discovered that doing unbind/bind for ravb
driver in a loop leads to wrong speed and duplex for Ethernet links and
broken connectivity (the connectivity cannot be restored even with
bringing interface down/up). Before doing unbind/bind the Ethernet
interfaces were configured though systemd. The sh instructions used to
do unbind/bind were:
$ cd /sys/bus/platform/drivers/ravb/
$ while :; do echo 11c30000.ethernet > unbind ; \
echo 11c30000.ethernet > bind; done
It has been discovered that there is a race b/w IOCTLs initialized by
systemd at the response of success binding and the
"ravb_write(ndev, CCC_OPC_RESET, CCC)" call in ravb_remove() as
follows:
1/ as a result of bind success the user space open/configures the
interfaces tough an IOCTL; the following stack trace has been
identified on RZ/G3S:
Call trace:
dump_backtrace+0x9c/0x100
show_stack+0x20/0x38
dump_stack_lvl+0x48/0x60
dump_stack+0x18/0x28
ravb_open+0x70/0xa58
__dev_open+0xf4/0x1e8
__dev_change_flags+0x198/0x218
dev_change_flags+0x2c/0x80
devinet_ioctl+0x640/0x708
inet_ioctl+0x1e4/0x200
sock_do_ioctl+0x50/0x108
sock_ioctl+0x240/0x358
__arm64_sys_ioctl+0xb0/0x100
invoke_syscall+0x50/0x128
el0_svc_common.constprop.0+0xc8/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x34/0xb8
el0t_64_sync_handler+0xc0/0xc8
el0t_64_sync+0x190/0x198
2/ this call may execute concurrently with ravb_remove() as the
unbind/bind operation was executed in a loop
3/ if the operation mode is changed to RESET (through
ravb_write(ndev, CCC_OPC_RESET, CCC) call in ravb_remove())
while the above ravb_open() is in progress it may lead to MAC
(or PHY, or MAC-PHY connection, the right point hasn't been identified
at the moment) to be broken, thus the Ethernet connectivity fails to
restore.
The simple fix for this is to move ravb_write(ndev, CCC_OPC_RESET, CCC))
after unregister_netdev() to avoid resetting the controller while the
netdev interface is still registered.
To avoid future issues in ravb_remove(), the patch follows the proper order
of operations in ravb_remove(): reverse order compared with ravb_probe().
This avoids described races as the IOCTLs as well as unregister_netdev()
(called now at the beginning of ravb_remove()) calls rtnl_lock() before
continuing and IOCTLs check (though devinet_ioctl()) if device is still
registered just after taking the lock:
int devinet_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr)
{
// ...
rtnl_lock();
ret = -ENODEV;
dev = __dev_get_by_name(net, ifr->ifr_name);
if (!dev)
goto done;
// ...
done:
rtnl_unlock();
out:
return ret;
}
Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In case ravb_phy_start() returns with error the settings applied in
ravb_dmac_init() are not reverted (e.g. config mode). For this call
ravb_stop_dma() on failure path of ravb_open().
Fixes: a0d2f20650e8 ("Renesas Ethernet AVB PTP clock driver")
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
ravb_phy_start() may fail. If that happens, the TX queues will remain
started. Thus, move the netif_tx_start_all_queues() after PHY is
successfully initialized.
Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Hardware manual of RZ/G3S (and RZ/G2L) specifies the following on the
description of CXR35 register (chapter "PHY interface select register
(CXR35)"): "After release reset, make write-access to this register before
making write-access to other registers (except MDIOMOD). Even if not need
to change the value of this register, make write-access to this register
at least one time. Because RGMII/MII MODE is recognized by accessing this
register".
The setup procedure for EMAC module (chapter "Setup procedure" of RZ/G3S,
RZ/G2L manuals) specifies the E-MAC.CXR35 register is the first EMAC
register that is to be configured.
Note [A] from chapter "PHY interface select register (CXR35)" specifies
the following:
[A] The case which CXR35 SEL_XMII is used for the selection of RGMII/MII
in APB Clock 100 MHz.
(1) To use RGMII interface, Set ‘H’03E8_0000’ to this register.
(2) To use MII interface, Set ‘H’03E8_0002’ to this register.
Take into account these indication.
Fixes: 1089877ada8d ("ravb: Add RZ/G2L MII interface support")
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
pm_runtime_get_sync() may return an error. In case it returns with an error
dev->power.usage_count needs to be decremented. pm_runtime_resume_and_get()
takes care of this. Thus use it.
Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
reset_control_deassert() could return an error. Some devices cannot work
if reset signal de-assert operation fails. To avoid this check the return
code of reset_control_deassert() in ravb_probe() and take proper action.
Along with it, the free_netdev() call from the error path was moved after
reset_control_assert() on its own label (out_free_netdev) to free
netdev in case reset_control_deassert() fails.
Fixes: 0d13a1a464a0 ("ravb: Add reset support")
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Since pci_free_irq_vectors() set pdev->msix_enabled as 0 in the
calling of pci_msix_shutdown(), wx->msix_entries is never freed.
Reordering the lines to fix the memory leak.
Cc: stable@vger.kernel.org
Fixes: 3f703186113f ("net: libwx: Add irq flow functions")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://lore.kernel.org/r/20231128095928.1083292-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is an error when an interface has the following conditions:
- PF is in an aggregate (bond)
- PF has VFs created on it
- bond is in a state where it is failed-over to the secondary interface
- A VF reset is issued on one or more of those VFs
The issue is generated by the originating PF trying to rebuild or
reconfigure the VF resources. Since the bond is failed over to the
secondary interface the queue contexts are in a modified state.
To fix this issue, have the originating interface reclaim its resources
prior to the tear-down and rebuild or reconfigure. Then after the process
is complete, move the resources back to the currently active interface.
There are multiple paths that can be used depending on what triggered the
event, so create a helper function to move the queues and use paired calls
to the helper (back to origin, process, then move back to active interface)
under the same lag_mutex lock.
Fixes: 1e0f9881ef79 ("ice: Flesh out implementation of support for SRIOV on bonded interface")
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20231127212340.1137657-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix the cifs filesystem implementations of FALLOC_FL_INSERT_RANGE, in
smb3_insert_range(), to set i_size after extending the file on the server
and before we do the copy to open the gap (as we don't clean up the EOF
marker if the copy fails).
Fixes: 7fe6fe95b936 ("cifs: add FALLOC_FL_INSERT_RANGE support")
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix the cifs filesystem implementations of FALLOC_FL_ZERO_RANGE, in
smb3_zero_range(), to set i_size after extending the file on the server.
Fixes: 72c419d9b073 ("cifs: fix smb3_zero_range so it can expand the file-size when required")
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If device_register() fails, the refcount of device is not 0, the name
allocated in dev_set_name() is leaked. To fix this by calling put_device(),
so that it will be freed in callback function kobject_cleanup().
unreferenced object 0xffff9d99035c7a90 (size 8):
comm "systemd-udevd", pid 168, jiffies 4294672386 (age 152.089s)
hex dump (first 8 bytes):
66 77 30 2e 30 00 ff ff fw0.0...
backtrace:
[<00000000e1d62bac>] __kmem_cache_alloc_node+0x1e9/0x360
[<00000000bbeaff31>] __kmalloc_node_track_caller+0x44/0x1a0
[<00000000491f2fb4>] kvasprintf+0x67/0xd0
[<000000005b960ddc>] kobject_set_name_vargs+0x1e/0x90
[<00000000427ac591>] dev_set_name+0x4e/0x70
[<000000003b4e447d>] create_units+0xc5/0x110
fw_unit_release() will be called in the error path, move fw_device_get()
before calling device_register() to keep balanced with fw_device_put() in
fw_unit_release().
Cc: stable@vger.kernel.org
Fixes: 1fa5ae857bb1 ("driver core: get rid of struct device's bus_id string array")
Fixes: a1f64819fe9f ("firewire: struct device - replace bus_id with dev_name(), dev_set_name()")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
|
|
This adds a test where both pairs of a af_unix paired socket are put into a
BPF map. This ensures that when we tear down the af_unix pair we don't have
any issues on sockmap side with ordering and reference counting.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20231129012557.95371-3-john.fastabend@gmail.com
|
|
AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
will lookup the paired socket as part of the send operation. It is possible
however to put just one of the pairs in a BPF map. This currently increments
the refcnt on the sock in the sockmap to ensure it is not free'd by the
stack before sockmap cleans up its state and stops any skbs being sent/recv'd
to that socket.
But we missed a case. If the peer socket is closed it will be free'd by the
stack. However, the paired socket can still be referenced from BPF sockmap
side because we hold a reference there. Then if we are sending traffic through
BPF sockmap to that socket it will try to dereference the free'd pair in its
send logic creating a use after free. And following splat:
[59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
[59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
[...]
[59.905468] Call Trace:
[59.905787] <TASK>
[59.906066] dump_stack_lvl+0x130/0x1d0
[59.908877] print_report+0x16f/0x740
[59.910629] kasan_report+0x118/0x160
[59.912576] sk_wake_async+0x31/0x1b0
[59.913554] sock_def_readable+0x156/0x2a0
[59.914060] unix_stream_sendmsg+0x3f9/0x12a0
[59.916398] sock_sendmsg+0x20e/0x250
[59.916854] skb_send_sock+0x236/0xac0
[59.920527] sk_psock_backlog+0x287/0xaa0
To fix let BPF sockmap hold a refcnt on both the socket in the sockmap and its
paired socket. It wasn't obvious how to contain the fix to bpf_unix logic. The
primarily problem with keeping this logic in bpf_unix was: In the sock close()
we could handle the deref by having a close handler. But, when we are destroying
the psock through a map delete operation we wouldn't have gotten any signal
thorugh the proto struct other than it being replaced. If we do the deref from
the proto replace its too early because we need to deref the sk_pair after the
backlog worker has been stopped.
Given all this it seems best to just cache it at the end of the psock and eat 8B
for the af_unix and vsock users. Notice dgram sockets are OK because they handle
locking already.
Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20231129012557.95371-2-john.fastabend@gmail.com
|
|
The legacy region at 0x7F000 maps to valid registers in GC 9.4.3 SOCs.
Use 0x1A000 offset instead as MMIO register remap region.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
A number of DML parameters related to HostVM were either missing or
being set incorrectly, which may cause inaccuracies in calculating
margins and determining BW limitations.
[How]
Correct these values where needed and populate the missing values.
Cc: stable@vger.kernel.org
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Signed-off-by: Taimur Hassan <syed.hassan@amd.com>
Signed-off-by: Roman Li <Roman.Li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
Memory leaks of gang_ctx_bo and wptr_bo.
[How]
Free gang_ctx_bo and wptr_bo in pqm_uninit.
v2: add a common function pqm_clean_queue_resource to
free queue's resources.
v3: reset pdd->pqd.num_gws when destorying GWS queue.
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Check smu v13_0_0 SKU type to select EEPROM I2C address.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.1.x
|
|
[Why]
DTBCLK is enabled on idle and it will burn power.
[How]
There's a few issues here:
- Always enabling DTBCLK on clock manager init
- Setting refclk when DTBCLK is supposed to be disabled
- Not applying the correct calculated version refclk, but instead the
base value which might be zero
On dtbclk_en change we'll message PMFW to enable or disable the clock
accordingly.
The DTBDTO will be then based on refclk, but it will be set to the
default fixed value if there was nothing calculated in DML despite the
clock being considered enabled.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
For 'AMDGPU_FAMILY_SI' family cards, in 'si_common_early_init' func, init
'didt_rreg' and 'didt_wreg' to 'NULL'. But in func
'amdgpu_debugfs_regs_didt_read/write', using 'RREG32_DIDT' 'WREG32_DIDT'
lacks of relevant judgment. And other 'amdgpu_ip_block_version' that use
these two definitions won't be added for 'AMDGPU_FAMILY_SI'.
So, add null pointer judgment before calling.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|