Age | Commit message (Collapse) | Author | Files | Lines |
|
Until now, every dat table allocated to map a guest was put in a
linked list. The page->lru field of struct page was used to keep track
of which pages were being used, and when the gmap is torn down, the
list was walked and all pages freed.
This patch gets rid of the usage of page->lru. Page tables are now
freed by recursively walking the dat table tree.
Since s390_unlist_old_asce() becomes useless now, remove it.
Acked-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-12-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-12-imbrenda@linux.ibm.com>
|
|
The host_to_guest radix tree will now map userspace addresses to guest
addresses, instead of userspace addresses to segment tables.
When segment tables and page tables are needed, they are found using an
additional gmap_table_walk().
This gets rid of all usage of page->index for non-shadow gmaps.
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-11-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-11-imbrenda@linux.ibm.com>
|
|
Move some gmap shadowing functions from mm/gmap.c to kvm/kvm-s390.c and
the newly created kvm/gmap-vsie.c
This is a step toward removing gmap from mm.
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-10-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-10-imbrenda@linux.ibm.com>
|
|
Add gpa_to_hva(), which uses memslots, and use it to replace all uses
of gmap_translate().
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-9-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-9-imbrenda@linux.ibm.com>
|
|
All gmap page faults are already handled in kvm by the function
kvm_s390_handle_dat_fault(); only few users of gmap_fault remained, all
within kvm.
Convert those calls to use kvm_s390_handle_dat_fault() instead.
Remove gmap_fault() entirely since it has no more users.
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-8-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-8-imbrenda@linux.ibm.com>
|
|
Refactor the existing page fault handling code to use __kvm_faultin_pfn().
This possible now that memslots are always present.
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-7-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-7-imbrenda@linux.ibm.com>
|
|
Move gmap related functions from kernel/uv into kvm.
Create a new file to collect gmap-related functions.
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
[fixed unpack_one(), thanks mhartmay@linux.ibm.com]
Link: https://lore.kernel.org/r/20250123144627.312456-6-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-6-imbrenda@linux.ibm.com>
|
|
With the latest patch, attempting to create a memslot from userspace
will result in an EEXIST error for UCONTROL VMs, instead of EINVAL,
since the new memslot will collide with the internal memslot. There is
no simple way to bring back the previous behaviour.
This is not a problem, but the test needs to be fixed accordingly.
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-5-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-5-imbrenda@linux.ibm.com>
|
|
Create a fake memslot for ucontrol VMs. The fake memslot identity-maps
userspace.
Now memslots will always be present, and ucontrol is not a special case
anymore.
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-4-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-4-imbrenda@linux.ibm.com>
|
|
Wrap the call to KVM_BUG; this reduces code duplication and improves
readability.
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250123144627.312456-3-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-3-imbrenda@linux.ibm.com>
|
|
Exempt KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES restriction, as
the limit on the number of pages exists purely to play nice with dirty
bitmap operations, which use 32-bit values to index the bitmaps, and dirty
logging isn't supported for KVM-internal memslots.
Link: https://lore.kernel.org/all/20240802205003.353672-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250123144627.312456-2-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-2-imbrenda@linux.ibm.com>
|
|
Now that we no longer use page->index and the page refcount explicitly,
let's avoid messing with "struct page" completely.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-5-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Let's stop messing with the page refcount, and use a flag that is set /
cleared atomically to remember whether a vsie page is currently in use.
Note that we could use a page flag, or a lower bit of the scb_gpa. Let's
keep it simple for now, we have sufficient space.
While at it, stop passing "struct kvm *" to put_vsie_page(), it's
unused.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-4-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Let's stop using page->index, and instead use a field inside "struct
vsie_page" to hold that value. We have plenty of space left in there.
This is one part of stopping using "struct page" when working with vsie
pages. We place the "page_to_virt(page)" strategically, so the next
cleanups requires less churn.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-3-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
We try to reuse the same vsie page when re-executing the vsie with a
given SCB address. The result is that we use the same shadow SCB --
residing in the vsie page -- and can avoid flushing the TLB when
re-running the vsie on a CPU.
So, when we allocate a fresh vsie page, or when we reuse a vsie page for
a different SCB address -- reusing the shadow SCB in different context --
we set ihcpu=0xffff to trigger the flush.
However, after we looked up the SCB address in the radix tree, but before
we grabbed the vsie page by raising the refcount to 2, someone could reuse
the vsie page for a different SCB address, adjusting page->index and the
radix tree. In that case, we would be reusing the vsie page with a
wrong page->index.
Another corner case is that we might set the SCB address for a vsie
page, but fail the insertion into the radix tree. Whoever would reuse
that page would remove the corresponding radix tree entry -- which might
now be a valid entry pointing at another page, resulting in the wrong
vsie page getting removed from the radix tree.
Let's handle such races better, by validating that the SCB address of a
vsie page didn't change after we grabbed it (not reuse for a different
SCB; the alternative would be performing another tree lookup), and by
setting the SCB address to invalid until the insertion in the tree
succeeded (SCB addresses are aligned to 512, so ULONG_MAX is invalid).
These scenarios are rare, the effects a bit unclear, and these issues were
only found by code inspection. Let's CC stable to be safe.
Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization")
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-2-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Now that we've standardized on the byte-by-byte implementation of CRC32
as the only generic implementation (see previous commit for the
rationale), remove the code for the other implementations.
Tested with crc_kunit.
Link: https://lore.kernel.org/r/20250123212904.118683-3-ebiggers@kernel.org
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Make the following simplifications to the kconfig options for choosing
CRC implementations for CRC32 and CRC_T10DIF:
1. Make the option to disable the arch-optimized code be visible only
when CONFIG_EXPERT=y.
2. Make a single option control the inclusion of the arch-optimized code
for all enabled CRC variants.
3. Make CRC32_SARWATE (a.k.a. slice-by-1 or byte-by-byte) be the only
generic CRC32 implementation.
The result is there is now just one option, CRC_OPTIMIZATIONS, which is
default y and can be disabled only when CONFIG_EXPERT=y.
Rationale:
1. Enabling the arch-optimized code is nearly always the right choice.
However, people trying to build the tiniest kernel possible would
find some use in disabling it. Anything we add to CRC32 is de facto
unconditional, given that CRC32 gets selected by something in nearly
all kernels. And unfortunately enabling the arch CRC code does not
eliminate the need to build the generic CRC code into the kernel too,
due to CPU feature dependencies. The size of the arch CRC code will
also increase slightly over time as more CRC variants get added and
more implementations targeting different instruction set extensions
get added. Thus, it seems worthwhile to still provide an option to
disable it, but it should be considered an expert-level tweak.
2. Considering the use case described in (1), there doesn't seem to be
sufficient value in making the arch-optimized CRC code be
independently configurable for different CRC variants. Note also
that multiple variants were already grouped together, e.g.
CONFIG_CRC32 actually enables three different variants of CRC32.
3. The bit-by-bit implementation is uselessly slow, whereas slice-by-n
for n=4 and n=8 use tables that are inconveniently large: 4096 bytes
and 8192 bytes respectively, compared to 1024 bytes for n=1. Higher
n gives higher instruction-level parallelism, so higher n easily wins
on traditional microbenchmarks on most CPUs. However, the larger
tables, which are accessed randomly, can be harmful in real-world
situations where the dcache may be cold or useful data may need be
evicted from the dcache. Meanwhile, today most architectures have
much faster CRC32 implementations using dedicated CRC32 instructions
or carryless multiplication instructions anyway, which make the
generic code obsolete in most cases especially on long messages.
Another reason for going with n=1 is that this is already what is
used by all the other CRC variants in the kernel. CRC32 was unique
in having support for larger tables. But as per the above this can
be considered an outdated optimization.
The standardization on slice-by-1 a.k.a. CRC32_SARWATE makes much of
the code in lib/crc32.c unused. A later patch will clean that up.
Link: https://lore.kernel.org/r/20250123212904.118683-2-ebiggers@kernel.org
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Move the change_cookie and subvol up to avoid two 4 byte holes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When retpolines and IBT are both disabled, the compiler is free to use
jump tables to optimize switch instructions. However, these are emitted
by Clang as absolute references into .rodata:
jmp *-0x7dfffe90(,%r9,8)
R_X86_64_32S .rodata+0x170
Given that this code will execute before that address in .rodata has even
been mapped, it is guaranteed to crash a SEV-SNP guest in a way that is
difficult to diagnose.
So disable jump tables when building this code. It would be better if we
could attach this annotation to the __head macro but this appears to be
impossible.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250127114334.1045857-6-ardb+git@google.com
|
|
Use '%u' instead of '%d' for unsigned int.
Link: https://lore.kernel.org/all/20241105011048.201629-1-luoyifan@cmss.chinamobile.com/
Fixes: 973780011106 ("tools/bootconfig: Suppress non-error messages")
Signed-off-by: Luo Yifan <luoyifan@cmss.chinamobile.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.
Created this by running an spatch followed by a sed command:
Spatch:
virtual patch
@
depends on !(file in "net")
disable optional_qualifier
@
identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@
+ const
struct ctl_table table_name [] = { ... };
sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
The enable_uring module parameter allows administrators to enable/disable
io-uring support for FUSE at runtime. However, disabling io-uring while
connections already have it enabled can lead to an inconsistent state.
Fix this by keeping io-uring enabled on connections that were already using
it, even if the module parameter is later disabled. This ensures active
FUSE mounts continue to function correctly.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
All required parts are handled now, fuse-io-uring can
be enabled.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Avoid races and block request allocation until io-uring
queues are ready.
This is a especially important for background requests,
as bg request completion might cause lock order inversion
of the typical queue->lock and then fc->bg_lock
fuse_request_end
spin_lock(&fc->bg_lock);
flush_bg_queue
fuse_send_one
fuse_uring_queue_fuse_req
spin_lock(&queue->lock);
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When the fuse-server terminates while the fuse-client or kernel
still has queued URING_CMDs, these commands retain references
to the struct file used by the fuse connection. This prevents
fuse_dev_release() from being invoked, resulting in a hung mount
point.
This patch addresses the issue by making queued URING_CMDs
cancelable, allowing fuse_dev_release() to proceed as expected
and preventing the mount point from hanging.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This prepares queueing and sending background requests through
io-uring.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This prepares queueing and sending foreground requests through
io-uring.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
These functions are also needed by fuse-over-io-uring.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
On teardown struct file_operations::uring_cmd requests
need to be completed by calling io_uring_cmd_done().
Not completing all ring entries would result in busy io-uring
tasks giving warning messages in intervals and unreleased
struct file.
Additionally the fuse connection and with that the ring can
only get released when all io-uring commands are completed.
Completion is done with ring entries that are
a) in waiting state for new fuse requests - io_uring_cmd_done
is needed
b) already in userspace - io_uring_cmd_done through teardown
is not needed, the request can just get released. If fuse server
is still active and commits such a ring entry, fuse_uring_cmd()
already checks if the connection is active and then complete the
io-uring itself with -ENOTCONN. I.e. special handling is not
needed.
This scheme is basically represented by the ring entry state
FRRS_WAIT and FRRS_USERSPACE.
Entries in state:
- FRRS_INIT: No action needed, do not contribute to
ring->queue_refs yet
- All other states: Are currently processed by other tasks,
async teardown is needed and it has to wait for the two
states above. It could be also solved without an async
teardown task, but would require additional if conditions
in hot code paths. Also in my personal opinion the code
looks cleaner with async teardown.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
NVIDIA is productizing the new Grace Blackwell superchip
SKU bearing device ID 0x2941.
Add the SKU devid to nvgrace_gpu_vfio_pci_table.
CC: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-5-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
In contrast to Grace Hopper systems, the HBM training has been moved
out of the UEFI on the Grace Blackwell systems. This reduces the system
bootup time significantly.
The onus of checking whether the HBM training has completed thus falls
on the module.
The HBM training status can be determined from a BAR0 register.
Similarly, another BAR0 register exposes the status of the CPU-GPU
chip-to-chip (C2C) cache coherent interconnect.
Based on testing, 30s is determined to be sufficient to ensure
initialization completion on all the Grace based systems. Thus poll
these register and check for 30s. If the HBM training is not complete
or if the C2C link is not ready, fail the probe.
While the time is not required on Grace Hopper systems, it is
beneficial to make the check to ensure the device is in an
expected state. Hence keeping it generalized to both the generations.
Ensure that the BAR0 is enabled before accessing the registers.
CC: Alex Williamson <alex.williamson@redhat.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-4-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.
The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].
This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.
Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.
Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-3-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip cache coherent interconnect.
There is a HW defect on GH systems to support the Multi-Instance
GPU (MIG) feature [1] that necessiated the presence of a 1G region
with uncached mapping carved out from the device memory. The 1G
region is shown as a fake BAR (comprising region 2 and 3) to
workaround the issue. This is fixed on the GB systems.
The presence of the fix for the HW defect is communicated by the
device firmware through the DVSEC PCI config register with ID 3.
The module reads this to take a different codepath on GB vs GH.
Scan through the DVSEC registers to identify the correct one and use
it to determine the presence of the fix. Save the value in the device's
nvgrace_gpu_pci_core_device structure.
Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
CC: Jason Gunthorpe <jgg@nvidia.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-2-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
This adds support for fuse request completion through ring SQEs
(FUSE_URING_CMD_COMMIT_AND_FETCH handling). After committing
the ring entry it becomes available for new fuse requests.
Handling of requests through the ring (SQE/CQE handling)
is complete now.
Fuse request data are copied through the mmaped ring buffer,
there is no support for any zero copy yet.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for proper cleanup and re-initialization of virtio-blk devices
during transport reset error recovery flow.
This enhancement includes:
- Pre-reset handler (reset_prepare) to perform device-specific cleanup
- Post-reset handler (reset_done) to re-initialize the device
These changes allow the device to recover from various reset scenarios,
ensuring proper functionality after a reset event occurs.
Without this implementation, the device cannot properly recover from
resets, potentially leading to undefined behavior or device malfunction.
This feature has been tested using PCI transport with Function Level
Reset (FLR) as an example reset mechanism. The reset can be triggered
manually via sysfs (echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset).
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <1732690652-3065-3-git-send-email-israelr@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Implement support for Function Level Reset (FLR) in virtio_pci devices.
This change adds reset_prepare and reset_done callbacks, allowing
drivers to properly handle FLR operations.
Without this patch, performing and recovering from an FLR is not possible
for virtio_pci devices. This implementation ensures proper FLR handling
and recovery for both physical and virtual functions.
The device reset can be triggered in case of error or manually via
sysfs:
echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <1732690652-3065-2-git-send-email-israelr@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.
Fixes: 41e3e42108bc ("vhost/net: enable virtio 1.0")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-Id: <20240915-v1-v1-1-f10d2cb5e759@daynix.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Added support to read the vendor-specific PCI capability to identify the
type of device being emulated.
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Message-Id: <20250103153226.1933479-4-sthotton@marvell.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Added macro definition for VIRTIO_PCI_CAP_VENDOR_CFG to identify the PCI
vendor data type in the virtio_pci_cap structure. Defined a new struct
virtio_pci_vndr_data for the vendor data capability header as per the
specification.
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Message-Id: <20250103153226.1933479-3-sthotton@marvell.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The first interrupt of the device is used to notify the host about
device configuration changes, such as link status updates. The ISR
configuration area is updated to indicate a config change event when
triggered.
Signed-off-by: Satha Rao <skoteshwar@marvell.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Message-Id: <20250103153226.1933479-2-sthotton@marvell.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Updated the driver to utilize all the MSI-X interrupt vectors supported
by each OCTEON endpoint VF, instead of relying on a single vector.
Enabling more interrupts allows packets from multiple rings to be
distributed across multiple cores, improving parallelism and
performance.
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Message-Id: <20250103153226.1933479-1-sthotton@marvell.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The PCI functions
pcim_iomap_regions()
pcim_iounmap_regions()
pcim_iomap_table()
have been deprecated by the PCI subsystem.
Replace these functions with their successors pcim_iomap_region() and
pcim_iounmap_region().
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Message-Id: <20241219094428.21511-2-phasta@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
|
|
Let's add support for including virtio-mem device RAM in the crash dump,
setting NEED_PROC_VMCORE_DEVICE_RAM, and implementing
elfcorehdr_fill_device_ram_ptload_elf64().
To avoid code duplication, factor out the code to fill a PT_LOAD entry.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-13-david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Let's implement the get_device_ram() vmcore callback, so
architectures that select NEED_PROC_VMCORE_NEED_DEVICE_RAM, like s390
soon, can include that memory in a crash dump.
Merge ranges, and process ranges that might contain a mixture of plugged
and unplugged, to reduce the total number of ranges.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-12-david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Let's remember the usable region size, which will be helpful in kdump
mode next.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-11-david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
After the callbacks are registered we may immediately get a callback. So
mark the device ready before registering the callbacks.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-10-david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in
the crashed kernel.
RAM provided by memory devices such as virtio-mem can only be detected
using the device driver; when vmcore_init() is called, these device
drivers are usually not loaded yet, or the devices did not get probed
yet. Consequently, on s390 these RAM ranges will not be included in
the crash dump, which makes the dump partially corrupt and is
unfortunate.
Instead of deferring the vmcore_init() call, to an (unclear?) later point,
let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as
the device drivers probe the device and get access to this information.
Then, we'll add these ranges to the vmcore, adding more PT_LOAD
entries and updating the offsets+vmcore size.
Use a separate Kconfig option to be set by an architecture to include this
code only if the arch really needs it. Further, we'll make the config
depend on the relevant drivers (i.e., virtio_mem) once they implement
support (next). The alternative of having a PROVIDE_PROC_VMCORE_DEVICE_RAM
config option was dropped for now for simplicity.
The current target use case is s390, which only creates an elf64
elfcore, so focusing on elf64 is sufficient.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-9-david@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Let's factor it out into include/linux/crash_dump.h, from where we can
use it also outside of vmcore.c later.
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-8-david@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Let's factor it out into include/linux/crash_dump.h, from where we can
use it also outside of vmcore.c later.
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-7-david@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
These vmcore defines are not related to /proc/kcore, move them out.
We'll move "struct vmcoredd_node" to vmcore.c, because it is only used
internally. While "struct vmcore" is only used internally for now,
we're planning on using it from inline functions in crash_dump.h next,
so move it to crash_dump.h.
While at it, rename "struct vmcore" to "struct vmcore_range", which is a
more suitable name and will make the usage of it outside of vmcore.c
clearer.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20241204125444.1734652-6-david@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|