qemu/subprojects, branch master

libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported

2024-07-02T13:27:56Z

libvhost-user will panic when receiving VHOST_USER_GET_INFLIGHT_FD message if MFD_ALLOW_SEALING is not defined, since it's not able to create a memfd. VHOST_USER_GET_INFLIGHT_FD is used only if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD is negotiated. So, let's mask that feature if the backend is not able to properly handle these messages. Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Reviewed-by: David Hildenbrand Signed-off-by: Stefano Garzarella Message-Id: <20240618100043.144657-5-sgarzare@redhat.com> Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-07-02T13:27:56Z

In vu_message_write() we use sendmsg() to send the message header, then a write() to send the payload. If sendmsg() fails we should avoid sending the payload, since we were unable to send the header. Discovered before fixing the issue with the previous patch, where sendmsg() failed on macOS due to wrong parameters, but the frontend still sent the payload which the backend incorrectly interpreted as a wrong header. Reviewed-by: Daniel P. Berrangé Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Reviewed-by: David Hildenbrand Signed-off-by: Stefano Garzarella Message-Id: <20240618100043.144657-4-sgarzare@redhat.com> Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: set msg.msg_control to NULL when it is empty

2024-07-02T13:27:56Z

On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if the `struct msghdr` has the field `msg_controllen` set to 0, but `msg_control` is not NULL. Reviewed-by: Eric Blake Reviewed-by: David Hildenbrand Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella Message-Id: <20240618100043.144657-3-sgarzare@redhat.com> Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

Fix vhost user assertion when sending more than one fd

2024-07-01T18:56:23Z

If the client sends more than one region this assert triggers. The reason is that two fd's are 8 bytes and VHOST_MEMORY_BASELINE_NREGIONS is exactly 8. The assert is wrong because it should not test for the size of the fd array, but for the numbers of regions. Signed-off-by: Christian Pötzsch Message-Id: <20240426083313.3081272-1-christian.poetzsch@kernkonzept.com> Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP

2024-03-12T21:56:55Z

We already use MADV_NORESERVE to deal with sparse memory regions. Let's also set madvise(MADV_DONTDUMP), otherwise a crash of the process can result in us allocating all memory in the mmap'ed region for dumping purposes. This change implies that the mmap'ed rings won't be included in a coredump. If ever required for debugging purposes, we could mark only the mapped rings MADV_DODUMP. Ignore errors during madvise() for now. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand Message-Id: <20240214151701.29906-15-david@redhat.com> Tested-by: Mario Casquero Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions

2024-03-12T21:56:55Z

Currently, we try to remap all rings whenever we add a single new memory region. That doesn't quite make sense, because we already map rings when setting the ring address, and panic if that goes wrong. Likely, that handling was simply copied from set_mem_table code, where we actually have to remap all rings. Remapping all rings might require us to walk quite a lot of memory regions to perform the address translations. Ideally, we'd simply remove that remapping. However, let's be a bit careful. There might be some weird corner cases where we might temporarily remove a single memory region (e.g., resize it), that would have worked for now. Further, a ring might be located on hotplugged memory, and as the VM reboots, we might unplug that memory, to hotplug memory before resetting the ring addresses. So let's unmap affected rings as we remove a memory region, and try dynamically mapping the ring again when required. Acked-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand Message-Id: <20240214151701.29906-14-david@redhat.com> Tested-by: Mario Casquero Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: Factor out vq usability check

2024-03-12T21:56:55Z

Let's factor it out to prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand Message-Id: <20240214151701.29906-13-david@redhat.com> Tested-by: Mario Casquero Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: Use most of mmap_offset as fd_offset

2024-03-12T21:56:55Z

In the past, QEMU would create memory regions that could partially cover hugetlb pages, making mmap() fail if we would use the mmap_offset as an fd_offset. For that reason, we never used the mmap_offset as an offset into the fd and instead always mapped the fd from the very start. However, that can easily result in us mmap'ing a lot of unnecessary parts of an fd, possibly repeatedly. QEMU nowadays does not create memory regions that partially cover huge pages -- it never really worked with postcopy. QEMU handles merging of regions that partially cover huge pages (due to holes in boot memory) since 2018 in c1ece84e7c93 ("vhost: Huge page align and merge"). Let's be a bit careful and not unconditionally convert the mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb size and pass as much as we can as fd_offset, making sure that we call mmap() with a properly aligned offset. With QEMU and a virtio-mem device that is fully plugged (50GiB using 50 memslots) the qemu-storage daemon process consumes in the VA space 1281GiB before this change and 58GiB after this change. ================ Vhost user message ================ Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size: 40 Fds: 59 Adding region 4 guest_phys_addr: 0x0000000200000000 memory_size: 0x0000000040000000 userspace_addr: 0x00007fb73bffe000 old mmap_offset: 0x0000000080000000 fd_offset: 0x0000000080000000 new mmap_offset: 0x0000000000000000 mmap_addr: 0x00007f02f1bdc000 Successfully added new region ================ Vhost user message ================ Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size: 40 Fds: 59 Adding region 5 guest_phys_addr: 0x0000000240000000 memory_size: 0x0000000040000000 userspace_addr: 0x00007fb77bffe000 old mmap_offset: 0x00000000c0000000 fd_offset: 0x00000000c0000000 new mmap_offset: 0x0000000000000000 mmap_addr: 0x00007f0284000000 Successfully added new region Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand Message-Id: <20240214151701.29906-12-david@redhat.com> Tested-by: Mario Casquero Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()

2024-03-12T21:56:55Z

Let's speed up GPA to memory region / virtual address lookup. Store the memory regions ordered by guest physical addresses, and use binary search for address translation, as well as when adding/removing memory regions. Most importantly, this will speed up GPA->VA address translation when we have many memslots. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand Message-Id: <20240214151701.29906-11-david@redhat.com> Tested-by: Mario Casquero Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin

libvhost-user: Factor out search for memory region by GPA and simplify

2024-03-12T21:56:55Z

Memory regions cannot overlap, and if we ever hit that case something would be really flawed. For example, when vhost code in QEMU decides to increase the size of memory regions to cover full huge pages, it makes sure to never create overlaps, and if there would be overlaps, it would bail out. QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary list"), c1ece84e7c93 ("vhost: Huge page align and merge") and e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that handling and how overlaps are impossible. Consequently, each GPA can belong to at most one memory region, and everything else doesn't make sense. Let's factor out our search to prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand Message-Id: <20240214151701.29906-10-david@redhat.com> Tested-by: Mario Casquero Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin