Age | Commit message (Collapse) | Author | Files | Lines |
|
It's possible for an interrupt returning to an irqs-disabled context to
lose a pending soft-masked irq because it branches to part of the exit
code for irqs-enabled contexts, which is meant to clear only the
PACA_IRQS_HARD_DIS flag from PACAIRQHAPPENED by zeroing the byte. This
just looks like a simple thinko from a recent commit (if there was no
hard mask pending, there would be no reason to clear it anyway).
This also adds comment to the code that actually does need to clear the
flag.
Fixes: e485f6c751e0a ("powerpc/64/interrupt: Fix return to masked context after hard-mask irq becomes pending")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221013064418.1311104-1-npiggin@gmail.com
|
|
powerpc 32-bit system call (and function) calling convention for 64-bit
arguments requires the next available odd-pair (two sequential registers
with the first being odd-numbered) from the standard register argument
allocation.
The first argument register is r3, so a 64-bit argument that appears at
an even position in the argument list must skip a register (unless there
were preceding 64-bit arguments, which might throw things off). This
requires non-standard compat definitions to deal with the holes in the
argument register allocation.
With pt_regs syscall wrappers which use a standard mapper to map pt_regs
GPRs to function arguments, 32-bit kernels hit the same basic problem,
the standard definitions don't cope with the unused argument registers.
Fix this by having 32-bit kernels share those syscall definitions with
compat.
Thanks to Jason for spending a lot of time finding and bisecting this
and developing a trivial reproducer. The perfect bug report.
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Fixes: 7e92e01b72452 ("powerpc: Provide syscall wrapper")
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221012035335.866440-1-npiggin@gmail.com
|
|
The merge of the kbuild tree dropped the renaming of the FSL_BOOKE
kconfig option.
Fixes: 8afc66e8d43b ("Merge tag 'kbuild-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild")
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This ensures that no module record/or entry is added to the
unloaded_tainted_modules list if it does not carry a taint.
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Fixes: 99bd9956551b ("module: Introduce module unload taint tracking")
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
shash was not being initialized in one place in smb3_calc_signature
and smb2_calc_signature
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The struct sdesc is just a wrapper around shash_desc, with exact same
memory layout. Replace the hashing TFMs with shash_desc as it's what's
passed to the crypto API anyway.
Also remove the crypto_shash pointers as they can be accessed via
shash_desc->tfm (and are actually only used in the setkey calls).
Adapt cifs_{alloc,free}_hash functions to this change.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Detach the TFM name from a specific algorithm (AES-CCM) as
AES-GCM is also supported, making the name misleading.
s/ccmaesencrypt/enc/
s/ccmaesdecrypt/dec/
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Replace kfree with kfree_sensitive, or prepend memzero_explicit() in
other cases, when freeing sensitive material that could still be left
in memory.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/r/202209201529.ec633796-oliver.sang@intel.com
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The background is that we use dpu in cloud computing,the arch is x86,80
cores. We will have a lots of virtio devices,like 512 or more.
When we probe about 200 virtio_blk devices,it will fail and
the stack is printed as follows:
[25338.485128] virtio-pci 0000:b3:00.0: virtio_pci: leaving for legacy driver
[25338.496174] genirq: Flags mismatch irq 0. 00000080 (virtio418) vs. 00015a00 (timer)
[25338.503822] CPU: 20 PID: 5431 Comm: kworker/20:0 Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.30.1.el8.x86_64
[25338.516403] Hardware name: Inspur NF5280M5/YZMB-00882-10E, BIOS 4.1.21 08/25/2021
[25338.523881] Workqueue: events work_for_cpu_fn
[25338.528235] Call Trace:
[25338.530687] dump_stack+0x5c/0x80
[25338.534000] __setup_irq.cold.53+0x7c/0xd3
[25338.538098] request_threaded_irq+0xf5/0x160
[25338.542371] vp_find_vqs+0xc7/0x190
[25338.545866] init_vq+0x17c/0x2e0 [virtio_blk]
[25338.550223] ? ncpus_cmp_func+0x10/0x10
[25338.554061] virtblk_probe+0xe6/0x8a0 [virtio_blk]
[25338.558846] virtio_dev_probe+0x158/0x1f0
[25338.562861] really_probe+0x255/0x4a0
[25338.566524] ? __driver_attach_async_helper+0x90/0x90
[25338.571567] driver_probe_device+0x49/0xc0
[25338.575660] bus_for_each_drv+0x79/0xc0
[25338.579499] __device_attach+0xdc/0x160
[25338.583337] bus_probe_device+0x9d/0xb0
[25338.587167] device_add+0x418/0x780
[25338.590654] register_virtio_device+0x9e/0xe0
[25338.595011] virtio_pci_probe+0xb3/0x140
[25338.598941] local_pci_probe+0x41/0x90
[25338.602689] work_for_cpu_fn+0x16/0x20
[25338.606443] process_one_work+0x1a7/0x360
[25338.610456] ? create_worker+0x1a0/0x1a0
[25338.614381] worker_thread+0x1cf/0x390
[25338.618132] ? create_worker+0x1a0/0x1a0
[25338.622051] kthread+0x116/0x130
[25338.625283] ? kthread_flush_work_fn+0x10/0x10
[25338.629731] ret_from_fork+0x1f/0x40
[25338.633395] virtio_blk: probe of virtio418 failed with error -16
The log :
"genirq: Flags mismatch irq 0. 00000080 (virtio418) vs. 00015a00 (timer)"
was printed because of the irq 0 is used by timer exclusive,and when
vp_find_vqs call vp_find_vqs_msix and returns false twice (for
whatever reason), then it will call vp_find_vqs_intx as a fallback.
Because vp_dev->pci_dev->irq is zero, we request irq 0 with
flag IRQF_SHARED, and get a backtrace like above.
According to PCI spec about "Interrupt Pin" Register (Offset 3Dh):
"The Interrupt Pin register is a read-only register that identifies the
legacy interrupt Message(s) the Function uses. Valid values are 01h, 02h,
03h, and 04h that map to legacy interrupt Messages for INTA,
INTB, INTC, and INTD respectively. A value of 00h indicates that the
Function uses no legacy interrupt Message(s)."
So if vp_dev->pci_dev->pin is zero, we should not request legacy
interrupt.
Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20220930000915.548-1-angus.chen@jaguarmicro.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The spec says:
mtu only exists if VIRTIO_NET_F_MTU is set
The mac address field always exists (though
is only valid if VIRTIO_NET_F_MAC is set)
So vdpa_dev_net_config_fill() should read MTU and MAC
conditionally on the feature bits.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220929014555.112323-7-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
This commit fixes spars warnings: cast to restricted __le16
in function vdpa_dev_net_mq_config_fill()
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220929014555.112323-6-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
vdpa_dev_net_mq_config_fill() should checks device features
for MQ than driver features.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220929014555.112323-5-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
virtio 1.2 spec says:
max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
VIRTIO_NET_F_RSS is set.
So when reporint MQ to userspace, it should check both
VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS.
unused parameter struct vdpa_device *vdev is removed
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220929014555.112323-4-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
This commit reports driver features to user space
only after FEATURES_OK is features negotiation is done.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20220929014555.112323-3-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
This commit adds a new vDPA netlink attribution
VDPA_ATTR_VDPA_DEV_SUPPORTED_FEATURES. Userspace can query
features of vDPA devices through this new attr.
This commit invokes vdpa_config_ops.get_config()
rather than vdpa_get_config_unlocked() to read
the device config spcae, so no races in
vdpa_set_features_unlocked()
Userspace tool iproute2 example:
$ vdpa dev config show vdpa0
vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false max_vq_pairs 4 mtu 1500
negotiated_features MRG_RXBUF CTRL_VQ MQ VERSION_1 ACCESS_PLATFORM
dev_features MTU MAC MRG_RXBUF CTRL_VQ MQ ANY_LAYOUT VERSION_1 ACCESS_PLATFORM
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20220929014555.112323-2-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The hugetlb vma lock was originally designed to synchronize pmd sharing.
As such, it was only necessary to allocate the lock for vmas that were
capable of pmd sharing. Later in the development cycle, it was discovered
that it could also be used to simplify fault/truncation races as described
in [1]. However, a subsequent change to allocate the lock for all vmas
that use the page cache was never made. A fault/truncation race could
leave pages in a file past i_size until the file is removed.
Remove the previous restriction and allocate lock for all VM_MAYSHARE
vmas. Warn in the unlikely event of allocation failure.
[1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
Fixes: "hugetlb: clean up code checking for fault/truncation races"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
hugetlb file truncation/hole punch code may need to back out and take
locks in order in the routine hugetlb_unmap_file_folio(). This code could
race with vma freeing as pointed out in [1] and result in accessing a
stale vma pointer. To address this, take the vma_lock when clearing the
vma_lock->vma pointer.
[1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
[mike.kravetz@oracle.com: address build issues]
Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "hugetlb: fixes for new vma lock series".
In review of the series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", Miaohe Lin pointed out two key issues:
1) There is a race in the routine hugetlb_unmap_file_folio when locks
are dropped and reacquired in the correct order [1].
2) With the switch to using vma lock for fault/truncate synchronization,
we need to make sure lock exists for all VM_MAYSHARE vmas, not just
vmas capable of pmd sharing.
These two issues are addressed here. In addition, having a vma lock
present in all VM_MAYSHARE vmas, uncovered some issues around vma
splitting. Those are also addressed.
[1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
This patch (of 3):
The hugetlb vma lock hangs off the vm_private_data field and is specific
to the vma. When vm_area_dup() is called as part of vma splitting, the
vma lock pointer is copied to the new vma. This will result in issues
such as double freeing of the structure. Update the hugetlb open vm_ops
to allocate a new vma lock for the new vma.
The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
to prevent subsequent pmd sharing. hugetlb_vma_lock_free attempted to
anticipate this by checking both VM_MAYSHARE and VM_SHARED. However, if
only VM_MAYSHARE was set we would miss the free. With the introduction of
the vma lock, a vma can not participate in pmd sharing if vm_private_data
is NULL. Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
free the vma lock to prevent sharing. Also, update the sharing code to
make sure vma lock is indeed a condition for pmd sharing.
hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.
Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
Fixes: "hugetlb: add vma based lock for pmd sharing"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Link: https://lkml.kernel.org/r/YzSWfFI+MOeb1ils@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
wakeup_flusher_threads() was added under the assumption that if a system
runs out of clean cold pages, it might want to write back dirty pages more
aggressively so that they can become clean and be dropped.
However, doing so can breach the rate limit a system wants to impose on
writeback, resulting in early SSD wearout.
Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
inode_lock[ATOMIC_FILE] was used for protecting sbi->atomic_files,
update atomic_files variable's type to atomic_t instead of unsigned
int, then inode_lock[ATOMIC_FILE] can be obsoleted.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In order to check count of opened swapfile inodes.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Support for the VIRTIO_BLK_F_SECURE_ERASE VirtIO feature.
A device that offers this feature can receive VIRTIO_BLK_T_SECURE_ERASE
commands.
A device which supports this feature has the following fields in the
virtio config:
- max_secure_erase_sectors
- max_secure_erase_seg
- secure_erase_sector_alignment
max_secure_erase_sectors and secure_erase_sector_alignment are expressed
in 512-byte units.
Every secure erase command has the following fields:
- sectors: The starting offset in 512-byte units.
- num_sectors: The number of sectors.
Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Message-Id: <20220921082729.2516779-1-alvaro.karsz@solid-run.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
|
|
This patch allows the device features to be provisioned via
netlink. This is done by:
1) validating the provisioned features to be a subset of the parent
features.
2) clearing the features that is not wanted by the userspace
For example:
# vdpa mgmtdev show
pci/0000:02:00.0:
supported_classes net
max_supported_vqs 3
dev_features CSUM GUEST_CSUM CTRL_GUEST_OFFLOADS MAC GUEST_TSO4
GUEST_TSO6 GUEST_ECN GUEST_UFO HOST_TSO4 HOST_TSO6 HOST_ECN HOST_UFO
MRG_RXBUF STATUS CTRL_VQ CTRL_RX CTRL_VLAN CTRL_RX_EXTRA
GUEST_ANNOUNCE CTRL_MAC_ADDR RING_INDIRECT_DESC RING_EVENT_IDX
VERSION_1 ACCESS_PLATFORM
1) provision vDPA device with all features that are supported by the virtio-pci
# vdpa dev add name dev1 mgmtdev pci/0000:02:00.0
# vdpa dev config show
dev1: mac 52:54:00:12:34:56 link up link_announce false mtu 65535
negotiated_features CSUM GUEST_CSUM CTRL_GUEST_OFFLOADS MAC
GUEST_TSO4 GUEST_TSO6 GUEST_ECN GUEST_UFO HOST_TSO4 HOST_TSO6
HOST_ECN HOST_UFO MRG_RXBUF STATUS CTRL_VQ CTRL_RX CTRL_VLAN
GUEST_ANNOUNCE CTRL_MAC_ADDR RING_INDIRECT_DESC RING_EVENT_IDX
VERSION_1 ACCESS_PLATFORM
2) provision vDPA device with a subset of the features
# vdpa dev add name dev1 mgmtdev pci/0000:02:00.0 device_features 0x300020000
# dev1: mac 52:54:00:12:34:56 link up link_announce false mtu 65535
negotiated_features CTRL_VQ VERSION_1 ACCESS_PLATFORM
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220927074810.28627-4-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
This patch implements features provisioning for vdpa_sim_net.
1) validating the provisioned features to be a subset of the parent
features.
2) clearing the features that is not wanted by the userspace
For example:
vdpasim_net:
supported_classes net
max_supported_vqs 3
dev_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT VERSION_1 ACCESS_PLATFORM
1) provision vDPA device with all features that are supported by the
net simulator
dev1: mac 00:00:00:00:00:00 link up link_announce false mtu 1500
negotiated_features MTU MAC CTRL_VQ CTRL_MAC_ADDR VERSION_1 ACCESS_PLATFORM
2) provision vDPA device with a subset of the features
dev1: mac 00:00:00:00:00:00 link up link_announce false mtu 1500
negotiated_features CTRL_VQ VERSION_1 ACCESS_PLATFORM
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220927074810.28627-3-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
|
|
This patch allows the device features to be provisioned through
netlink. A new attribute is introduced to allow the userspace to pass
a 64bit device features during device adding.
This provides several advantages:
- Allow to provision a subset of the features to ease the cross vendor
live migration.
- Better debug-ability for vDPA framework and parent.
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20220927074810.28627-2-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Currently add_recvbuf_big() allocates MAX_SKB_FRAGS segments for big
packets even when GUEST_* offloads are not present on the device.
However, if guest GSO is not supported, it would be sufficient to
allocate segments to cover just up the MTU size and no further.
Allocating the maximum amount of segments results in a large waste of
buffer space in the queue, which limits the number of packets that can
be buffered and can result in reduced performance.
Therefore, if guest GSO is not supported, use the MTU to calculate the
optimal amount of segments required.
Below is the iperf TCP test results over a Mellanox NIC, using vDPA for
1 VQ, queue size 1024, before and after the change, with the iperf
server running over the virtio-net interface.
MTU(Bytes)/Bandwidth (Gbit/s)
Before After
1500 22.5 22.4
9000 12.8 25.9
And result of queue size 256.
MTU(Bytes)/Bandwidth (Gbit/s)
Before After
9000 2.15 11.9
With this patch no degradation is observed with multiple below tests and
feature bit combinations. Results are summarized below for q depth of
1024. Interface MTU is 1500 if MTU feature is disabled. MTU is set to 9000
in other tests.
Features/ Bandwidth (Gbit/s)
Before After
mtu off 20.1 20.2
mtu/indirect on 17.4 17.3
mtu/indirect/packed on 17.2 17.2
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Message-Id: <20220914144911.56422-3-gavinl@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
|
|
Probe routine is already several hundred lines.
Use helper function for guest gso support check.
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Message-Id: <20220914144911.56422-2-gavinl@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
There's actually no way to set queue size on legacy virtio pci.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20220815220447.155860-1-mst@redhat.com>
|
|
Add some spaces to vring_alloc_queue(make it look prettier).
Signed-off-by: Deming Wang <wangdeming@inspur.com>
Message-Id: <20220926183306.4535-1-wangdeming@inspur.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The operators of vring_alloc_queue_split should use the unified style.Add
space for the '|' ,make it be looked more pretty.
Signed-off-by: Deming Wang <wangdeming@inspur.com>
Message-Id: <20220926022202.1516-1-wangdeming@inspur.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Add missing __init/__exit annotations to module init/exit funcs.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Message-Id: <20220917083803.21521-1-xiujianfeng@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
This parse_opts will set invalid opts.rfd/wfd in case of failure which
we already check, but it is not clear for readers that parse_opts error
are handled in p9_fd_create: clarify this by explicitely checking the
return value.
Link: https://lkml.kernel.org/r/20220921210921.1654735-1-floridsleeves@gmail.com
Signed-off-by: Li Zhong <floridsleeves@gmail.com>
[Dominique: reworded commit message to clarify this is NOOP]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
xen transport was missing annotations
Link: https://lkml.kernel.org/r/20220909103546.73015-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Shamelessly copying the explanation from Tetsuo Handa's suggested
patch[1] (slightly reworded):
syzbot is reporting inconsistent lock state in p9_req_put()[2],
for p9_tag_remove() from p9_req_put() from IRQ context is using
spin_lock_irqsave() on "struct p9_client"->lock but trans_fd
(not from IRQ context) is using spin_lock().
Since the locks actually protect different things in client.c and in
trans_fd.c, just replace trans_fd.c's lock by a new one specific to the
transport (client.c's protect the idr for fid/tag allocations,
while trans_fd.c's protects its own req list and request status field
that acts as the transport's state machine)
Link: https://lore.kernel.org/r/20220904112928.1308799-1-asmadeus@codewreck.org
Link: https://lkml.kernel.org/r/2470e028-9b05-2013-7198-1fdad071d999@I-love.SAKURA.ne.jp [1]
Link: https://syzkaller.appspot.com/bug?extid=2f20b523930c32c160cc [2]
Reported-by: syzbot <syzbot+2f20b523930c32c160cc@syzkaller.appspotmail.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
The hard-coded marker is out of date now, fix it using the nice define.
Fixes: 17773afdcd15 ("powerpc/64: use 32-bit immediate for STACK_FRAME_REGS_MARKER")
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221006143345.129077-1-npiggin@gmail.com
|
|
I have a Sipeed Lichee RV dock board which only has 512MB DDR, so
memory optimizations such as swap on zram are helpful. As is seen
in commit d0637c505f8a ("arm64: enable THP_SWAP for arm64") and
commit bd4c82c22c367e ("mm, THP, swap: delay splitting THP after
swapped out"), THP_SWAP can improve the swap throughput significantly.
Enable THP_SWAP for RV64, testing the micro-benchmark which is
introduced by commit d0637c505f8a ("arm64: enable THP_SWAP for arm64")
shows below numbers on the Lichee RV dock board:
swp out bandwidth w/o patch: 66908 bytes/ms (mean of 10 tests)
swp out bandwidth w/ patch: 322638 bytes/ms (mean of 10 tests)
Improved by 382%!
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20220829145742.3139-1-jszhang@kernel.org/
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
This got out of order during a merge conflict, fix it by putting the
entries in the correct order.
Fixes: 7ab52f75a9cf ("RISC-V: Add Sstc extension support")
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220920204518.10988-1-palmer@rivosinc.com/
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.
This is causing instability on Linus' desktop, and I'm seeing
oops with VK CTS runs.
netconsole got me the following oops:
[ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
[ 1234.778782] #PF: supervisor read access in kernel mode
[ 1234.778787] #PF: error_code(0x0000) - not-present page
[ 1234.778791] PGD 0 P4D 0
[ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
[ 1234.778809] Hardware name: System manufacturer System Product
Name/PRIME X370-PRO, BIOS 5603 07/28/2020
[ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
[ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
00 f0
[ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087
[ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018
[ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000
[ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808
[ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00
[ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458
[ 1234.778912] FS: 00007f35e7008580(0000) GS:ffff95428ebc0000(0000)
knlGS:0000000000000000
[ 1234.778916] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0
[ 1234.778924] Call Trace:
[ 1234.778981] <IRQ>
[ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0
[ 1234.778999] dma_fence_signal+0x2c/0x50
[ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu]
[ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
[ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
[ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu]
[ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu]
[ 1234.779940] __handle_irq_event_percpu+0x46/0x190
[ 1234.779946] handle_irq_event+0x34/0x70
[ 1234.779949] handle_edge_irq+0x9f/0x240
[ 1234.779954] __common_interrupt+0x66/0x100
[ 1234.779960] common_interrupt+0xa0/0xc0
[ 1234.779965] </IRQ>
[ 1234.779968] <TASK>
[ 1234.779971] asm_common_interrupt+0x22/0x40
[ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
[ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
83 ea
[ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202
Revert it for now and figure it out later.
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
syzbot is reporting hung task at p9_fd_close() [1], for p9_mux_poll_stop()
from p9_conn_destroy() from p9_fd_close() is failing to interrupt already
started kernel_read() from p9_fd_read() from p9_read_work() and/or
kernel_write() from p9_fd_write() from p9_write_work() requests.
Since p9_socket_open() sets O_NONBLOCK flag, p9_mux_poll_stop() does not
need to interrupt kernel_read()/kernel_write(). However, since p9_fd_open()
does not set O_NONBLOCK flag, but pipe blocks unless signal is pending,
p9_mux_poll_stop() needs to interrupt kernel_read()/kernel_write() when
the file descriptor refers to a pipe. In other words, pipe file descriptor
needs to be handled as if socket file descriptor.
We somehow need to interrupt kernel_read()/kernel_write() on pipes.
A minimal change, which this patch is doing, is to set O_NONBLOCK flag
from p9_fd_open(), for O_NONBLOCK flag does not affect reading/writing
of regular files. But this approach changes O_NONBLOCK flag on userspace-
supplied file descriptors (which might break userspace programs), and
O_NONBLOCK flag could be changed by userspace. It would be possible to set
O_NONBLOCK flag every time p9_fd_read()/p9_fd_write() is invoked, but still
remains small race window for clearing O_NONBLOCK flag.
If we don't want to manipulate O_NONBLOCK flag, we might be able to
surround kernel_read()/kernel_write() with set_thread_flag(TIF_SIGPENDING)
and recalc_sigpending(). Since p9_read_work()/p9_write_work() works are
processed by kernel threads which process global system_wq workqueue,
signals could not be delivered from remote threads when p9_mux_poll_stop()
from p9_conn_destroy() from p9_fd_close() is called. Therefore, calling
set_thread_flag(TIF_SIGPENDING)/recalc_sigpending() every time would be
needed if we count on signals for making kernel_read()/kernel_write()
non-blocking.
Link: https://lkml.kernel.org/r/345de429-a88b-7097-d177-adecf9fed342@I-love.SAKURA.ne.jp
Link: https://syzkaller.appspot.com/bug?extid=8b41a1365f1106fd0f33 [1]
Reported-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
[Dominique: add comment at Christian's suggestion]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
The commit that added the new get_random_{u8,u16}() functions neglected
to update the code that clears the batches when bringing up a new CPU.
It also forgot a few comments and helper defines, so add those in too.
Fixes: 585cd5fe9f73 ("random: add 8-bit and 16-bit batches")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The function hooks (ftrace) is a completely different subsystem from the
general tracing. It manages how to attach callbacks to most functions in
the kernel. It is also used by live kernel patching. It really is not part
of tracing, although tracing uses it.
Create a separate entry for FUNCTION HOOKS (FTRACE) to be separate from
tracing itself in the MAINTAINERS file.
Perhaps it should be moved out of the kernel/trace directory, but that's
for another time.
Link: https://lkml.kernel.org/r/20221006144439.459272364@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The tracing git repo will no longer be housed in my personal git repo,
but instead live in trace/linux-trace.git.
Update the MAINTAINERS file appropriately.
Link: https://lkml.kernel.org/r/20221006144439.282193367@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When using syscall wrappers the __SYSCALL_DEFINEx() and related macros
add a "__powerpc_" prefix to all syscall entry points.
So for example sys_mmap becomes __powerpc_sys_mmap.
This risks breaking workflows and tools that expect the old naming
scheme. At a minimum setting a breakpoint on eg. sys_mmap with gdb no
longer works.
There seems to be no compelling reason to add the "__powerpc_" prefix,
other than that it follows what some other arches do (x86, arm64, s390).
But unlike other arches powerpc doesn't always enable syscall wrappers,
so the syscall entry points can change name depending on CONFIG options.
For those reasons drop the "__powerpc_" prefix, reverting to the
existing naming.
Doing so reveals two prototypes in signal.h that have the incorrect type
when syscall wrappers are enabled. There are already prototypes for both
functions in syscalls.h, so drop the ones from signal.h.
Fixes: 7e92e01b7245 ("powerpc: Provide syscall wrapper")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221006135940.1223988-1-mpe@ellerman.id.au
|
|
This removes the second use of the sched_core_mask temporary mask.
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
|
|
Following the recent introduction of for_each_andnot(), add some tests to
ensure for_each_cpu_and(not) results in the same as iterating over the
result of cpumask_and(not)().
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
|
|
for_each_cpu_and() is very convenient as it saves having to allocate a
temporary cpumask to store the result of cpumask_and(). The same issue
applies to cpumask_andnot() which doesn't actually need temporary storage
for iteration purposes.
Following what has been done for for_each_cpu_and(), introduce
for_each_cpu_andnot().
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
|
|
In preparation of introducing for_each_cpu_andnot(), add a variant of
find_next_bit() that negate the bits in @addr2 when ANDing them with the
bits in @addr1.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
|
|
Since commit ae626eb97376 ("ARM/dma-mapping: use dma-direct
unconditionally") only the dma_coherent flag in struct device is used,
so remove the now write only flag in struct dev_archdata.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
Commit ae626eb97376 ("ARM/dma-mapping: use dma-direct unconditionally")
caused a regression on the mvebu platform, wherein devices that are
dma-coherent are marked as dma-noncoherent, because although
mvebu_hwcc_notifier() after that commit still marks then as coherent,
the arm_coherent_dma_ops() function, which is called later, overwrites
this setting, since it is being called from drivers/of/device.c with
coherency parameter determined by of_dma_is_coherent(), and the
device-trees do not declare the 'dma-coherent' property.
Fix this by defaulting never clearing the dma_coherent flag in
arm_coherent_dma_ops().
Fixes: ae626eb97376 ("ARM/dma-mapping: use dma-direct unconditionally")
Reported-by: Marek Behún <kabel@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Marek Behún <kabel@kernel.org>
|