| Age | Commit message (Collapse) | Author | Files | Lines |
|
Separate the functions for generating MMIO page table entries from the
function that inserts them into the paging structure. This refactoring
will facilitate changes to the MMU sychronization model to use atomic
compare / exchanges (which are not guaranteed to succeed) instead of a
monolithic MMU lock.
No functional change expected.
Tested by running kvm-unit-tests on an Intel Haswell machine. This
commit introduced no new failures.
Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
There are several functions which pass an access permission mask for
SPTEs as an unsigned. This works, but checkpatch complains about it.
Switch the occurrences of unsigned to unsigned int to satisfy checkpatch.
No functional change expected.
Tested by running kvm-unit-tests on an Intel Haswell machine. This
commit introduced no new failures.
Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The blurb pertaining to the return value of nested_vmx_load_cr3() no
longer matches reality, remove it entirely as the behavior it is
attempting to document is quite obvious when reading the actual code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fold kvm_mips_comparecount_func() into kvm_mips_comparecount_wakeup() to
eliminate the nondescript function name as well as its unnecessary cast
of a vcpu to "unsigned long" and back to a vcpu. Presumably func() was
used as a callback at some point during pre-upstream development, as
wakeup() is the only user of func() and has been the only user since
both with introduced by commit 669e846e6c4e ("KVM/MIPS32: MIPS arch
specific APIs for KVM").
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Hoist kvm_mips_comparecount_wakeup() above its only user,
kvm_arch_vcpu_create() to fix a compilation error due to referencing an
undefined function.
Fixes: d11dfed5d700 ("KVM: MIPS: Move all vcpu init code into kvm_arch_vcpu_create()")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvm_setup_pv_tlb_flush will waste memory and print a misguiding message
when KVM paravirtualization is not available.
Intel SDM says that the when cpuid is used with EAX higher than the
maximum supported value for basic of extended function, the data for the
highest supported basic function will be returned.
So, in some systems, kvm_arch_para_features will return bogus data,
causing kvm_setup_pv_tlb_flush to detect support for pv tlb flush.
Testing for kvm_para_available will work as it checks for the hypervisor
signature.
Besides, when the "nopv" command line parameter is used, it should not
continue as well, as kvm_guest_init will no be called in that case.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
We are testing Virtual Machine with KSM on v5.4-rc2 kernel,
and found the zero_page refcount overflow.
The cause of refcount overflow is increased in try_async_pf
(get_user_page) without being decreased in mmu_set_spte()
while handling ept violation.
In kvm_release_pfn_clean(), only unreserved page will call
put_page. However, zero page is reserved.
So, as well as creating and destroy vm, the refcount of
zero page will continue to increase until it overflows.
step1:
echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan
echo 1 > /sys/kernel/pages_to_scan/run
echo 1 > /sys/kernel/pages_to_scan/use_zero_pages
step2:
just create several normal qemu kvm vms.
And destroy it after 10s.
Repeat this action all the time.
After a long period of time, all domains hang because
of the refcount of zero page overflow.
Qemu print error log as follow:
…
error: kvm run failed Bad address
EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd
ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4
EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000f7070 00000037
IDT= 000f70ae 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 <8b> 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04
…
Meanwhile, a kernel warning is departed.
[40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30
[40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G OE 5.2.0-rc2 #5
[40914.836415] RIP: 0010:try_get_page+0x1f/0x30
[40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 <0f> 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0
0 00 00 00 48 8b 47 08 a8
[40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286
[40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000
[40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440
[40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000
[40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8
[40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440
[40914.836423] FS: 00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000
[40914.836425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0
[40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[40914.836428] Call Trace:
[40914.836433] follow_page_pte+0x302/0x47b
[40914.836437] __get_user_pages+0xf1/0x7d0
[40914.836441] ? irq_work_queue+0x9/0x70
[40914.836443] get_user_pages_unlocked+0x13f/0x1e0
[40914.836469] __gfn_to_pfn_memslot+0x10e/0x400 [kvm]
[40914.836486] try_async_pf+0x87/0x240 [kvm]
[40914.836503] tdp_page_fault+0x139/0x270 [kvm]
[40914.836523] kvm_mmu_page_fault+0x76/0x5e0 [kvm]
[40914.836588] vcpu_enter_guest+0xb45/0x1570 [kvm]
[40914.836632] kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm]
[40914.836645] kvm_vcpu_ioctl+0x26e/0x5d0 [kvm]
[40914.836650] do_vfs_ioctl+0xa9/0x620
[40914.836653] ksys_ioctl+0x60/0x90
[40914.836654] __x64_sys_ioctl+0x16/0x20
[40914.836658] do_syscall_64+0x5b/0x180
[40914.836664] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[40914.836666] RIP: 0033:0x7fb61cb6bfc7
Signed-off-by: LinFeng <linfeng23@huawei.com>
Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Take a u64 instead of an unsigned long in kvm_dr7_valid() to fix a build
warning on i386 due to right-shifting a 32-bit value by 32 when checking
for bits being set in dr7[63:32].
Alternatively, the warning could be resolved by rewriting the check to
use an i386-friendly method, but taking a u64 fixes another oddity on
32-bit KVM. Beause KVM implements natural width VMCS fields as u64s to
avoid layout issues between 32-bit and 64-bit, a devious guest can stuff
vmcs12->guest_dr7 with a 64-bit value even when both the guest and host
are 32-bit kernels. KVM eventually drops vmcs12->guest_dr7[63:32] when
propagating vmcs12->guest_dr7 to vmcs02, but ideally KVM would not rely
on that behavior for correctness.
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Fixes: ecb697d10f70 ("KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") changed kvmclock to use tkr_raw instead of tkr_mono. However,
the default kvmclock_offset for the VM was still based on the monotonic
clock and, if the raw clock drifted enough from the monotonic clock,
this could cause a negative system_time to be written to the guest's
struct pvclock. RHEL5 does not like it and (if it boots fast enough to
observe a negative time value) it hangs.
There is another thing to be careful about: getboottime64 returns the
host boot time with tkr_mono frequency, and subtracting the tkr_raw-based
kvmclock value will cause the wallclock to be off if tkr_raw drifts
from tkr_mono. To avoid this, compute the wallclock delta from the
current time instead of being clever and using getboottime64.
Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
We will need a copy of tk->offs_boot in the next patch. Store it and
cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot
included, store the raw value in struct pvclock_clock and sum it in
do_monotonic_raw and do_realtime. tk->tkr_xxx.xtime_nsec also moves
to struct pvclock_clock.
While at it, fix a (usually harmless) typo in do_monotonic_raw, which
was using gtod->clock.shift instead of gtod->raw_clock.shift.
Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The function nested_vmx_run() declaration is below its implementation. So
this is meaningless and should be removed.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
SVM is now able to disable AVIC dynamically whenever the in-kernel PIT sets
up an ack notifier, so we can enable it even if in-kernel IOAPIC/PIC/PIT
are in use.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In-kernel IOAPIC does not receive EOI with AMD SVM AVIC
since the processor accelerate write to APIC EOI register and
does not trap if the interrupt is edge-triggered.
Workaround this by lazy check for pending APIC EOI at the time when
setting new IOPIC irq, and update IOAPIC EOI if no pending APIC EOI.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Refactor code for handling IOAPIC EOI for subsequent patch.
There is no functional change.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
AMD SVM AVIC accelerates EOI write and does not trap. This causes
in-kernel PIT re-injection mode to fail since it relies on irq-ack
notifier mechanism. So, APICv is activated only when in-kernel PIT
is in discard mode e.g. w/ qemu option:
-global kvm-pit.lost_tick_policy=discard
Also, introduce APICV_INHIBIT_REASON_PIT_REINJ bit to be used for this
reason.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
AMD AVIC does not support ExtINT. Therefore, AVIC must be temporary
deactivated and fall back to using legacy interrupt injection via vINTR
and interrupt window.
Also, introduce APICV_INHIBIT_REASON_IRQWIN to be used for this reason.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Rename svm_request_update_avic to svm_toggle_avic_for_extint. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Since AVIC does not currently work w/ nested virtualization,
deactivate AVIC for the guest if setting CPUID Fn80000001_ECX[SVM]
(i.e. indicate support for SVM, which is needed for nested virtualization).
Also, introduce a new APICV_INHIBIT_REASON_NESTED bit to be used for
this reason.
Suggested-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Since disabling APICv has to be done for all vcpus on AMD-based
system, adopt the newly introduced kvm_request_apicv_update()
interface, and introduce a new APICV_INHIBIT_REASON_HYPERV.
Also, remove the kvm_vcpu_deactivate_apicv() since no longer used.
Cc: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add necessary logics to support (de)activate AVIC at runtime.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
AMD SVM AVIC needs to update APIC backing page mapping before changing
APICv mode. Introduce struct kvm_x86_ops.pre_update_apicv_exec_ctrl
function hook to be called prior KVM APICv update request to each vcpu.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Inibit reason bits are used to determine if APICv deactivation is
applicable for a particular hardware virtualization architecture.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Re-factor avic_init_access_page() to avic_update_access_page() since
activate/deactivate AVIC requires setting/unsetting the memory region used
for virtual APIC backing page (APIC_ACCESS_PAGE_PRIVATE_MEMSLOT).
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Introduce interface for (de)activate posted interrupts, and
implement SVM hooks to toggle AMD IOMMU guest virtual APIC mode.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add trace points when sending request to (de)activate APICv.
Suggested-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Certain runtime conditions require APICv to be temporary deactivated
during runtime. The current implementation only support run-time
deactivation of APICv when Hyper-V SynIC is enabled, which is not
temporary.
In addition, for AMD, when APICv is (de)activated at runtime,
all vcpus in the VM have to operate in the same mode. Thus the
requesting vcpu must notify the others.
So, introduce the following:
* A new KVM_REQ_APICV_UPDATE request bit
* Interfaces to request all vcpus to update APICv status
* A new interface to update APICV-related parameters for each vcpu
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
It is unused now.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
There are several reasons in which a VM needs to deactivate APICv
e.g. disable APICv via parameter during module loading, or when
enable Hyper-V SynIC support. Additional inhibit reasons will be
introduced later on when dynamic APICv is supported,
Introduce KVM APICv inhibit reason bits along with a new variable,
apicv_inhibit_reasons, to help keep track of APICv state for each VM,
Initially, the APICV_INHIBIT_REASON_DISABLE bit is used to indicate
the case where APICv is disabled during KVM module load.
(e.g. insmod kvm_amd avic=0 or insmod kvm_intel enable_apicv=0).
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Do not use get_enable_apicv; consider irqchip_split in svm.c. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Re-factor code into a helper function for setting lapic parameters when
activate/deactivate APICv, and export the function for subsequent usage.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that we are mapping kvm_steal_time from the guest directly we
don't need keep a copy of it in kvm_vcpu_arch.st. The same is true
for the stime field.
This is part of CVE-2019-3016.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
There is a potential race in record_steal_time() between setting
host-local vcpu->arch.st.steal.preempted to zero (i.e. clearing
KVM_VCPU_PREEMPTED) and propagating this value to the guest with
kvm_write_guest_cached(). Between those two events the guest may
still see KVM_VCPU_PREEMPTED in its copy of kvm_steal_time, set
KVM_VCPU_FLUSH_TLB and assume that hypervisor will do the right
thing. Which it won't.
Instad of copying, we should map kvm_steal_time and that will
guarantee atomicity of accesses to @preempted.
This is part of CVE-2019-3016.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
__kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
* relatively expensive
* in certain cases (such as when done from atomic context) cannot be called
Stashing gfn-to-pfn mapping should help with both cases.
This is part of CVE-2019-3016.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvm_vcpu_(un)map operates on gfns from any current address space.
In certain cases we want to make sure we are not mapping SMRAM
and for that we can use kvm_(un)map_gfn() that we are introducing
in this patch.
This is part of CVE-2019-3016.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB
bit if it is called more than once while VCPU is preempted.
This is part of CVE-2019-3016.
(This bug was also independently discovered by Jim Mattson
<jmattson@google.com>)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
VF numbers were assigned to node_guid and port_guid, but cleared
right before such query calls were issued. It caused to return
node/port GUIDs of VF index 0 for all VFs.
Fixes: 30aad41721e0 ("net/core: Add support for getting VF GUIDs")
Reported-by: Adrian Chiris <adrianc@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
'struct timex' is one of the last users of 'struct timeval' and is
only referenced in one place in the kernel any more, to convert the
user space timex into the kernel-internal version on sparc64, with a
different tv_usec member type.
As a preparation for hiding the time_t definition and everything
using that in the kernel, change the implementation once more
to only convert the timeval member, and then enclose the
struct definition in an #ifdef.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IDE core always sets ->dn correctly so changing it is never
required.
Setting it to a different value than assigned by IDE core is very likely
to result in data corruption (due to wrong transfer timings being set on
the controller etc.)
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Tested-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:
ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!
This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.
As exporting a symbol for an empty function would be a bit wasteful, fix
this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
CONFIG_MPTCP_IPV6=n case.
Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
clear this is a pure-IPV6 function, just like mptcpv6_init().
Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Convert the equivalent but rather odd uses of kmemdup with
__GFP_ZERO to the more common kstrdup and avoid unnecessary
zeroing of copied over memory.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 6cd021a58c18a ("udp: segment looped gso packets correctly")
fixes an issue with rare udp gso multicast packets looped onto the
receive path.
The stable backport makes the narrowest change to target only these
packets, when needed. As opposed to, say, expanding __udp_gso_segment,
which is harder to reason to be free from unintended side-effects.
But the resulting code is hardly self-describing.
Document its purpose and rationale.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As the MPTCP HMAC test is integrated into the MPTCP code, it can be
built only when MPTCP is enabled. Hence when MPTCP is disabled, asking
the user if the test code should be enabled is futile.
Wrap the whole block of MPTCP-specific config options inside a check for
MPTCP. While at it, drop the "default n" for MPTCP_HMAC_TEST, as that
is the default anyway.
Fixes: 65492c5a6ab5df50 ("mptcp: move from sha1 (v0) to sha256 (v1)")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:
net/mptcp/protocol.o: In function `__mptcp_tcp_fallback':
protocol.c:(.text+0x786): undefined reference to `inet6_stream_ops'
Fix this by checking for CONFIG_MPTCP_IPV6 instead of CONFIG_IPV6, like
is done in all other places in the mptcp code.
Fixes: 8ab183deb26a3b79 ("mptcp: cope with later TCP fallback")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the
epoll_ctl(2) system call.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Also make it available outside of epoll, along with the helper that
decides if we need to copy the passed in epoll_event.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're not consistent in how the file table is grabbed and assigned if we
have a command linked that requires the use of it.
Add ->file_table to the io_op_defs[] array, and use that to determine
when to grab the table instead of having the handlers set it if they
need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
flag. We always initialize work->files, so io-wq can just check for
that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 74759e1693311a8d1441de836c4080c192374238.
mptcp@lists.01.org accepts messages from non-subscribers. There was an
invisible and unexpected server-wide rule limiting the number of
recipients for subscribers and non-subscribers alike, and that has now
been turned off for this list.
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We can't deal with syncookie mode yet, the syncookie rx path will create
tcp reqsk, i.e. we get OOB access because we treat tcp reqsk as mptcp reqsk one:
TCP: SYN flooding on port 20002. Sending cookies.
BUG: KASAN: slab-out-of-bounds in subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
Read of size 1 at addr ffff8881167bc148 by task syz-executor099/2120
subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
tcp_get_cookie_sock+0xcf/0x520 net/ipv4/syncookies.c:209
cookie_v6_check+0x15a5/0x1e90 net/ipv6/syncookies.c:252
tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1123 [inline]
[..]
Bug can be reproduced via "sysctl net.ipv4.tcp_syncookies=2".
Note that MPTCP should work with syncookies (4th ack would carry needed
state), but it appears better to sort that out in -next so do tcp
fallback for now.
I removed the MPTCP ifdef for tcp_rsk "is_mptcp" member because
if (IS_ENABLED()) is easier to read than "#ifdef IS_ENABLED()/#endif" pair.
Cc: Eric Dumazet <edumazet@google.com>
Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot triggered following lockdep splat:
ffffffff82d2cd40 (rtnl_mutex){+.+.}, at: ip_mc_drop_socket+0x52/0x180
but task is already holding lock:
ffff8881187a2310 (sk_lock-AF_INET){+.+.}, at: mptcp_close+0x18/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (sk_lock-AF_INET){+.+.}:
lock_acquire+0xee/0x230
lock_sock_nested+0x89/0xc0
do_ip_setsockopt.isra.0+0x335/0x22f0
ip_setsockopt+0x35/0x60
tcp_setsockopt+0x5d/0x90
__sys_setsockopt+0xf3/0x190
__x64_sys_setsockopt+0x61/0x70
do_syscall_64+0x72/0x300
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (rtnl_mutex){+.+.}:
check_prevs_add+0x2b7/0x1210
__lock_acquire+0x10b6/0x1400
lock_acquire+0xee/0x230
__mutex_lock+0x120/0xc70
ip_mc_drop_socket+0x52/0x180
inet_release+0x36/0xe0
__sock_release+0xfd/0x130
__mptcp_close+0xa8/0x1f0
inet_release+0x7f/0xe0
__sock_release+0x69/0x130
sock_close+0x18/0x20
__fput+0x179/0x400
task_work_run+0xd5/0x110
do_exit+0x685/0x1510
do_group_exit+0x7e/0x170
__x64_sys_exit_group+0x28/0x30
do_syscall_64+0x72/0x300
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The trigger is:
socket(AF_INET, SOCK_STREAM, 0x106 /* IPPROTO_MPTCP */) = 4
setsockopt(4, SOL_IP, MCAST_JOIN_GROUP, {gr_interface=7, gr_group={sa_family=AF_INET, sin_port=htons(20003), sin_addr=inet_addr("224.0.0.2")}}, 136) = 0
exit(0)
Which results in a call to rtnl_lock while we are holding
the parent mptcp socket lock via
mptcp_close -> lock_sock(msk) -> inet_release -> ip_mc_drop_socket -> rtnl_lock().
>From lockdep point of view we thus have both
'rtnl_lock; lock_sock' and 'lock_sock; rtnl_lock'.
Fix this by stealing the msk conn_list and doing the subflow close
without holding the msk lock.
Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Its not possible to call the kernel_(s|g)etsockopt functions here,
the address points to user memory:
General protection fault in user access. Non-canonical address?
WARNING: CPU: 1 PID: 5352 at arch/x86/mm/extable.c:77 ex_handler_uaccess+0xba/0xe0 arch/x86/mm/extable.c:77
Kernel panic - not syncing: panic_on_warn set ...
[..]
Call Trace:
fixup_exception+0x9d/0xcd arch/x86/mm/extable.c:178
general_protection+0x2d/0x40 arch/x86/entry/entry_64.S:1202
do_ip_getsockopt+0x1f6/0x1860 net/ipv4/ip_sockglue.c:1323
ip_getsockopt+0x87/0x1c0 net/ipv4/ip_sockglue.c:1561
tcp_getsockopt net/ipv4/tcp.c:3691 [inline]
tcp_getsockopt+0x8c/0xd0 net/ipv4/tcp.c:3685
kernel_getsockopt+0x121/0x1f0 net/socket.c:3736
mptcp_getsockopt+0x69/0x90 net/mptcp/protocol.c:830
__sys_getsockopt+0x13a/0x220 net/socket.c:2175
We can call tcp_get/setsockopt functions instead. Doing so fixes
crashing, but still leaves rtnl related lockdep splat:
WARNING: possible circular locking dependency detected
5.5.0-rc6 #2 Not tainted
------------------------------------------------------
syz-executor.0/16334 is trying to acquire lock:
ffffffff84f7a080 (rtnl_mutex){+.+.}, at: do_ip_setsockopt.isra.0+0x277/0x3820 net/ipv4/ip_sockglue.c:644
but task is already holding lock:
ffff888116503b90 (sk_lock-AF_INET){+.+.}, at: lock_sock include/net/sock.h:1516 [inline]
ffff888116503b90 (sk_lock-AF_INET){+.+.}, at: mptcp_setsockopt+0x28/0x90 net/mptcp/protocol.c:1284
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (sk_lock-AF_INET){+.+.}:
lock_sock_nested+0xca/0x120 net/core/sock.c:2944
lock_sock include/net/sock.h:1516 [inline]
do_ip_setsockopt.isra.0+0x281/0x3820 net/ipv4/ip_sockglue.c:645
ip_setsockopt+0x44/0xf0 net/ipv4/ip_sockglue.c:1248
udp_setsockopt+0x5d/0xa0 net/ipv4/udp.c:2639
__sys_setsockopt+0x152/0x240 net/socket.c:2130
__do_sys_setsockopt net/socket.c:2146 [inline]
__se_sys_setsockopt net/socket.c:2143 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2143
do_syscall_64+0xbd/0x5b0 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (rtnl_mutex){+.+.}:
check_prev_add kernel/locking/lockdep.c:2475 [inline]
check_prevs_add kernel/locking/lockdep.c:2580 [inline]
validate_chain kernel/locking/lockdep.c:2970 [inline]
__lock_acquire+0x1fb2/0x4680 kernel/locking/lockdep.c:3954
lock_acquire+0x127/0x330 kernel/locking/lockdep.c:4484
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x158/0x1340 kernel/locking/mutex.c:1103
do_ip_setsockopt.isra.0+0x277/0x3820 net/ipv4/ip_sockglue.c:644
ip_setsockopt+0x44/0xf0 net/ipv4/ip_sockglue.c:1248
tcp_setsockopt net/ipv4/tcp.c:3159 [inline]
tcp_setsockopt+0x8c/0xd0 net/ipv4/tcp.c:3153
kernel_setsockopt+0x121/0x1f0 net/socket.c:3767
mptcp_setsockopt+0x69/0x90 net/mptcp/protocol.c:1288
__sys_setsockopt+0x152/0x240 net/socket.c:2130
__do_sys_setsockopt net/socket.c:2146 [inline]
__se_sys_setsockopt net/socket.c:2143 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2143
do_syscall_64+0xbd/0x5b0 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET);
lock(rtnl_mutex);
lock(sk_lock-AF_INET);
lock(rtnl_mutex);
The lockdep complaint is because we hold mptcp socket lock when calling
the sk_prot get/setsockopt handler, and those might need to acquire the
rtnl mutex. Normally, order is:
rtnl_lock(sk) -> lock_sock
Whereas for mptcp the order is
lock_sock(mptcp_sk) rtnl_lock -> lock_sock(subflow_sk)
We can avoid this by releasing the mptcp socket lock early, but, as Paolo
points out, we need to get/put the subflow socket refcount before doing so
to avoid race with concurrent close().
Fixes: 717e79c867ca5 ("mptcp: Add setsockopt()/getsockopt() socket operations")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|