Age | Commit message (Collapse) | Author | Files | Lines |
|
Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST
will generally use the default value for MSR_IA32_ARCH_CAPABILITIES.
When this happens and the host has tsx=on, it is possible to end up with
virtual machines that have HLE and RTM disabled, but TSX_CTRL available.
If the fleet is then switched to tsx=off, kvm_get_arch_capabilities()
will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to
use the tsx=off hosts as migration destinations, even though the guests
do not have TSX enabled.
To allow this migration, allow guests to write to their TSX_CTRL MSR,
while keeping the host MSR unchanged for the entire life of the guests.
This ensures that TSX remains disabled and also saves MSR reads and
writes, and it's okay to do because with tsx=off we know that guests will
not have the HLE and RTM features in their CPUID. (If userspace sets
bogus CPUID data, we do not expect HLE and RTM to work in guests anyway).
Cc: stable@vger.kernel.org
Fixes: cbbaa2727aa3 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Grab kvm->lock before pinning memory when registering an encrypted
region; sev_pin_memory() relies on kvm->lock being held to ensure
correctness when checking and updating the number of pinned pages.
Add a lockdep assertion to help prevent future regressions.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 1e80fdc09d12 ("KVM: SVM: Pin guest memory when SEV is active")
Signed-off-by: Peter Gonda <pgonda@google.com>
V2
- Fix up patch description
- Correct file paths svm.c -> sev.c
- Add unlock of kvm->lock on sev_pin_memory error
V1
- https://lore.kernel.org/kvm/20210126185431.1824530-1-pgonda@google.com/
Message-Id: <20210127161524.2832400-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Nested VMX was enabled by default in commit 1e58e5e59148 ("KVM:
VMX: enable nested virtualization by default"), which was merged
in Linux 4.20. This patch is to fix the documentation accordingly.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210128154747.4242-1-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Recent commit 255cbecfe0 modified struct kvm_vcpu_arch to make
'cpuid_entries' a pointer to an array of kvm_cpuid_entry2 entries
rather than embedding the array in the struct. KVM_SET_CPUID and
KVM_SET_CPUID2 were updated accordingly, but KVM_GET_CPUID2 was missed.
As a result, KVM_GET_CPUID2 currently returns random fields from struct
kvm_vcpu_arch to userspace rather than the expected CPUID values. Fix
this by treating 'cpuid_entries' as a pointer when copying its
contents to userspace buffer.
Fixes: 255cbecfe0c9 ("KVM: x86: allocate vcpu->arch.cpuid_entries dynamically")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com.com>
Message-Id: <20210128024451.1816770-1-michael.roth@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode. Therefore we cannot
WARN in that case.
However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.
Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Revert the dirty/available tracking of GPRs now that KVM copies the GPRs
to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty. Per
commit de3cd117ed2f ("KVM: x86: Omit caching logic for always-available
GPRs"), tracking for GPRs noticeably impacts KVM's code footprint.
This reverts commit 1c04d8c986567c27c56c05205dceadc92efb14ff.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the
GRPs' dirty bits are set from time zero and never cleared, i.e. will
always be seen as dirty. The obvious alternative would be to clear
the dirty bits when appropriate, but removing the dirty checks is
desirable as it allows reverting GPR dirty+available tracking, which
adds overhead to all flavors of x86 VMs.
Note, unconditionally writing the GPRs in the GHCB is tacitly allowed
by the GHCB spec, which allows the hypervisor (or guest) to provide
unnecessary info; it's the guest's responsibility to consume only what
it needs (the hypervisor is untrusted after all).
The guest and hypervisor can supply additional state if desired but
must not rely on that additional state being provided.
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Even when we are outside the nested guest, some vmcs02 fields
may not be in sync vs vmcs12. This is intentional, even across
nested VM-exit, because the sync can be delayed until the nested
hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
rarely accessed fields.
However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare()
before the vmcs12 contents are copied to userspace.
Fixes: 7952d769c29ca ("KVM: nVMX: Sync rarely accessed guest fields only when needed")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:
CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE
CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE
It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.
Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).
A trace obtained with this patch applied looks like this:
CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE
CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT
Of course, not calling trace_kvm_entry() in common x86 code any
longer means that we need to adjust the SVM side of things too.
Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Update various words, including the wrong parameter name and the vague
description of the usage of "slot" field.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Message-Id: <20201208043439.895-1-yuzenghui@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The injection process of smi has two steps:
Qemu KVM
Step1:
cpu->interrupt_request &= \
~CPU_INTERRUPT_SMI;
kvm_vcpu_ioctl(cpu, KVM_SMI)
call kvm_vcpu_ioctl_smi() and
kvm_make_request(KVM_REQ_SMI, vcpu);
Step2:
kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
call process_smi() if
kvm_check_request(KVM_REQ_SMI, vcpu) is
true, mark vcpu->arch.smi_pending = true;
The vcpu->arch.smi_pending will be set true in step2, unfortunately if
vcpu paused between step1 and step2, the kvm_run->immediate_exit will be
set and vcpu has to exit to Qemu immediately during step2 before mark
vcpu->arch.smi_pending true.
During VM migration, Qemu will get the smi pending status from KVM using
KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status
will be lost.
Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com>
Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as
0x0300 in the intel_perfmon_event_map[]. Correct its usage.
Fixes: 62079d8a4312 ("KVM: PMU: add proper support for fixed counter 2")
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Since we know vPMU will not work properly when (1) the guest bit_width(s)
of the [gp|fixed] counters are greater than the host ones, or (2) guest
requested architectural events exceeds the range supported by the host, so
we can setup a smaller left shift value and refresh the guest cpuid entry,
thus fixing the following UBSAN shift-out-of-bounds warning:
shift exponent 197 is too large for 64-bit type 'long long unsigned int'
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348
kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177
kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308
kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709
kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add compile-time asserts in rsvd_bits() to guard against KVM passing in
garbage hardcoded values, and cap the upper bound at '63' for dynamic
values to prevent generating a mask that would overflow a u64.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210113204515.3473079-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The documentation classifies KVM_ENABLE_CAP with KVM_CAP_ENABLE_CAP_VM
as a vcpu ioctl, which is incorrect. Fix it by specifying it as a VM
ioctl.
Fixes: e5d83c74a580 ("kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic")
Signed-off-by: Quentin Perret <qperret@google.com>
Message-Id: <20210108165349.747359-1-qperret@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
arm_smccc_1_1_hvc() only adds write contraints for x0-3 in the inline
assembly for the HVC instruction so make sure those are the only
registers that change when __do_hyp_init is called.
Tested-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Andrew Scull <ascull@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210125145415.122439-3-ascull@google.com
|
|
The use of a tagged address could be pretty confusing for the
whole memslot infrastructure as well as the MMU notifiers.
Forbid it altogether, as it never quite worked the first place.
Cc: stable@vger.kernel.org
Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When running on v8.0 HW, make sure we don't try to advertise
events in the 0x4000-0x403f range.
Cc: stable@vger.kernel.org
Fixes: 88865beca9062 ("KVM: arm64: Mask out filtered events in PCMEID{0,1}_EL1")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210121105636.1478491-1-maz@kernel.org
|
|
KASAN in HW_TAGS mode will store MTE tags in the top byte of the
pointer. When computing the offset for TPIDR_EL2 we don't want anything
in the top byte, so remove the tag to ensure the computation is correct
no matter what the tag.
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Signed-off-by: Steven Price <steven.price@arm.com>
[maz: added comment]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210108161254.53674-1-steven.price@arm.com
|
|
The reg_to_encoding() macro is a wrapper over sys_reg() and conveniently
takes a sys_reg_desc or a sys_reg_params argument and returns the 32 bit
register encoding. Use it instead of calling sys_reg() directly.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210106144218.110665-1-alexandru.elisei@arm.com
|
|
The KVM/arm64 PSCI relay assumes that SYSTEM_OFF and SYSTEM_RESET should
not return, as dictated by the PSCI spec. However, there is firmware out
there which breaks this assumption, leading to a hyp panic. Make KVM
more robust to broken firmware by allowing these to return.
Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201229160059.64135-1-dbrazdil@google.com
|
|
Now that all PMU registers are gated behind a .visibility callback,
remove the other checks against an absent PMU.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
It appears that while we are now able to properly hide PMU
registers from the guest when a PMU isn't available (either
because none has been configured, the host doesn't have
the PMU support compiled in, or that the HW doesn't have
one at all), we are still exposing more than we should to
userspace.
Introduce a visibility callback gating all the PMU registers,
which covers both usrespace and guest.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
|
|
Section 3.11 was incorrectly called 3.9, fix it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Change my email contact ahead of a likely painful eleven-month migration
to a certain cobalt enteprisey groupware cloud product that will totally
break my workflow. Some day I may get used to having to email being
sequestered behind both claret and cerulean oath2+sms 2fa layers, but
for now I'll stick with keying in one password to receive an email vs.
the required four.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When the creator of SQPOLL io_uring dies (i.e. sqo_task), we don't want
its internals like ->files and ->mm to be poked by the SQPOLL task, it
have never been nice and recently got racy. That can happen when the
owner undergoes destruction and SQPOLL tasks tries to submit new
requests in parallel, and so calls io_sq_thread_acquire*().
That patch halts SQPOLL submissions when sqo_task dies by introducing
sqo_dead flag. Once set, the SQPOLL task must not do any submission,
which is synchronised by uring_lock as well as the new flag.
The tricky part is to make sure that disabling always happens, that
means either the ring is discovered by creator's do_exit() -> cancel,
or if the final close() happens before it's done by the creator. The
last is guaranteed by the fact that for SQPOLL the creator task and only
it holds exactly one file note, so either it pins up to do_exit() or
removed by the creator on the final put in flush. (see comments in
uring_flush() around file->f_count == 2).
One more place that can trigger io_sq_thread_acquire_*() is
__io_req_task_submit(). Shoot off requests on sqo_dead there, even
though actually we don't need to. That's because cancellation of
sqo_task should wait for the request before going any further.
note 1: io_disable_sqo_submit() does io_ring_set_wakeup_flag() so the
caller would enter the ring to get an error, but it still doesn't
guarantee that the flag won't be cleared.
note 2: if final __userspace__ close happens not from the creator
task, the file note will pin the ring until the task dies.
Fixed: b1b6b5a30dce8 ("kernel/io_uring: cancel io_uring before task works")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
files_cancel() should cancel all relevant requests and drop file notes,
so we should never have file notes after that, including on-exit fput
and flush. Add a WARN_ONCE to be sure.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A simple preparation change inlining io_uring_attempt_task_drop() into
io_uring_flush().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We expect io_rw_reissue() to take place only during submission with
uring_lock held. Add a lockdep annotation to check that invariant.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET is set in incompat feature
set, it means the cache device is created with obsoleted layout with
obso_bucket_site_hi. Now bcache does not support this feature bit, a new
BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE incompat feature bit is added
for a better layout to support large bucket size.
For the legacy compatibility purpose, if a cache device created with
obsoleted BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit, all bcache
devices attached to this cache set should be set to read-only. Then the
dirty data can be written back to backing device before re-create the
cache device with BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE feature bit
by the latest bcache-tools.
This patch checks BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit
when running a cache set and attach a bcache device to the cache set. If
this bit is set,
- When run a cache set, print an error kernel message to indicate all
following attached bcache device will be read-only.
- When attach a bcache device, print an error kernel message to indicate
the attached bcache device will be read-only, and ask users to update
to latest bcache-tools.
Such change is only for cache device whose bucket size >= 32MB, this is
for the zoned SSD and almost nobody uses such large bucket size at this
moment. If you don't explicit set a large bucket size for a zoned SSD,
such change is totally transparent to your bcache device.
Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
was introduced into the incompat feature set. It used bucket_size_hi
(which was added at the tail of struct cache_sb_disk) to extend current
16bit bucket size to 32bit with existing bucket_size in struct
cache_sb_disk.
This is not a good idea, there are two obvious problems,
- Bucket size is always value power of 2, if store log2(bucket size) in
existing bucket_size of struct cache_sb_disk, it is unnecessary to add
bucket_size_hi.
- Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
struct cache_sb_disk, bucket_size_hi was added after d[] which makes
csum_set calculate an unexpected super block checksum.
To fix the above problems, this patch introduces a new incompat feature
bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
means bucket_size in struct cache_sb_disk stores the order of power-of-2
bucket size value. When user specifies a bucket size larger than 32768
sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
incompat feature set, and bucket_size stores log2(bucket size) more
than store the real bucket size value.
The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
recognized by kernel driver for legacy compatible purpose. The previous
bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
and not used in bcache-tools anymore.
For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
bcache-tools and kernel driver still recognize the feature string and
display it as "obso_large_bucket".
With this change, the unnecessary extra space extend of bcache on-disk
super block can be avoided, and csum_set() may generate expected check
sum as well.
Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds the check for features which is incompatible for
current supported feature sets.
Now if the bcache device created by bcache-tools has features that
current kernel doesn't support, read_super() will fail with error
messoage. E.g. if an unsupported incompatible feature detected,
bcache register will fail with dmesg "bcache: register_bcache() error :
Unsupported incompatible feature found".
Fixes: d721a43ff69c ("bcache: increase super block version for cache device and backing device")
Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch fixes the following typos,
from BCH_FEATURE_COMPAT_SUUP to BCH_FEATURE_COMPAT_SUPP
from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_INCOMPAT_SUPP
from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_RO_COMPAT_SUPP
Fixes: d721a43ff69c ("bcache: increase super block version for cache device and backing device")
Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is no need to reassign pdev_set_uuid in the second loop iteration,
so move it to the place before second loop.
Signed-off-by: Yi Li <yili@winhong.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
HSDK has hardware floating point and the common use case is with
glibc+hf so enable that as default.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
|
|
The kernel test robot reported a -5.8% performance regression on the
"poll2" test of will-it-scale, and bisected it to commit d55564cfc222
("x86: Make __put_user() generate an out-of-line call").
I didn't expect an out-of-line __put_user() to matter, because no normal
core code should use that non-checking legacy version of user access any
more. But I had overlooked the very odd poll() usage, which does a
__put_user() to update the 'revents' values of the poll array.
Now, Al Viro correctly points out that instead of updating just the
'revents' field, it would be much simpler to just copy the _whole_
pollfd entry, and then we could just use "copy_to_user()" on the whole
array of entries, the same way we use "copy_from_user()" a few lines
earlier to get the original values.
But that is not what we've traditionally done, and I worry that threaded
applications might be concurrently modifying the other fields of the
pollfd array. So while Al's suggestion is simpler - and perhaps worth
trying in the future - this instead keeps the "just update revents"
model.
To fix the performance regression, use the modern "unsafe_put_user()"
instead of __put_user(), with the proper "user_write_access_begin()"
guarding in place. This improves code generation enormously.
Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Laight <David.Laight@aculab.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 757055ae8dedf5333af17b3b5b4b70ba9bc9da4e.
The commit caused that ttynull was used as the default console
on several systems[1][2][3]. As a result, the console was
blank even when a better alternative existed.
It happened when there was no console configured
on the command line and ttynull_init() was the first initcall
calling register_console().
Or it happened when /dev/ did not exist when console_on_rootfs()
was called. It was not able to open /dev/console even though
a console driver was registered. It tried to add ttynull console
but it obviously did not help. But ttynull became the preferred
console and was used by /dev/console when it was available later.
The commit tried to fix a historical problem that have been there
for ages. The primary motivation was the commit 3cffa06aeef7ece30f6
("printk/console: Allow to disable console output by using console=""
or console=null"). It provided a clean solution for a workaround
that was widely used and worked only by chance.
This revert causes that the console="" or console=null command line
options will again work only by chance. These options will cause that
a particular console will be preferred and the default (tty) ones
will not get enabled. There will be no console registered at
all. As a result there won't be stdin, stdout, and stderr for
the init process. But it worked exactly this way even before.
The proper solution has to fulfill many conditions:
+ Register ttynull only when explicitly required or as
the ultimate fallback.
+ ttynull should get associated with /dev/console but it must
not become preferred console when used as a fallback.
Especially, it must still be possible to replace it
by a better console later.
Such a change requires clean up of the register_console() code.
Otherwise, it would be even harder to follow. Especially, the use
of has_preferred_console and CON_CONSDEV flag is tricky. The clean
up is risky. The ordering of consoles is not well defined. And
any changes tend to break existing user settings.
Do the revert at the least risky solution for now.
[1] https://lore.kernel.org/linux-kselftest/20201221144302.GR4077@smile.fi.intel.com/
[2] https://lore.kernel.org/lkml/d2a3b3c0-e548-7dd1-730f-59bc5c04e191@synopsys.com/
[3] https://patchwork.ozlabs.org/project/linux-um/patch/20210105120128.10854-1-thomas@m3y3r.de/
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reported-by: Vineet Gupta <vgupta@synopsys.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
hwmon, specifically hwmon_num_channel_attrs, expects the config
array in the hwmon_channel_info structure to be terminated by
a zero entry. amd_energy does not honor this convention. As
result, a KASAN warning is possible. Fix this by adding an
additional entry and setting it to zero.
Fixes: 8abee9566b7e ("hwmon: Add amd_energy driver to report energy counters")
Signed-off-by: David Arcari <darcari@redhat.com>
Cc: Naveen Krishna Chatradhi <nchatrad@amd.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: David Arcari <darcari@redhat.com>
Acked-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Link: https://lore.kernel.org/r/20210107144707.6927-1-darcari@redhat.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
dtc points out that the interrupts for some devices are not parsable:
picoxcell-pc3x2.dtsi:45.19-49.5: Warning (interrupts_property): /paxi/gem@30000: Missing interrupt-parent
picoxcell-pc3x2.dtsi:51.21-55.5: Warning (interrupts_property): /paxi/dmac@40000: Missing interrupt-parent
picoxcell-pc3x2.dtsi:57.21-61.5: Warning (interrupts_property): /paxi/dmac@50000: Missing interrupt-parent
picoxcell-pc3x2.dtsi:233.21-237.5: Warning (interrupts_property): /rwid-axi/axi2pico@c0000000: Missing interrupt-parent
There are two VIC instances, so it's not clear which one needs to be
used. I found the BSP sources that reference VIC0, so use that:
https://github.com/r1mikey/meta-picoxcell/blob/master/recipes-kernel/linux/linux-picochip-3.0/0001-picoxcell-support-for-Picochip-picoXcell-SoC.patch
Acked-by: Jamie Iles <jamie@jamieiles.com>
Link: https://lore.kernel.org/r/20201230152010.3914962-1-arnd@kernel.org'
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:
root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3
Add the decoding for that flag.
Fixes: 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We had kernel panic, it is caused by unload module and last
close confirmation.
call trace:
[1196029.743127] free_sess+0x15/0x50 [rtrs_client]
[1196029.743128] rtrs_clt_close+0x4c/0x70 [rtrs_client]
[1196029.743129] ? rnbd_clt_unmap_device+0x1b0/0x1b0 [rnbd_client]
[1196029.743130] close_rtrs+0x25/0x50 [rnbd_client]
[1196029.743131] rnbd_client_exit+0x93/0xb99 [rnbd_client]
[1196029.743132] __x64_sys_delete_module+0x190/0x260
And in the crashdump confirmation kworker is also running.
PID: 6943 TASK: ffff9e2ac8098000 CPU: 4 COMMAND: "kworker/4:2"
#0 [ffffb206cf337c30] __schedule at ffffffff9f93f891
#1 [ffffb206cf337cc8] schedule at ffffffff9f93fe98
#2 [ffffb206cf337cd0] schedule_timeout at ffffffff9f943938
#3 [ffffb206cf337d50] wait_for_completion at ffffffff9f9410a7
#4 [ffffb206cf337da0] __flush_work at ffffffff9f08ce0e
#5 [ffffb206cf337e20] rtrs_clt_close_conns at ffffffffc0d5f668 [rtrs_client]
#6 [ffffb206cf337e48] rtrs_clt_close at ffffffffc0d5f801 [rtrs_client]
#7 [ffffb206cf337e68] close_rtrs at ffffffffc0d26255 [rnbd_client]
#8 [ffffb206cf337e78] free_sess at ffffffffc0d262ad [rnbd_client]
#9 [ffffb206cf337e88] rnbd_clt_put_dev at ffffffffc0d266a7 [rnbd_client]
The problem is both code path try to close same session, which lead to
panic.
To fix it, just skip the sess if the refcount already drop to 0.
Fixes: f7a7a5c228d4 ("block/rnbd: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Adding name to the Contributors List
Signed-off-by: Swapnil Ingle <ingleswapnil@gmail.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Acked-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since dynamically allocate sglist is used for rnbd_iu, we can't free sg
table after send_usr_msg since the callback function (cqe.done) could
still access the sglist.
Otherwise KASAN reports UAF issue:
[ 4856.600257] BUG: KASAN: use-after-free in dma_direct_unmap_sg+0x53/0x290
[ 4856.600772] Read of size 4 at addr ffff888206af3a98 by task swapper/1/0
[ 4856.601729] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G W 5.10.0-pserver #5.10.0-1+feature+linux+next+20201214.1025+0910d71
[ 4856.601748] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 3.3 02/21/2020
[ 4856.601766] Call Trace:
[ 4856.601785] <IRQ>
[ 4856.601822] dump_stack+0x99/0xcb
[ 4856.601856] ? dma_direct_unmap_sg+0x53/0x290
[ 4856.601888] print_address_description.constprop.7+0x1e/0x230
[ 4856.601913] ? freeze_kernel_threads+0x73/0x73
[ 4856.601965] ? mark_held_locks+0x29/0xa0
[ 4856.602019] ? dma_direct_unmap_sg+0x53/0x290
[ 4856.602039] ? dma_direct_unmap_sg+0x53/0x290
[ 4856.602079] kasan_report.cold.9+0x37/0x7c
[ 4856.602188] ? mlx5_ib_post_recv+0x430/0x520 [mlx5_ib]
[ 4856.602209] ? dma_direct_unmap_sg+0x53/0x290
[ 4856.602256] dma_direct_unmap_sg+0x53/0x290
[ 4856.602366] complete_rdma_req+0x188/0x4b0 [rtrs_client]
[ 4856.602451] ? rtrs_clt_close+0x80/0x80 [rtrs_client]
[ 4856.602535] ? mlx5_ib_poll_cq+0x48b/0x16e0 [mlx5_ib]
[ 4856.602589] ? radix_tree_insert+0x3a0/0x3a0
[ 4856.602610] ? do_raw_spin_lock+0x119/0x1d0
[ 4856.602647] ? rwlock_bug.part.1+0x60/0x60
[ 4856.602740] rtrs_clt_rdma_done+0x3f7/0x670 [rtrs_client]
[ 4856.602804] ? rtrs_clt_rdma_cm_handler+0xda0/0xda0 [rtrs_client]
[ 4856.602857] ? check_flags.part.31+0x6c/0x1f0
[ 4856.602927] ? rcu_read_lock_sched_held+0xaf/0xe0
[ 4856.602963] ? rcu_read_lock_bh_held+0xc0/0xc0
[ 4856.603137] __ib_process_cq+0x10a/0x350 [ib_core]
[ 4856.603309] ib_poll_handler+0x41/0x1c0 [ib_core]
[ 4856.603358] irq_poll_softirq+0xe6/0x280
[ 4856.603392] ? lockdep_hardirqs_on_prepare+0x111/0x210
[ 4856.603446] __do_softirq+0x10d/0x646
[ 4856.603540] asm_call_irq_on_stack+0x12/0x20
[ 4856.603563] </IRQ>
[ 4856.605096] Allocated by task 8914:
[ 4856.605510] kasan_save_stack+0x19/0x40
[ 4856.605532] __kasan_kmalloc.constprop.7+0xc1/0xd0
[ 4856.605552] __kmalloc+0x155/0x320
[ 4856.605574] __sg_alloc_table+0x155/0x1c0
[ 4856.605594] sg_alloc_table+0x1f/0x50
[ 4856.605620] send_msg_sess_info+0x119/0x2e0 [rnbd_client]
[ 4856.605646] remap_devs+0x71/0x210 [rnbd_client]
[ 4856.605676] init_sess+0xad8/0xe10 [rtrs_client]
[ 4856.605706] rtrs_clt_reconnect_work+0xd6/0x170 [rtrs_client]
[ 4856.605728] process_one_work+0x521/0xa90
[ 4856.605748] worker_thread+0x65/0x5b0
[ 4856.605769] kthread+0x1f2/0x210
[ 4856.605789] ret_from_fork+0x22/0x30
[ 4856.606159] Freed by task 8914:
[ 4856.606559] kasan_save_stack+0x19/0x40
[ 4856.606580] kasan_set_track+0x1c/0x30
[ 4856.606601] kasan_set_free_info+0x1b/0x30
[ 4856.606622] __kasan_slab_free+0x108/0x150
[ 4856.606642] slab_free_freelist_hook+0x64/0x190
[ 4856.606661] kfree+0xe2/0x650
[ 4856.606681] __sg_free_table+0xa4/0x100
[ 4856.606707] send_msg_sess_info+0x1d6/0x2e0 [rnbd_client]
[ 4856.606733] remap_devs+0x71/0x210 [rnbd_client]
[ 4856.606763] init_sess+0xad8/0xe10 [rtrs_client]
[ 4856.606792] rtrs_clt_reconnect_work+0xd6/0x170 [rtrs_client]
[ 4856.606813] process_one_work+0x521/0xa90
[ 4856.606833] worker_thread+0x65/0x5b0
[ 4856.606853] kthread+0x1f2/0x210
[ 4856.606872] ret_from_fork+0x22/0x30
The solution is to free iu's sgtable after the iu is not used anymore.
And also move sg_alloc_table into rnbd_get_iu accordingly.
Fixes: 5a1328d0c3a7 ("block/rnbd-clt: Dynamically allocate sglist for rnbd_iu")
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
KASAN detect following BUG:
[ 778.215311] ==================================================================
[ 778.216696] BUG: KASAN: use-after-free in rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[ 778.219037] Read of size 8 at addr ffff88b1d6516c28 by task tee/8842
[ 778.220500] CPU: 37 PID: 8842 Comm: tee Kdump: loaded Not tainted 5.10.0-pserver #5.10.0-1+feature+linux+next+20201214.1025+0910d71
[ 778.220529] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 3.3 02/21/2020
[ 778.220555] Call Trace:
[ 778.220609] dump_stack+0x99/0xcb
[ 778.220667] ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[ 778.220715] print_address_description.constprop.7+0x1e/0x230
[ 778.220750] ? freeze_kernel_threads+0x73/0x73
[ 778.220896] ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[ 778.220932] ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[ 778.220994] kasan_report.cold.9+0x37/0x7c
[ 778.221066] ? kobject_put+0x80/0x270
[ 778.221102] ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[ 778.221184] rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[ 778.221240] rnbd_srv_dev_session_force_close_store+0x6a/0xc0 [rnbd_server]
[ 778.221304] ? sysfs_file_ops+0x90/0x90
[ 778.221353] kernfs_fop_write+0x141/0x240
[ 778.221451] vfs_write+0x142/0x4d0
[ 778.221553] ksys_write+0xc0/0x160
[ 778.221602] ? __ia32_sys_read+0x50/0x50
[ 778.221684] ? lockdep_hardirqs_on_prepare+0x13d/0x210
[ 778.221718] ? syscall_enter_from_user_mode+0x1c/0x50
[ 778.221821] do_syscall_64+0x33/0x40
[ 778.221862] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 778.221896] RIP: 0033:0x7f4affdd9504
[ 778.221928] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
[ 778.221956] RSP: 002b:00007fffebb36b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 778.222011] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f4affdd9504
[ 778.222038] RDX: 0000000000000002 RSI: 00007fffebb36c50 RDI: 0000000000000003
[ 778.222066] RBP: 00007fffebb36c50 R08: 0000556a151aa600 R09: 00007f4affeb1540
[ 778.222094] R10: fffffffffffffc19 R11: 0000000000000246 R12: 0000556a151aa520
[ 778.222121] R13: 0000000000000002 R14: 00007f4affea6760 R15: 0000000000000002
[ 778.222764] Allocated by task 3212:
[ 778.223285] kasan_save_stack+0x19/0x40
[ 778.223316] __kasan_kmalloc.constprop.7+0xc1/0xd0
[ 778.223347] kmem_cache_alloc_trace+0x186/0x350
[ 778.223382] rnbd_srv_rdma_ev+0xf16/0x1690 [rnbd_server]
[ 778.223422] process_io_req+0x4d1/0x670 [rtrs_server]
[ 778.223573] __ib_process_cq+0x10a/0x350 [ib_core]
[ 778.223709] ib_cq_poll_work+0x31/0xb0 [ib_core]
[ 778.223743] process_one_work+0x521/0xa90
[ 778.223773] worker_thread+0x65/0x5b0
[ 778.223802] kthread+0x1f2/0x210
[ 778.223833] ret_from_fork+0x22/0x30
[ 778.224296] Freed by task 8842:
[ 778.224800] kasan_save_stack+0x19/0x40
[ 778.224829] kasan_set_track+0x1c/0x30
[ 778.224860] kasan_set_free_info+0x1b/0x30
[ 778.224889] __kasan_slab_free+0x108/0x150
[ 778.224919] slab_free_freelist_hook+0x64/0x190
[ 778.224947] kfree+0xe2/0x650
[ 778.224982] rnbd_destroy_sess_dev+0x2fa/0x3b0 [rnbd_server]
[ 778.225011] kobject_put+0xda/0x270
[ 778.225046] rnbd_srv_sess_dev_force_close+0x30/0x60 [rnbd_server]
[ 778.225081] rnbd_srv_dev_session_force_close_store+0x6a/0xc0 [rnbd_server]
[ 778.225111] kernfs_fop_write+0x141/0x240
[ 778.225140] vfs_write+0x142/0x4d0
[ 778.225169] ksys_write+0xc0/0x160
[ 778.225198] do_syscall_64+0x33/0x40
[ 778.225227] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 778.226506] The buggy address belongs to the object at ffff88b1d6516c00
which belongs to the cache kmalloc-512 of size 512
[ 778.227464] The buggy address is located 40 bytes inside of
512-byte region [ffff88b1d6516c00, ffff88b1d6516e00)
The problem is in the sess_dev release function we call
rnbd_destroy_sess_dev, and could free the sess_dev already, but we still
set the keep_id in rnbd_srv_sess_dev_force_close, which lead to use
after free.
To fix it, move the keep_id before the sysfs removal, and cache the
rnbd_srv_session for lock accessing,
Fixes: 786998050cbc ("block/rnbd-srv: close a mapped device from server side.")
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
lkp reboot following build error:
drivers/block/rnbd/rnbd-clt.c: In function 'rnbd_softirq_done_fn':
>> drivers/block/rnbd/rnbd-clt.c:387:2: error: implicit declaration of function 'sg_free_table_chained' [-Werror=implicit-function-declaration]
387 | sg_free_table_chained(&iu->sgt, RNBD_INLINE_SG_CNT);
| ^~~~~~~~~~~~~~~~~~~~~
The reason is CONFIG_SG_POOL is not enabled in the config, to
avoid such failure, select SG_POOL in Kconfig for RNBD_CLIENT.
Fixes: 5a1328d0c3a7 ("block/rnbd-clt: Dynamically allocate sglist for rnbd_iu")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Shakeel Butt reported in [1] that a user can request a task to be moved
to a resource group even if the task is already in the group. It just
wastes time to do the move operation which could be costly to send IPI
to a different CPU.
Add a sanity check to ensure that the move operation only happens when
the task is not already in the resource group.
[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/962ede65d8e95be793cb61102cca37f7bb018e66.1608243147.git.reinette.chatre@intel.com
|
|
Currently, when moving a task to a resource group the PQR_ASSOC MSR is
updated with the new closid and rmid in an added task callback. If the
task is running, the work is run as soon as possible. If the task is not
running, the work is executed later in the kernel exit path when the
kernel returns to the task again.
Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
is running is the right thing to do. Queueing work for a task that is
not running is unnecessary (the PQR_ASSOC MSR is already updated when
the task is scheduled in) and causing system resource waste with the way
in which it is implemented: Work to update the PQR_ASSOC register is
queued every time the user writes a task id to the "tasks" file, even if
the task already belongs to the resource group.
This could result in multiple pending work items associated with a
single task even if they are all identical and even though only a single
update with most recent values is needed. Specifically, even if a task
is moved between different resource groups while it is sleeping then it
is only the last move that is relevant but yet a work item is queued
during each move.
This unnecessary queueing of work items could result in significant
system resource waste, especially on tasks sleeping for a long time.
For example, as demonstrated by Shakeel Butt in [1] writing the same
task id to the "tasks" file can quickly consume significant memory. The
same problem (wasted system resources) occurs when moving a task between
different resource groups.
As pointed out by Valentin Schneider in [2] there is an additional issue
with the way in which the queueing of work is done in that the task_struct
update is currently done after the work is queued, resulting in a race with
the register update possibly done before the data needed by the update is
available.
To solve these issues, update the PQR_ASSOC MSR in a synchronous way
right after the new closid and rmid are ready during the task movement,
only if the task is running. If a moved task is not running nothing
is done since the PQR_ASSOC MSR will be updated next time the task is
scheduled. This is the same way used to update the register when tasks
are moved as part of resource group removal.
[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com
[ bp: Massage commit message and drop the two update_task_closid_rmid()
variants. ]
Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/17aa2fb38fc12ce7bb710106b3e7c7b45acb9e94.1608243147.git.reinette.chatre@intel.com
|
|
bdev_evict_inode and bdev_free_inode are also called for the root inode
of bdevfs, for which bdev_alloc is never called. Move the zeroing o
f struct block_device and the initialization of the bd_bdi field into
bdev_alloc_inode to make sure they are initialized for the root inode
as well.
Fixes: e6cb53827ed6 ("block: initialize struct block_device in bdev_alloc")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|