linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2022-05-03	srcu: Automatically determine size-transition strategy at boot	Paul E. McKenney	2	-3/+30
	This commit adds a srcutree.convert_to_big option of zero that causes SRCU to decide at boot whether to wait for contention (small systems) or immediately expand to large (large systems). A new srcutree.big_cpu_lim (defaulting to 128) defines how many CPUs constitute a large system. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Add contention-triggered addition of srcu_node tree	Paul E. McKenney	3	-24/+94
	This commit instruments the acquisitions of the srcu_struct structure's ->lock, enabling the initiation of a transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG when sufficient contention is experienced. The instrumentation counts the number of trylock failures within the confines of a single jiffy. If that number exceeds the value specified by the srcutree.small_contention_lim kernel boot parameter (which defaults to 100), and if the value specified by the srcutree.convert_to_big kernel boot parameter has the 0x10 bit set (defaults to 0), then a transition will be automatically initiated. By default, there will never be any transitions, so that none of the srcu_struct structures ever gains an srcu_node array. The useful values for srcutree.convert_to_big are: 0x00: Never convert. 0x01: Always convert at init_srcu_struct() time. 0x02: Convert when rcutorture prints its first round of statistics. 0x03: Decide conversion approach at boot given system size. 0x10: Convert if contention is encountered. 0x12: Convert if contention is encountered or when rcutorture prints its first round of statistics, whichever comes first. The value 0x11 acts the same as 0x01 because the conversion happens before there is any chance of contention. [ paulmck: Apply "static" feedback from kernel test robot. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Create concurrency-safe helper for initiating size transition	Paul E. McKenney	1	-2/+21
	Once there are contention-initiated size transitions, it will be possible for rcutorture to initiate a transition at the same time as a contention-initiated transition. This commit therefore creates a concurrency-safe helper function named srcu_transition_to_big() to safely initiate size transitions. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Explain srcu_funnel_gp_start() call to list_add() is safe	Paul E. McKenney	1	-0/+6
	This commit adds a comment explaining why an unprotected call to list_add() from srcu_funnel_gp_start() can be safe. TL;DR: It is only called during very early boot when we don't have no steeking concurrency! Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Prevent cleanup_srcu_struct() from freeing non-dynamic ->sda	Paul E. McKenney	2	-3/+7
	When an srcu_struct structure is created (but not in a kernel module) by DEFINE_SRCU() and friends, the per-CPU srcu_data structure is statically allocated. In all other cases, that structure is obtained from alloc_percpu(), in which case cleanup_srcu_struct() must invoke free_percpu() on the resulting ->sda pointer in the srcu_struct pointer. Which it does. Except that it also invokes free_percpu() on the ->sda pointer referencing the statically allocated per-CPU srcu_data structures. Which free_percpu() is surprisingly OK with. This commit nevertheless stops cleanup_srcu_struct() from freeing statically allocated per-CPU srcu_data structures. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Avoid NULL dereference in srcu_torture_stats_print()	Paul E. McKenney	1	-28/+34
	You really shouldn't invoke srcu_torture_stats_print() after invoking cleanup_srcu_struct(), but there is really no reason to get a compiler-obfuscated per-CPU-variable NULL pointer dereference as the diagnostic. This commit therefore checks for NULL ->sda and makes a more polite console-message complaint in that case. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Use export for srcu_struct defined by DEFINE_STATIC_SRCU()	Alexander Aring	1	-0/+1
	If an srcu_struct structure defined by tree SRCU's DEFINE_STATIC_SRCU() is used by a module, sparse will give the following diagnostic: sparse: symbol '__srcu_struct_nodes_srcu' was not declared. Should it be static? The problem is that a within-module DEFINE_STATIC_SRCU() must define a non-static srcu_struct because it is exported by referencing it in a special '__section("___srcu_struct_ptrs")'. This reference is needed so that module load and unloading can invoke init_srcu_struct() and cleanup_srcu_struct(), respectively. Unfortunately, sparse is unaware of '__section("___srcu_struct_ptrs")', resulting in the above false-positive diagnostic. To avoid this false positive, this commit therefore creates a prototype of the srcu_struct with an "extern" keyword. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Add boot-time control over srcu_node array allocation	Paul E. McKenney	2	-16/+45
	This commit adds an srcu_tree.convert_to_big kernel parameter that either refuses to convert at all (0), converts immediately at init_srcu_struct() time (1), or lets rcutorture convert it (2). An addition contention-based dynamic conversion choice will be added, along with documentation. [ paulmck: Apply callback-scanning feedback from Neeraj Upadhyay. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Ensure snp nodes tree is fully initialized before traversal	Neeraj Upadhyay	1	-3/+19
	For configurations where snp node tree is not initialized at init time (added in subsequent commits), srcu_funnel_gp_start() and srcu_funnel_exp_start() can potential traverse and observe the snp nodes' transient (uninitialized) states. This can potentially happen, when init_srcu_struct_nodes() initialization of sdp->mynode races with srcu_funnel_gp_start() and srcu_funnel_exp_start() Consider the case below where srcu_funnel_gp_start() observes sdp->mynode to be not NULL and uses an uninitialized sdp->grpmask P1 P2 init_srcu_struct_nodes() void srcu_funnel_gp_start(...) { for_each_possible_cpu(cpu) { ... sdp->mynode = &snp_first[...]; for (snp = sdp->mynode;...) struct srcu_node *snp_leaf = smp_load_acquire(&sdp->mynode) ... if (snp_leaf) { for (snp = snp_leaf; ...) ... if (snp == snp_leaf) snp->srcu_data_have_cbs[idx] \|= sdp->grpmask; sdp->grpmask = 1 << (cpu - sdp->mynode->grplo); } } Similarly, init_srcu_struct_nodes() and srcu_funnel_exp_start() can race, where srcu_funnel_exp_start() could observe state of snp lock before spin_lock_init(). P1 P2 init_srcu_struct_nodes() void srcu_funnel_exp_start(...) { srcu_for_each_node_breadth_first(ssp, snp) { for (; ...) { spin_lock_...(snp, ) spin_lock_init(&ACCESS_PRIVATE(snp, lock)); ... } for_each_possible_cpu(cpu) { ... sdp->mynode = &snp_first[...]; To avoid these issues, ensure that snp node tree initialization is complete i.e. after SRCU_SIZE_WAIT_BARRIER srcu_size_state is reached, before traversing the tree. Given that srcu_funnel_gp_start() and srcu_funnel_exp_start() are called within SRCU read side critical sections, this check is safe, in the sense that all callbacks are enqueued on CPU0 srcu_cblist until SRCU_SIZE_WAIT_CALL is entered, and these read side critical sections (containing srcu_funnel_gp_start() and srcu_funnel_exp_start()) need to complete, before SRCU_SIZE_WAIT_CALL is reached. Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Use invalid initial value for srcu_node GP sequence numbers	Paul E. McKenney	1	-9/+27
	Currently, tree SRCU relies on the srcu_node structures being initialized at the same time that the srcu_struct itself is initialized, and thus use the initial grace-period sequence number as the initial value for the srcu_node structure's ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields. Although this has a high probability of also working when the srcu_node array is allocated and initialized at some random later time, it would be better to avoid leaving such things to chance. This commit therefore initializes these fields with 0x2, which is a recognizable invalid value. It then adds the required checks for this invalid value in order to avoid confusion on long-running kernels (especially those on 32-bit systems) that allocate and initialize srcu_node arrays late in life. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Compute snp_seq earlier in srcu_funnel_gp_start()	Paul E. McKenney	1	-5/+3
	Currently, srcu_funnel_gp_start() tests snp->srcu_have_cbs[idx] and then separately assigns it to the snp_seq local variable. This commit does the assignment earlier to simplify the code a bit. While in the area, this commit also takes advantage of the 100-character line limit to put the call to srcu_schedule_cbs_sdp() on a single line. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Make rcutorture dump the SRCU size state	Paul E. McKenney	1	-2/+20
	This commit adds the numeric and string version of ->srcu_size_state to the Tree-SRCU-specific portion of the rcutorture output. [ paulmck: Apply feedback from kernel test robot and Dan Carpenter. ] [ quic_neeraju: Apply feedback from Jiapeng Chong. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Add size-state transitioning code	Paul E. McKenney	1	-0/+13
	This is just dead code at the moment, and will be used once the state-transition code is activated. Because srcu_barrier() must be aware of transition before call_srcu(), the state machine waits for an SRCU grace period before callbacks are queued to the non-CPU-0 queues. This requres that portions of srcu_barrier() be enclosed in an SRCU read-side critical section. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Dynamically allocate srcu_node array	Paul E. McKenney	2	-16/+51
	This commit shrinks the srcu_struct structure by converting its ->node field from a fixed-size compile-time array to a pointer to a dynamically allocated array. In kernels built with large values of NR_CPUS that boot on systems with smaller numbers of CPUs, this can save significant memory. [ paulmck: Apply kernel test robot feedback. ] Reported-by: A cast of thousands Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Make Tree SRCU able to operate without snp_node array	Paul E. McKenney	2	-93/+124
	This commit makes Tree SRCU able to operate without an snp_node array, that is, when the srcu_data structures' ->mynode pointers are NULL. This can result in high contention on the srcu_struct structure's ->lock, but only when there are lots of call_srcu(), synchronize_srcu(), and synchronize_srcu_expedited() calls. Note that when there is no snp_node array, all SRCU callbacks use CPU 0's callback queue. This is optimal in the common case of low update-side load because it removes the need to search each CPU for the single callback that made the grace period happen. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Make srcu_funnel_gp_start() cache ->mynode in snp_leaf	Paul E. McKenney	1	-6/+7
	Currently, the srcu_funnel_gp_start() walks its local variable snp up the tree and reloads sdp->mynode whenever it is necessary to check whether it is still at the leaf srcu_node level. This works, but is a bit more obtuse than absolutely necessary. In addition, upcoming commits will dynamically size srcu_struct structures, in which case sdp->mynode will no longer necessarily be a constant, and this commit helps prepare for that dynamic sizing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Fix s/is/if/ typo in srcu_node comment	Paul E. McKenney	1	-5/+3
	This commit fixed a typo in the srcu_node structure's ->srcu_have_cbs comment. While in the area, redo a couple of comments to take advantage of 100-character line lengths. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11	srcu: Tighten cleanup_srcu_struct() GP checks	Paul E. McKenney	1	-2/+4
	Currently, cleanup_srcu_struct() checks for a grace period in progress, but it does not check for a grace period that has not yet started but which might start at any time. Such a situation could result in a use-after-free bug, so this commit adds a check for a grace period that is needed but not yet started to cleanup_srcu_struct(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-03	Linux 5.18-rc1	Linus Torvalds	1	-2/+2

2022-04-02	Revert "clk: Drop the rate range on clk_put()"	Stephen Boyd	2	-136/+14
	This reverts commit 7dabfa2bc4803eed83d6f22bd6f045495f40636b. There are multiple reports that this breaks boot on various systems. The common theme is that orphan clks are having rates set on them when that isn't expected. Let's revert it out for now so that -rc1 boots. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Tony Lindgren <tony@atomide.com> Reported-by: Alexander Stein <alexander.stein@ew.tq-group.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Link: https://lore.kernel.org/r/366a0232-bb4a-c357-6aa8-636e398e05eb@samsung.com Cc: Maxime Ripard <maxime@cerno.tech> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Link: https://lore.kernel.org/r/20220403022818.39572-1-sboyd@kernel.org
2022-04-03	modpost: restore the warning message for missing symbol versions	Masahiro Yamada	1	-1/+1
	This log message was accidentally chopped off. I was wondering why this happened, but checking the ML log, Mark precisely followed my suggestion [1]. I just used "..." because I was too lazy to type the sentence fully. Sorry for the confusion. [1]: https://lore.kernel.org/all/CAK7LNAR6bXXk9-ZzZYpTqzFqdYbQsZHmiWspu27rtsFxvfRuVA@mail.gmail.com/ Fixes: 4a6795933a89 ("kbuild: modpost: Explicitly warn about unprototyped symbols") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2022-04-02	Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"	Jens Axboe	1	-12/+12
	This reverts commit 6d35d04a9e18990040e87d2bbf72689252669d54. Both Gabriel and Borislav report that this commit casues a regression with nbd: sysfs: cannot create duplicate filename '/dev/block/43:0' Revert it before 5.18-rc1 and we'll investigage this separately in due time. Link: https://lore.kernel.org/all/YkiJTnFOt9bTv6A2@zn.tnic/ Reported-by: Gabriel L. Somlo <somlo@cmu.edu> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-02	watch_queue: Free the page array when watch_queue is dismantled	Eric Dumazet	1	-0/+1
	Commit 7ea1a0124b6d ("watch_queue: Free the alloc bitmap when the watch_queue is torn down") took care of the bitmap, but not the page array. BUG: memory leak unreferenced object 0xffff88810d9bc140 (size 32): comm "syz-executor335", pid 3603, jiffies 4294946994 (age 12.840s) hex dump (first 32 bytes): 40 a7 40 04 00 ea ff ff 00 00 00 00 00 00 00 00 @.@............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: kmalloc_array include/linux/slab.h:621 [inline] kcalloc include/linux/slab.h:652 [inline] watch_queue_set_size+0x12f/0x2e0 kernel/watch_queue.c:251 pipe_ioctl+0x82/0x140 fs/pipe.c:632 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] Reported-by: syzbot+25ea042ae28f3888727a@syzkaller.appspotmail.com Fixes: c73be61cede5 ("pipe: Add general notification queue support") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20220322004654.618274-1-eric.dumazet@gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-02	tracing: mark user_events as BROKEN	Steven Rostedt (Google)	3	-0/+6
	After being merged, user_events become more visible to a wider audience that have concerns with the current API. It is too late to fix this for this release, but instead of a full revert, just mark it as BROKEN (which prevents it from being selected in make config). Then we can work finding a better API. If that fails, then it will need to be completely reverted. To not have the code silently bitrot, still allow building it with COMPILE_TEST. And to prevent the uapi header from being installed, then later changed, and then have an old distro user space see the old version, move the header file out of the uapi directory. Surround the include with CONFIG_COMPILE_TEST to the current location, but when the BROKEN tag is taken off, it will use the uapi directory, and fail to compile. This is a good way to remind us to move the header back. Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-02	tracing: Move user_events.h temporarily out of include/uapi	Steven Rostedt (Google)	2	-0/+5
	While user_events API is under development and has been marked for broken to not let the API become fixed, move the header file out of the uapi directory. This is to prevent it from being installed, then later changed, and then have an old distro user space update with a new kernel, where applications see the user_events being available, but the old header is in place, and then they get compiled incorrectly. Also, surround the include with CONFIG_COMPILE_TEST to the current location, but when the BROKEN tag is taken off, it will use the uapi directory, and fail to compile. This is a good way to remind us to move the header back. Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com Link: https://lkml.kernel.org/r/20220401143903.188384f3@gandalf.local.home Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-02	ftrace: Make ftrace_graph_is_dead() a static branch	Christophe Leroy	2	-15/+18
	ftrace_graph_is_dead() is used on hot paths, it just reads a variable in memory and is not worth suffering function call constraints. For instance, at entry of prepare_ftrace_return(), inlining it avoids saving prepare_ftrace_return() parameters to stack and restoring them after calling ftrace_graph_is_dead(). While at it using a static branch is even more performant and is rather well adapted considering that the returned value will almost never change. Inline ftrace_graph_is_dead() and replace 'kill_ftrace_graph' bool by a static branch. The performance improvement is noticeable. Link: https://lkml.kernel.org/r/e0411a6a0ed3eafff0ad2bc9cd4b0e202b4617df.1648623570.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-02	tracing: Set user_events to BROKEN	Steven Rostedt (Google)	1	-0/+1
	After being merged, user_events become more visible to a wider audience that have concerns with the current API. It is too late to fix this for this release, but instead of a full revert, just mark it as BROKEN (which prevents it from being selected in make config). Then we can work finding a better API. If that fails, then it will need to be completely reverted. Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/ Link: https://lkml.kernel.org/r/20220330155835.5e1f6669@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-02	tracing/user_events: Remove eBPF interfaces	Beau Belgrave	3	-136/+4
	Remove eBPF interfaces within user_events to ensure they are fully reviewed. Link: https://lore.kernel.org/all/20220329165718.GA10381@kbox/ Link: https://lkml.kernel.org/r/20220329173051.10087-1-beaub@linux.microsoft.com Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-02	tracing/user_events: Hold event_mutex during dyn_event_add	Beau Belgrave	1	-2/+6
	Make sure the event_mutex is properly held during dyn_event_add call. This is required when adding dynamic events. Link: https://lkml.kernel.org/r/20220328223225.1992-1-beaub@linux.microsoft.com Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-02	proc: bootconfig: Add null pointer check	Lv Ruyi	1	-0/+2
	kzalloc is a memory allocation function which can return NULL when some internal memory errors happen. It is safer to add null pointer check. Link: https://lkml.kernel.org/r/20220329104004.2376879-1-lv.ruyi@zte.com.cn Cc: stable@vger.kernel.org Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-02	tracing: Rename the staging files for trace_events	Steven Rostedt (Google)	9	-14/+14
	When looking for implementation of different phases of the creation of the TRACE_EVENT() macro, it is pretty useless when all helper macro redefinitions are in files labeled "stageX_defines.h". Rename them to state which phase the files are for. For instance, when looking for the defines that are used to create the event fields, seeing "stage4_event_fields.h" gives the developer a good idea that the defines are in that file. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-02	KVM: x86: fix sending PV IPI	Li RongQing	1	-1/+1
	If apic_id is less than min, and (max - apic_id) is greater than KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but the new apic_id does not fit the bitmask. In this case __send_ipi_mask should send the IPI. This is mostly theoretical, but it can happen if the apic_ids on three iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0. Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest") Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86/mmu: do compare-and-exchange of gPTE via the user address	Paolo Bonzini	1	-40/+34
	FNAME(cmpxchg_gpte) is an inefficient mess. It is at least decent if it can go through get_user_pages_fast(), but if it cannot then it tries to use memremap(); that is not just terribly slow, it is also wrong because it assumes that the VM_PFNMAP VMA is contiguous. The right way to do it would be to do the same thing as hva_to_pfn_remapped() does since commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05), using follow_pte() and fixup_user_fault() to determine the correct address to use for memremap(). To do this, one could for example extract hva_to_pfn() for use outside virt/kvm/kvm_main.c. But really there is no reason to do that either, because there is already a perfectly valid address to do the cmpxchg() on, only it is a userspace address. That means doing user_access_begin()/user_access_end() and writing the code in assembly to handle exceptions correctly. Worse, the guest PTE can be 8-byte even on i686 so there is the extra complication of using cmpxchg8b to account for. But at least it is an efficient mess. (Thanks to Linus for suggesting improvement on the inline assembly). Reported-by: Qiuhao Li <qiuhao@sysec.org> Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: Remove redundant vm_entry_controls_clearbit() call	Zhenzhong Duan	1	-1/+0
	When emulating exit from long mode, EFER_LMA is cleared with vmx_set_efer(). This will already unset the VM_ENTRY_IA32E_MODE control bit as requested by SDM, so there is no need to unset VM_ENTRY_IA32E_MODE again in exit_lmode() explicitly. In case EFER isn't supported by hardware, long mode isn't supported, so exit_lmode() cannot be reached. Note that, thanks to the shadow controls mechanism, this change doesn't eliminate vmread or vmwrite. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Message-Id: <20220311102643.807507-3-zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: cleanup enter_rmode()	Zhenzhong Duan	1	-9/+5
	vmx_set_efer() sets uret->data but, in fact if the value of uret->data will be used vmx_setup_uret_msrs() will have rewritten it with the value returned by update_transition_efer(). uret->data is consumed if and only if uret->load_into_hardware is true, and vmx_setup_uret_msrs() takes care of (a) updating uret->data before setting uret->load_into_hardware to true (b) setting uret->load_into_hardware to false if uret->data isn't updated. Opportunistically use "vmx" directly instead of redoing to_vmx(). Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Message-Id: <20220311102643.807507-2-zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: SVM: fix tsc scaling when the host doesn't support it	Maxim Levitsky	3	-9/+6
	It was decided that when TSC scaling is not supported, the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0' value. However in this case kvm_max_tsc_scaling_ratio is not set, which breaks various assumptions. Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of host support. For consistency, do the same for VMX. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	kvm: x86: SVM: remove unused defines	Maxim Levitsky	1	-8/+0
	Remove some unused #defines from svm.c Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: SVM: move tsc ratio definitions to svm.h	Maxim Levitsky	2	-10/+11
	Another piece of SVM spec which should be in the header file Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: SVM: fix avic spec based definitions again	Maxim Levitsky	2	-14/+5
	Due to wrong rebase, commit 4a204f7895878 ("KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255") moved avic spec #defines back to avic.c. Move them back, and while at it extend AVIC_DOORBELL_PHYSICAL_ID_MASK to 12 bits as well (it will be used in nested avic) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: MIPS: remove reference to trap&emulate virtualization	Paolo Bonzini	1	-6/+0
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220313140522.1307751-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: document limitations of MSR filtering	Paolo Bonzini	1	-0/+5
	MSR filtering requires an exit to userspace that is hard to implement and would be very slow in the case of nested VMX vmexit and vmentry MSR accesses. Document the limitation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr	Hou Wenlong	3	-16/+40
	If MSR access is rejected by MSR filtering, kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED, and the return value is only handled well for rdmsr/wrmsr. However, some instruction emulation and state transition also use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger some unexpected results if MSR access is rejected, E.g. RDPID emulation would inject a #UD but RDPID wouldn't cause a exit when RDPID is supported in hardware and ENABLE_RDTSCP is set. And it would also cause failure when load MSR at nested entry/exit. Since msr filtering is based on MSR bitmap, it is better to only do MSR filtering for rdmsr/wrmsr. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86/emulator: Emulate RDPID only if it is enabled in guest	Hou Wenlong	3	-1/+10
	When RDTSCP is supported but RDPID is not supported in host, RDPID emulation is available. However, __kvm_get_msr() would only fail when RDTSCP/RDPID both are disabled in guest, so the emulator wouldn't inject a #UD when RDPID is disabled but RDTSCP is enabled in guest. Fixes: fb6d4d340e05 ("KVM: x86: emulate RDPID") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86/pmu: Fix and isolate TSX-specific performance event logic	Like Xu	2	-13/+15
	HSW_IN_TX* bits are used in generic code which are not supported on AMD. Worse, these bits overlap with AMD EventSelect[11:8] and hence using HSW_IN_TX* bits unconditionally in generic code is resulting in unintentional pmu behavior on AMD. For example, if EventSelect[11:8] is 0x2, pmc_reprogram_counter() wrongly assumes that HSW_IN_TX_CHECKPOINTED is set and thus forces sampling period to be 0. Also per the SDM, both bits 32 and 33 "may only be set if the processor supports HLE or RTM" and for "IN_TXCP (bit 33): this bit may only be set for IA32_PERFEVTSEL2." Opportunistically eliminate code redundancy, because if the HSW_IN_TX* bit is set in pmc->eventsel, it is already set in attr.config. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Reported-by: Jim Mattson <jmattson@google.com> Fixes: 103af0a98788 ("perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5") Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20220309084257.88931-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set	Maxim Levitsky	1	-1/+1
	It makes more sense to print new SPTE value than the old value. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs	Jim Mattson	1	-5/+3
	AMD EPYC CPUs never raise a #GP for a WRMSR to a PerfEvtSeln MSR. Some reserved bits are cleared, and some are not. Specifically, on Zen3/Milan, bits 19 and 42 are not cleared. When emulating such a WRMSR, KVM should not synthesize a #GP, regardless of which bits are set. However, undocumented bits should not be passed through to the hardware MSR. So, rather than checking for reserved bits and synthesizing a #GP, just clear the reserved bits. This may seem pedantic, but since KVM currently does not support the "Host/Guest Only" bits (41:40), it is necessary to clear these bits rather than synthesizing #GP, because some popular guests (e.g Linux) will set the "Host Only" bit even on CPUs that don't support EFER.SVME, and they don't expect a #GP. For example, root@Ubuntu1804:~# perf stat -e r26 -a sleep 1 Performance counter stats for 'system wide': 0 r26 1.001070977 seconds time elapsed Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379957] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000130026) at rIP: 0xffffffff9b276a28 (native_write_msr+0x8/0x30) Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379958] Call Trace: Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379963] amd_pmu_disable_event+0x27/0x90 Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Reported-by: Lotus Fenn <lotusf@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Like Xu <likexu@tencent.com> Reviewed-by: David Dunn <daviddunn@google.com> Message-Id: <20220226234131.2167175-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: Trace all APICv inhibit changes and capture overall status	Sean Christopherson	2	-18/+29
	Trace all APICv inhibit changes instead of just those that result in APICv being (un)inhibited, and log the current state. Debugging why APICv isn't working is frustrating as it's hard to see why APICv is still inhibited, and logging only the first inhibition means unnecessary onion peeling. Opportunistically drop the export of the tracepoint, it is not and should not be used by vendor code due to the need to serialize toggling via apicv_update_lock. Note, using the common flow means kvm_apicv_init() switched from atomic to non-atomic bitwise operations. The VM is unreachable at init, so non-atomic is perfectly ok. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: Add wrappers for setting/clearing APICv inhibits	Sean Christopherson	6	-36/+49
	Add set/clear wrappers for toggling APICv inhibits to make the call sites more readable, and opportunistically rename the inner helpers to align with the new wrappers and to make them more readable as well. Invert the flag from "activate" to "set"; activate is painfully ambiguous as it's not obvious if the inhibit is being activated, or if APICv is being activated, in which case the inhibit is being deactivated. For the functions that take @set, swap the order of the inhibit reason and @set so that the call sites are visually similar to those that bounce through the wrapper. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: x86: Make APICv inhibit reasons an enum and cleanup naming	Sean Christopherson	6	-31/+35
	Use an enum for the APICv inhibit reasons, there is no meaning behind their values and they most definitely are not "unsigned longs". Rename the various params to "reason" for consistency and clarity (inhibit may be confused as a command, i.e. inhibit APICv, instead of the reason that is getting toggled/checked). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02	KVM: X86: Handle implicit supervisor access with SMAP	Lai Jiangshan	4	-17/+21
	There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>