Age | Commit message (Collapse) | Author | Files | Lines |
|
When a Secondary Bus Reset is issued at a hotplug port, it causes a Data
Link Layer State Changed event as a side effect. On hotplug ports using
in-band presence detect, it additionally causes a Presence Detect Changed
event.
These spurious events should not result in teardown and re-enumeration of
the device in the slot. Hence commit 2e35afaefe64 ("PCI: pciehp: Add
reset_slot() method") masked the Presence Detect Changed Enable bit in the
Slot Control register during a Secondary Bus Reset. Commit 06a8d89af551
("PCI: pciehp: Disable link notification across slot reset") additionally
masked the Data Link Layer State Changed Enable bit.
However masking those bits only disables interrupt generation (PCIe r6.2
sec 6.7.3.1). The events are still visible in the Slot Status register
and picked up by the IRQ handler if it runs during a Secondary Bus Reset.
This can happen if the interrupt is shared or if an unmasked hotplug event
occurs, e.g. Attention Button Pressed or Power Fault Detected.
The likelihood of this happening used to be small, so it wasn't much of a
problem in practice. That has changed with the recent introduction of
bandwidth control in v6.13-rc1 with commit 665745f27487 ("PCI/bwctrl:
Re-add BW notification portdrv as PCIe BW controller"):
Bandwidth control shares the interrupt with PCIe hotplug. A Secondary Bus
Reset causes a Link Bandwidth Notification, so the hotplug IRQ handler
runs, picks up the masked events and tears down the device in the slot.
As a result, Joel reports VFIO passthrough failure of a GPU, which Ilpo
root-caused to the incorrect handling of masked hotplug events.
Clearly, a more reliable way is needed to ignore spurious hotplug events.
For Downstream Port Containment, a new ignore mechanism was introduced by
commit a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC").
It has been working reliably for the past four years.
Adapt it for Secondary Bus Resets.
Introduce two helpers to annotate code sections which cause spurious link
changes: pci_hp_ignore_link_change() and pci_hp_unignore_link_change()
Use those helpers in lieu of masking interrupts in the Slot Control
register.
Introduce a helper to check whether such a code section is executing
concurrently and if so, await it: pci_hp_spurious_link_change()
Invoke the helper in the hotplug IRQ thread pciehp_ist(). Re-use the
IRQ thread's existing code which ignores DPC-induced link changes unless
the link is unexpectedly down after reset recovery or the device was
replaced during the bus reset.
That code block in pciehp_ist() was previously only executed if a Data
Link Layer State Changed event has occurred. Additionally execute it for
Presence Detect Changed events. That's necessary for compatibility with
PCIe r1.0 hotplug ports because Data Link Layer State Changed didn't exist
before PCIe r1.1. DPC was added with PCIe r3.1 and thus DPC-capable
hotplug ports always support Data Link Layer State Changed events.
But the same cannot be assumed for Secondary Bus Reset, which already
existed in PCIe r1.0.
Secondary Bus Reset is only one of many causes of spurious link changes.
Others include runtime suspend to D3cold, firmware updates or FPGA
reconfiguration. The new pci_hp_{,un}ignore_link_change() helpers may be
used by all kinds of drivers to annotate such code sections, hence their
declarations are publicly visible in <linux/pci.h>. A case in point is
the Mellanox Ethernet driver which disables a firmware reset feature if
the Ethernet card is attached to a hotplug port, see commit 3d7a3f2612d7
("net/mlx5: Nack sync reset request when HotPlug is enabled"). Going
forward, PCIe hotplug will be able to cope gracefully with all such use
cases once the code sections are properly annotated.
The new helpers internally use two bits in struct pci_dev's priv_flags as
well as a wait_queue. This mirrors what was done for DPC by commit
a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC"). That may
be insufficient if spurious link changes are caused by multiple sources
simultaneously. An example might be a Secondary Bus Reset issued by AER
during FPGA reconfiguration. If this turns out to happen in real life,
support for it can easily be added by replacing the PCI_LINK_CHANGING flag
with an atomic_t counter incremented by pci_hp_ignore_link_change() and
decremented by pci_hp_unignore_link_change(). Instead of awaiting a zero
PCI_LINK_CHANGING flag, the pci_hp_spurious_link_change() helper would
then simply await a zero counter.
Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
Reported-by: Joel Mathew Thomas <proxy0@tutamail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219765
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Joel Mathew Thomas <proxy0@tutamail.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/d04deaf49d634a2edf42bf3c06ed81b4ca54d17b.1744298239.git.lukas@wunner.de
|
|
Commit a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC")
amended PCIe hotplug to not bring down the slot upon Data Link Layer State
Changed events caused by Downstream Port Containment.
However Keith reports off-list that if the slot uses in-band presence
detect (i.e. Presence Detect State is derived from Data Link Layer Link
Active), DPC also causes a spurious Presence Detect Changed event.
This needs to be ignored as well.
Unfortunately there's no register indicating that in-band presence detect
is used. PCIe r5.0 sec 7.5.3.10 introduced the In-Band PD Disable bit in
the Slot Control Register. The PCIe hotplug driver sets this bit on
ports supporting it. But older ports may still use in-band presence
detect.
If in-band presence detect can be disabled, Presence Detect Changed events
occurring during DPC must not be ignored because they signal device
replacement. On all other ports, device replacement cannot be detected
reliably because the Presence Detect Changed event could be a side effect
of DPC. On those (older) ports, perform a best-effort device replacement
check by comparing the Vendor ID, Device ID and other data in Config Space
with the values cached in struct pci_dev. Use the existing helper
pciehp_device_replaced() to accomplish this. It is currently #ifdef'ed to
CONFIG_PM_SLEEP in pciehp_core.c, so move it to pciehp_hpc.c where most
other functions accessing config space reside.
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/fa264ff71952915c4e35a53c89eb0cde8455a5c5.1744298239.git.lukas@wunner.de
|
|
|
|
The "real" linux/types.h UAPI header gracefully degrades to a NOOP when
included from assembly code.
Mirror this behaviour in the tools/ variant.
Test for __ASSEMBLER__ over __ASSEMBLY__ as the former is provided by the
toolchain automatically.
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/lkml/af553c62-ca2f-4956-932c-dd6e3a126f58@sirena.org.uk/
Fixes: c9fbaa879508 ("selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/20250321-uapi-consistency-v1-1-439070118dc0@linutronix.de
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Support up to 8192 processors
Add cpuidle governor debug telemetry, disabled by default
Update default output to exclude cpuidle invocation counts
Bug fixes
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Create "pct_idle" counter group, the sofware notion of residency
so it can now be singled out, independent of other counter groups.
Create "cpuidle" group, the cpuidle invocation counts.
Disable "cpuidle", by default.
Create "swidle" = "cpuidle" + "pct_idle".
Undocument "sysfs", the old name for "swidle", but keep it working
for backwards compatibilty.
Create "hwidle", all the HW idle counters
Modify "idle", enabled by default
"idle" = "hwidle" + "pct_idle" (and now excludes "cpuidle")
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
... and don't error out so hard on missing module descriptions.
Before commit 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()")
we used to warn about missing module descriptions, but only when
building with extra warnigns (ie 'W=1').
After that commit the warning became an unconditional hard error.
And it turns out not all modules have been converted despite the claims
to the contrary. As reported by Damian Tometzki, the slub KUnit test
didn't have a module description, and apparently nobody ever really
noticed.
The reason nobody noticed seems to be that the slub KUnit tests get
disabled by SLUB_TINY, which also ends up disabling a lot of other code,
both in tests and in slub itself. And so anybody doing full build tests
didn't actually see this failre.
So let's disable SLUB_TINY for build-only tests, since it clearly ends
up limiting build coverage. Also turn the missing module descriptions
error back into a warning, but let's keep it around for non-'W=1'
builds.
Reported-by: Damian Tometzki <damian@riscv-rocks.de>
Link: https://lore.kernel.org/all/01070196099fd059-e8463438-7b1b-4ec8-816d-173874be9966-000000@eu-central-1.amazonses.com/
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Probe cpuidle "sysfs" residency and counts separately,
since soon we will make one disabled on, and the
other disabled off.
Clarify that some BIC (build-in-counters) are actually "groups".
since we're about to re-name some of those groups.
no functional change.
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Do fflush() to discard the buffered data, before each read of the
graphics sysfs knobs.
Fixes: ba99a4fc8c24 ("tools/power turbostat: Remove unnecessary fflush() call")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Document that on Intel Granite Rapids Systems,
Uncore domains 0-2 are CPU domains, and
uncore domains 3-4 are IO domains.
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
The CoreThr column displays total thermal throttling events
since boot time.
Change it to report events during the measurement interval.
This is more useful for showing a user the current conditions.
Total events since boot time are still available to the user via
/sys/devices/system/cpu/cpu*/thermal_throttle/*
Document CoreThr on turbostat.8
Fixes: eae97e053fe30 ("turbostat: Support thermal throttle count print")
Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
|
|
On systems with >= 1024 cpus (in my case 1152), turbostat fails with the error output:
"turbostat: /sys/fs/cgroup/cpuset.cpus.effective: cpu str malformat 0-1151"
A similar error appears with the use of turbostat --cpu when the inputted cpu
range contains a cpu number >= 1024:
# turbostat -c 1100-1151
"--cpu 1100-1151" malformed
...
Both errors are caused by parse_cpu_str() reaching its limit of CPU_SUBSET_MAXCPUS.
It's a good idea to limit the maximum cpu number being parsed, but 1024 is too low.
For a small increase in compute and allocated memory, increasing CPU_SUBSET_MAXCPUS
brings support for parsing cpu numbers >= 1024.
Increase CPU_SUBSET_MAXCPUS to 8192, a common setting for CONFIG_NR_CPUS on x86_64.
Signed-off-by: Justin Ernst <justin.ernst@hpe.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
The rpm-pkg make target currently suffers from a few issues related to
debuginfo:
1. debuginfo for things built into the kernel (vmlinux) is not available
in any RPM produced by make rpm-pkg. This makes using tools like
systemtap against a make rpm-pkg kernel impossible.
2. debug source for the kernel is not available. This means that
commands like 'disas /s' in gdb, which display source intermixed with
assembly, can only print file names/line numbers which then must be
painstakingly resolved to actual source in a separate editor.
3. debuginfo for modules is available, but it remains bundled with the
.ko files that contain module code, in the main kernel RPM. This is a
waste of space for users who do not need to debug the kernel (i.e.
most users).
Address all of these issues by additionally building a debuginfo RPM
when the kernel configuration allows for it, in line with standard
patterns followed by RPM distributors. With these changes:
1. systemtap now works (when these changes are backported to 6.11, since
systemtap lags a bit behind in compatibility), as verified by the
following simple test script:
# stap -e 'probe kernel.function("do_sys_open").call { printf("%s\n", $$parms); }'
dfd=0xffffffffffffff9c filename=0x7fe18800b160 flags=0x88800 mode=0x0
...
2. disas /s works correctly in gdb, with source and disassembly
interspersed:
# gdb vmlinux --batch -ex 'disas /s blk_op_str'
Dump of assembler code for function blk_op_str:
block/blk-core.c:
125 {
0xffffffff814c8740 <+0>: endbr64
127
128 if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op])
0xffffffff814c8744 <+4>: mov $0xffffffff824a7378,%rax
0xffffffff814c874b <+11>: cmp $0x23,%edi
0xffffffff814c874e <+14>: ja 0xffffffff814c8768 <blk_op_str+40>
0xffffffff814c8750 <+16>: mov %edi,%edi
126 const char *op_str = "UNKNOWN";
0xffffffff814c8752 <+18>: mov $0xffffffff824a7378,%rdx
127
128 if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op])
0xffffffff814c8759 <+25>: mov -0x7dfa0160(,%rdi,8),%rax
126 const char *op_str = "UNKNOWN";
0xffffffff814c8761 <+33>: test %rax,%rax
0xffffffff814c8764 <+36>: cmove %rdx,%rax
129 op_str = blk_op_name[op];
130
131 return op_str;
132 }
0xffffffff814c8768 <+40>: jmp 0xffffffff81d01360 <__x86_return_thunk>
End of assembler dump.
3. The size of the main kernel package goes down substantially,
especially if many modules are built (quite typical). Here is a
comparison of installed size of the kernel package (configured with
allmodconfig, dwarf4 debuginfo, and module compression turned off)
before and after this patch:
# rpm -qi kernel-6.13* | grep -E '^(Version|Size)'
Version : 6.13.0postpatch+
Size : 1382874089
Version : 6.13.0prepatch+
Size : 17870795887
This is a ~92% size reduction.
Note that a debuginfo package can only be produced if the following
configs are set:
- CONFIG_DEBUG_INFO=y
- CONFIG_MODULE_COMPRESS=n
- CONFIG_DEBUG_INFO_SPLIT=n
The first of these is obvious - we can't produce debuginfo if the build
does not generate it. The second two requirements can in principle be
removed, but doing so is difficult with the current approach, which uses
a generic rpmbuild script find-debuginfo.sh that processes all packaged
executables. If we want to remove those requirements the best path
forward is likely to add some debuginfo extraction/installation logic to
the modules_install target (controllable by flags). That way, it's
easier to operate on modules before they're compressed, and the logic
can be reused by all packaging targets.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
The scripts/kconfig/merge_config.sh script requires an existing
$INITFILE (or the $1 argument) as a base file for merging Kconfig
fragments. However, an empty $INITFILE can serve as an initial starting
point, later referenced by the KCONFIG_ALLCONFIG Makefile variable
if -m is not used. This variable can point to any configuration file
containing preset config symbols (the merged output) as stated in
Documentation/kbuild/kconfig.rst. When -m is used $INITFILE will
contain just the merge output requiring the user to run make (i.e.
KCONFIG_ALLCONFIG=<$INITFILE> make <allnoconfig/alldefconfig> or make
olddefconfig).
Instead of failing when `$INITFILE` is missing, create an empty file and
use it as the starting point for merges.
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Commit 654102df2ac2 ("kbuild: add generic support for built-in boot
DTBs") introduced generic support for built-in DTBs.
Select GENERIC_BUILTIN_DTB when built-in DTB support is enabled.
To keep consistency across architectures, this commit also renames
CONFIG_NIOS2_DTB_SOURCE_BOOL to CONFIG_BUILTIN_DTB, and
CONFIG_NIOS2_DTB_SOURCE to CONFIG_BUILTIN_DTB_NAME.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
This option was removed from Kconfig in 8c710f75256b ("net/sched:
Retire tcindex classifier") but from the defconfigs.
Fixes: 8c710f75256b ("net/sched: Retire tcindex classifier")
Signed-off-by: Johan Korsnes <johan.korsnes@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
|
|
J2-based devices expect to find a device tree blob at the end of the
.bss section. As of a77725a9a3c5 ("scripts/dtc: Update to upstream
version v1.6.1-19-g0a3a9d3449c8"), libfdt enforces 8-byte alignment
for the DTB, causing J2 devices to fail early in sh_fdt_init().
As the J2 loader firmware calculates the DTB location based on the kernel
image .bss section size rather than the __bss_stop symbol offset, the
required alignment can't be enforced with BSS_SECTION(0, PAGE_SIZE, 8).
To fix this, inline a modified version of the above macro which grows
.bss by the required size. While this change affects all existing SH
boards, it should be benign on platforms which don't need this alignment.
Signed-off-by: Artur Rojek <contact@artur-rojek.eu>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: Rob Landley <rob@landley.net>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
|
|
The function hrtimer_init() doesn't exist anymore. It was replaced by
hrtimer_setup().
Thus, rename the hrtimer_init trace event to hrtimer_setup to keep it
consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/cba84c3d853c5258aa3a262363a6eac08e2c7afc.1738746927.git.namcao@linutronix.de
|
|
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename debug_init_on_stack() to debug_setup_on_stack() as well, to keep the
names consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/073cf6162779a2f5b12624677d4c49ee7eccc1ed.1738746927.git.namcao@linutronix.de
|
|
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename debug_init() to debug_setup() as well, to keep the names consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/4b730c1f79648b16a1c5413f928fdc2e138dfc43.1738746927.git.namcao@linutronix.de
|
|
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper() as well, to
keep the names consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/807694aedad9353421c4a7347629a30c5c31026f.1738746927.git.namcao@linutronix.de
|
|
The struct hrtimer::function field can only be changed using
hrtimer_setup*() or hrtimer_update_function(), and both already null-check
'function'. Therefore, null-checking 'function' in hrtimer_start_range_ns()
is not necessary.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/4661c571ee87980c340ccc318fc1a473c0c8f6bc.1738746927.git.namcao@linutronix.de
|
|
Make the struct hrtimer::function field private, to prevent users from
changing this field in an unsafe way. hrtimer_update_function() should be
used if the callback function needs to be changed.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/7d0e6e0c5c59a64a9bea940051aac05d750bc0c2.1738746927.git.namcao@linutronix.de
|
|
__hrtimer_init() is only called by __hrtimer_setup(). Simplify by merging
__hrtimer_init() into __hrtimer_setup().
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/8a0a847a35f711f66b2d05b57255aa44e7e61279.1738746927.git.namcao@linutronix.de
|
|
__hrtimer_init_sleeper() calls __hrtimer_init() and also sets up the
callback function. But there is already __hrtimer_setup() which does both
actions.
Switch to use __hrtimer_setup() to simplify the code.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/d9a45a51b6a8aa0045310d63f73753bf6b33f385.1738746927.git.namcao@linutronix.de
|
|
hrtimer_init() is now unused. Delete it.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/003722f60c7a2a4f8d4ed24fb741aa313b7e5136.1738746927.git.namcao@linutronix.de
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Coccinelle scripted cleanup.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This reverts commit 757b000f7b936edf79311ab0971fe465bbda75ea.
Miroslav reported that the changes for handling the inconsistencies in the
coarse time getters result in a regression on the adjtimex() side.
There are two issues:
1) The forwarding of the base time moves the update out of the original
period and establishes a new one.
2) The clearing of the accumulated NTP error is changing the behaviour as
well.
Userspace expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.
Revert the change, so that the established expectations of user space
implementations (ntpd, chronyd) are restored. The re-introduced
inconsistency of the coarse time getters will be addressed in a subsequent
fix.
Fixes: 757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids")
Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/Z-qsg6iDGlcIJulJ@localhost
|
|
Frank reported, that the common irq_force_complete_move() breaks the out of
tree build of ia64. The reason is that ia64 uses the migration code, but
does not have hierarchical interrupt domains enabled.
This went unnoticed in mainline as both x86 and RISC-V have hierarchical
domains enabled. Not that it matters for mainline, but it's still
inconsistent.
Use irqd_get_parent_data() instead of accessing the parent_data field
directly. The helper returns NULL when hierarchical domains are disabled
otherwise it accesses the parent_data field of the domain.
No functional change.
Fixes: 751dc837dabd ("genirq: Introduce common irq_force_complete_move() implementation")
Reported-by: Frank Scheiner <frank.scheiner@web.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Frank Scheiner <frank.scheiner@web.de>
Link: https://lore.kernel.org/all/87h634ugig.ffs@tglx
|
|
Our CI expects output from the test at least once every 10 minutes.
The AMT test when running on debug kernel is just on the edge
of that time for the stress test. Improve the output:
- print the name of the test first, before starting it,
- output a dot every 10% of the way.
Output after:
TEST: amt discovery [ OK ]
TEST: IPv4 amt multicast forwarding [ OK ]
TEST: IPv6 amt multicast forwarding [ OK ]
TEST: IPv4 amt traffic forwarding torture .......... [ OK ]
TEST: IPv6 amt traffic forwarding torture .......... [ OK ]
Reviewed-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250403145636.2891166-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
It is confusing to see 'host' and 'domain' to be used as 'domain'. Given
this header is all about domains, switch the remaining 'host' uses to
'domain'.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-5-jirislaby@kernel.org
|
|
Naming interrupt domains host is confusing at best and the irqdomain code
uses both domain and host inconsistently.
Therefore rename irq_get_default_host() to irq_get_default_domain().
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-4-jirislaby@kernel.org
|
|
Naming interrupt domains host is confusing at best and the irqdomain code
uses both domain and host inconsistently.
Therefore rename irq_set_default_host() to irq_set_default_domain().
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-3-jirislaby@kernel.org
|
|
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in route-attrs and metrics and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
Fixes: 023289b4f582 ("doc/netlink: Add spec for rt route messages")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in addr-attrs and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
Fixes: dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Command names should match C defines, codegens may depend on it.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The spec is mis-formatted, schema validation says:
Failed validating 'type' in schema['properties']['operations']['properties']['list']['items']['properties']['dump']['properties']['request']['properties']['value']:
{'minimum': 0, 'type': 'integer'}
On instance['operations']['list'][3]['dump']['request']['value']:
'58 - ifa-family'
The ifa-family clearly wants to be part of an attribute list.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Yuyang Huang <yuyanghuang@google.com>
Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support")
Link: https://patch.msgid.link/20250403013706.2828322-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit under Fixes solved the problem of spurious warnings when we
uninstall an MP from a device while its down. The __net_mp_close_rxq()
which is used by io_uring was not fixed. Move the fix over and reuse
__net_mp_close_rxq() in the devmem path.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250403013405.2827250-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
devmem code performs a number of safety checks to avoid having
to reimplement all of them in the drivers. Move those to
__net_mp_open_rxq() and reuse that function for binding to make
sure that io_uring ZC also benefits from them.
While at it rename the queue ID variable to rxq_idx in
__net_mp_open_rxq(), we touch most of the relevant lines.
The XArray insertion is reordered after the netdev_rx_queue_restart()
call, otherwise we'd need to duplicate the queue index check
or risk inserting an invalid pointer. The XArray allocation
failures should be extremely rare.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 6e18ed929d3b ("net: add helpers for setting a memory provider on an rx queue")
Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text
Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.
Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.
I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:
Setting pool0/active to 0
Setting pool1/active to 1
[ 73.911067][ T4365] ibmveth 30000002 eth0: close starting
Setting pool1/active to 1
Setting pool1/active to 0
[ 73.911367][ T4366] ibmveth 30000002 eth0: close starting
[ 73.916056][ T4365] ibmveth 30000002 eth0: close complete
[ 73.916064][ T4365] ibmveth 30000002 eth0: open starting
[ 110.808564][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 230.808495][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 243.683786][ T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
[ 243.683827][ T123] Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
[ 243.683833][ T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.683838][ T123] task:stress.sh state:D stack:28096 pid:4365 tgid:4365 ppid:4364 task_flags:0x400040 flags:0x00042000
[ 243.683852][ T123] Call Trace:
[ 243.683857][ T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
[ 243.683868][ T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
[ 243.683878][ T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
[ 243.683888][ T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
[ 243.683896][ T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
[ 243.683904][ T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
[ 243.683913][ T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
[ 243.683921][ T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
[ 243.683928][ T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
[ 243.683936][ T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
[ 243.683944][ T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
[ 243.683951][ T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
[ 243.683958][ T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
[ 243.683966][ T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
[ 243.683973][ T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
...
[ 243.684087][ T123] Showing all locks held in the system:
[ 243.684095][ T123] 1 lock held by khungtaskd/123:
[ 243.684099][ T123] #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
[ 243.684114][ T123] 4 locks held by stress.sh/4365:
[ 243.684119][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684132][ T123] #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684143][ T123] #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684155][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
[ 243.684166][ T123] 5 locks held by stress.sh/4366:
[ 243.684170][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684183][ T123] #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684194][ T123] #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684205][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
[ 243.684216][ T123] #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0
From the ibmveth debug, two threads are calling veth_pool_store, which
calls ibmveth_close and ibmveth_open. Here's the sequence:
T4365 T4366
----------------- ----------------- ---------
veth_pool_store veth_pool_store
ibmveth_close
ibmveth_close
napi_disable
napi_disable
ibmveth_open
napi_enable <- HANG
ibmveth_close calls napi_disable at the top and ibmveth_open calls
napi_enable at the top.
https://docs.kernel.org/networking/napi.html]] says
The control APIs are not idempotent. Control API calls are safe
against concurrent use of datapath APIs but an incorrect sequence of
control API calls may result in crashes, deadlocks, or race
conditions. For example, calling napi_disable() multiple times in a
row will deadlock.
In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.
Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com>
Fixes: 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically")
Reviewed-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250402154403.386744-1-davemarq@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
devm_kasprintf() returns NULL when memory allocation fails. Currently,
com20020pci_probe() does not check for this case, which results in a
NULL pointer dereference.
Add NULL check after devm_kasprintf() to prevent this issue and ensure
no resources are left allocated.
Fixes: 6b17a597fc2f ("arcnet: restoring support for multiple Sohard Arcnet cards")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Link: https://patch.msgid.link/20250402135036.44697-1-bsdhenrymartin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Nexthops whose link is down are not supposed to be considered during
path selection when the "ignore_routes_with_linkdown" sysctl is set.
This is done by assigning them a negative region boundary.
However, when comparing the computed hash (unsigned) with the region
boundary (signed), the negative region boundary is treated as unsigned,
resulting in incorrect nexthop selection.
Fix by treating the computed hash as signed. Note that the computed hash
is always in range of [0, 2^31 - 1].
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250402114224.293392-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cited commit transitioned IPv6 path selection to use hash-threshold
instead of modulo-N. With hash-threshold, each nexthop is assigned a
region boundary in the multipath hash function's output space and a
nexthop is chosen if the calculated hash is smaller than the nexthop's
region boundary.
Hash-threshold does not work correctly if path selection does not start
with the first nexthop. For example, if fib6_select_path() is always
passed the last nexthop in the group, then it will always be chosen
because its region boundary covers the entire hash function's output
space.
Fix this by starting the selection process from the first nexthop and do
not consider nexthops for which rt6_score_route() provided a negative
score.
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Reported-by: Stanislav Fomichev <stfomichev@gmail.com>
Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250402114224.293392-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Missing usbnet_going_away Check in Critical Path.
The usb_submit_urb function lacks a usbnet_going_away
validation, whereas __usbnet_queue_skb includes this check.
This inconsistency creates a race condition where:
A URB request may succeed, but the corresponding SKB data
fails to be queued.
Subsequent processes:
(e.g., rx_complete → defer_bh → __skb_unlink(skb, list))
attempt to access skb->next, triggering a NULL pointer
dereference (Kernel Panic).
Fixes: 04e906839a05 ("usbnet: fix cyclical race on disconnect with work queue")
Cc: stable@vger.kernel.org
Signed-off-by: Ying Lu <luying1@xiaomi.com>
Link: https://patch.msgid.link/4c9ef2efaa07eb7f9a5042b74348a67e5a3a7aea.1743584159.git.luying1@xiaomi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the current implementation octeontx2 manages XDP_ABORTED and XDP
invalid as XDP_PASS forwarding the skb to the networking stack.
Align the behaviour to other XDP drivers handling XDP_ABORTED and XDP
invalid as XDP_DROP.
Please note this patch has just compile tested.
Fixes: 06059a1a9a4a5 ("octeontx2-pf: Add XDP support to netdev PF")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250401-octeontx2-xdp-abort-fix-v1-1-f0587c35a0b9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
struct geneve_opt uses 5 bit length for each single option, which
means every vary size option should be smaller than 128 bytes.
However, all current related Netlink policies cannot promise this
length condition and the attacker can exploit a exact 128-byte size
option to *fake* a zero length option and confuse the parsing logic,
further achieve heap out-of-bounds read.
One example crash log is like below:
[ 3.905425] ==================================================================
[ 3.905925] BUG: KASAN: slab-out-of-bounds in nla_put+0xa9/0xe0
[ 3.906255] Read of size 124 at addr ffff888005f291cc by task poc/177
[ 3.906646]
[ 3.906775] CPU: 0 PID: 177 Comm: poc-oob-read Not tainted 6.1.132 #1
[ 3.907131] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 3.907784] Call Trace:
[ 3.907925] <TASK>
[ 3.908048] dump_stack_lvl+0x44/0x5c
[ 3.908258] print_report+0x184/0x4be
[ 3.909151] kasan_report+0xc5/0x100
[ 3.909539] kasan_check_range+0xf3/0x1a0
[ 3.909794] memcpy+0x1f/0x60
[ 3.909968] nla_put+0xa9/0xe0
[ 3.910147] tunnel_key_dump+0x945/0xba0
[ 3.911536] tcf_action_dump_1+0x1c1/0x340
[ 3.912436] tcf_action_dump+0x101/0x180
[ 3.912689] tcf_exts_dump+0x164/0x1e0
[ 3.912905] fw_dump+0x18b/0x2d0
[ 3.913483] tcf_fill_node+0x2ee/0x460
[ 3.914778] tfilter_notify+0xf4/0x180
[ 3.915208] tc_new_tfilter+0xd51/0x10d0
[ 3.918615] rtnetlink_rcv_msg+0x4a2/0x560
[ 3.919118] netlink_rcv_skb+0xcd/0x200
[ 3.919787] netlink_unicast+0x395/0x530
[ 3.921032] netlink_sendmsg+0x3d0/0x6d0
[ 3.921987] __sock_sendmsg+0x99/0xa0
[ 3.922220] __sys_sendto+0x1b7/0x240
[ 3.922682] __x64_sys_sendto+0x72/0x90
[ 3.922906] do_syscall_64+0x5e/0x90
[ 3.923814] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3.924122] RIP: 0033:0x7e83eab84407
[ 3.924331] Code: 48 89 fa 4c 89 df e8 38 aa 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 faf
[ 3.925330] RSP: 002b:00007ffff505e370 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[ 3.925752] RAX: ffffffffffffffda RBX: 00007e83eaafa740 RCX: 00007e83eab84407
[ 3.926173] RDX: 00000000000001a8 RSI: 00007ffff505e3c0 RDI: 0000000000000003
[ 3.926587] RBP: 00007ffff505f460 R08: 00007e83eace1000 R09: 000000000000000c
[ 3.926977] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffff505f3c0
[ 3.927367] R13: 00007ffff505f5c8 R14: 00007e83ead1b000 R15: 00005d4fbbe6dcb8
Fix these issues by enforing correct length condition in related
policies.
Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts")
Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve")
Fixes: 0ed5269f9e41 ("net/sched: add tunnel option support to act_tunnel_key")
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250402165632.6958-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Don't use a scoped guard that only protects the next statement.
Use a regular guard to make sure that the namespace semaphore is held
across the whole function.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reported-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/all/20250401170715.GA112019@unreal/
Fixes: db04662e2f4f ("fs: allow detached mounts in clone_private_mount()")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix io_uring zero copy rx selftest with updated netdev Python helpers.
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250402172414.895276-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Netdevsim has extra register_netdevice_notifier_dev_net notifiers,
use netdevim instead of dummy device to test them out.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|