Age | Commit message (Collapse) | Author | Files | Lines |
|
The SR-IOV specification requires both PFs and VFs to implement a PCIe
capability. Generally this is sufficient to assume extended config space
is present, but we generally also perform additional tests to make sure the
extended config space is reachable and not simply an alias of standard
config space. For a VF to exist extended config space must be accessible
on the PF, therefore we can also assume it to be accessible on the VF.
This enables a micro performance optimization previously implemented in
commit 975bb8b4dc93 ("PCI/IOV: Use VF0 cached config space size for other
VFs") to speed up probing of VFs.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Hao Zheng <yinhe@linux.alibaba.com>
|
|
Revert 975bb8b4dc93 ("PCI/IOV: Use VF0 cached config space size for other
VFs"), which attempted to cache the config space size from the first VF to
re-use for subsequent VFs.
The cached value was determined prior to discovering the PCIe capability on
the VF, which resulted in the first VF reporting the correct config space
size (4K), as it has a special case through pci_cfg_space_size(), while all
the other VFs only reported 256 bytes. As this was only a performance
optimization, we're better off without it.
Fixes: 975bb8b4dc93 ("PCI/IOV: Use VF0 cached config space size for other VFs")
Link: https://lore.kernel.org/r/156046663197.29869.3633634445109057665.stgit@gimli.home
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Hao Zheng <yinhe@linux.alibaba.com>
|
|
The driver name in /proc/bus/pci/devices can be printed without a printf
format specification, so use seq_puts() instead of seq_printf().
This issue was detected by using the Coccinelle software.
Link: https://lore.kernel.org/r/a6b110cb-0d0e-5dc3-9ca1-9041609cf74c@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Drivers that use dma_virt_ops were meant to be rejected when testing
compatibility for P2PDMA.
This check got inadvertently dropped in one of the later versions of the
original patchset, so add it back.
Fixes: 52916982af48 ("PCI/P2PDMA: Support peer-to-peer memory")
Link: https://lore.kernel.org/r/20190702173544.21950-1-logang@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
If "hotplug_bridges == 0", "!dev->is_hotplug_bridge" is always true, so the
loop that divides the remaining resources among hotplug-capable bridges
does nothing.
Check for "hotplug_bridges == 0" earlier, so we don't even have to compute
the amount of remaining resources. No functional change intended.
Link: https://lore.kernel.org/r/PS2P216MB0642C7A485649D2D787A1C6F80000@PS2P216MB0642.KORP216.PROD.OUTLOOK.COM
Link: https://lore.kernel.org/r/20190622210310.180905-3-helgaas@kernel.org
Signed-off-by: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
|
|
Reorder pci_bus_distribute_available_resources() to group related code
together. No functional change intended.
Link: https://lore.kernel.org/r/PS2P216MB0642C7A485649D2D787A1C6F80000@PS2P216MB0642.KORP216.PROD.OUTLOOK.COM
Link: https://lore.kernel.org/r/20190622210310.180905-2-helgaas@kernel.org
Signed-off-by: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
|
|
If we must preserve the firmware resource assignments, claim the existing
resources rather than reassigning everything.
Link: https://lore.kernel.org/r/20190615002359.29577-4-benh@kernel.crashing.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
|
|
Call pci_assign_unassigned_root_bus_resources() instead of the simpler:
pci_bus_size_bridges(bus);
pci_bus_assign_resources(bus);
pci_assign_unassigned_root_bus_resources() calls:
__pci_bus_size_bridges(bus, add_list);
__pci_bus_assign_resources(bus, add_list, &fail_head);
so this should be equivalent as long as we're able to assign everything.
If we were unable to assign something, previously we did nothing and left
it unassigned, but after this patch, we will attempt to do some
reallocation.
Once we start honoring FW resource allocations, this will bring up the
"reallocation" feature which can help making room for SR-IOV when
necessary.
Link: https://lore.kernel.org/r/20190615002359.29577-1-benh@kernel.crashing.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
|
|
Prevent auto-enabling of bridges reallocation when the FW tells us that the
initial configuration must be preserved for a given host bridge.
Link: https://lore.kernel.org/r/20190615002359.29577-3-benh@kernel.crashing.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Evaluate _DSM Function #5, the "PCI Boot Configuration" function. If the
result is 0, the OS should preserve any resource assignments made by the
firmware.
Link: https://lore.kernel.org/r/20190615002359.29577-2-benh@kernel.crashing.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
With CONFIG_PROVE_LOCKING=y, using sysfs to remove a bridge with a device
below it causes a lockdep warning, e.g.,
# echo 1 > /sys/class/pci_bus/0000:00/device/0000:00:00.0/remove
============================================
WARNING: possible recursive locking detected
...
pci_bus 0000:01: busn_res: [bus 01] is released
The remove recursively removes the subtree below the bridge. Each call
uses a different lock so there's no deadlock, but the locks were all
created with the same lockdep key so the lockdep checker can't tell them
apart.
Mark the "remove" sysfs attribute with __ATTR_IGNORE_LOCKDEP() as it is
safe to ignore the lockdep check between different "remove" kernfs
instances.
There's discussion about a similar issue in USB at [1], which resulted in
356c05d58af0 ("sysfs: get rid of some lockdep false positives") and
e9b526fe7048 ("i2c: suppress lockdep warning on delete_device"), which do
basically the same thing for USB "remove" and i2c "delete_device" files.
[1] https://lore.kernel.org/r/Pine.LNX.4.44L0.1204251436140.1206-100000@iolanthe.rowland.org
Link: https://lore.kernel.org/r/20190526225151.3865-1-marek.vasut@gmail.com
Signed-off-by: Marek Vasut <marek.vasut+renesas@gmail.com>
[bhelgaas: trim commit log, details at above links]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Phil Edworthy <phil.edworthy@renesas.com>
Cc: Simon Horman <horms+renesas@verge.net.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wolfram Sang <wsa@the-dreams.de>
|
|
Stratix 10 PCIe controller does not support Type 1 to Type 0 conversion
as previous version (V1) does so the PCIe controller configuration
mechanism needs to send Type 0 config TLP if the target bus number
matches with the secondary bus number.
Implement a function to form a TLP header that depends on the PCIe
controller version, so that the header can be formed according to
specific host controller HW internals, fixing the type conversion issue.
Signed-off-by: Ley Foon Tan <ley.foon.tan@intel.com>
[lorenzo.pieralisi@arm.com: commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
|
|
Convert the PM documents to ReST, in order to allow them to
build with Sphinx.
The conversion is actually:
- add blank lines and indentation in order to identify paragraphs;
- fix tables markups;
- add some lists markups;
- mark literal blocks;
- adjust title markups.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
|
|
PCIe r5.0, sec 7.5.3.18, defines a new 32.0 GT/s bit in the Supported Link
Speeds Vector of Link Capabilities 2. Decode this new speed. This does
not affect the speed of the link, which should be negotiated automatically
by the hardware; it only adds decoding when showing the speed to the user.
Previously, reading the speed of a link operating at this speed showed
"Unknown speed" instead of "32.0 GT/s".
Link: https://lore.kernel.org/lkml/92365e3caf0fc559f9ab14bcd053bfc92d4f661c.1559664969.git.gustavo.pimentel@synopsys.com
Signed-off-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Commit 0e7df22401a3 ("PCI: Add sysfs sriov_drivers_autoprobe to control
VF driver binding") introduced the sriov_drivers_autoprobe attribute
which allows users to prevent the kernel from automatically probing a
driver for new VFs as they are created. This allows VFs to be spawned
without automatically binding the new device to a host driver, such as
in cases where the user intends to use the device only with a meta
driver like vfio-pci. However, the current implementation prevents any
use of drivers_probe with the VF while sriov_drivers_autoprobe=0. This
blocks the now current general practice of setting driver_override
followed by using drivers_probe to bind a device to a specified driver.
The kernel never automatically sets a driver_override therefore it seems
we can assume a driver_override reflects the intent of the user. Also,
probing a device using a driver_override match seems outside the scope
of the 'auto' part of sriov_drivers_autoprobe. Therefore, let's allow
driver_override matches regardless of sriov_drivers_autoprobe, which we
can do by simply testing if a driver_override is set for a device as a
'can probe' condition.
Fixes: 0e7df22401a3 ("PCI: Add sysfs sriov_drivers_autoprobe to control VF driver binding")
Link: https://lore.kernel.org/lkml/155742996741.21878.569845487290798703.stgit@gimli.home
Link: https://lore.kernel.org/linux-pci/155672991496.20698.4279330795743262888.stgit@gimli.home/T/#u
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
The NVIDIA Turing GPU is a multi-function PCI device with the following
functions:
- Function 0: VGA display controller
- Function 1: Audio controller
- Function 2: USB xHCI Host controller
- Function 3: USB Type-C UCSI controller
Function 0 is tightly coupled with other functions in the hardware. When
function 0 is in D3, it gates power for hardware blocks used by other
functions, which means those functions only work when function 0 is in D0.
If any of these functions (1/2/3) are in D0, then function 0 should also be
in D0.
Commit 07f4f97d7b4b ("vga_switcheroo: Use device link for HDA controller")
already creates a device link to show the dependency of function 1 on
function 0 of this GPU. Create additional device links to express the
dependencies of functions 2 and 3 on function 0. This means function 0
will be in D0 if any other function is in D0.
[bhelgaas: I think the PCI spec expectation is that functions can be
power-managed independently, so I don't think this device is technically
compliant. For example, the PCIe r5.0 spec, sec 1.4, says "the PCI/PCIe
hardware/software model includes architectural constructs necessary to
discover, configure, and use a Function, without needing Function-specific
knowledge" and sec 5.1 says "D states are associated with a particular
Function" and "PM provides ... a mechanism to identify power management
capabilities of a given Function [and] the ability to transition a Function
into a certain power management state."]
Link: https://lore.kernel.org/lkml/20190606092225.17960-3-abhsahu@nvidia.com
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Although not allowed by the PCI specs, some multi-function devices have
power dependencies between the functions. For example, function 1 may not
work unless function 0 is in the D0 power state.
The existing quirk_gpu_hda() adds a device link to express this dependency
for GPU and HDA devices, but it really is not specific to those device
types.
Generalize it and rename it to pci_create_device_link() so we can create
dependencies between any "consumer" and "producer" functions of a
multi-function device, where the consumer is only functional if the
producer is in D0. This reorganization should not affect any
functionality.
Link: https://lore.kernel.org/lkml/20190606092225.17960-2-abhsahu@nvidia.com
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
[bhelgaas: commit log, reword diagnostic]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree. No essential content change.
Move the description of struct pci_driver and struct pci_device_id into
in-source comments.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
[bhelgaas: fix kernel-doc warnings related to moving descriptions to
linux/pci.h, fix "space tab" whitespace errors in mod_devicetable.h]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Altera MSI IP is a soft IP and is only available after
an FPGA image (with design containing it) is programmed.
Make driver modulable to support use case FPGA image is programmed the
after kernel has booted, so that the driver can be loaded upon request.
Signed-off-by: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
|
|
Altera PCIe Rootport IP is a soft IP and is only available after
an FPGA image (whose design contains it) is programmed.
Make driver modulable to support use cases when FPGA image is
programmed after the kernel has booted, so that the driver
can be loaded upon request.
Signed-off-by: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
|
|
Commit 0e7df22401a3 ("PCI: Add sysfs sriov_drivers_autoprobe to control
VF driver binding") allows the user to specify that drivers for VFs of
a PF should not be probed, but it actually causes pci_device_probe() to
return success back to the driver core in this case. Therefore by all
sysfs appearances the device is bound to a driver, the driver link from
the device exists as does the device link back from the driver, yet the
driver's probe function is never called on the device. We also fail to
do any sort of cleanup when we're prohibited from probing the device,
the IRQ setup remains in place and we even hold a device reference.
Instead, abort with errno before any setup or references are taken when
pci_device_can_probe() prevents us from trying to probe the device.
Link: https://lore.kernel.org/lkml/155672991496.20698.4279330795743262888.stgit@gimli.home
Fixes: 0e7df22401a3 ("PCI: Add sysfs sriov_drivers_autoprobe to control VF driver binding")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Add index.rst for PCI subsystem. More docs will be added later.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
|
|
'ifeq ... else ifneq ... endif' notation is supported by GNU Make 3.81
or later, which is the requirement for building the kernel since
commit 37d69ee30808 ("docs: bump minimal GNU Make version to 3.81").
Use it to improve the readability.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
|
Currently on panic, kernel will lower the loglevel and print out pending
printk msg only with console_flush_on_panic().
Add an option for users to configure the "panic_print" to replay all
dmesg in buffer, some of which they may have never seen due to the
loglevel setting, which will help panic debugging .
[feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()]
Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: use logbuf lock to protect the console log index]
Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Aaro Koskinen <aaro.koskinen@nokia.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since commit 54c7a8916a88 ("initramfs: free initrd memory if opening
/initrd.image fails"), the kernel has unconditionally attempted to free
the initrd even if it doesn't exist.
In the non-existent case this causes a boot-time splat if
CONFIG_DEBUG_VIRTUAL is enabled due to a call to virt_to_phys() with a
NULL address.
Instead we should check that the initrd actually exists and only attempt
to free it if it does.
Link: http://lkml.kernel.org/r/20190516143125.48948-1-steven.price@arm.com
Fixes: 54c7a8916a88 ("initramfs: free initrd memory if opening /initrd.image fails")
Signed-off-by: Steven Price <steven.price@arm.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
switch may not go to the workqueue after synchronize_rcu(). Thus
previous scheduled switches was not finished even flushing the
workqueue, which will cause a NULL pointer dereferenced followed below.
VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day...
BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
evict+0xb3/0x180
iput+0x1b0/0x230
inode_switch_wbs_work_fn+0x3c0/0x6a0
worker_thread+0x4e/0x490
? process_one_work+0x410/0x410
kthread+0xe6/0x100
ret_from_fork+0x39/0x50
Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
pending callbacks to finish. And inc isw_nr_in_flight after call_rcu()
in inode_switch_wbs() to make more sense.
Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
syzbot reported the following error from a tree with a head commit of
baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")
BUG: unable to handle kernel paging request at ffffea0003348000
#PF error: [normal kernel read fault]
PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
FS: 00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
Call Trace:
fast_isolate_around mm/compaction.c:1243 [inline]
fast_isolate_freepages mm/compaction.c:1418 [inline]
isolate_freepages mm/compaction.c:1438 [inline]
compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550
There is no reproducer and it is difficult to hit -- 1 crash every few
days. The issue is very similar to the fix in commit 6b0868c820ff
("mm/compaction.c: correct zone boundary handling when resetting pageblock
skip hints"). When isolating free pages around a target pageblock, the
boundary handling is off by one and can stray into the next pageblock.
Triggering the syzbot error requires that the end of pageblock is section
or zone aligned, and that the next section is unpopulated.
A more subtle consequence of the bug is that pageblocks were being
improperly used as migration targets which potentially hurts fragmentation
avoidance in the long-term one page at a time.
A debugging patch revealed that it's definitely possible to stray outside
of a pageblock which is not intended. While syzbot cannot be used to
verify this patch, it was confirmed that the debugging warning no longer
triggers with this patch applied. It has also been confirmed that the THP
allocation stress tests are not degraded by this patch.
Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> # v5.1+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This macro adds some debug code to check that vmap allocations are
happened in ascending order.
By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.
[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This macro adds some debug code to check that the augment tree is
maintained correctly, meaning that every node contains valid
subtree_max_size value.
By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.
[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "improve vmap allocation", v3.
Objective
---------
Please have a look for the description at:
https://lkml.org/lkml/2018/10/19/786
but let me also summarize it a bit here as well.
The current implementation has O(N) complexity. Requests with different
permissive parameters can lead to long allocation time. When i say
"long" i mean milliseconds.
Description
-----------
This approach organizes the KVA memory layout into free areas of the
1-ULONG_MAX range, i.e. an allocation is done over free areas lookups,
instead of finding a hole between two busy blocks. It allows to have
lower number of objects which represent the free space, therefore to have
less fragmented memory allocator. Because free blocks are always as large
as possible.
It uses the augment tree where all free areas are sorted in ascending
order of va->va_start address in pair with linked list that provides
O(1) access to prev/next elements.
Since the tree is augment, we also maintain the "subtree_max_size" of VA
that reflects a maximum available free block in its left or right
sub-tree. Knowing that, we can easily traversal toward the lowest (left
most path) free area.
Allocation: ~O(log(N)) complexity. It is sequential allocation method
therefore tends to maximize locality. The search is done until a first
suitable block is large enough to encompass the requested parameters.
Bigger areas are split.
I copy paste here the description of how the area is split, since i
described it in https://lkml.org/lkml/2018/10/19/786
<snip>
A free block can be split by three different ways. Their names are
FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they
correspond to how requested size and alignment fit to a free block.
FL_FIT_TYPE - in this case a free block is just removed from the free
list/tree because it fully fits. Comparing with current design there is
an extra work with rb-tree updating.
LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do
is just cutting a free block. It is as fast as a current design. Most of
the vmalloc allocations just end up with this case, because the edge is
always aligned to 1.
NE_FIT_TYPE - Is much less common case. Basically it happens when
requested size and alignment does not fit left nor right edges, i.e. it
is between them. In this case during splitting we have to build a
remaining left free area and place it back to the free list/tree.
Comparing with current design there are two extra steps. First one is we
have to allocate a new vmap_area structure. Second one we have to insert
that remaining free block to the address sorted list/tree.
In order to optimize a first case there is a cache with free_vmap objects.
Instead of allocating from slab we just take an object from the cache and
reuse it.
Second one is pretty optimized. Since we know a start point in the tree
we do not do a search from the top. Instead a traversal begins from a
rb-tree node we split.
<snip>
De-allocation. ~O(log(N)) complexity. An area is not inserted straight
away to the tree/list, instead we identify the spot first, checking if it
can be merged around neighbors. The list provides O(1) access to
prev/next, so it is pretty fast to check it. Summarizing. If merged then
large coalesced areas are created, if not the area is just linked making
more fragments.
There is one more thing that i should mention here. After modification of
VA node, its subtree_max_size is updated if it was/is the biggest area in
its left or right sub-tree. Apart of that it can also be populated back
to upper levels to fix the tree. For more details please have a look at
the __augment_tree_propagate_from() function and the description.
Tests and stressing
-------------------
I use the "test_vmalloc.sh" test driver available under
"tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo
./test_vmalloc.sh" to find out how to deal with it.
Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
Regarding last one, i do not have any physical access to NUMA system,
therefore i emulated it. The time of stressing is days.
If you run the test driver in "stress mode", you also need the patch that
is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it:
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c
After massive testing, i have not identified any problems like memory
leaks, crashes or kernel panics. I find it stable, but more testing would
be good.
Performance analysis
--------------------
I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and
another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas
Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1
as well, but the results have not been ready by time i an writing this.
Currently it consist of 8 tests. There are three of them which correspond
to different types of splitting(to compare with default). We have 3
ones(see above). Another 5 do allocations in different conditions.
a) sudo ./test_vmalloc.sh performance
When the test driver is run in "performance" mode, it runs all available
tests pinned to first online CPU with sequential execution test order. We
do it in order to get stable and repeatable results. Take a look at time
difference in "long_busy_list_alloc_test". It is not surprising because
the worst case is O(N).
# i5-3320M
How many cycles all tests took:
CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles
# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt
# Hikey960 8x CPUs
How many cycles all tests took:
CPU0=3478683207 cycles vs CPU0=463767978 cycles
# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt
b) time sudo ./test_vmalloc.sh test_repeat_count=1
With this configuration, all tests are run on all available online CPUs.
Before running each CPU shuffles its tests execution order. It gives
random allocation behaviour. So it is rough comparison, but it puts in
the picture for sure.
# i5-3320M
<default> vs <patched>
real 101m22.813s real 0m56.805s
user 0m0.011s user 0m0.015s
sys 0m5.076s sys 0m0.023s
# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt
# Hikey960 8x CPUs
<default> vs <patched>
real unknown real 4m25.214s
user unknown user 0m0.011s
sys unknown sys 0m0.670s
I did not manage to complete this test on "default Hikey960" kernel
version. After 24 hours it was still running, therefore i had to cancel
it. That is why real/user/sys are "unknown".
This patch (of 3):
Currently an allocation of the new vmap area is done over busy list
iteration(complexity O(n)) until a suitable hole is found between two busy
areas. Therefore each new allocation causes the list being grown. Due to
over fragmented list and different permissive parameters an allocation can
take a long time. For example on embedded devices it is milliseconds.
This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks
sorted by their offsets in pair with linked list keeping the free space in
order of increasing addresses.
Nodes are augmented with the size of the maximum available free block in
its left or right sub-tree. Thus, that allows to take a decision and
traversal toward the block that will fit and will have the lowest start
address, i.e. it is sequential allocation.
Allocation: to allocate a new block a search is done over the tree until a
suitable lowest(left most) block is large enough to encompass: the
requested size, alignment and vstart point. If the block is bigger than
requested size - it is split.
De-allocation: when a busy vmap area is freed it can either be merged or
inserted to the tree. Red-black tree allows efficiently find a spot
whereas a linked list provides a constant-time access to previous and next
blocks to check if merging can be done. In case of merging of
de-allocated memory chunk a large coalesced area is created.
Complexity: ~O(log(N))
[urezki@gmail.com: v3]
Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In the recent build test of linux-next, Stephen saw a build error
caused by a broken .tmp_versions/*.mod file:
https://lkml.org/lkml/2019/5/13/991
drivers/net/phy/asix.ko and drivers/net/usb/asix.ko have the same
basename, and there is a race in generating .tmp_versions/asix.mod
Kbuild has not checked this before, and it suddenly shows up with
obscure error messages when this kind of race occurs.
Non-unique module names cause various sort of problems, but it is
not trivial to catch them by eyes.
Hence, this script.
It checks not only real modules, but also built-in modules (i.e.
controlled by tristate CONFIG option, but currently compiled with =y).
Non-unique names for built-in modules also cause problems because
/sys/modules/ would fall over.
For the latest kernel, I tested "make allmodconfig all" (or more
quickly "make allyesconfig modules"), and it detected the following:
warning: same basename if the following are built as modules:
drivers/regulator/88pm800.ko
drivers/mfd/88pm800.ko
warning: same basename if the following are built as modules:
drivers/gpu/drm/bridge/adv7511/adv7511.ko
drivers/media/i2c/adv7511.ko
warning: same basename if the following are built as modules:
drivers/net/phy/asix.ko
drivers/net/usb/asix.ko
warning: same basename if the following are built as modules:
fs/coda/coda.ko
drivers/media/platform/coda/coda.ko
warning: same basename if the following are built as modules:
drivers/net/phy/realtek.ko
drivers/net/dsa/realtek.ko
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Currently menu blocks start with a pretty header but end with nothing in
the generated config. So next config options stick together with the
options from the menu block.
Let's terminate menu blocks in the generated config with a comment and
a newline if needed. Example:
...
CONFIG_BPF_STREAM_PARSER=y
CONFIG_NET_FLOW_LIMIT=y
#
# Network testing
#
CONFIG_NET_PKTGEN=y
CONFIG_NET_DROP_MONITOR=y
# end of Network testing
# end of Networking options
CONFIG_HAMRADIO=y
...
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
|
For *-pkg targets, the LICENSES directory should be included in the
source tarball.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
|
The 'addtree' and 'flags' in scripts/Kbuild.include are so compilecated
and ugly.
As I mentioned in [1], Kbuild should stop automatic prefixing of header
search path options.
I fixed up (almost) all Makefiles in the kernel. Now 'addtree' and
'flags' have been removed.
Kbuild still caters to add $(srctree)/$(src) and $(objtree)/$(obj)
to the header search path for O= building, but never touches extra
compiler options from ccflags-y etc.
[1]: https://patchwork.kernel.org/patch/9632347/
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
|
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].
To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.
Having whitespaces after -I does not matter since commit 48f6e3cf5bc6
("kbuild: do not drop -I without parameter").
[1]: https://patchwork.kernel.org/patch/9632347/
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
|
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].
To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.
Having whitespaces after -I does not matter since commit 48f6e3cf5bc6
("kbuild: do not drop -I without parameter").
[1]: https://patchwork.kernel.org/patch/9632347/
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com>
|
|
I was able to build without these extra header search paths.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
|
As of Linux 5.1, alpha and s390 are the last architectures that
have defconfig in arch/*/ instead of arch/*/configs/.
$ find arch -name defconfig | sort
arch/alpha/defconfig
arch/arm64/configs/defconfig
arch/csky/configs/defconfig
arch/nds32/configs/defconfig
arch/riscv/configs/defconfig
arch/s390/defconfig
The arch/$(ARCH)/defconfig is the hard-coded default in Kconfig,
and I want to deprecate it after evacuating the remaining defconfig
into the standard location, arch/*/configs/.
Define KBUILD_DEFCONFIG like other architectures, and move defconfig
into the configs/ subdirectory.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Paul Walmsley <paul@pwsan.com>
|
|
If the compiler specified by $(CC) is not present, the Kconfig stage
sprinkles 'not found' messages, then succeeds.
$ make CROSS_COMPILE=foo defconfig
/bin/sh: 1: foogcc: not found
/bin/sh: 1: foogcc: not found
*** Default configuration is based on 'x86_64_defconfig'
./scripts/gcc-version.sh: 17: ./scripts/gcc-version.sh: foogcc: not found
./scripts/gcc-version.sh: 18: ./scripts/gcc-version.sh: foogcc: not found
./scripts/gcc-version.sh: 19: ./scripts/gcc-version.sh: foogcc: not found
./scripts/gcc-version.sh: 17: ./scripts/gcc-version.sh: foogcc: not found
./scripts/gcc-version.sh: 18: ./scripts/gcc-version.sh: foogcc: not found
./scripts/gcc-version.sh: 19: ./scripts/gcc-version.sh: foogcc: not found
./scripts/clang-version.sh: 11: ./scripts/clang-version.sh: foogcc: not found
./scripts/gcc-plugin.sh: 11: ./scripts/gcc-plugin.sh: foogcc: not found
init/Kconfig:16:warning: 'GCC_VERSION': number is invalid
#
# configuration written to .config
#
Terminate parsing files immediately if $(CC) or $(LD) is not found.
"make *config" will fail more nicely.
$ make CROSS_COMPILE=foo defconfig
*** Default configuration is based on 'x86_64_defconfig'
scripts/Kconfig.include:34: compiler 'foogcc' not found
make[1]: *** [scripts/kconfig/Makefile;82: defconfig] Error 1
make: *** [Makefile;557: defconfig] Error 2
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|