Age | Commit message (Collapse) | Author | Files | Lines |
|
Since Switchtec patch there has been a new topology added to
the NTB API. It's called NTB_TOPO_SWITCH and dedicated for
PCIe switch chips. Even though topo field isn't used within the
IDT driver much, lets set it for the sake of unification.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
ntb_perf driver has been also updated so to have the multi-port
interface support. User now must specify what peer port is going
to be used to perform the test.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
There are devices (like IDT PCIe switches), which outbound MWs xlat address
is setup on peer side. In this case local side is supposed to allocate
a memory buffer and somehow deliver the xlat DMA address to peer so one
could set the outbound MW up. The MW test is altered so to support both
previous Intel/AMD and new IDT-like devices.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Messages NTB API is now available. ntb_tool driver has been altered
to perform messages send and receive operation. The test of messages
read/write to/from peer device has been added to the script.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Scratchpad NTB API has changed so has the ntb_tool driver. Outbound
Scratchpad DebugFS files have been moved to peer specific directories.
Each scratchpad is now available via separate file. The test code
has been accordingly altered.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
DB interface of ntb_tool driver hasn't been changed much, but
db_valid_mask DebugFS file has still been added. In this case
it's much better to test all valid DB bits instead of using
the predefined mask, which may be incorrect in general.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Link Up and Down methods are used to change NTB link settings on
local side only for multi-port devices. Link is considered up
only if both sides local and peer set it up. Intel/AMD hardware
acts a bit different by assigning the Primary and Secondary roles,
so Primary device only is able to change the link state. Such behaviour
should be reflected in the test code.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Multi-port interface is now available in ntb_tool driver. According
to the new NTB API, there might be more than two devices connected over
NTB. It means each device can have multiple freely enumerated ports.
Each port got index assigned by NTB hardware driver. This test is
performed to determine the local and peer ports as well as their indexes.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
If some of variables like LOC/REM or LOCAL_*/REMOTE_* got
whitespaces, the script may fail with syntax error.
Fixes: a9c59ef77458 ("ntb_test: Add a selftest script for the NTB subsystem")
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Former NTB Performance driver could only work with NTB devices, which
got Scratchpads available and had just two ports. Since there are
devices, which don't have Scratchpads and got more than two peer
ports, the performance measuring tool needs to be rewritten. This
patch adds the ability to test any available NTB peer.
Additionally it allows to set NTB memory windows up using any
available data exchange interface: Scratchpad or Message registers.
Some cleanups are also added here.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Former NTB Debugging tool driver supported only the limited
functionality of the recently updated NTB API, which is now available
to work with the truly NTB multi-port devices and devices, which
got NTB Message registers instead of Scratchpads. This patch
fully rewrites the driver so one would fully expose all the new
NTB API interfaces. Particularly it concerns the Message registers,
peer ports API, NTB link settings. Additional cleanups are also added
here.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Current Ping Pong driver can't truly work with multi-port devices.
Additionally it requires the Scratchpad registers being available
on NTB device. This patches rewrites the driver so one would
perform the cyclic Ping-Pong algorithm around all the available
NTB peers and makes it working with NTB hardware, which doesn't
support Scratchpads, but such alternative as NTB Message register.
Additional cleanups are also added here.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Simple (1 << pidx) operation causes undefined behaviour when
pidx >= 32. It must be casted to u64 to match the actual return
value of ntb_link_is_up() method, so to have all the possible
peer indexes covered and to get rid of undefined behaviour.
Additionally there are special macros in "linux/bitops.h" to perform
the bit-set-shift operations, so it's recommended to have them used
for proper bit setting.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
The dma_mask and dma_coherent_mask fields of the NTB struct device
weren't initialized in hardware drivers. In fact it should be done
instead of PCIe interface usage, since NTB clients are supposed to
use NTB API and left unaware of real hardware implementation.
In addition to that ntb_device_register() method shouldn't clear
the passed ntb_dev structure, since it dma_mask is initialized
by hardware drivers.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
There is a common methods signature form used over all the NTB API
like functions naming scheme, arguments names and order, etc.
Recently added NTB messaging API IO callbacks were named a bit
different so should be renamed to be in compliance with the rest
of the API.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Newer gcc (version 7 and 8 presumably) warn about a statement mixing
the << operator with logical and:
drivers/ntb/hw/mscc/ntb_hw_switchtec.c: In function 'switchtec_ntb_init_sndev':
drivers/ntb/hw/mscc/ntb_hw_switchtec.c:888:24: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
My interpretation here is that the author must have intended a bitmask
rather than a comparison, so I'm changing the '&&' to '&', which makes
a lot more sense in the context.
Fixes: 1b249475275d ("ntb_hw_switchtec: Allow using Switchtec NTB in multi-partition setups")
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
With Switchtec hardware, the buffer used for a memory window must be
aligned to its size (the hardware only replaces the lower bits). In
certain circumstances dma_alloc_coherent() will not provide a buffer
that adheres to this requirement like when using the CMA and
CONFIG_CMA_ALIGNMENT is set lower than the buffer size.
When we get an unaligned buffer mw_set_trans() should return an error.
We also log an error so we know the cause of the problem.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
When using the max_mw_size parameter of ntb_transport to limit the size of
the Memory windows, communication cannot be established and the queues
freeze.
This is because the mw_size that's reported to the peer is correctly
limited but the size used locally is not. So the MW is initialized
with a buffer smaller than the window but the TX side is using the
full window. This means the TX side will be writing to a region of the
window that points nowhere.
This is easily fixed by applying the same limit to tx_size in
ntb_transport_init_queue().
Fixes: e26a5843f7f5 ("NTB: Split ntb_hw_intel and ntb_transport drivers")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Allen Hubbe <Allen.Hubbe@dell.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
I am no longer employed by Dell EMC. For the purposes of NTB driver
development and maintenance, please contact me via my personal email.
Signed-off-by: Allen Hubbe <allenbh@gmail.com>
Signed-off-by: Allen Hubbe <Allen.Hubbe@emc.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
If one host crashes and soft reboots, the other host may not see a
link down event. Then when the crashed host comes back up, the
surviving host may not know the link was reset and the NTB clients
may not work without being reset.
To solve this, we send a LINK_FORCE_DOWN message to each peer every
time we come up, before we register the NTB device. If a surviving
host still thinks the link is up it will take it down immediately.
In this way, once the crashed host comes up fully, it will send a
regular link up event as per usual and the link will be properly
restarted.
While we are in the area, this also fixes the MSG_LINK_UP message that
was in the link down function that was reported by Doug Meyers.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reported-by: ThanhTuThai <cruisethai@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
In a crosslink configuration doorbells and messages largely work the
same but the NTB registers must be accessed through the reserved LUT
window. Also, as a bonus, seeing there are now two independent sets of
NTB links, both partitions can actually use all 60 doorbell registers
instead of them having to be split into two for each partition.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Crosslink is a feature of the Switchtec switches that is similar to
the B2B mode of other NTB devices. It allows a system to be designed
that is perfectly symmetric with two identical switches that link
two hosts together.
In order for the system to be symmetric, there is an empty host-less
partition between the two switches which the host must enumerate and
assign BAR addresses to. The firmware in the switch manages this
specially so that the BAR addresses on both sides of the empty
partition will be identical despite being in the same partition with
the same address space.
The driver determines whether crosslink is enabled by a flag set in
the NTB partition info registers which are set by the switch's
configuration file.
When crosslink is enabled, a reserved LUT window is setup to point to
the peer's switch's NTB registers and the local MWs are set to forward
to the host-less partition's BARs. (Yes, this hurts my brain too.)
Once this is setup, largely the same NTB infrastructure is used to
communicate between the two hosts.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
The PFF CSR registers actual mirrors the PCI configuration space
for all the ports in the switch. Previously, this was not needed by
the driver but will be used by the crosslink code to enumerate the
bus in an host-less centre partition.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
This is a prep patch in order to support the crosslink feature which
will require the driver to setup the requester ID table in another
partition as well as it's own. To aid this, create a helper function
which sets up the requester IDs from an array.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
This is a prep patch in order to support the crosslink feature which
will require the driver to use another reserved LUT window. To
simplify this we move the code which sets up the reserved LUT window
into a helper function which will be used by the crosslink
initialization.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
This is a prep patch in order to support the crosslink feature which will
require the driver to use another reserved LUT window. To simplify this,
we add some code to track the number of reserved LUT windows in use
instead of assuming this is always 1.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Allow using Switchtec NTB in setups that have more than two partitions.
Note: this does not enable having multi-host communication, it only
allows for a single NTB link between two hosts in a network that might
have more than two.
Use following logic to determine the NT peer partition:
1) If there are 2 partitions, and the target vector is set in
the Switchtec configuration, use the partition specified in target
vector.
2) If there are 2 partitions and target vector is unset
use the only other partition as specified in the NT EP map.
3) If there are more than 2 partitions and target vector is set
use the other partition specified in target vector.
4) If there are more than 2 partitions and target vector is unset,
this is invalid and report an error.
Signed-off-by: Kelvin Cao <kelvin.cao@microsemi.com>
[logang@deltatee.com: commit message fleshed out]
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Trivial addition of "\n" to the dev_* prints where necessary
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Trivial fix to spelling mistake in dev_err error message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-By: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
Removing dead code since this is not being used.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
There is no need to #define the license of the driver, just put it in
the MODULE_LICENSE() line directly as a text string.
This allows tools that check that the module license matches the source
code license to work properly, as there is no need to unwind the
unneeded dereference, especially when the string is defined just a few
lines above the usage of it.
Reported-and-reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <Allen.Hubbe@emc.com>
Cc: Gary R Hook <gary.hook@amd.com>
Cc: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
This resolves a bug which may incorrectly configure the peer host's
LUT for shared memory window access. The code was using the local
host's first BAR number, rather than the peer hosts's first BAR
number, to determine what peer NT control register to program.
The bug will cause the Switchtec NTB link to work only if both peers
have the same first NTB BAR configured. In all other configurations,
the link will not come up, failing silently.
When both hosts have the same first BAR, the configuration works only
because the first BAR numbers happent to be the same. When the hosts
do not have the same first BAR, then the LUT translation will not be
configured in the correct peer LUT and will not give the peer the
shared memory window access required for the link to operate.
Signed-off-by: Doug Meyer <dmeyer@gigaio.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Fixes: 678784a44ae8 ("NTB: switchtec_ntb: Initialize hardware for memory windows")
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
|
|
When ORC support was added for the ftrace_64.S code, an ENDPROC
for function_hook() was missed. This results in the following warning:
arch/x86/kernel/ftrace_64.o: warning: objtool: .entry.text+0x0: unreachable instruction
Fixes: e2ac83d74a4d ("x86/ftrace: Fix ORC unwinding from ftrace handlers")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180128022150.dqierscqmt3uwwsr@treble
|
|
The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.
If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.
Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.
Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.
Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
|
|
Due to some unfortunate events, I have not been directly involved in
the x86 kernel patch flow for a while now. I have also not been able
to ramp back up by now like I had hoped to, and after reviewing what I
will need to work on both internally at Intel and elsewhere in the near
term, it is clear that I am not going to be able to ramp back up until
late 2018 at the very earliest.
It is not acceptable to not recognize that this load is currently
taken by Ingo and Thomas without my direct participation, so I mark
myself as R: (designated reviewer) rather than M: (maintainer) until
further notice. This is in fact recognizing the de facto situation
for the past few years.
I have obviously no intention of going away, and I will do everything
within my power to improve Linux on x86 and x86 for Linux. This,
however, puts credit where it is due and reflects a change of focus.
This patch also removes stale entries for portions of the x86
architecture which have not been maintained separately from arch/x86
for a long time. If there is a reason to re-introduce them then that
can happen later.
Signed-off-by: H. Peter Anvin <h.peter.anvin@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180125195934.5253-1-hpa@zytor.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
select(2) with wfds but no rfds must return when the socket is shut down
by the peer. This way userspace notices socket activity and gets -EPIPE
from the next write(2).
Currently select(2) does not return for virtio-vsock when a SEND+RCV
shutdown packet is received. This is because vsock_poll() only sets
POLLOUT | POLLWRNORM for TCP_CLOSE, not the TCP_CLOSING state that the
socket is in when the shutdown is received.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ccid2_hc_tx_rto_expire() timer callback always restarts the timer
again and can run indefinitely (unless it is stopped outside), and after
commit 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at dismantle time"),
which moved ccid_hc_tx_delete() (also includes sk_stop_timer()) from
dccp_destroy_sock() to sk_destruct(), this started to happen quite often.
The timer prevents releasing the socket, as a result, sk_destruct() won't
be called.
Found with LTP/dccp_ipsec tests running on the bonding device,
which later couldn't be unloaded after the tests were completed:
unregister_netdevice: waiting for bond0 to become free. Usage count = 148
Fixes: 2a91aa396739 ("[DCCP] CCID2: Initial CCID2 (TCP-Like) implementation")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now that we're upstream in Linux we've been able to make some
infrastructure changes so our port works a bit more like other ports.
Specifically:
* We now have a mailing list specific to the RISC-V Linux port, hosted
at lists.infreadead.org.
* We now have a kernel.org git tree where work on our port is
coordinated.
This patch changes the RISC-V maintainers entry to reflect these new
bits of infrastructure.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
|
|
On a 5-level kernel, if a non-init mm has a top-level entry, it needs to
match init_mm's, but the vmalloc_fault() code skipped over the BUG_ON()
that would have checked it.
While we're at it, get rid of the rather confusing 4-level folded "pgd"
logic.
Cleans-up: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Neil Berrington <neil.berrington@datacore.com>
Link: https://lkml.kernel.org/r/2ae598f8c279b0a29baf75df207e6f2fdddc0a1b.1516914529.git.luto@kernel.org
|
|
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses
large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd
folding code into account, so, on a 4-level kernel, the pgd synchronization
logic compiles away to exactly nothing.
Interestingly, the problem doesn't trigger with nopti. I assume this is
because the kernel is mapped with global pages if we boot with nopti. The
sequence of operations when we create a new task is that we first load its
mm while still running on the old stack (which crashes if the old stack is
unmapped in the new mm unless the TLB saves us), then we call
prepare_switch_to(), and then we switch to the new stack.
prepare_switch_to() pokes the new stack directly, which will populate the
mapping through vmalloc_fault(). I assume that we're getting lucky on
non-PTI systems -- the old stack's TLB entry stays alive long enough to
make it all the way through prepare_switch_to() and switch_to() so that we
make it to a valid stack.
Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/346541c56caed61abbe693d7d2742b4a380c5001.1516914529.git.luto@kernel.org
|
|
Sukumar reported that sends to the local broadcast address
(255.255.255.255) are broken. Check for the address in vrf driver
and do not redirect to the VRF device - similar to multicast
packets.
With this change sockets can use SO_BINDTODEVICE to specify an
egress interface and receive responses. Note: the egress interface
can not be a VRF device but needs to be the enslaved device.
https://bugzilla.kernel.org/show_bug.cgi?id=198521
Reported-by: Sukumar Gopalakrishnan <sukumarg1973@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Hardware statistics retrieval hurts in tight invocation loops.
Avoid extraneous write and enforce strict ordering of writes targeted to
the tally counters dump area address registers.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Tested-by: Oliver Freyermuth <o.freyermuth@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After do_readv_writev, the inode cache is invalidated anyway, so i_size
will never be read. It will be fetched from the server which will also
know about updates from other machines.
Fixes deadlock on 32-bit SMP.
See https://marc.info/?l=linux-fsdevel&m=151268557427760&w=2
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For a while we've been having issues with seemingly random interrupts
coming from nvidia cards when resuming them. Originally the fix for this
was thought to be just re-arming the MSI interrupt registers right after
re-allocating our IRQs, however it seems a lot of what we do is both
wrong and not even nessecary.
This was made apparent by what appeared to be a regression in the
mainline kernel that started introducing suspend/resume issues for
nouveau:
a0c9259dc4e1 (irq/matrix: Spread interrupts on allocation)
After this commit was introduced, we started getting interrupts from the
GPU before we actually re-allocated our own IRQ (see references below)
and assigned the IRQ handler. Investigating this turned out that the
problem was not with the commit, but the fact that nouveau even
free/allocates it's irqs before and after suspend/resume.
For starters: drivers in the linux kernel haven't had to handle
freeing/re-allocating their IRQs during suspend/resume cycles for quite
a while now. Nouveau seems to be one of the few drivers left that still
does this, despite the fact there's no reason we actually need to since
disabling interrupts from the device side should be enough, as the
kernel is already smart enough to know to disable host-side interrupts
for us before going into suspend. Since we were tearing down our IRQs by
hand however, that means there was a short period during resume where
interrupts could be received before we re-allocated our IRQ which would
lead to us getting an unhandled IRQ. Since we never handle said IRQ and
re-arm the interrupt registers, this would cause us to miss all of the
interrupts from the GPU and cause our init process to start timing out
on anything requiring interrupts.
So, since this whole setup/teardown every suspend/resume cycle is
useless anyway, move irq setup/teardown into the pci subdev's ctor/dtor
functions instead so they're only called at driver load and driver
unload. This should fix most of the issues with pending interrupts on
resume, along with getting suspend/resume for nouveau to work again.
As well, this probably means we can also just remove the msi rearm call
inside nvkm_pci_init(). But since our main focus here is to fix
suspend/resume before 4.15, we'll save that for a later patch.
Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
|
|
Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
"BUG: unable to handle kernel NULL pointer dereference at (null)"
Let's add a helper to check if update_pmtu is available before calling it.
Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
CC: Roman Kapl <code@rkapl.cz>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
More lockdep gifts, a 5-way lockup race:
perf_event_create_kernel_counter()
perf_event_alloc()
perf_try_init_event()
x86_pmu_event_init()
__x86_pmu_event_init()
x86_reserve_hardware()
#0 mutex_lock(&pmc_reserve_mutex);
reserve_ds_buffer()
#1 get_online_cpus()
perf_event_release_kernel()
_free_event()
hw_perf_event_destroy()
x86_release_hardware()
#0 mutex_lock(&pmc_reserve_mutex)
release_ds_buffer()
#1 get_online_cpus()
#1 do_cpu_up()
perf_event_init_cpu()
#2 mutex_lock(&pmus_lock)
#3 mutex_lock(&ctx->mutex)
sys_perf_event_open()
mutex_lock_double()
#3 mutex_lock(ctx->mutex)
#4 mutex_lock_nested(ctx->mutex, 1);
perf_try_init_event()
#4 mutex_lock_nested(ctx->mutex, 1)
x86_pmu_event_init()
intel_pmu_hw_config()
x86_add_exclusive()
#0 mutex_lock(&pmc_reserve_mutex)
Fix it by using ordering constructs instead of locking.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Lockdep noticed the following 3-way lockup scenario:
sys_perf_event_open()
perf_event_alloc()
perf_try_init_event()
#0 ctx = perf_event_ctx_lock_nested(1)
perf_swevent_init()
swevent_hlist_get()
#1 mutex_lock(&pmus_lock)
perf_event_init_cpu()
#1 mutex_lock(&pmus_lock)
#2 mutex_lock(&ctx->mutex)
sys_perf_event_open()
mutex_lock_double()
#2 mutex_lock()
#0 mutex_lock_nested()
And while we need that perf_event_ctx_lock_nested() for HW PMUs such
that they can iterate the sibling list, trying to match it to the
available counters, the software PMUs need do no such thing. Exclude
them.
In particular the swevent triggers the above invertion, while the
tpevent PMU triggers a more elaborate one through their event_mutex.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Lockdep noticed the following 3-way lockup race:
perf_trace_init()
#0 mutex_lock(&event_mutex)
perf_trace_event_init()
perf_trace_event_reg()
tp_event->class->reg() := tracepoint_probe_register
#1 mutex_lock(&tracepoints_mutex)
trace_point_add_func()
#2 static_key_enable()
#2 do_cpu_up()
perf_event_init_cpu()
#3 mutex_lock(&pmus_lock)
#4 mutex_lock(&ctx->mutex)
perf_ioctl()
#4 ctx = perf_event_ctx_lock()
_perf_iotcl()
ftrace_profile_set_filter()
#0 mutex_lock(&event_mutex)
Fudge it for now by noting that the tracepoint state does not depend
on the event <-> context relation. Ugly though :/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|