aboutsummaryrefslogtreecommitdiffstats
path: root/net/hsr/hsr_debugfs.c (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2019-12-25hsr: rename debugfs file when interface name is changedTaehee Yoo1-0/+13
hsr interface has own debugfs file, which name is same with interface name. So, interface name is changed, debugfs file name should be changed too. Fixes: fc4ecaeebd26 ("net: hsr: add debugfs support for display node list") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25hsr: add hsr root debugfs directoryTaehee Yoo4-3/+28
In current hsr code, when hsr interface is created, it creates debugfs directory /sys/kernel/debug/<interface name>. If there is same directory or file name in there, it fails. In order to reduce possibility of failure of creation of debugfs, this patch adds root directory. Test commands: ip link add dummy0 type dummy ip link add dummy1 type dummy ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 Before this patch: /sys/kernel/debug/hsr0/node_table After this patch: /sys/kernel/debug/hsr/hsr0/node_table Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25hsr: fix error handling routine in hsr_dev_finalize()Taehee Yoo3-24/+21
hsr_dev_finalize() is called to create new hsr interface. There are some wrong error handling codes. 1. wrong checking return value of debugfs_create_{dir/file}. These function doesn't return NULL. If error occurs in there, it returns error pointer. So, it should check error pointer instead of NULL. 2. It doesn't unregister interface if it fails to setup hsr interface. If it fails to initialize hsr interface after register_netdevice(), it should call unregister_netdevice(). 3. Ignore failure of creation of debugfs If creating of debugfs dir and file is failed, creating hsr interface will be failed. But debugfs doesn't affect actual logic of hsr module. So, ignoring this is more correct and this behavior is more general. Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25hsr: avoid debugfs warning message when module is removeTaehee Yoo1-1/+0
When hsr module is being removed, debugfs_remove() is called to remove both debugfs directory and file. When module is being removed, module state is changed to MODULE_STATE_GOING then exit() is called. At this moment, module couldn't be held so try_module_get() will be failed. debugfs's open() callback tries to hold the module if .owner is existing. If it fails, warning message is printed. CPU0 CPU1 delete_module() try_stop_module() hsr_exit() open() <-- WARNING debugfs_remove() In order to avoid the warning message, this patch makes hsr module does not set .owner. Unsetting .owner is safe because these are protected by inode_lock(). Test commands: #SHELL1 ip link add dummy0 type dummy ip link add dummy1 type dummy while : do ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 modprobe -rv hsr done #SHELL2 while : do cat /sys/kernel/debug/hsr0/node_table done Splat looks like: [ 101.223783][ T1271] ------------[ cut here ]------------ [ 101.230309][ T1271] debugfs file owner did not clean up at exit: node_table [ 101.230380][ T1271] WARNING: CPU: 3 PID: 1271 at fs/debugfs/file.c:309 full_proxy_open+0x10f/0x650 [ 101.233153][ T1271] Modules linked in: hsr(-) dummy veth openvswitch nsh nf_conncount nf_nat nf_conntrack nf_d] [ 101.237112][ T1271] CPU: 3 PID: 1271 Comm: cat Tainted: G W 5.5.0-rc1+ #204 [ 101.238270][ T1271] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 101.240379][ T1271] RIP: 0010:full_proxy_open+0x10f/0x650 [ 101.241166][ T1271] Code: 48 c1 ea 03 80 3c 02 00 0f 85 c1 04 00 00 49 8b 3c 24 e8 04 86 7e ff 84 c0 75 2d 4c 8 [ 101.251985][ T1271] RSP: 0018:ffff8880ca22fa38 EFLAGS: 00010286 [ 101.273355][ T1271] RAX: dffffc0000000008 RBX: ffff8880cc6e6200 RCX: 0000000000000000 [ 101.274466][ T1271] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8880c4dd5c14 [ 101.275581][ T1271] RBP: 0000000000000000 R08: fffffbfff2922f5d R09: 0000000000000000 [ 101.276733][ T1271] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffc0551bc0 [ 101.277853][ T1271] R13: ffff8880c4059a48 R14: ffff8880be50a5e0 R15: ffffffff941adaa0 [ 101.278956][ T1271] FS: 00007f8871cda540(0000) GS:ffff8880da800000(0000) knlGS:0000000000000000 [ 101.280216][ T1271] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 101.282832][ T1271] CR2: 00007f88717cfd10 CR3: 00000000b9440005 CR4: 00000000000606e0 [ 101.283974][ T1271] Call Trace: [ 101.285328][ T1271] do_dentry_open+0x63c/0xf50 [ 101.286077][ T1271] ? open_proxy_open+0x270/0x270 [ 101.288271][ T1271] ? __x64_sys_fchdir+0x180/0x180 [ 101.288987][ T1271] ? inode_permission+0x65/0x390 [ 101.289682][ T1271] path_openat+0x701/0x2810 [ 101.290294][ T1271] ? path_lookupat+0x880/0x880 [ 101.290957][ T1271] ? check_chain_key+0x236/0x5d0 [ 101.291676][ T1271] ? __lock_acquire+0xdfe/0x3de0 [ 101.292358][ T1271] ? sched_clock+0x5/0x10 [ 101.292962][ T1271] ? sched_clock_cpu+0x18/0x170 [ 101.293644][ T1271] ? find_held_lock+0x39/0x1d0 [ 101.305616][ T1271] do_filp_open+0x17a/0x270 [ 101.306061][ T1271] ? may_open_dev+0xc0/0xc0 [ ... ] Fixes: fc4ecaeebd26 ("net: hsr: add debugfs support for display node list") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25MAINTAINERS: Add additional maintainers to ENA Ethernet driverNetanel Belgazal1-0/+2
Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24s390/qeth: fix initialization on old HWJulian Wiedmann1-3/+1
I stumbled over an old OSA model that claims to support DIAG_ASSIST, but then rejects the cmd to query its DIAG capabilities. In the old code this was ok, as the returned raw error code was > 0. Now that we translate the raw codes to errnos, the "rc < 0" causes us to fail the initialization of the device. The fix is trivial: don't bail out when the DIAG query fails. Such an error is not critical, we can still use the device (with a slightly reduced set of features). Fixes: 742d4d40831d ("s390/qeth: convert remaining legacy cmd callbacks") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24s390/qeth: vnicc Fix init to defaultAlexandra Winter1-1/+3
During vnicc_init wanted_char should be compared to cur_char and not to QETH_VNICC_DEFAULT. Without this patch there is no way to enforce the default values as desired values. Note, that it is expected, that a card comes online with default values. This patch was tested with private card firmware. Fixes: caa1f0b10d18 ("s390/qeth: add VNICC enable/disable support") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24s390/qeth: Fix vnicc_is_in_use if rx_bcast not setAlexandra Winter1-2/+1
Symptom: After vnicc/rx_bcast has been manually set to 0, bridge_* sysfs parameters can still be set or written. Only occurs on HiperSockets, as OSA doesn't support changing rx_bcast. Vnic characteristics and bridgeport settings are mutually exclusive. rx_bcast defaults to 1, so manually setting it to 0 should disable bridge_* parameters. Instead it makes sense here to check the supported mask. If the card does not support vnicc at all, bridge commands are always allowed. Fixes: caa1f0b10d18 ("s390/qeth: add VNICC enable/disable support") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24s390/qeth: fix false reporting of VNIC CHAR config failureAlexandra Winter1-1/+0
Symptom: Error message "Configuring the VNIC characteristics failed" in dmesg whenever an OSA interface on z15 is set online. The VNIC characteristics get re-programmed when setting a L2 device online. This follows the selected 'wanted' characteristics - with the exception that the INVISIBLE characteristic unconditionally gets switched off. For devices that don't support INVISIBLE (ie. OSA), the resulting IO failure raises a noisy error message ("Configuring the VNIC characteristics failed"). For IQD, INVISIBLE is off by default anyways. So don't unnecessarily special-case the INVISIBLE characteristic, and thereby suppress the misleading error message on OSA devices. Fixes: caa1f0b10d18 ("s390/qeth: add VNICC enable/disable support") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24s390/qeth: lock the card while changing its hsuidJulian Wiedmann2-17/+28
qeth_l3_dev_hsuid_store() initially checks the card state, but doesn't take the conf_mutex to ensure that the card stays in this state while being reconfigured. Rework the code to take this lock, and drop a redundant state check in a helper function. Fixes: b333293058aa ("qeth: add support for af_iucv HiperSockets transport") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24s390/qeth: fix qdio teardown after early init errorJulian Wiedmann3-14/+10
qeth_l?_set_online() goes through a number of initialization steps, and on any error uses qeth_l?_stop_card() to tear down the residual state. The first initialization step is qeth_core_hardsetup_card(). When this fails after having established a QDIO context on the device (ie. somewhere after qeth_mpc_initialize()), qeth_l?_stop_card() doesn't shut down this QDIO context again (since the card state hasn't progressed from DOWN at this stage). Even worse, we then call qdio_free() as final teardown step to free the QDIO data structures - while some of them are still hooked into wider QDIO infrastructure such as the IRQ list. This is inevitably followed by use-after-frees and other nastyness. Fix this by unconditionally calling qeth_qdio_clear_card() to shut down the QDIO context, and also to halt/clear any pending activity on the various IO channels. Remove the naive attempt at handling the teardown in qeth_mpc_initialize(), it clearly doesn't suffice and we're handling it properly now in the wider teardown code. Fixes: 4a71df50047f ("qeth: new qeth device driver") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24net/dst: do not confirm neighbor for vxlan and geneve pmtu updateHangbin Liu1-1/+1
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end, we should not call dst_confirm_neigh() as there is no two-way communication. So disable the neigh confirm for vxlan and geneve pmtu update. v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path") Reviewed-by: Guillaume Nault <gnault@redhat.com> Tested-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24sit: do not confirm neighbor when do pmtu updateHangbin Liu1-1/+1
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end, we should not call dst_confirm_neigh() as there is no two-way communication. v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24vti: do not confirm neighbor when do pmtu updateHangbin Liu2-2/+2
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end, we should not call dst_confirm_neigh() as there is no two-way communication. Although vti and vti6 are immune to this problem because they are IFF_NOARP interfaces, as Guillaume pointed. There is still no sense to confirm neighbour here. v5: Update commit description. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24tunnel: do not confirm neighbor when do pmtu updateHangbin Liu2-3/+3
When do tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end, we should not call dst_confirm_neigh() as there is no two-way communication. v5: No Change. v4: Update commit description v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Fixes: 0dec879f636f ("net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP") Reviewed-by: Guillaume Nault <gnault@redhat.com> Tested-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24net/dst: add new function skb_dst_update_pmtu_no_confirmHangbin Liu1-0/+9
Add a new function skb_dst_update_pmtu_no_confirm() for callers who need update pmtu but should not do neighbor confirm. v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24gtp: do not confirm neighbor when do pmtu updateHangbin Liu1-1/+1
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end, we should not call dst_confirm_neigh() as there is no two-way communication. Although GTP only support ipv4 right now, and __ip_rt_update_pmtu() does not call dst_confirm_neigh(), we still set it to false to keep consistency with IPv6 code. v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24ip6_gre: do not confirm neighbor when do pmtu updateHangbin Liu1-1/+1
When we do ipv6 gre pmtu update, we will also do neigh confirm currently. This will cause the neigh cache be refreshed and set to REACHABLE before xmit. But if the remote mac address changed, e.g. device is deleted and recreated, we will not able to notice this and still use the old mac address as the neigh cache is REACHABLE. Fix this by disable neigh confirm when do pmtu update v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Reported-by: Jianlin Shi <jishi@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24net: add bool confirm_neigh parameter for dst_ops.update_pmtuHangbin Liu14-25/+42
The MTU update code is supposed to be invoked in response to real networking events that update the PMTU. In IPv6 PMTU update function __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor confirmed time. But for tunnel code, it will call pmtu before xmit, like: - tnl_update_pmtu() - skb_dst_update_pmtu() - ip6_rt_update_pmtu() - __ip6_rt_update_pmtu() - dst_confirm_neigh() If the tunnel remote dst mac address changed and we still do the neigh confirm, we will not be able to update neigh cache and ping6 remote will failed. So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we should not be invoking dst_confirm_neigh() as we have no evidence of successful two-way communication at this point. On the other hand it is also important to keep the neigh reachability fresh for TCP flows, so we cannot remove this dst_confirm_neigh() call. To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu to choose whether we should do neigh update or not. I will add the parameter in this patch and set all the callers to true to comply with the previous way, and fix the tunnel code one by one on later patches. v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Suggested-by: David Miller <davem@davemloft.net> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24net: dsa: bcm_sf2: Fix IP fragment location and behaviorFlorian Fainelli1-3/+3
The IP fragment is specified through user-defined field as the first bit of the first user-defined word. We were previously trying to extract it from the user-defined mask which could not possibly work. The ip_frag is also supposed to be a boolean, if we do not cast it as such, we risk overwriting the next fields in CFP_DATA(6) which would render the rule inoperative. Fixes: 7318166cacad ("net: dsa: bcm_sf2: Add support for ethtool::rxnfc") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24sctp: fix err handling of stream initializationMarcelo Ricardo Leitner1-15/+15
The fix on 951c6db954a1 fixed the issued reported there but introduced another. When the allocation fails within sctp_stream_init() it is okay/necessary to free the genradix. But it is also called when adding new streams, from sctp_send_add_streams() and sctp_process_strreset_addstrm_in() and in those situations it cannot just free the genradix because by then it is a fully operational association. The fix here then is to only free the genradix in sctp_stream_init() and on those other call sites move on with what it already had and let the subsequent error handling to handle it. Tested with the reproducers from this report and the previous one, with lksctp-tools and sctp-tests. Reported-by: syzbot+9a1bc632e78a1a98488b@syzkaller.appspotmail.com Fixes: 951c6db954a1 ("sctp: fix memleak on err handling of stream initialization") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24udp: fix integer overflow while computing available space in sk_rcvbufAntonio Messina1-1/+1
When the size of the receive buffer for a socket is close to 2^31 when computing if we have enough space in the buffer to copy a packet from the queue to the buffer we might hit an integer overflow. When an user set net.core.rmem_default to a value close to 2^31 UDP packets are dropped because of this overflow. This can be visible, for instance, with failure to resolve hostnames. This can be fixed by casting sk_rcvbuf (which is an int) to unsigned int, similarly to how it is done in TCP. Signed-off-by: Antonio Messina <amessina@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-22pipe: fix empty pipe check in pipe_write()Jan Stancek1-1/+1
LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb, with read side observing empty pipe and sleeping and write side running out of space and then sleeping as well. In this scenario there are 5 writers and 1 reader. Problem is that after pipe_write() reacquires pipe lock, it re-checks for empty pipe with potentially stale 'head' and doesn't wake up read side anymore. pipe->tail can advance beyond 'head', because there are multiple writers. Use pipe->head for empty pipe check after reacquiring lock to observe current state. Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour. Without patch it hanged within a minute. Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic") Reported-by: Rachel Sibley <rasibley@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-22MAINTAINERS: remove Radim from KVM maintainersPaolo Bonzini1-2/+0
Radim's kernel.org email is bouncing, which I take as a signal that he is not really able to deal with KVM at this time. Make MAINTAINERS match the effective value of KVM's bus factor. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-22MAINTAINERS: Orphan KVM for MIPSJames Hogan1-2/+2
I haven't been active for 18 months, and don't have the hardware set up to test KVM for MIPS, so mark it as orphaned and remove myself as maintainer. Hopefully somebody from MIPS can pick this up. Signed-off-by: James Hogan <jhogan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Paul Burton <paulburton@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: kvm@vger.kernel.org Cc: linux-mips@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-21ext4: clarify impact of 'commit' mount optionJan Kara1-8/+11
The description of 'commit' mount option dates back to ext3 times. Update the description to match current meaning for ext4. Reported-by: Paul Richards <paul.richards@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191218111210.14161-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-21ext4: fix unused-but-set-variable warning in ext4_add_entry()Yunfeng Ye1-1/+3
Warning is found when compile with "-Wunused-but-set-variable": fs/ext4/namei.c: In function ‘ext4_add_entry’: fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used [-Wunused-but-set-variable] struct ext4_sb_info *sbi; ^~~ Fix this by moving the variable @sbi under CONFIG_UNICODE. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-21tracing: Fix endianness bug in histogram triggerSven Schnelle1-1/+20
At least on PA-RISC and s390 synthetic histogram triggers are failing selftests because trace_event_raw_event_synth() always writes a 64 bit values, but the reader expects a field->size sized value. On little endian machines this doesn't hurt, but on big endian this makes the reader always read zero values. Link: http://lore.kernel.org/linux-trace-devel/20191218074427.96184-4-svens@linux.ibm.com Cc: stable@vger.kernel.org Fixes: 4b147936fa509 ("tracing: Add support for 'synthetic' events") Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-21samples/trace_printk: Wait for IRQ work to finishSven Schnelle1-0/+1
trace_printk schedules work via irq_work_queue(), but doesn't wait until it was processed. The kprobe_module.tc testcase does: :;: "Load module again, which means the event1 should be recorded";: modprobe trace-printk grep "event1:" trace so the grep which checks the trace file might run before the irq work was processed. Fix this by adding a irq_work_sync(). Link: http://lore.kernel.org/linux-trace-devel/20191218074427.96184-3-svens@linux.ibm.com Cc: stable@vger.kernel.org Fixes: af2a0750f3749 ("selftests/ftrace: Improve kprobe on module testcase to load/unload module") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-21tracing: Fix lock inversion in trace_event_enable_tgid_record()Prateek Sood2-4/+12
Task T2 Task T3 trace_options_core_write() subsystem_open() mutex_lock(trace_types_lock) mutex_lock(event_mutex) set_tracer_flag() trace_event_enable_tgid_record() mutex_lock(trace_types_lock) mutex_lock(event_mutex) This gives a circular dependency deadlock between trace_types_lock and event_mutex. To fix this invert the usage of trace_types_lock and event_mutex in trace_options_core_write(). This keeps the sequence of lock usage consistent. Link: http://lkml.kernel.org/r/0101016eef175e38-8ca71caf-a4eb-480d-a1e6-6f0bbc015495-000000@us-west-2.amazonses.com Cc: stable@vger.kernel.org Fixes: d914ba37d7145 ("tracing: Add support for recording tgid of tasks") Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-22kbuild: clarify the difference between obj-y and obj-m w.r.t. descendingMasahiro Yamada1-3/+13
Kbuild descends into a directory by either 'y' or 'm', but there is an important difference. Kbuild combines the built-in objects into built-in.a in each directory. The built-in.a in the directory visited by obj-y is merged into the built-in.a in the parent directory. This merge happens recursively when Kbuild is ascending back towards the top directory, then built-in objects are linked into vmlinux eventually. This works properly only when the Makefile specifying obj-y is reachable by the chain of obj-y. On the other hand, Kbuild does not take built-in.a from the directory visited by obj-m. This it, all the objects in that directory are supposed to be modular. If Kbuild descends into a directory by obj-m, but the Makefile in the sub-directory specifies obj-y, those objects are just left orphan. The current statement "Kbuild only uses this information to decide that it needs to visit the directory" is misleading. Clarify the difference. Reported-by: Johan Hovold <johan@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Johan Hovold <johan@kernel.org>
2019-12-20sfc: Include XDP packet headroom in buffer step size.Charles McLachlan1-7/+7
Correct a mismatch between rx_page_buf_step and the actual step size used when filling buffer pages. This patch fixes the page overrun that occured when the MTU was set to anything bigger than 1692. Fixes: 3990a8fffbda ("sfc: allocate channels for XDP tx queues") Signed-off-by: Charles McLachlan <cmclachlan@solarflare.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20sfc: fix channel allocation with brute forceEdward Cree2-22/+19
It was possible for channel allocation logic to get confused between what it had and what it wanted, and end up trying to use the same channel for both PTP and regular TX. This led to a kernel panic: BUG: unable to handle page fault for address: 0000000000047635 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 5.4.0-rc3-ehc14+ #900 Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013 RIP: 0010:native_queued_spin_lock_slowpath+0x188/0x1e0 Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 c0 98 02 00 48 03 04 f5 a0 c6 ed 81 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 RSP: 0018:ffffc90000003d28 EFLAGS: 00010006 RAX: 0000000000047635 RBX: 0000000000000246 RCX: 0000000000040000 RDX: ffff888627a298c0 RSI: 0000000000003ffe RDI: ffff88861f6b8dd4 RBP: ffff8886225c6e00 R08: 0000000000040000 R09: 0000000000000000 R10: 0000000616f080c6 R11: 00000000000000c0 R12: ffff88861f6b8dd4 R13: ffffc90000003dc8 R14: ffff88861942bf00 R15: ffff8886150f2000 FS: 0000000000000000(0000) GS:ffff888627a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000047635 CR3: 000000000200a000 CR4: 00000000000006f0 Call Trace: <IRQ> _raw_spin_lock_irqsave+0x22/0x30 skb_queue_tail+0x1b/0x50 sock_queue_err_skb+0x9d/0xf0 __skb_complete_tx_timestamp+0x9d/0xc0 efx_dequeue_buffer+0x126/0x180 [sfc] efx_xmit_done+0x73/0x1c0 [sfc] efx_ef10_ev_process+0x56a/0xfe0 [sfc] ? tick_sched_do_timer+0x60/0x60 ? timerqueue_add+0x5d/0x70 ? enqueue_hrtimer+0x39/0x90 efx_poll+0x111/0x380 [sfc] ? rcu_accelerate_cbs+0x50/0x160 net_rx_action+0x14a/0x400 __do_softirq+0xdd/0x2d0 irq_exit+0xa0/0xb0 do_IRQ+0x53/0xe0 common_interrupt+0xf/0xf </IRQ> In the long run we intend to rewrite the channel allocation code, but for 'net' fix this by allocating extra_channels, and giving them TX queues, even if we do not in fact need them (e.g. on NICs without MAC TX timestamping), and thereby using simpler logic to assign the channels once they're allocated. Fixes: 3990a8fffbda ("sfc: allocate channels for XDP tx queues") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: dst: Force 4-byte alignment of dst_metricsGeert Uytterhoeven1-1/+1
When storing a pointer to a dst_metrics structure in dst_entry._metrics, two flags are added in the least significant bits of the pointer value. Hence this assumes all pointers to dst_metrics structures have at least 4-byte alignment. However, on m68k, the minimum alignment of 32-bit values is 2 bytes, not 4 bytes. Hence in some kernel builds, dst_default_metrics may be only 2-byte aligned, leading to obscure boot warnings like: WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x44/0x9a refcount_t: underflow; use-after-free. Modules linked in: CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G W 5.5.0-rc2-atari-01448-g114a1a1038af891d-dirty #261 Stack from 10835e6c: 10835e6c 0038134f 00023fa6 00394b0f 0000001c 00000009 00321560 00023fea 00394b0f 0000001c 001a70f8 00000009 00000000 10835eb4 00000001 00000000 04208040 0000000a 00394b4a 10835ed4 00043aa8 001a70f8 00394b0f 0000001c 00000009 00394b4a 0026aba8 003215a4 00000003 00000000 0026d5a8 00000001 003215a4 003a4361 003238d6 000001f0 00000000 003215a4 10aa3b00 00025e84 003ddb00 10834000 002416a8 10aa3b00 00000000 00000080 000aa038 0004854a Call Trace: [<00023fa6>] __warn+0xb2/0xb4 [<00023fea>] warn_slowpath_fmt+0x42/0x64 [<001a70f8>] refcount_warn_saturate+0x44/0x9a [<00043aa8>] printk+0x0/0x18 [<001a70f8>] refcount_warn_saturate+0x44/0x9a [<0026aba8>] refcount_sub_and_test.constprop.73+0x38/0x3e [<0026d5a8>] ipv4_dst_destroy+0x5e/0x7e [<00025e84>] __local_bh_enable_ip+0x0/0x8e [<002416a8>] dst_destroy+0x40/0xae Fix this by forcing 4-byte alignment of all dst_metrics structures. Fixes: e5fd387ad5b30ca3 ("ipv6: do not overwrite inetpeer metrics prematurely") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20selftests: pmtu: fix init mtu value in descriptionHangbin Liu1-3/+3
There is no a_r3, a_r4 in the testing topology. It should be b_r1, b_r2. Also b_r1 mtu is 1400 and b_r2 mtu is 1500. Fixes: e44e428f59e4 ("selftests: pmtu: add basic IPv4 and IPv6 PMTU tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20hv_netvsc: Fix unwanted rx_table resetHaiyang Zhang3-6/+11
In existing code, the receive indirection table, rx_table, is in struct rndis_device, which will be reset when changing MTU, ringparam, etc. User configured receive indirection table values will be lost. To fix this, move rx_table to struct net_device_context, and check netif_is_rxfh_configured(), so rx_table will be set to default only if no user configured value. Fixes: ff4a44199012 ("netvsc: allow get/set of RSS indirection table") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: phy: ensure that phy IDs are correctly typedRussell King2-5/+5
PHY IDs are 32-bit unsigned quantities. Ensure that they are always treated as such, and not passed around as "int"s. Fixes: 13d0ab6750b2 ("net: phy: check return code when requesting PHY driver module") Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20mod_devicetable: fix PHY module formatRussell King1-2/+2
When a PHY is probed, if the top bit is set, we end up requesting a module with the string "mdio:-10101110000000100101000101010001" - the top bit is printed to a signed -1 value. This leads to the module not being loaded. Fix the module format string and the macro generating the values for it to ensure that we only print unsigned types and the top bit is always 0/1. We correctly end up with "mdio:10101110000000100101000101010001". Fixes: 8626d3b43280 ("phylib: Support phy module autoloading") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20qede: Disable hardware gro when xdp prog is installedManish Chopra1-2/+2
commit 18c602dee472 ("qede: Use NETIF_F_GRO_HW.") introduced a regression in driver that when xdp program is installed on qede device, device's aggregation feature (hardware GRO) is not getting disabled, which is unexpected with xdp. Fixes: 18c602dee472 ("qede: Use NETIF_F_GRO_HW.") Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: ena: fix issues in setting interrupt moderation params in ethtoolArthur Kiyanovski1-14/+10
Issue 1: -------- Reproduction steps: 1. sudo ethtool -C eth0 rx-usecs 128 2. sudo ethtool -C eth0 adaptive-rx on 3. sudo ethtool -C eth0 adaptive-rx off 4. ethtool -c eth0 expected output: rx-usecs 128 actual output: rx-usecs 0 Reason for issue: In stage 3, ethtool userspace calls first the ena_get_coalesce() handler to get the current value of all properties, and then the ena_set_coalesce() handler. When ena_get_coalesce() is called the adaptive interrupt moderation is still on. There is an if in the code that returns the rx_coalesce_usecs only if the adaptive interrupt moderation is off. And since it is still on, rx_coalesce_usecs is not set, meaning it stays 0. Solution to issue: Remove this if static interrupt moderation intervals have nothing to do with dynamic ones. Issue 2: -------- Reproduction steps: 1. sudo ethtool -C eth0 adaptive-rx on 2. sudo ethtool -C eth0 rx-usecs 128 3. ethtool -c eth0 expected output: rx-usecs 128 actual output: rx-usecs 0 Reason for issue: In stage 2, when ena_set_coalesce() is called, the handler tests if rx adaptive interrupt moderation is on, and if it is, it returns before getting to the part in the function that sets the rx non-adaptive interrupt moderation interval. Solution to issue: Remove the return from the function when rx adaptive interrupt moderation is on. Also cleaned up the fixed code in ena_set_coalesce by grouping together adaptive interrupt moderation toggling, and using && instead of nested ifs. Fixes: b3db86dc4b82 ("net: ena: reimplement set/get_coalesce()") Fixes: 0eda847953d8 ("net: ena: fix retrieval of nonadaptive interrupt moderation intervals") Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: ena: fix default tx interrupt moderation intervalArthur Kiyanovski1-1/+1
Current default non-adaptive tx interrupt moderation interval is 196 us. This value is too high and might cause the tx queue to fill up. In this commit we set the default non-adaptive tx interrupt moderation interval to 64 us in order to: 1. Reduce the probability of the queue filling-up (when compared to the current default value of 196 us). 2. Reduce unnecessary tx interrupt overhead (which happens if we set the default tx interval to 0). We determined experimentally that 64 us is an optimal value that reduces interrupt rate by more than 20% without affecting performance. Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net/smc: unregister ib devices in reboot_eventKarsten Graul1-1/+1
In the reboot_event handler, unregister the ib devices and enable the IB layer to release the devices before the reboot. Fixes: a33a803cfe64 ("net/smc: guarantee removal of link groups in reboot") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: stmmac: platform: Fix MDIO init for platforms without PHYPadmanabhan Rajanbabu1-1/+1
The current implementation of "stmmac_dt_phy" function initializes the MDIO platform bus data, even in the absence of PHY. This fix will skip MDIO initialization if there is no PHY present. Fixes: 7437127 ("net: stmmac: Convert to phylink and remove phylib logic") Acked-by: Jayati Sahu <jayati.sahu@samsung.com> Signed-off-by: Sriram Dash <sriram.dash@samsung.com> Signed-off-by: Padmanabhan Rajanbabu <p.rajanbabu@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)Chan Shu Tak, Alex1-2/+2
When a frame with NULL DSAP is received, llc_station_rcv is called. In turn, llc_stat_ev_rx_null_dsap_xid_c is called to check if it is a NULL XID frame. The return statement of llc_stat_ev_rx_null_dsap_xid_c returns 1 when the incoming frame is not a NULL XID frame and 0 otherwise. Hence, a NULL XID response is returned unexpectedly, e.g. when the incoming frame is a NULL TEST command. To fix the error, simply remove the conditional operator. A similar error in llc_stat_ev_rx_null_dsap_test_c is also fixed. Signed-off-by: Chan Shu Tak, Alex <alexchan@task.com.hk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: hisilicon: Fix a BUG trigered by wrong bytes_complJiangfeng Xiao1-1/+1
When doing stress test, we get the following trace: kernel BUG at lib/dynamic_queue_limits.c:26! Internal error: Oops - BUG: 0 [#1] SMP ARM Modules linked in: hip04_eth CPU: 0 PID: 2003 Comm: tDblStackPcap0 Tainted: G O L 4.4.197 #1 Hardware name: Hisilicon A15 task: c3637668 task.stack: de3bc000 PC is at dql_completed+0x18/0x154 LR is at hip04_tx_reclaim+0x110/0x174 [hip04_eth] pc : [<c041abfc>] lr : [<bf0003a8>] psr: 800f0313 sp : de3bdc2c ip : 00000000 fp : c020fb10 r10: 00000000 r9 : c39b4224 r8 : 00000001 r7 : 00000046 r6 : c39b4000 r5 : 0078f392 r4 : 0078f392 r3 : 00000047 r2 : 00000000 r1 : 00000046 r0 : df5d5c80 Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 32c5387d Table: 1e189b80 DAC: 55555555 Process tDblStackPcap0 (pid: 2003, stack limit = 0xde3bc190) Stack: (0xde3bdc2c to 0xde3be000) [<c041abfc>] (dql_completed) from [<bf0003a8>] (hip04_tx_reclaim+0x110/0x174 [hip04_eth]) [<bf0003a8>] (hip04_tx_reclaim [hip04_eth]) from [<bf0012c0>] (hip04_rx_poll+0x20/0x388 [hip04_eth]) [<bf0012c0>] (hip04_rx_poll [hip04_eth]) from [<c04c8d9c>] (net_rx_action+0x120/0x374) [<c04c8d9c>] (net_rx_action) from [<c021eaf4>] (__do_softirq+0x218/0x318) [<c021eaf4>] (__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac) [<c021eea0>] (irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4) [<c0240130>] (msa_irq_exit) from [<c0267ba8>] (__handle_domain_irq+0x110/0x148) [<c0267ba8>] (__handle_domain_irq) from [<c0201588>] (gic_handle_irq+0xd4/0x118) [<c0201588>] (gic_handle_irq) from [<c0558360>] (__irq_svc+0x40/0x58) Exception stack(0xde3bdde0 to 0xde3bde28) dde0: 00000000 00008001 c3637668 00000000 00000000 a00f0213 dd3627a0 c0af6380 de00: c086d380 a00f0213 c0a22a50 de3bde6c 00000002 de3bde30 c0558138 c055813c de20: 600f0213 ffffffff [<c0558360>] (__irq_svc) from [<c055813c>] (_raw_spin_unlock_irqrestore+0x44/0x54) Kernel panic - not syncing: Fatal exception in interrupt Pre-modification code: int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev) { [...] [1] priv->tx_head = TX_NEXT(tx_head); [2] count++; [3] netdev_sent_queue(ndev, skb->len); [...] } An rx interrupt occurs if hip04_mac_start_xmit just executes to the line 2, tx_head has been updated, but corresponding 'skb->len' has not been added to dql_queue. And then hip04_mac_interrupt->__napi_schedule->hip04_rx_poll->hip04_tx_reclaim In hip04_tx_reclaim, because tx_head has been updated, bytes_compl will plus an additional "skb-> len" which has not been added to dql_queue. And then trigger the BUG_ON(bytes_compl > num_queued - dql->num_completed). To solve the problem described above, we put "netdev_sent_queue(ndev, skb->len);" before "priv->tx_head = TX_NEXT(tx_head);" Fixes: a41ea46a9a12 ("net: hisilicon: new hip04 ethernet driver") Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: dsa: ksz: use common define for tag lenMichael Grzeschik1-5/+3
Remove special taglen define KSZ8795_INGRESS_TAG_LEN and use generic KSZ_INGRESS_TAG_LEN instead. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20s390/qeth: don't return -ENOTSUPP to userspaceJulian Wiedmann1-1/+1
ENOTSUPP is not uapi, use EOPNOTSUPP instead. Fixes: d66cb37e9664 ("qeth: Add new priority queueing options") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20s390/qeth: fix promiscuous mode after resetJulian Wiedmann3-1/+4
When managing the promiscuous mode during an RX modeset, qeth caches the current HW state to avoid repeated programming of the same state on each modeset. But while tearing down a device, we forget to clear the cached state. So when the device is later set online again, the initial RX modeset doesn't program the promiscuous mode since we believe it is already enabled. Fix this by clearing the cached state in the tear-down path. Note that for the SBP variant of promiscuous mode, this accidentally works right now because we unconditionally restore the SBP role while re-initializing. Fixes: 4a71df50047f ("qeth: new qeth device driver") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20s390/qeth: handle error due to unsupported transport modeJulian Wiedmann2-7/+12
Along with z/VM NICs, there's additional device types that only support a specific transport mode (eg. external-bridged IQD). Identify the corresponding error code, and raise a fitting error message so that the user knows to adjust their device configuration. On top of that also fix the subsequent error path, so that the rejected cmd doesn't need to wait for a timeout but gets cancelled straight away. Fixes: 4a71df50047f ("qeth: new qeth device driver") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20sbitmap: only queue kyber's wait callback if not already activeDavid Jeffery1-1/+1
Under heavy loads where the kyber I/O scheduler hits the token limits for its scheduling domains, kyber can become stuck. When active requests complete, kyber may not be woken up leaving the I/O requests in kyber stuck. This stuck state is due to a race condition with kyber and the sbitmap functions it uses to run a callback when enough requests have completed. The running of a sbt_wait callback can race with the attempt to insert the sbt_wait. Since sbitmap_del_wait_queue removes the sbt_wait from the list first then sets the sbq field to NULL, kyber can see the item as not on a list but the call to sbitmap_add_wait_queue will see sbq as non-NULL. This results in the sbt_wait being inserted onto the wait list but ws_active doesn't get incremented. So the sbitmap queue does not know there is a waiter on a wait list. Since sbitmap doesn't think there is a waiter, kyber may never be informed that there are domain tokens available and the I/O never advances. With the sbt_wait on a wait list, kyber believes it has an active waiter so cannot insert a new waiter when reaching the domain's full state. This race can be fixed by only adding the sbt_wait to the queue if the sbq field is NULL. If sbq is not NULL, there is already an action active which will trigger the re-running of kyber. Let it run and add the sbt_wait to the wait list if still needing to wait. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Jeffery <djeffery@redhat.com> Reported-by: John Pittman <jpittman@redhat.com> Tested-by: John Pittman <jpittman@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>