aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2019-06-14udp: Remove unused parameter (exact_dif)Tim Beale2-12/+11
Originally this was used by the VRF logic in compute_score(), but that was later replaced by udp_sk_bound_dev_eq() and the parameter became unused. Note this change adds an 'unused variable' compiler warning that will be removed in the next patch (I've split the removal in two to make review slightly easier). Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14ipv4: tcp: fix ACK/RST sent with a transmit delayEric Dumazet5-11/+19
If we want to set a EDT time for the skb we want to send via ip_send_unicast_reply(), we have to pass a new parameter and initialize ipc.sockc.transmit_time with it. This fixes the EDT time for ACK/RST packets sent on behalf of a TIME_WAIT socket. Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: remove empty netlink_tap_exit_netLi RongQing1-5/+0
Pointer members of an object with static storage duration, if not explicitly initialized, will be initialized to a NULL pointer. The net namespace API checks if this pointer is not NULL before using it, it are safe to remove the function. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'nfp-flower-loosen-L4-checks-and-add-extack-to-flower-offload'David S. Miller6-103/+286
Jakub Kicinski says: ==================== nfp: flower: loosen L4 checks and add extack to flower offload Pieter says: This set allows the offload of filters that make use of an unknown ip protocol, given that layer 4 is being wildcarded. The set then aims to make use of extack messaging for flower offloads. It adds about 70 extack messages to the driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14nfp: flower: extend extack messaging for flower match and actionsPieter Jansen van Vuuren6-77/+196
Use extack messages in flower offload when compiling match and actions messages that will configure hardware. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14nfp: flower: use extack messages in flower offloadPieter Jansen van Vuuren1-25/+80
Use extack messages in flower offload, specifically focusing on the extack use in add offload, remove offload and get stats paths. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14nfp: flower: check L4 matches on unknown IP protocolsPieter Jansen van Vuuren1-2/+11
Matching on fields with a protocol that is unknown to hardware is not strictly unsupported. Determine if hardware can offload a filter with an unknown protocol by checking if any L4 fields are being matched as well. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge tag 'mlx5-updates-2019-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller19-144/+1516
Saeed Mahameed says: ==================== mlx5-updates-2019-06-13 Mlx5 devlink health fw reporters and sw reset support This series provides mlx5 firmware reset support and firmware devlink health reporters. 1) Add initial mlx5 kernel documentation and include devlink health reporters 2) Add CR-Space access and FW Crdump snapshot support via devlink region_snapshot 3) Issue software reset upon FW asserts 4) Add fw and fw_fatal devlink heath reporters to follow fw errors indication by dump and recover procedures and enable trigger these functionality by user. 4.1) fw reporter: The fw reporter implements diagnose and dump callbacks. It follows symptoms of fw error such as fw syndrome by triggering fw core dump and storing it and any other fw trace into the dump buffer. The fw reporter diagnose command can be triggered any time by the user to check current fw status. 4.2) fw_fatal repoter: The fw_fatal reporter implements dump and recover callbacks. It follows fatal errors indications by CR-space dump and recover flow. The CR-space dump uses vsc interface which is valid even if the FW command interface is not functional, which is the case in most FW fatal errors. The CR-space dump is stored as a memory region snapshot to ease read by address. The recover function runs recover flow which reloads the driver and triggers fw reset if needed. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14ipv4: Support multipath hashing on inner IP pkts for GRE tunnelStephen Suryaputra3-1/+19
Multipath hash policy value of 0 isn't distributing since the outer IP dest and src aren't varied eventhough the inner ones are. Since the flow is on the inner ones in the case of tunneled traffic, hashing on them is desired. This is done mainly for IP over GRE, hence only tested for that. But anything else supported by flow dissection should work. v2: Use skb_flow_dissect_flow_keys() directly so that other tunneling can be supported through flow dissection (per Nikolay Aleksandrov). v3: Remove accidental inclusion of ports in the hash keys and clarify the documentation (Nikolay Alexandrov). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14virtio_net: enable napi_tx by defaultWillem de Bruijn1-1/+1
NAPI tx mode improves TCP behavior by enabling TCP small queues (TSQ). TSQ reduces queuing ("bufferbloat") and burstiness. Previous measurements have shown significant improvement for TCP_STREAM style workloads. Such as those in commit 86a5df1495cc ("Merge branch 'virtio-net-tx-napi'"). There has been uncertainty about smaller possible regressions in latency due to increased reliance on tx interrupts. The above results did not show that, nor did I observe this when rerunning TCP_RR on Linux 5.1 this week on a pair of guests in the same rack. This may be subject to other settings, notably interrupt coalescing. In the unlikely case of regression, we have landed a credible runtime solution. Ethtool can configure it with -C tx-frames [0|1] as of commit 0c465be183c7 ("virtio_net: ethtool tx napi configuration"). NAPI tx mode has been the default in Google Container-Optimized OS (COS) for over half a year, as of release M70 in October 2018, without any negative reports. Link: https://marc.info/?l=linux-netdev&m=149305618416472 Link: https://lwn.net/Articles/507065/ Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: sched: ingress: set 'unlocked' flag for clsact Qdisc opsVlad Buslov1-0/+1
To remove rtnl lock dependency in tc filter update API when using clsact Qdisc, set QDISC_CLASS_OPS_DOIT_UNLOCKED flag in clsact Qdisc_class_ops. Clsact Qdisc ops don't require any modifications to be used without rtnl lock on tc filter update path. Implementation never changes its q->block and only releases it when Qdisc is being destroyed. This means it is enough for RTM_{NEWTFILTER|DELTFILTER|GETTFILTER} message handlers to hold clsact Qdisc reference while using it without relying on rtnl lock protection. Unlocked Qdisc ops support is already implemented in filter update path by unlocked cls API patch set. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'enable-and-use-static_branch_deferred_inc'David S. Miller2-3/+4
Willem de Bruijn says: ==================== enable and use static_branch_deferred_inc 1. make static_branch_deferred_inc available if !CONFIG_JUMP_LABEL 2. convert the existing STATIC_KEY_DEFERRED_FALSE user to this api ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14tcp: use static_branch_deferred_inc for clean_acked_data_enabledWillem de Bruijn1-1/+1
Deferred static key clean_acked_data_enabled uses the deferred variants of dec and flush. Do the same for inc. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14locking/static_key: always define static_branch_deferred_incWillem de Bruijn1-2/+3
This interface is currently only defined if CONFIG_JUMP_LABEL. Make it available also when jump labels are off. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'hns3-next'David S. Miller15-308/+476
Huazhong Tan says: ==================== net: hns3: some code optimizations & cleanups & bugfixes This patch-set includes code optimizations, cleanups and bugfixes for the HNS3 ethernet controller driver. [patch 1/12 - 6/12] adds some code optimizations and bugfixes about RAS and MSI-X HW error. [patch 7/12] fixes a loading issue. [patch 8/12 - 11/12] adds some bugfixes. [patch 12/12] adds some cleanups, which does not change the logic of code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: some variable modificationWeihang Li12-61/+95
This patch does following things: 1. add the keyword const before some variables which won't be modified in functions. 2. changes some variables from signed to unsigned to avoid bitwise operation on signed variables. 3. adds or removes initialization of some variables. 4. defines a new structure to help parsing mailbox messages instead of using an array which is harder to get the meaning of each element. Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: delay ring buffer clearing during resetYunsheng Lin1-5/+14
The driver may not be able to disable the ring through firmware when downing the netdev during reset process, which may cause hardware accessing freed buffer problem. This patch delays the ring buffer clearing to reset uninit process because hardware will not access the ring buffer after hardware reset is completed. Fixes: bb6b94a896d4 ("net: hns3: Add reset interface implementation in client") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: fix for skb leak when doing selftestYunsheng Lin1-2/+4
If hns3_nic_net_xmit does not return NETDEV_TX_BUSY when doing a loopback selftest, the skb is not freed in hns3_clean_tx_ring or hns3_nic_net_xmit, which causes skb not freed problem. This patch fixes it by freeing skb when hns3_nic_net_xmit does not return NETDEV_TX_OK. Fixes: c39c4d98dc65 ("net: hns3: Add mac loopback selftest support in hns3 driver") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: fix for dereferencing before null checkingYunsheng Lin1-2/+5
The netdev is dereferenced before null checking in the function hns3_setup_tc. This patch moves the dereferencing after the null checking. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: free irq when exit from abnormal branchYonglong Liu1-0/+1
In hns3_nic_init_irq(), if request irq fail at index i, the function return directly without releasing irq resources that already requested, and nowhere else will release them. Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: clear restting state when initializing HW devicePeng Li1-0/+18
IMP will set restting state for all function when PF FLR, driver just clear the restting state in resetting progress, but don't do it in initializing progress. As FLR is not created by driver, it is necessary to clear restting state when initializing HW device. Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: extract handling of mpf/pf msi-x errors into functionsWeihang Li1-39/+83
Function hclge_handle_all_hw_msix_error() contains four parts: 1. Query buffer descriptors for MSI-X errors. 2. Query and clear all main PF MSI-X errors. 3. Query and clear all PF MSI-X errors. 4. Handle mac tunnel interrupts. Part 2 and part 3 handle errors of some different modules respectively, this patch extracts them into dividual functions, which makes the logic clearer. Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: some changes of MSI-X bits in PPU(RCB)Weihang Li2-4/+3
This patch modifies print message of rx_q_search_miss from error to dfx to prevent misleading users, because this interrupt may occur if we receive packets during initialization of HNS3 driver. Otherwise, this patch masks 28th bit of PPU_MPF_ABNORMAL_SRC2 which is now meaningless. Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: add recovery for the H/W errors occurred before the HNS dev initializationShiju Jose1-2/+14
This patch adds the recovery for the HNS H/W errors which occurred before the driver initialization. Reported-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: process H/W errors occurred before HNS dev initializationShiju Jose3-8/+107
Presently the HNS driver enables the HNS H/W error interrupts after the dev initialization is completed. However some exceptions such as NCSI errors can occur when the network port driver is not loaded and those errors required reporting to the BMC. Therefore the firmware enabled all the HNS ras error interrupts before the driver is loaded. And in some cases, there will be some H/W errors remained unclear before reboot. Thus the HNS driver needs to process and recover those hw errors occurred before HNS driver is initialized. This patch adds processing of the HNS hw errors(RAS and MSI-X) which occurred before the driver initialization. For RAS, because they are enabled by firmware, so we can detect specific bits, then log and clear them. But for MSI-X which can not be enabled before open vector0 irq, we can't detect the specific error bits, so we just write 1 to all interrupt source registers to clear. Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: fix avoid unnecessary resetting for the H/W errors which do not require resetShiju Jose1-171/+109
HNS does not need to be reset when errors occur in some bits. However presently the HNAE3_FUNC_RESET is set in this case and as a result the default_reset is done when these errors are reported. This patch fix this issue. Also patch does some optimization in setting the reset level for the error recovery. Reported-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: hns3: delay setting of reset level for hw errors until slot_reset is calledShiju Jose4-42/+51
Presently the error handling code sets the reset level required for the recovery of the hw errors to the reset framework in the error_detected AER callback. However the rest_event would be called later from the slot_reset callback. This can cause issue of using the wrong reset_level if a high priority reset request occur before the slot_reset is called. This patch delays setting of the reset level, required for the hw errors, to the reset framework until the slot_reset is called. Reported-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'qed-iWARP-fixes'David S. Miller1-10/+39
Michal Kalderon says: ==================== qed: iWARP fixes This series contains a few small fixes related to iWARP. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14qed: iWARP - Fix default window size to be based on chipMichal Kalderon1-5/+25
The default window size is calculated for best performance based on internal hw buffer sizes. The size differs between the different chips and modes. Fixes: 67b40dccc45f ("qed: Implement iWARP initialization, teardown and qp operations") Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14qed: iWARP - Fix tc for MPA ll2 connectionMichal Kalderon1-0/+2
The driver needs to assign a lossless traffic class for the MPA ll2 connection to ensure no packets are dropped when returning from the driver as they will never be re-transmitted by the peer. Fixes: ae3488ff37dc ("qed: Add ll2 connection for processing unaligned MPA packets") Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14qed: iWARP - fix uninitialized callbackMichal Kalderon1-0/+1
Fix uninitialized variable warning by static checker. Fixes: ae3488ff37dc ("qed: Add ll2 connection for processing unaligned MPA packets") Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14qed: iWARP - Use READ_ONCE and smp_store_release to access ep->stateMichal Kalderon1-5/+11
Destroy QP waits for it's ep object state to be set to CLOSED before proceeding. ep->state can be updated from a different context. Add smp_store_release/READ_ONCE to synchronize. Fixes: fc4c6065e661 ("qed: iWARP implement disconnect flows") Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: phy: sfp: clean up a conditionDan Carpenter1-1/+1
The acpi_node_get_property_reference() doesn't return ACPI error codes, it just returns regular negative kernel error codes. This patch doesn't affect run time, it's just a clean up. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Ruslan Babayev <ruslan@babayev.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14vsock: correct removal of socket from the listSunil Muthuswamy1-31/+7
The current vsock code for removal of socket from the list is both subject to race and inefficient. It takes the lock, checks whether the socket is in the list, drops the lock and if the socket was on the list, deletes it from the list. This is subject to race because as soon as the lock is dropped once it is checked for presence, that condition cannot be relied upon for any decision. It is also inefficient because if the socket is present in the list, it takes the lock twice. Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'nfp-add-two-user-friendly-errors'David S. Miller2-1/+10
Jakub Kicinski says: ==================== nfp: add two user friendly errors This small series adds two error messages based on recent bug reports which turned out not to be bugs.. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14nfp: print a warning when binding VFs to PF driverJakub Kicinski1-0/+4
Users sometimes mistakenly try to manually bind the PF driver to the VFs, print a warning message in that case. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14nfp: update the old flash error messageJakub Kicinski1-1/+6
Apparently there are still cards in the wild with a very old management FW. Let's make the error message in that case indicate more clearly that management firmware has to be updated. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'Microchip-KSZ-driver-enhancements'David S. Miller4-0/+72
Robert Hancock says: ==================== Microchip KSZ driver enhancements A couple of enhancements to the Microchip KSZ switch driver: one to add PHY register settings for errata workarounds for more stable operation, and another to add a device tree option to change the output clock rate as required by some board designs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: dsa: microchip: Support optional 125MHz SYNCLKO outputRobert Hancock4-0/+9
The KSZ9477 series chips have a SYNCLKO pin which by default outputs a 25MHz clock, but some board setups require a 125MHz clock instead. Added a microchip,synclko-125 device tree property to allow indicating a 125MHz clock output is required. Signed-off-by: Robert Hancock <hancock@sedsystems.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: dsa: microchip: Add PHY errata workaroundsRobert Hancock2-0/+63
The Silicon Errata and Data Sheet Clarification documents for the KSZ9477 series of chips describe a number of otherwise undocumented PHY register settings which are required to work around various chip errata. Apply these settings when initializing the PHY ports on these chips. Signed-off-by: Robert Hancock <hancock@sedsystems.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: stmmac: use GPIO descriptors in stmmac_mdio_resetMartin Blumenstingl2-15/+14
Switch stmmac_mdio_reset to use GPIO descriptors. GPIO core handles the "snps,reset-gpio" for GPIO descriptors so we don't need to take care of it inside the driver anymore. The advantage of this is that we now preserve the GPIO flags which are passed via devicetree. This is required on some newer Amlogic boards which use an Open Drain pin for the reset GPIO. This pin can only output a LOW signal or switch to input mode but it cannot output a HIGH signal. There are already devicetree bindings for these special cases and GPIO core already takes care of them but only if we use GPIO descriptors instead of GPIO numbers. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'packet-DDOS'David S. Miller2-41/+56
Eric Dumazet says: ==================== net/packet: better behavior under DDOS Using tcpdump (or other af_packet user) on a busy host can lead to catastrophic consequences, because suddenly, potentially all cpus are spinning on a contended spinlock. Both packet_rcv() and tpacket_rcv() grab the spinlock to eventually find there is no room for an additional packet. This patch series align packet_rcv() and tpacket_rcv() to both check if the queue is full before grabbing the spinlock. If the queue is full, they both increment a new atomic counter placed on a separate cache line to let readers drain the queue faster. There is still false sharing on this new atomic counter, we might in the future make it per cpu if there is interest. ==================== Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: introduce packet_rcv_try_clear_pressure() helperEric Dumazet1-4/+9
There are two places where we want to clear the pressure if possible, add a helper to make it more obvious. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Willem de Bruijn <willemb@google.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: remove locking from packet_rcv_has_room()Eric Dumazet1-11/+9
__packet_rcv_has_room() can now be run without lock being held. po->pressure is only a non persistent hint, we can mark all read/write accesses with READ_ONCE()/WRITE_ONCE() to document the fact that the field could be written without any synchronization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: implement shortcut in tpacket_rcv()Eric Dumazet1-0/+6
tpacket_rcv() can be hit under DDOS quite hard, since it will always grab a socket spinlock, to eventually find there is no room for an additional packet. Using tcpdump [1] on a busy host can lead to catastrophic consequences, because of all cpus spinning on a contended spinlock. This replicates a similar strategy used in packet_rcv() [1] Also some applications mistakenly use af_packet socket bound to ETH_P_ALL only to send packets. Receive queue is never drained and immediately full. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: make tp_drops atomicEric Dumazet2-9/+12
Under DDOS, we want to be able to increment tp_drops without touching the spinlock. This will help readers to drain the receive queue slightly faster :/ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: constify __packet_rcv_has_room()Eric Dumazet1-5/+8
Goal is use the helper without lock being held. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: constify prb_lookup_block() and __tpacket_v3_has_room()Eric Dumazet1-7/+7
Goal is to be able to use __tpacket_v3_has_room() without holding a lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: constify packet_lookup_frame() and __tpacket_has_room()Eric Dumazet1-7/+7
Goal is to be able to use __tpacket_has_room() without holding a lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/packet: constify __packet_get_status() argumentEric Dumazet1-1/+1
struct packet_sock is only read. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>