aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-12-04tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWATEric Dumazet3-8/+22
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12 as a step to enable bigger tcp sndbuf limits. It works reasonably well, but the following happens : Once the limit is reached, TCP stack generates an [E]POLLOUT event for every incoming ACK packet. This causes a high number of context switches. This patch implements the strategy David Miller added in sock_def_write_space() : - If TCP socket has a notsent_lowat constraint of X bytes, allow sendmsg() to fill up to X bytes, but send [E]POLLOUT only if number of notsent bytes is below X/2 This considerably reduces TCP_NOTSENT_LOWAT overhead, while allowing to keep the pipe full. Tested: 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM A:/# cat /proc/sys/net/ipv4/tcp_wmem 4096 262144 64000000 A:/# super_netperf 100 -H B -l 1000 -- -K bbr & A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/ A:/# vmstat 5 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 256220672 13532 694976 0 0 10 0 28 14 0 1 99 0 0 2 0 0 256320016 13532 698480 0 0 512 0 715901 5927 0 10 90 0 0 0 0 0 256197232 13532 700992 0 0 735 13 771161 5849 0 11 89 0 0 1 0 0 256233824 13532 703320 0 0 512 23 719650 6635 0 11 89 0 0 2 0 0 256226880 13532 705780 0 0 642 4 775650 6009 0 12 88 0 0 A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow A:/# vmstat 5 5 # check that context switches have not inflated too much. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 260386512 13592 662148 0 0 10 0 17 14 0 1 99 0 0 0 0 0 260519680 13592 604184 0 0 512 13 726843 12424 0 10 90 0 0 1 1 0 260435424 13592 598360 0 0 512 25 764645 12925 0 10 90 0 0 1 0 0 260855392 13592 578380 0 0 512 7 722943 13624 0 11 88 0 0 1 0 0 260445008 13592 601176 0 0 614 34 772288 14317 0 10 90 0 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04Merge branch 'act_tunnel_key-support-key-less-tunnels'David S. Miller1-11/+14
Or Gerlitz says: ==================== net/sched: act_tunnel_key: support key-less tunnels This short series from Adi Nissim allows to support key-less tunnels by the tc tunnel key actions, which is needed for some GRE use-cases. changes from V0: - addresses build warning spotted by kbuild, make sure to always init to zero the tunnel key ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04net/sched: act_tunnel_key: Don't dump dst port if it wasn't setAdi Nissim1-1/+3
It's possible to set a tunnel without a destination port. However, on dump(), a zero dst port is returned to user space even if it was not set, fix that. Note that so far it wasn't required, b/c key less tunnels were not supported and the UDP tunnels do require destination port. Signed-off-by: Adi Nissim <adin@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04net/sched: act_tunnel_key: Allow key-less tunnelsAdi Nissim1-10/+11
Allow setting a tunnel without a tunnel key. This is required for tunneling protocols, such as GRE, that define the key as an optional field. Signed-off-by: Adi Nissim <adin@mellanox.com> Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04qed: fix spelling mistake "Dispalying" -> "Displaying"Colin Ian King1-1/+1
There is a spelling mistake in a DP_NOTICE message, fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04Merge branch 'mlxsw-Add-one-armed-router-support'David S. Miller8-9/+287
Ido Schimmel says: ==================== mlxsw: Add one-armed router support Up until now, when a packet was routed by the ASIC through the same router interface (RIF) from which it ingressed from, the ASIC passed the sole copy of the packet to the kernel. This allowed the kernel to route the packet and also potentially generate an ICMP redirect. There are scenarios (e.g., "one-armed router") where packets are intentionally routed this way and are therefore not deemed as exceptions. In such scenarios the current method of trapping packets to the CPU is problematic, as it results in major packet loss. This patchset solves the problem by having the ASIC forward the packet, but also send a copy to the CPU, which gives the kernel the opportunity to generate required exceptions. To prevent the kernel from forwarding such packets again, the driver marks them with 'offload_l3_fwd_mark', which causes the kernel to consume them in ip{,6}_forward_finish(). Patch #1 renames 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark'. When set, the field indicates that a packet was already forwarded in L3 (unicast / multicast) by a capable device. Patch #2 teaches the kernel to consume unicast packets that have 'offload_l3_fwd_mark' set. Patch #3 changes mlxsw to mirror loopbacked (iRIF == eRIF) packets, instead of trapping them. Patch #4 adds a test case for above mentioned scenario. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04selftests: mlxsw: Add one-armed router testIdo Schimmel1-0/+259
Construct a "one-armed router" topology and test that packets are forwarded by the ASIC and that a copy of the packet is sent to the kernel, which does not forward the packet again. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04mlxsw: spectrum: Mirror loopbacked packets instead of trapping themIdo Schimmel2-1/+4
When the ASIC detects that a unicast packet is routed through the same router interface (RIF) from which it ingressed (iRIF == eRIF), it raises a trap called loopback error (LBERROR). Thus far, this trap was configured to send a sole copy of the packet to the CPU so that ICMP redirect packets could be potentially generated by the kernel. This is problematic as the CPU cannot forward packets at 3.2Tb/s and there are scenarios (e.g., "one-armed router") where iRIF == eRIF is not an exception. Solve this by changing the trap to send a copy of the packet to the CPU. To prevent the kernel from forwarding the packet again, it is marked with 'offload_l3_fwd_mark'. The trap is configured in a trap group of its own with a dedicated policer in order not to prevent packets trapped by other traps from reaching the CPU. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04net: Do not route unicast IP packets twiceIdo Schimmel2-0/+14
Packets marked with 'offload_l3_fwd_mark' were already forwarded by a capable device and should not be forwarded again by the kernel. Therefore, have the kernel consume them. The check is performed in ip{,6}_forward_finish() in order to allow the kernel to process such packets in ip{,6}_forward() and generate required exceptions. For example, ICMP redirects. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04skbuff: Rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark'Ido Schimmel4-8/+10
Commit abf4bb6b63d0 ("skbuff: Add the offload_mr_fwd_mark field") added the 'offload_mr_fwd_mark' field to indicate that a packet has already undergone L3 multicast routing by a capable device. The field is used to prevent the kernel from forwarding a packet through a netdev through which the device has already forwarded the packet. Currently, no unicast packet is routed by both the device and the kernel, but this is about to change by subsequent patches and we need to be able to mark such packets, so that they will no be forwarded twice. Instead of adding yet another field to 'struct sk_buff', we can just rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet either has a multicast or a unicast destination IP. While at it, add a comment about both 'offload_fwd_mark' and 'offload_l3_fwd_mark'. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: marvell: convert to DEFINE_SHOW_ATTRIBUTEYangtao Li2-26/+2
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: qca_spi: convert to DEFINE_SHOW_ATTRIBUTEYangtao Li1-14/+2
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Acked-by: Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: stmmac: convert to DEFINE_SHOW_ATTRIBUTEYangtao Li1-30/+4
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03nfp: convert to DEFINE_SHOW_ATTRIBUTEYangtao Li1-34/+8
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03Merge branch 'mlx4_core-cleanups'David S. Miller1-9/+8
Tariq Toukan says: ==================== mlx4_core cleanups This patchset by Erez contains cleanups to the mlx4_core driver. Patch 1 replaces -EINVAL with -EOPNOTSUPP for unsupported operations. Patch 2 fixes some coding style issues. Series generated against net-next commit: 97e6c858a26e net: usb: aqc111: Initialize wol_cfg with memset in aqc111_suspend ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net/mlx4_core: Fix several coding style errorsErez Alfasi1-3/+3
Fix 3 checkpatch errors within mlx4/main.c: - Unnecessary mlx4_debug_level global variable initialization to 0. - Prohibited space before comma. - Whitespaces instead of tab. Signed-off-by: Erez Alfasi <ereza@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net/mlx4_core: Fix return codes of unsupported operationsErez Alfasi1-6/+5
Functions __set_port_type and mlx4_check_port_params returned -EINVAL while the proper return code is -EOPNOTSUPP as a result of an unsupported operation. All drivers should generate this and all users should check for it when detecting an unsupported functionality. Signed-off-by: Erez Alfasi <ereza@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: phy: Also request modules for C45 IDsJose Abreu1-1/+15
Logic of phy_device_create() requests PHY modules according to PHY ID but for C45 PHYs we use different field for the IDs. Let's also request the modules for these IDs. Changes from v1: - Only request C22 modules if C45 are not present (Andrew) Signed-off-by: Jose Abreu <joabreu@synopsys.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Joao Pinto <joao.pinto@synopsys.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03Merge branch 'octeontx2-next'David S. Miller10-184/+1095
Jerin Jacob says: ==================== octeontx2-af: NIX and NPC enhancements This patchset is a continuation to earlier submitted four patch series to add a new driver for Marvell's OcteonTX2 SOC's Resource virtualization unit (RVU) admin function driver. 1. octeontx2-af: Add RVU Admin Function driver https://www.spinics.net/lists/netdev/msg528272.html 2. octeontx2-af: NPA and NIX blocks initialization https://www.spinics.net/lists/netdev/msg529163.html 3. octeontx2-af: NPC parser and NIX blocks initialization https://www.spinics.net/lists/netdev/msg530252.html 4. octeontx2-af: NPC MCAM support and FLR handling https://www.spinics.net/lists/netdev/msg534392.html This patch series adds support for below NPC block: - Add NPC(mkex) profile support for various Key extraction configurations NIX block: - Enable dynamic RSS flow key algorithm configuration - Enhancements on Rx checksum and error checks - Add support for Tx packet marking support - TL1 schedule queue allocation enhancements - Add LSO format configuration mbox - VLAN TPID configuration - Skip multicast entry init for broadcast tables v2: - Rename FLOW_* to NIX_FLOW_* to avoid serious global namespace collisions, as we have various FLOW_* definitions coming from include/uapi/linux/pkt_cls.h for example.(David Miller) - Pack the arguments of rvu_get_tl1_schqs() function as 80 columns allows.(David Miller) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Enable mkex profileVamsi Attunuru8-2/+213
The following set of NPC registers allow the driver to configure NPC to generate different key value schemes to compare against packet payload in MCAM search. NPC_AF_INTF(0..1)_KEX_CFG NPC_AF_KEX_LDATA(0..1)_FLAGS_CFG NPC_AF_INTF(0..1)_LID(0..7)_LT(0..15)_LD(0..1)_CFG NPC_AF_INTF(0..1)_LDATA(0..1)_FLAGS(0..15)_CFG Currently, the AF driver populates these registers to configure the default values to address the most common use cases such as key generation for channel number + DMAC. The secure firmware stores different configuration value of these registers to enable different NPC use case along with the name for the lookup. Patch loads profile binary from secure firmware over the exiting CGX mailbox interface and apply the profile. AF driver shall fall back to the default configuration in case of any errors. The AF consumer driver can know the selected profile on response to NPC_GET_KEX_CFG mailbox by introducing mkex_pfl_name in the struct npc_get_kex_cfg_rsp. Signed-off-by: Vamsi Attunuru <vamsi.attunuru@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Add LSO format configuration mailboxNithin Dabilpuram3-7/+91
NIX_AF_LSO_FORMAT(0..31)_FIELD(0..7) register enables an SW defined means to define LSO packet modification formats. 0..31 works as an index to choose the algorithm, On success, the mailbox returns the index to the client of chosen LSO algorithm selection. This index will be used in configuring the transmit descriptors. Add mailbox interface to dynamically reserve and configure LSO format. This commit also fixes 'sizem1' for NIX_LSOALG_TCP_FLAGS to '1' i.e 2 Bytes. Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Add L3 and L4 packet verification mailboxVidhya Raman3-0/+55
Adds mailbox support for L4 checksum verification and L3 and L4 length verification configuration. Signed-off-by: Vidhya Raman <vraman@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Configure VLAN TPIDsNithin Dabilpuram1-0/+7
Setup TPID's for vlan0 and vlan1 for Tx VLAN insertion offloads. Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Add support for Tx packet markingKrzysztof Kanas3-0/+133
NIX_AF_MARK_FORMAT(0..127)_CTL register enables an SW defined means to mark/insert various data in the packet based on final packet color from traffic shaping HW. 0..127 works as an index to choose the algorithm. On success, the mailbox returns the index to the client. Add NIX_MARK_FORMAT_CFG mailbox which reserves mark format based on tuple (offset, y_mask, y_val, r_mask, r_val) If the tuple is requested again for mark format that was already reserved, then it will be reused. If not it will reserve a new entry if space is available. Also on AF init commonly used marker format such as VLAN DEI, IPv4 ECN, IPv4 DSCP are reserved for AF consumers. Signed-off-by: Krzysztof Kanas <kkanas@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Enable RSS with promiscuous modeVamsi Attunuru1-4/+31
This patch adds support for enabling RSS in promiscuous mode if RSS is already requested by the AF client. Signed-off-by: Vamsi Attunuru <vamsi.attunuru@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Define all NIX_AF_RX_DEF_* registersJerin Jacob1-5/+20
In order to support all NIX specific valid length errors and checksum errors on Rx, Update all NIX_AF_RX_DEF_* registers. Also sorted all registers in HRM definition order. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Enable inner IPv4 checksum and its error codeJerin Jacob1-1/+8
This patch enables the inner IPv4 checksum and defines the error code for Rx inner and outer checksum errors. Setting ERRCODE as 1 so that CQE descriptor can be embedded valid checksum error code and the driver can interpret checksum error as ERRLEV = LID + 1 and ERRCODE = 1. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Allow freeing single TLx Tx schedule queueNithin Dabilpuram1-1/+71
The default behavior was to free all the TLx Tx schedule queues. This patch adds support for freeing a single Tx schedule queue if TXSCHQ_FREE_ALL flag is not set. Signed-off-by: Krzysztof Kanas <kkanas@marvell.com> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Restrict TL1 allocation and configurationNithin Dabilpuram3-18/+218
TL1 is the root node in the scheduling hierarchy and it is a global resource with a limited number. This patch introduces restriction and validation on the allocation of the TL1 nodes for the effective resource sharing across the AF consumers. - Limit TL1 allocation to 2 per lmac. One could be for the normal link and one for IEEE802.3br express link (Express Send DMA). Effectively all the VF's of an RVU PF(lmac) share the two TL1 schqs. - TL1 cannot be freed once allocated. - Allow VF's to only apply default config to TL1 if not already applied. PF's can always overwrite the TL1 config. - Consider NIX_AQ_INSTOP_WRITE while validating txschq when sq.ena is set. Signed-off-by: Krzysztof Kanas <kkanas@marvell.com> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Add support for runtime RSS algo index reservationJerin Jacob4-94/+119
Introduced reserve_flowkey_alg_idx()to reserve RSS algorithm index, it would internally use set_flowkey_fields() to generate fields based on the flow key dynamically. On AF driver init, it would reserve a predefined set RSS algo indexes, which will be available all the time for all the AF driver consumers. The leftover algo indexes can be reserved at runtime through exiting nix_rss_flowkey_cfg mailbox message. The NIX_FLOW_KEY_TYPE_PORT is removed from predefined a set of RSS flow type as it is not used by any consumer. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Add support for dynamic flow cfg to RSS field generationJerin Jacob2-27/+87
Introduce state-based algorithm to convert the flow_key value to RSS algo field used by NIX_AF_RX_FLOW_KEY_ALGX_FIELDX register. The outer `for loop` goes over _all_ protocol field and the following variables depict the state machine forward progress logic. a) keyoff_marker - Enabled when hash byte length needs to be accounted in field->key_offset update. b) field_marker - Enabled when a new field needs to be selected. c) group_member - Enabled when a protocol is part of a group. This would remove the existing hard coding and enable to add new protocol support seamlessly. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Add response for RSS flow key cfg messageJerin Jacob4-73/+86
Added response for nix_rss_flowkey_cfg message to return selected RSS algorithm index. The FLOW_KEY_TYPE* definition is part of the mbox message and it will be used by the other consumers of AF driver hence moving to mbox.h. Also renamed FLOW_* definitions to NIX_FLOW_* to avoid global name space collisions, as we have various coming from include/uapi/linux/pkt_cls.h for example. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03octeontx2-af: Skip NIXLF check for bcast MCE entrySunil Goutham2-6/+10
At the time of initial broadcast packet replication table init, NIXLFs are not yet attached to PF_FUNCs. Hence skipped checking NIXLF while submitting MCE entry init instruction to NIX admin queue. Also did a minor cleanup while installing bcast match entry in packet parser unit i.e NPC. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03Merge branch 'udp-msg_zerocopy'David S. Miller9-27/+90
Willem de Bruijn says: ==================== udp msg_zerocopy Enable MSG_ZEROCOPY for udp sockets Patch 1/3 is the main patch, a rework of RFC patch http://patchwork.ozlabs.org/patch/899630/ more details in the patch commit message Patch 2/3 is an optimization to remove a branch from the UDP hot path and refcount_inc/refcount_dec_and_test pair when zerocopy is used. This used to be included in the first patch in v2. Patch 3/3 runs the already existing udp zerocopy tests as part of kselftest See also recent Linux Plumbers presentation https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf Changes: v1 -> v2 - Fixup reverse christmas tree violation v2 -> v3 - Split refcount avoidance optimization into separate patch - Fix refcount leak on error in fragmented case (thanks to Paolo Abeni for pointing this one out!) - Fix refcount inc on zero v3 -> v4 - Move skb_zcopy_set below the only kfree_skb that might cause a premature uarg destroy before skb_zerocopy_put_abort - Move the entire skb_shinfo assignment block, to keep that cacheline access in one place ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03selftests: extend zerocopy tests to udpWillem de Bruijn3-1/+7
Both msg_zerocopy and udpgso_bench have udp zerocopy variants. Exercise these as part of the standard kselftest run. With udp, msg_zerocopy has no control channel. Ensure that the receiver exits after the sender by accounting for the initial delay in starting them (in msg_zerocopy.sh). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03udp: elide zerocopy operation in hot pathWillem de Bruijn5-31/+36
With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info. Release of its last reference triggers a completion notification. The TCP stack in tcp_sendmsg_locked holds an extra ref independent of the skbs, because it can build, send and free skbs within its loop, possibly reaching refcount zero and freeing the ubuf_info too soon. The UDP stack currently also takes this extra ref, but does not need it as all skbs are sent after return from __ip(6)_append_data. Avoid the extra refcount_inc and refcount_dec_and_test, and generally the sock_zerocopy_put in the common path, by passing the initial reference to the first skb. This approach is taken instead of initializing the refcount to 0, as that would generate error "refcount_t: increment on 0" on the next skb_zcopy_set. Changes v3 -> v4 - Move skb_zcopy_set below the only kfree_skb that might cause a premature uarg destroy before skb_zerocopy_put_abort - Move the entire skb_shinfo assignment block, to keep that cacheline access in one place Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03udp: msg_zerocopyWillem de Bruijn5-3/+55
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and interpret flag MSG_ZEROCOPY. This patch was previously part of the zerocopy RFC patchsets. Zerocopy is not effective at small MTU. With segmentation offload building larger datagrams, the benefit of page flipping outweights the cost of generating a completion notification. tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on test patch and making skb_orphan_frags_rx same as skb_orphan_frags: ipv4 udp -t 1 tx=191312 (11938 MB) txc=0 zc=n rx=191312 (11938 MB) ipv4 udp -z -t 1 tx=304507 (19002 MB) txc=304507 zc=y rx=304507 (19002 MB) ok ipv6 udp -t 1 tx=174485 (10888 MB) txc=0 zc=n rx=174485 (10888 MB) ipv6 udp -z -t 1 tx=294801 (18396 MB) txc=294801 zc=y rx=294801 (18396 MB) ok Changes v1 -> v2 - Fixup reverse christmas tree violation v2 -> v3 - Split refcount avoidance optimization into separate patch - Fix refcount leak on error in fragmented case (thanks to Paolo Abeni for pointing this one out!) - Fix refcount inc on zero - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data. This is needed since commit 5cf4a8532c99 ("tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03Merge tag 'wireless-drivers-next-for-davem-2018-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-nextDavid S. Miller131-1144/+4620
Kalle Valo says: ==================== wireless-drivers-next patches for 4.21 First set of patches for 4.21. Most notable here is support for Quantenna's QSR1000/QSR2000 chipsets and more flexible ways to provide nvram files for brcmfmac. Major changes: brcmfmac * add support for first trying to get a board specific nvram file * add support for getting nvram contents from EFI variables qtnfmac * use single PCIe driver for all platforms and rename Kconfig option CONFIG_QTNFMAC_PEARL_PCIE to CONFIG_QTNFMAC_PCIE * add support for QSR1000/QSR2000 (Topaz) family of chipsets ath10k * add support for WCN3990 firmware crash recovery * add firmware memory dump support for QCA4019 wil6210 * add firmware error recovery while in AP mode ath9k * remove experimental notice from dynack feature iwlwifi * PCI IDs for some new 9000-series cards * improve antenna usage on connection problems * new firmware debugging infrastructure * some more work on 802.11ax * improve support for multiple RF modules with 22000 devices cordic * move cordic macros and defines to a public header file * convert brcmsmac and b43 to fully use cordic library ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03Merge branch 'davinci_emac-read-the-MAC-address-from-nvmem'David S. Miller6-51/+49
Bartosz Golaszewski says: ==================== davinci_emac: read the MAC address from nvmem This series is part of a bigger series that aims at removing the platform data structure from the at24 EEPROM driver[1]. We provide a generalized version of of_get_nvmem_mac_address(), switch the only user of the of_ variant to using it, remove the previous implementation and use the new routine in the davinci_emac driver. [1] https://lkml.org/lkml/2018/11/13/884 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: davinci_emac: use nvmem_get_mac_address()Bartosz Golaszewski1-5/+9
All DaVinci boards still supported in board files now define nvmem cells containing the MAC address. We want to stop using the setup callback from at24 so the MAC address for those users will no longer be provided over platform data. If we didn't get a valid MAC in pdata, try nvmem before resorting to a random MAC. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03of: net: kill of_get_nvmem_mac_address()Bartosz Golaszewski2-45/+0
We've switched all users to nvmem_get_mac_address(). Remove the now dead code. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: cadence: switch to using nvmem_get_mac_address()Bartosz Golaszewski1-1/+1
We now have a generalized helper routine to read the MAC address from nvmem which takes struct device as argument. The nvmem subsystem will then try device tree first before all other potential providers. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: ethernet: provide nvmem_get_mac_address()Bartosz Golaszewski2-0/+39
We already have of_get_nvmem_mac_address() but some non-DT systems want to read the MAC address from NVMEM too. Implement a generalized routine that takes struct device as argument. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03rhashtable: detect when object movement between tables might have invalidated a lookupNeilBrown2-11/+31
Some users of rhashtables might need to move an object from one table to another - this appears to be the reason for the incomplete usage of NULLS markers. To support these, we store a unique NULLS_MARKER at the end of each chain, and when a search fails to find a match, we check if the NULLS marker found was the expected one. If not, the search may not have examined all objects in the target bucket, so it is repeated. The unique NULLS_MARKER is derived from the address of the head of the chain. As this cannot be derived at load-time the static rhnull in rht_bucket_nested() needs to be initialised at run time. Any caller of a lookup function must still be prepared for the possibility that the object returned is in a different table - it might have been there for some time. Note that this does NOT provide support for other uses of NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing the key of an object and re-inserting it in the same table. These could only be done safely if new objects were inserted at the *start* of a hash chain, and that is not currently the case. Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03Merge branch 'hns3-ethtool-dump'David S. Miller4-5/+345
Salil Mehta says: ==================== Adds VF/PF PCIe reg dump(ethtool -d) support to HNS3 driver This patchset adds VF/PF PCIe register dump support to HNS3 VF and PF driver using "ethtool -d" command. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: hns3: Adds support to dump(using ethool-d) PCIe regs in HNS3 PF driverJian Shen2-5/+171
This patch adds support to dump PF PCIe registers using ethtool -d for HNS3 PF Driver. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: hns3: Support "ethtool -d" for HNS3 VF driverJian Shen2-0/+174
This patch adds "ethtool -d" support for HNS3 VF Driver. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03net: phy: improve generic EEE ethtool functionsHeiner Kallweit1-5/+10
So far the two functions consider neither member eee_enabled nor eee_active. Therefore network drivers have to do this in some kind of glue code. I think this can be avoided. Getting EEE parameters: When not advertising any EEE mode, we can't consider EEE to be enabled. Therefore interpret "EEE enabled" as "we advertise at least one EEE mode". It's similar with "EEE active": interpret it as "EEE modes advertised by both link partner have at least one mode in common". Setting EEE parameters: If eee_enabled isn't set, don't advertise any EEE mode and restart aneg if needed to switch off EEE. If eee_enabled is set and data->advertised is empty (e.g. because EEE was disabled), advertise everything we support as default. This way EEE can easily switched on/off by doing ethtool --set-eee <if> eee on/off, w/o any additional parameters. The changes to both functions shouldn't break any existing user. Once the changes have been applied, at least some users can be simplified. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03Merge branch 'VXLAN-underlay-VRF'David S. Miller8-9/+228
Alexis Bauvin says: ==================== net: Add VRF support for VXLAN underlay v6 -> v7: - proper locking for device in udp_tunnel following Sabrina Dubroca's advice v5 -> v6: - remove automatic rebinding patch following Roopa Prabhu's advice v4 -> v5: - move test script to its own patch (6/6) - add schematic for test script - apply David Ahern comments to the test script v3 -> v4: - rename vxlan_is_in_l3mdev_chain to netdev_is_upper master - move it to net/core/dev.c - make it return bool instead of int - check if remote_ifindex is zero before resolving the l3mdev - add testing script v2 -> v3: - fix build when CONFIG_NET_IPV6 is off - fix build "unused l3mdev_master_upper_ifindex_by_index" build error with some configs v1 -> v2: - move vxlan_get_l3mdev from vxlan driver to l3mdev driver as l3mdev_master_upper_ifindex_by_index - vxlan: rename variables named l3mdev_ifindex to ifindex v0 -> v1: - fix typos We are trying to isolate the VXLAN traffic from different VMs with VRF as shown in the schemas below: +-------------------------+ +----------------------------+ | +----------+ | | +------------+ | | | | | | | | | | | tap-red | | | | tap-blue | | | | | | | | | | | +----+-----+ | | +-----+------+ | | | | | | | | | | | | | | +----+---+ | | +----+----+ | | | | | | | | | | | br-red | | | | br-blue | | | | | | | | | | | +----+---+ | | +----+----+ | | | | | | | | | | | | | | | | | | | | +----+--------+ | | +--------------+ | | | | | | | | | | | vxlan-red | | | | vxlan-blue | | | | | | | | | | | +------+------+ | | +-------+------+ | | | | | | | | | VRF | | | VRF | | | red | | | blue | +-------------------------+ +----------------------------+ | | | | +---------------------------------------------------------+ | | | | | | | | | | +--------------+ | | | | | | | | | +---------+ eth0.2030 +---------+ | | | 10.0.0.1/24 | | | +-----+--------+ VRF | | | green| +---------------------------------------------------------+ | | +----+---+ | | | eth0 | | | +--------+ iproute2 commands to reproduce the setup: ip link add green type vrf table 1 ip link set green up ip link add eth0.2030 link eth0 type vlan id 2030 ip link set eth0.2030 master green ip addr add 10.0.0.1/24 dev eth0.2030 ip link set eth0.2030 up ip link add blue type vrf table 2 ip link set blue up ip link add br-blue type bridge ip link set br-blue master blue ip link set br-blue up ip link add vxlan-blue type vxlan id 2 local 10.0.0.1 dev eth0.2030 \ port 4789 ip link set vxlan-blue master br-blue ip link set vxlan-blue up ip link set tap-blue master br-blue ip link set tap-blue up ip link add red type vrf table 3 ip link set red up ip link add br-red type bridge ip link set br-red master red ip link set br-red up ip link add vxlan-red type vxlan id 3 local 10.0.0.1 dev eth0.2030 \ port 4789 ip link set vxlan-red master br-red ip link set vxlan-red up ip link set tap-red master br-red ip link set tap-red up We faced some issue in the datapath, here are the details: * Egress traffic: The vxlan packets are sent directly to the default VRF because it's where the socket is bound, therefore the traffic has a default route via eth0. the workaround is to force this traffic to VRF green with ip rules. * Ingress traffic: When receiving the traffic on eth0.2030 the vxlan socket is unreachable from VRF green. The workaround is to enable *udp_l3mdev_accept* sysctl, but this breaks isolation between overlay and underlay: packets sent from blue or red by e.g. a guest VM will be accepted by the socket, allowing injection of VXLAN packets from the overlay. This patch series fixes the issues describe above by allowing VXLAN socket to be bound to a specific VRF device therefore looking up in the correct table. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03test/net: Add script for VXLAN underlay in a VRFAlexis Bauvin2-1/+130
This script tests the support of a VXLAN underlay in a non-default VRF. It does so by simulating two hypervisors and two VMs, an extended L2 between the VMs with the hypervisors as VTEPs with the underlay in a VRF, and finally by pinging the two VMs. It also tests that moving the underlay from a VRF to another works when down/up the VXLAN interface. Signed-off-by: Alexis Bauvin <abauvin@scaleway.com> Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: Amine Kherbouche <akherbouche@scaleway.com> Signed-off-by: David S. Miller <davem@davemloft.net>