linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2018-09-23	ice: remove ndo_poll_controller	Eric Dumazet	1	-27/+0
	As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. ice uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	igb: remove ndo_poll_controller	Eric Dumazet	1	-30/+0
	As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. igb uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	ixgb: remove ndo_poll_controller	Eric Dumazet	1	-25/+0
	As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. ixgb uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. This also removes a problematic use of disable_irq() in a context it is forbidden, as explained in commit af3e0fcf7887 ("8139too: Use disable_irq_nosync() in rtl8139_poll_controller()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	fm10k: remove ndo_poll_controller	Eric Dumazet	3	-28/+0
	As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture lasts for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. fm10k uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	ixgbevf: remove ndo_poll_controller	Eric Dumazet	1	-21/+0
	As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. ixgbevf uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	ixgbe: remove ndo_poll_controller	Eric Dumazet	1	-25/+0
	As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. ixgbe uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Reported-by: Song Liu <songliubraving@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Song Liu <songliubraving@fb.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	bonding: use netpoll_poll_dev() helper	Eric Dumazet	1	-9/+2
	We want to allow NAPI drivers to no longer provide ndo_poll_controller() method, as it has been proven problematic. team driver must not look at its presence, but instead call netpoll_poll_dev() which factorize the needed actions. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	netpoll: make ndo_poll_controller() optional	Eric Dumazet	2	-14/+10
	As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. It seems that all networking drivers that do use NAPI for their TX completions, should not provide a ndo_poll_controller(). NAPI drivers have netpoll support already handled in core networking stack, since netpoll_poll_dev() uses poll_napi(dev) to iterate through registered NAPI contexts for a device. This patch allows netpoll_poll_dev() to process NAPI contexts even for drivers not providing ndo_poll_controller(), allowing for following patches in NAPI drivers. Also we export netpoll_poll_dev() so that it can be called by bonding/team drivers in following patches. Reported-by: Song Liu <songliubraving@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	mlxsw: Make MLXSW_SP1_FWREV_MINOR a hard requirement	Petr Machata	1	-1/+4
	Up until now, mlxsw tolerated firmware versions that weren't exactly matching the required version, if the branch number matched. That allowed the users to test various firmware versions as long as they were on the right branch. On the other hand, it made it impossible for mlxsw to put a hard lower bound on a version that fixes all problems known to date. If a user had a somewhat older FW version installed, mlxsw would start up just fine, possibly performing non-optimally as it would use features that trigger problematic behavior. Therefore tweak the check to accept any FW version that is: - on the same branch as the preferred version, and - the same as or newer than the preferred version. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	rds: Fix build regression.	David S. Miller	1	-1/+1
	Use DECLARE_* not DEFINE_* Fixes: 8360ed6745df ("RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats") Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23	Linux 4.19-rc5	Greg Kroah-Hartman	1	-1/+1

2018-09-23	Merge tag 'mfd-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd	Greg Kroah-Hartman	2	-12/+15
	Lee writes: "MFD fixes for v4.19 - Fix Dialog DA9063 regulator constraints issue causing failure in probe - Fix OMAP Device Tree compatible strings to match DT" * tag 'mfd-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: mfd: omap-usb-host: Fix dts probe of children mfd: da9063: Fix DT probing with constraints
2018-09-23	Merge tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip	Greg Kroah-Hartman	2	-7/+22
	Juergen writes: "xen: Two small fixes for xen drivers." * tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: issue warning message when out of grant maptrack entries xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code
2018-09-23	Merge tag 'for-linus-20180922' of git://git.kernel.dk/linux-block	Greg Kroah-Hartman	5	-11/+12
	Jens writes: "Just a single fix in this pull request, fixing a regression in /proc/diskstats caused by the unification of timestamps." * tag 'for-linus-20180922' of git://git.kernel.dk/linux-block: block: use nanosecond resolution for iostat
2018-09-23	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	Greg Kroah-Hartman	16	-39/+245
	Thomas writes: "A set of fixes for x86: - Resolve the kvmclock regression on AMD systems with memory encryption enabled. The rework of the kvmclock memory allocation during early boot results in encrypted storage, which is not shareable with the hypervisor. Create a new section for this data which is mapped unencrypted and take care that the later allocations for shared kvmclock memory is unencrypted as well. - Fix the build regression in the paravirt code introduced by the recent spectre v2 updates. - Ensure that the initial static page tables cover the fixmap space correctly so early console always works. This worked so far by chance, but recent modifications to the fixmap layout can - depending on kernel configuration - move the relevant entries to a different place which is not covered by the initial static page tables. - Address the regressions and issues which got introduced with the recent extensions to the Intel Recource Director Technology code. - Update maintainer entries to document reality" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Expand static page table for fixmap space MAINTAINERS: Add X86 MM entry x86/intel_rdt: Add Reinette as co-maintainer for RDT MAINTAINERS: Add Borislav to the x86 maintainers x86/paravirt: Fix some warning messages x86/intel_rdt: Fix incorrect loop end condition x86/intel_rdt: Fix exclusive mode handling of MBA resource x86/intel_rdt: Fix incorrect loop end condition x86/intel_rdt: Do not allow pseudo-locking of MBA resource x86/intel_rdt: Fix unchecked MSR access x86/intel_rdt: Fix invalid mode warning when multiple resources are managed x86/intel_rdt: Global closid helper to support future fixes x86/intel_rdt: Fix size reporting of MBA resource x86/intel_rdt: Fix data type in parsing callbacks x86/kvm: Use __bss_decrypted attribute in shared variables x86/mm: Add .bss..decrypted section to hold shared variables
2018-09-23	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	Greg Kroah-Hartman	5	-12/+36
	Thomas writes: "- Provide a strerror_r wrapper so lib/bpf can be built on systems without _GNU_SOURCE - Unbreak the man page generator when building out of tree" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf Documentation: Fix out-of-tree asciidoctor man page generation tools lib bpf: Provide wrapper for strerror_r to build in !_GNU_SOURCE systems
2018-09-23	Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	Greg Kroah-Hartman	1	-3/+6
	Thomas writes: "Make the EFI arm stub device tree loader default on to unbreak existing EFI boot loaders which do not have DTB support." * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/libstub/arm: default EFI_ARMSTUB_DTB_LOADER to y
2018-09-22	Merge branch 'hv_netvsc-Support-LRO-RSC-in-the-vSwitch'	David S. Miller	5	-39/+194
	Haiyang Zhang says: ==================== hv_netvsc: Support LRO/RSC in the vSwitch The patch adds support for LRO/RSC in the vSwitch feature. It reduces the per packet processing overhead by coalescing multiple TCP segments when possible. The feature is enabled by default on VMs running on Windows Server 2019 and later. The patch set also adds ethtool command handler and documents. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-22	hv_netvsc: Update document for LRO/RSC support	Haiyang Zhang	1	-0/+9
	Update document for LRO/RSC support, and the command line info to change the setting. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-22	hv_netvsc: Add handler for LRO setting change	Haiyang Zhang	3	-3/+42
	This patch adds the handler for LRO setting change, so that a user can use ethtool command to enable / disable LRO feature. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-22	hv_netvsc: Add support for LRO/RSC in the vSwitch	Haiyang Zhang	4	-38/+145
	LRO/RSC in the vSwitch is a feature available in Windows Server 2019 hosts and later. It reduces the per packet processing overhead by coalescing multiple TCP segments when possible. This patch adds netvsc driver support for this feature. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-22	net-ethtool: ETHTOOL_GUFO did not and should not require CAP_NET_ADMIN	Maciej Żenczykowski	1	-0/+1
	So it should not fail with EPERM even though it is no longer implemented... This is a fix for: (userns)$ egrep ^Cap /proc/self/status CapInh: 0000003fffffffff CapPrm: 0000003fffffffff CapEff: 0000003fffffffff CapBnd: 0000003fffffffff CapAmb: 0000003fffffffff (userns)$ tcpdump -i usb_rndis0 tcpdump: WARNING: usb_rndis0: SIOCETHTOOL(ETHTOOL_GUFO) ioctl failed: Operation not permitted Warning: Kernel filter failed: Bad file descriptor tcpdump: can't remove kernel filter: Bad file descriptor With this change it returns EOPNOTSUPP instead of EPERM. See also https://github.com/the-tcpdump-group/libpcap/issues/689 Fixes: 08a00fea6de2 "net: Remove references to NETIF_F_UFO from ethtool." Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	Merge branch 'net-dsa-b53-SGMII-modes-fixes'	David S. Miller	2	-5/+7
	Florian Fainelli says: ==================== net: dsa: b53: SGMII modes fixes Here are two additional fixes that are required in order for SGMII to work correctly. This was discovered with using a copper SFP which would make us use SGMII mode, we would actually leave the HW configured in its default mode: Fiber. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: dsa: b53: Also include SGMII for mac_config and mac_link_state	Florian Fainelli	1	-4/+6
	In both 802.3z and SGMII modes we need to configure the MAC accordingly to flip between Fiber and SGMII modes, and we need to read the MAC status from the SGMII in-band control word. Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: dsa: b53: Fix B53_SERDES_DIGITAL_CONTROL offset	Florian Fainelli	1	-1/+1
	Maths went wrong, to get 0x20, we need to do 0x1e + (x) * 2, not 0x18, fix that offset so we access the correct registers. This would make us not access the correct SerDes Digital control words, status would be fine and so we would not be correctly flipping between Fiber and SGMII modes resulting in incorrect status words being pulled into the SerDes digital status register. Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: dsa: b53: Don't assign autonegotiation enabled	Florian Fainelli	1	-4/+1
	PHYLINK takes care of filing the right information into state->an_enabled, get rid of the read from the SerDes's BMCR register. Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	decnet: Remove unnecessary check for dev->name	Nathan Chancellor	1	-1/+1
	Clang warns that the address of a pointer will always evaluated as true in a boolean context. net/decnet/dn_dev.c:1366:10: warning: address of array 'dev->name' will always evaluate to 'true' [-Wpointer-bool-conversion] dev->name ? dev->name : "???", ~~~~~^~~~ ~ 1 warning generated. Link: https://github.com/ClangBuiltLinux/linux/issues/116 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	selftests/net: add ipv6 tests to ip_defrag selftest	Peter Oskolkov	2	-98/+190
	This patch adds ipv6 defragmentation tests to ip_defrag selftest, to complement existing ipv4 tests. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net	Peter Oskolkov	3	-3/+0
	Currently, ip[6]frag_high_thresh sysctl values in new namespaces are hard-limited to those of the root/init ns. There are at least two use cases when it would be desirable to set the high_thresh values higher in a child namespace vs the global hard limit: - a security/ddos protection policy may lower the thresholds in the root/init ns but allow for a special exception in a child namespace - testing: a test running in a namespace may want to set these thresholds higher in its namespace than what is in the root/init ns The new behavior: # ip netns add testns # ip netns exec testns bash # sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000 # sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 9000000 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000 # sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 9000000 The old behavior: # ip netns add testns # ip netns exec testns bash # sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000 # sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 4194304 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000 # sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 4194304 Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	ipv6: discard IP frag queue on more errors	Peter Oskolkov	1	-5/+6
	This is similar to how ipv4 now behaves: commit 0ff89efb5246 ("ip: fail fast on IP defrag errors"). Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats	Nathan Chancellor	1	-1/+1
	Clang warns when two declarations' section attributes don't match. net/rds/ib_stats.c:40:1: warning: section does not match previous declaration [-Wsection] DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_ib_statistics, rds_ib_stats); ^ ./include/linux/percpu-defs.h:142:2: note: expanded from macro 'DEFINE_PER_CPU_SHARED_ALIGNED' DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \ ^ ./include/linux/percpu-defs.h:93:9: note: expanded from macro 'DEFINE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name; \ ^ ./include/linux/percpu-defs.h:49:26: note: expanded from macro '__PCPU_ATTRS' __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ ^ net/rds/ib.h:446:1: note: previous attribute is here DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats); ^ ./include/linux/percpu-defs.h:111:2: note: expanded from macro 'DECLARE_PER_CPU' DECLARE_PER_CPU_SECTION(type, name, "") ^ ./include/linux/percpu-defs.h:87:9: note: expanded from macro 'DECLARE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name ^ ./include/linux/percpu-defs.h:49:26: note: expanded from macro '__PCPU_ATTRS' __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ ^ 1 warning generated. The initial definition was added in commit ec16227e1414 ("RDS/IB: Infiniband transport") and the cache aligned definition was added in commit e6babe4cc4ce ("RDS/IB: Stats and sysctls") right after. The definition probably should have been updated in net/rds/ib.h, which is what this patch does. Link: https://github.com/ClangBuiltLinux/linux/issues/114 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net/ipv4: avoid compile error in fib_info_nh_uses_dev	Eric Dumazet	1	-1/+1
	net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev': net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable] cc1: all warnings being treated as errors Fixes: 78f2756c5fc0 ("net/ipv4: Move device validation to helper") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	Merge branch 'tcp-switch-to-Early-Departure-Time-model'	David S. Miller	13	-128/+103
	Eric Dumazet says: ==================== tcp: switch to Early Departure Time model In the early days, pacing has been implemented in sch_fq (FQ) in a generic way : - SO_MAX_PACING_RATE could be used by any sockets. - TCP would vary effective pacing rate based on CWND*MSS/SRTT - FQ would ensure delays between packets based on current sk->sk_pacing_rate, but with some quantum based artifacts. (inflating RPC tail latencies) - BBR then tweaked the pacing rate in its various phases (PROBE, DRAIN, ...) This worked reasonably well, but had the side effect that TCP RTT samples would be inflated by the sojourn time of the packets in FQ. Also note that when FQ is not used and TCP wants pacing, the internal pacing fallback has very different behavior, since TCP emits packets at the time they should be sent (with unreasonable assumptions about scheduling costs) Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting TCP (or applications for UDP messages) decide of the Earliest Departure Time, instead of letting packet schedulers derive it from pacing rate. https://www.netdevconf.org/0x12/session.html?evolving-from-afap-teaching-nics-about-time https://www.files.netdevconf.org/d/46def75c2ef345809bbe/files/?p=/Evolving%20from%20AFAP%20%E2%80%93%20Teaching%20NICs%20about%20time.pdf Recent additions in linux provided SO_TXTIME and a new ETF qdisc supporting the new skb->tstamp role This patch series converts TCP and FQ to the same model. This might in the future allow us to relax tight TSQ limits (if FQ is present in the output path), and thus lower number of callbacks to tcp_write_xmit(), thanks to batching. This will be followed by FQ change allowing SO_TXTIME support so that QUIC servers can let the pacing being done in FQ (or offloaded if network device permits) For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT Before : ESTAB 0 211408 10.246.7.151:41558 10.246.7.152:33723 cubic wscale:8,8 rto:203 rtt:2.195/0.084 mss:1448 rcvmss:536 advmss:1448 cwnd:20 ssthresh:20 bytes_acked:36897937 segs_out:25488 segs_in:12454 data_segs_out:25486 send 105.5Mbps lastsnd:1 lastrcv:12851 lastack:1 pacing_rate 24.0Mbps/24.0Mbps delivery_rate 22.9Mbps busy:12851ms unacked:4 rcv_space:29200 notsent:205616 minrtt:0.026 After : ESTAB 0 192584 10.246.7.151:61612 10.246.7.152:34375 cubic wscale:8,8 rto:201 rtt:0.165/0.129 mss:1448 rcvmss:536 advmss:1448 cwnd:20 ssthresh:20 bytes_acked:170755401 segs_out:117931 segs_in:57651 data_segs_out:117929 send 1404.1Mbps lastsnd:1 lastrcv:56915 lastack:1 pacing_rate 24.0Mbps/24.0Mbps delivery_rate 24.2Mbps busy:56915ms unacked:4 rcv_space:29200 notsent:186792 minrtt:0.054 A nice side effect of this patch series is a reduction of max/p99 latencies of RPC workloads, since the FQ quantum no longer adds artifact. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net_sched: sch_fq: remove dead code dealing with retransmits	Eric Dumazet	1	-53/+5
	With the earliest departure time model, we no longer plan special casing TCP retransmits. We therefore remove dead code (since most compilers understood skb_is_retransmit() was false) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	tcp: switch tcp_internal_pacing() to tcp_wstamp_ns	Eric Dumazet	1	-13/+4
	Now TCP keeps track of tcp_wstamp_ns, recording the earliest departure time of next packet, we can remove duplicate code from tcp_internal_pacing() This removes one ktime_get_tai_ns() call, and a divide. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	tcp: switch tcp and sch_fq to new earliest departure time model	Eric Dumazet	3	-17/+33
	TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq no longer has to do it. Thanks to this model, TCP can get more accurate RTT samples, since pacing no longer inflates them. This has the nice effect of removing some delays caused by FQ quantum mechanism, causing inflated max/P99 latencies. Also we might relax TCP Small Queue tight limits in the future, since this new model allow TCP to build bigger batches, since sch_fq (or a device with earliest departure time offload) ensure these packets will be delivered on time. Note that other protocols are not converted (they will probably never be) so sch_fq has still support for SO_MAX_PACING_RATE Tested: Test showing FQ pacing quantum artifact for low-rate flows, adding unexpected throttles for RPC flows, inflating max and P99 latencies. The parameters chosen here are to show what happens typically when a TCP flow has a reduced pacing rate (this can be caused by a reduced cwin after few losses, or/and rtt above few ms) MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY" Before : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 19,82.78,5279,3825,482.02 After : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 20,49.94,128,63,3.18 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	tcp: switch internal pacing timer to CLOCK_TAI	Eric Dumazet	2	-2/+2
	Next patch will use tcp_wstamp_ns to feed internal TCP pacing timer, so switch to CLOCK_TAI to share same base. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	tcp: provide earliest departure time in skb->tstamp	Eric Dumazet	6	-14/+13
	Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns, from usec units to nsec units. Do not clear skb->tstamp before entering IP stacks in TX, so that qdisc or devices can implement pacing based on the earliest departure time instead of socket sk->sk_pacing_rate Packets are fed with tcp_wstamp_ns, and following patch will update tcp_wstamp_ns when both TCP and sch_fq switch to the earliest departure time mechanism. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	tcp: add tcp_wstamp_ns socket field	Eric Dumazet	3	-11/+19
	TCP will soon provide earliest departure time on TX skbs. It needs to track this in a new variable. tcp_mstamp_refresh() needs to update this variable, and became too big to stay an inline. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net_sched: sch_fq: switch to CLOCK_TAI	Eric Dumazet	1	-3/+3
	TCP will soon provide per skb->tstamp with earliest departure time, so that sch_fq does not have to determine departure time by looking at socket sk_pacing_rate. We chose in linux-4.19 CLOCK_TAI as the clock base for transports, qdiscs, and NIC offloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	tcp: introduce tcp_skb_timestamp_us() helper	Eric Dumazet	6	-17/+26
	There are few places where TCP reads skb->skb_mstamp expecting a value in usec unit. skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value. Add tcp_skb_timestamp_us() to provide proper conversion when needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	tcp: switch tcp_clock_ns() to CLOCK_TAI base	Eric Dumazet	1	-1/+1
	TCP pacing is either implemented in sch_fq or internally. We have the goal of being able to offload pacing on the NICS. TCP will soon provide per skb skb->tstamp as early departure time. Like ETF in commit 25db26a91364 ("net/sched: Introduce the ETF Qdisc") we chose CLOCK_T as the clock base, so that TCP and pacers can share a common clock, to get better RTT samples (without pacing artificially inflating these samples). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	Merge branch 'hns3-next'	David S. Miller	9	-107/+120
	Salil Mehta says: ==================== Bug fixes, snall modifications & cleanup for HNS3 driver This patch presents some bug fixes, small modifications and cleanups to the HNS3 VF and PF driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: hns3: Remove redundant hclge_get_port_type()	Peng Li	2	-23/+0
	This patch removes hclge_get_port_type which is redundant. Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: hns3: Fix speed/duplex information loss problem when executing ethtool ethx cmd of VF	Fuyun Liang	1	-27/+39
	Our VF has not implemented the ops for get_port_type. So when we executing ethtool ethx cmd of VF, hns3_get_link_ksettings will return directly. And we can not query anything. To support get_link_ksettings for VF, this patch replaces get_port_type with get_media_type. If the media type is HNAE3_MEDIA_TYPE_NONE, hns3_get_link_ksettings will return link information of VF. Fixes: 12f46bc1d447 ("net: hns3: Refine hns3_get_link_ksettings()") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: hns3: Add get_media_type ops support for VF	Peng Li	3	-0/+13
	This patch adds the ops of get_media_type support for VF. Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: hns3: Remove print messages for error packet	Jian Shen	1	-5/+0
	There are already multiple types packets statistics for error packets, it's unnecessary to print them, which may affect the rx performance if print too many. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: hns3: Add unlikely for dma_mapping_error check	Jian Shen	1	-1/+1
	For dma_mapping_error is unlikely happened, this patch adds unlikely for dma_mapping_error check. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: hns3: Add nic state check before calling netif_tx_wake_queue	Jian Shen	1	-1/+3
	When nic down, it firstly calls netif_tx_stop_all_queues(), then calls napi_disable(). But napi_disable() will wait current napi_poll finish, it may call netif_tx_wake_queue(). This patch fixes it by add nic state checking. Fixes: 424eb834a9be ("net: hns3: Unified HNS3 {VF\|PF} Ethernet Driver for hip08 SoC") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21	net: hns3: Add handle for default case	Jian Shen	5	-6/+20
	There are a few "switch-case" codes missed handle for default case. For some abnormal case, it should return error code instead of return 0. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>