linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2016-05-04	ixgbevf: Use mac_ops instead of trying to identify NIC type	Alexander Duyck	4	-27/+13
	This change makes it so that we can just use function pointers instead of having to identify if a given VF is running on a Linux or Windows PF. By doing this we can avoid having to pull too much information out of the lower layers and can instead just make use of the mac_ops pointers since they should differ between the two types of VFs anyway. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbevf: Change the relaxed order settings in VF driver for sparc	Babu Moger	1	-0/+6
	We noticed performance issues with VF interface on sparc compared to PF. Setting the RX to IXGBE_DCA_RXCTRL_DATA_WRO_EN brings it on far with PF. Also this matches to the default sparc setting in PF driver. Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbe: Revise populating few registers and macro definitions	Preethi Banala	2	-20/+8
	Revise populating few registers in ixgbe_get_regs() and macro definitions. Before applying patch: $ du -k objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko 8572 objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko After applying patch: $ du -k objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko 8568 objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko Signed-off-by: Preethi Banala <preethi.banala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbe: Return 64 bit stats values	Preethi Banala	1	-3/+6
	The code was ignoring higher 32 bits of stats registers. This patch correctly fills out 64 bit value in two 32 bit words. Signed-off-by: Preethi Banala <preethi.banala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbe: Remove duplicate and unused device ID definitions	Preethi Banala	1	-6/+2
	Remove duplicate and unused device ID definitions. Signed-off-by: Preethi Banala <preethi.banala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbe: check EEPROM for WOL support for X540 and above	Emil Tantilov	2	-28/+25
	This change aims to simplify the logic we use to determine WOL support by reading the EEPROM bits for MACs X540 and newer. Also some cleanups in ixgbe_wol_supported() - changed return type to bool and removed redundant return variable by simply using return after the checks. Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbe: add WoL support for some 82599 subdevice IDs	Emil Tantilov	2	-3/+11
	We had some 82599 subdevice IDs missing from the list of parts that support WoL. Reported-by: Neil Horman <nhorman@redhat.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbevf: Support Windows hosts (Hyper-V)	KY Srinivasan	5	-7/+263
	On Hyper-V, the VF/PF communication is a via software mediated path as opposed to the hardware mailbox. Make the necessary adjustments to support Hyper-V. Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbevf: Add the device ID's presented while running on Hyper-V	KY Srinivasan	1	-0/+5
	Intel SR-IOV cards present different ID when running on Hyper-V. Add the device IDs presented while running on Hyper-V. Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbe: Match on multiple headers for cls_u32 offloads	Amritha Nambiar	3	-68/+156
	Adds support to set filters with multiple header fields (L3,L4)to match on. This is achieved in the following order: 1. Create a leaf hash table for the next header. 2. Create a link to the leaf hash table from the base hash table with matches on next header type and current header fields. 3. Add filter in leaf hash table with match on next header fields and action. Verified with the following filters : Match TCP and DIP: handle 1: u32 divisor 1 u32 ht 800: order 1 link 1: \ offset at 0 mask 0f00 shift 6 plus 0 eat \ match ip protocol 6 ff match ip dst 10.0.0.1/32 match tcp src 28 ffff action drop Delete the filter: Match on DIP, SIP, UDP (SPort, DPort): handle 2: u32 divisor 1 u32 ht 800: order 2 link 2: \ offset at 0 mask 0f00 shift 6 plus 0 eat \ match ip dst 15.0.0.2/32 match ip protocol 17 ff \ match ip src 15.0.0.1/32 match udp src 30 ffff match udp dst 32 ffff action drop Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	ixgbe: Add support for redirect action to cls_u32 offloads	Sridhar Samudrala	1	-13/+83
	This patch enables 'redirect' to a SRIOV VF or a offloaded macvlan device queue via tc 'mirred' action. Verified with the following script that creates SRIOV VFs, offloaded macvlan and adds tc u32 filters with redirect action to the associated netdevs. # add ingress qdisc. tc qdisc add dev p4p1 ingress # enable hw tc offload. ethtool -K p4p1 hw-tc-offload on # create 4 sriov VFs and bring up the first one. echo 4 > /sys/class/net/p4p1/device/sriov_numvfs sleep 1 ip link set p4p1 up ip link set p4p1_0 up # create a offloaded macvlan device and bring it up. ethtool -K p4p1 l2-fwd-offload on ip link add link p4p1 name mvlan_1 type macvlan ip link set mvlan_1 up # add u32 filter with action to redirect to VF netdev tc filter add dev p4p1 parent ffff: protocol ip prio 99 \ handle 800:0:1 u32 ht 800: \ match ip src 192.168.1.3/32 \ action mirred egress redirect dev p4p1_0 # add u32 filter with action to redirect to macvlan netdev tc filter add dev p4p1 parent ffff: protocol ip prio 99 \ handle 800:0:2 u32 ht 800: \ match ip src 192.168.2.3/32 \ action mirred egress redirect dev mvlan_1 Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-04	net_sched: act_mirred: add helper inlines to access tcf_mirred info	Sridhar Samudrala	1	-0/+15
	Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-05-03	ipv6: add new struct ipcm6_cookie	Wei Wang	11	-109/+123
	In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local variables like hlimits, tclass, opt and dontfrag and pass them to corresponding functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames. This is not a good practice and makes it hard to add new parameters. This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in ipv4 and include the above mentioned variables. And we only pass the pointer to this structure to corresponding functions. This makes it easier to add new parameters in the future and makes the function cleaner. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	net: add __sock_wfree() helper	Eric Dumazet	3	-1/+26
	Hosts sending lot of ACK packets exhibit high sock_wfree() cost because of cache line miss to test SOCK_USE_WRITE_QUEUE We could move this flag close to sk_wmem_alloc but it is better to perform the atomic_sub_and_test() on a clean cache line, as it avoid one extra bus transaction. skb_orphan_partial() can also have a fast track for packets that either are TCP acks, or already went through another skb_orphan_partial() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	tipc: redesign connection-level flow control	Jon Paul Maloy	5	-62/+122
	There are two flow control mechanisms in TIPC; one at link level that handles network congestion, burst control, and retransmission, and one at connection level which' only remaining task is to prevent overflow in the receiving socket buffer. In TIPC, the latter task has to be solved end-to-end because messages can not be thrown away once they have been accepted and delivered upwards from the link layer, i.e, we can never permit the receive buffer to overflow. Currently, this algorithm is message based. A counter in the receiving socket keeps track of number of consumed messages, and sends a dedicated acknowledge message back to the sender for each 256 consumed message. A counter at the sending end keeps track of the sent, not yet acknowledged messages, and blocks the sender if this number ever reaches 512 unacknowledged messages. When the missing acknowledge arrives, the socket is then woken up for renewed transmission. This works well for keeping the message flow running, as it almost never happens that a sender socket is blocked this way. A problem with the current mechanism is that it potentially is very memory consuming. Since we don't distinguish between small and large messages, we have to dimension the socket receive buffer according to a worst-case of both. I.e., the window size must be chosen large enough to sustain a reasonable throughput even for the smallest messages, while we must still consider a scenario where all messages are of maximum size. Hence, the current fix window size of 512 messages and a maximum message size of 66k results in a receive buffer of 66 MB when truesize(66k) = 131k is taken into account. It is possible to do much better. This commit introduces an algorithm where we instead use 1024-byte blocks as base unit. This unit, always rounded upwards from the actual message size, is used when we advertise windows as well as when we count and acknowledge transmitted data. The advertised window is based on the configured receive buffer size in such a way that even the worst-case truesize/msgsize ratio always is covered. Since the smallest possible message size (from a flow control viewpoint) now is 1024 bytes, we can safely assume this ratio to be less than four, which is the value we are now using. This way, we have been able to reduce the default receive buffer size from 66 MB to 2 MB with maintained performance. In order to keep this solution backwards compatible, we introduce a new capability bit in the discovery protocol, and use this throughout the message sending/reception path to always select the right unit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	tipc: propagate peer node capabilities to socket layer	Jon Paul Maloy	3	-2/+22
	During neighbor discovery, nodes advertise their capabilities as a bit map in a dedicated 16-bit field in the discovery message header. This bit map has so far only be stored in the node structure on the peer nodes, but we now see the need to keep a copy even in the socket structure. This commit adds this functionality. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	tipc: re-enable compensation for socket receive buffer double counting	Jon Paul Maloy	1	-1/+1
	In the refactoring commit d570d86497ee ("tipc: enqueue arrived buffers in socket in separate function") we did by accident replace the test if (sk->sk_backlog.len == 0) atomic_set(&tsk->dupl_rcvcnt, 0); with if (sk->sk_backlog.len) atomic_set(&tsk->dupl_rcvcnt, 0); This effectively disables the compensation we have for the double receive buffer accounting that occurs temporarily when buffers are moved from the backlog to the socket receive queue. Until now, this has gone unnoticed because of the large receive buffer limits we are applying, but becomes indispensable when we reduce this buffer limit later in this series. We now fix this by inverting the mentioned condition. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	rtl8152: correct speed testing	Oliver Neukum	1	-1/+2
	Allow for SS+ USB Signed-off-by: Oliver Neukum <ONeukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	usbnet: correct speed testing	Oliver Neukum	1	-0/+1
	Allow for SS+ USB Signed-off-by: Oliver Neukum <ONeukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	brcm80211: correct speed testing	Oliver Neukum	1	-1/+3
	Allow for SS+ USB Signed-off-by: Oliver Neukum <ONeukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	qed: Apply tunnel configurations after PF start	Manish Chopra	1	-1/+9
	Configure and enable various tunnels on the adapter after PF start. This change was missed as a part of 'commit 464f664501816ef5fbbc00b8de96f4ae5a1c9325 ("qed: Add infrastructure support for tunneling")' Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	stmmac: dwmac-socfpga: kill init() and rename setup() to set_phy_mode()	Joachim Eastwood	1	-14/+3
	Remove old init callback which now contains only a call to socfpga_dwmac_setup(). Also rename socfpga_dwmac_setup() to indicate what the function really does. Signed-off-by: Joachim Eastwood <manabian@gmail.com> Tested-by: Marek Vasut <marex@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	stmmac: dwmac-socfpga: call phy_resume() only in resume callback	Joachim Eastwood	1	-31/+19
	Calling phy_resume() should only be need during driver resume to workaround a hardware errata. Signed-off-by: Joachim Eastwood <manabian@gmail.com> Tested-by: Marek Vasut <marex@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	stmmac: dwmac-socfpga: keep a copy of stmmac_rst in driver priv data	Joachim Eastwood	1	-11/+22
	The dwmac-socfpga driver needs to control the reset usually managed by the core driver to set the PHY mode. Take a copy of the reset handle from core priv data so it can be used by the driver later. This also allow us to move reset handling into socfpga_dwmac_setup() where the code that needs it is located. Signed-off-by: Joachim Eastwood <manabian@gmail.com> Tested-by: Marek Vasut <marex@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	stmmac: dwmac-socfpga: add PM ops and resume function	Joachim Eastwood	1	-2/+16
	Implement the needed PM callbacks in the driver instead of relying on the init/exit hooks in stmmac_platform. This gives the driver more flexibility in how the code is organized. Eventually the init/exit callbacks will be deprecated in favor of the standard PM callbacks and driver remove function. Signed-off-by: Joachim Eastwood <manabian@gmail.com> Tested-by: Marek Vasut <marex@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	stmmac: let remove/resume/suspend functions take device pointer	Joachim Eastwood	4	-34/+17
	Change stmmac_remove/resume/suspend to take a device pointer so they can be used directly by drivers that doesn't need to perform anything device specific. This lets us remove the PCI pm functions and later simplifiy the platform drivers. Signed-off-by: Joachim Eastwood <manabian@gmail.com> Tested-by: Marek Vasut <marex@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	net: ethernet: fec_mpc52xx: move to new ethtool api {get\|set}_link_ksettings	Philippe Reynes	1	-6/+10
	The ethtool api {get\|set}_settings is deprecated. We move the fec_mpc52xx driver to new api {get\|set}_link_ksettings. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	net: ethernet: fs-enet: move to new ethtool api {get\|set}_link_ksettings	Philippe Reynes	1	-6/+10
	The ethtool api {get\|set}_settings is deprecated. We move the fs-enet driver to new api {get\|set}_link_ksettings. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	net: ethernet: ucc: move to new ethtool api {get\|set}_link_ksettings	Philippe Reynes	1	-10/+7
	The ethtool api {get\|set}_settings is deprecated. We move the ucc driver to new api {get\|set}_link_ksettings. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	net: ethernet: gianfar: move to new ethtool api {get\|set}_link_ksettings	Philippe Reynes	1	-17/+8
	The ethtool api {get\|set}_settings is deprecated. We move the gianfar driver to new api {get\|set}_link_ksettings. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	VSOCK: constify vsock_transport structure	Julia Lawall	1	-1/+1
	The vsock_transport structure is never modified, so declare it as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	drivers: net: xgene: constify xgene_cle_ops structure	Julia Lawall	3	-3/+3
	The xgene_cle_ops structure is never modified, so declare it as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	fq_codel: add batch ability to fq_codel_drop()	Eric Dumazet	2	-19/+46
	In presence of inelastic flows and stress, we can call fq_codel_drop() for every packet entering fq_codel qdisc. fq_codel_drop() is quite expensive, as it does a linear scan of 4 KB of memory to find a fat flow. Once found, it drops the oldest packet of this flow. Instead of dropping a single packet, try to drop 50% of the backlog of this fat flow, with a configurable limit of 64 packets per round. TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this limit configurable. With this strategy the 4 KB search is amortized to a single cache line per drop [1], so fq_codel_drop() no longer appears at the top of kernel profile in presence of few inelastic flows. [1] Assuming a 64byte cache line, and 1024 buckets Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Taht <dave.taht@gmail.com> Cc: Jonathan Morton <chromatix99@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Dave Taht Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	ravb: Remove rx buffer ALIGN	Kazuya Mizuguchi	1	-6/+4
	Aligning the reception data size is not required. Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com> Tested-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03	net: relax expensive skb_unclone() in iptunnel_handle_offloads()	Eric Dumazet	2	-1/+11
	Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP tunnel have to go through an expensive skb_unclone() Reallocating skb->head is a lot of work. Test should really check if a 'real clone' of the packet was done. TCP does not care if the original gso_type is changed while the packet travels in the stack. This adds skb_header_unclone() which is a variant of skb_clone() using skb_header_cloned() check instead of skb_cloned(). This variant can probably be used from other points. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	netdevice: shrink size of struct netdev_queue	Florian Westphal	1	-7/+6
	- trans_timeout is incremented when tx queue timed out (tx watchdog). - tx_maxrate is set via sysfs Moving tx_maxrate to read-mostly part shrinks the struct by 64 bytes. While at it, also move trans_timeout (it is out-of-place in the 'write-mostly' part). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	bridge: netlink: export per-vlan stats	Nikolay Aleksandrov	5	-0/+118
	Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and get_linkxstats_size) in order to export the per-vlan stats. The paddings were added because soon these fields will be needed for per-port per-vlan stats (or something else if someone beats me to it) so avoiding at least a few more netlink attributes. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	bridge: vlan: learn to count	Nikolay Aleksandrov	5	-16/+110
	Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets allocated a per-cpu stats which is then set in each per-port vlan context for quick access. The br_allowed_ingress() common function is used to account for Rx packets and the br_handle_vlan() common function is used to account for Tx packets. Stats accounting is performed only if the bridge-wide vlan_stats_enabled option is set either via sysfs or netlink. A struct hole between vlan_enabled and vlan_proto is used for the new option so it is in the same cache line. Currently it is binary (on/off) but it is intentionally restricted to exactly 0 and 1 since other values will be used in the future for different purposes (e.g. per-port stats). Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	net: rtnetlink: add linkxstats callbacks and attribute	Nikolay Aleksandrov	3	-0/+49
	Add callbacks to calculate the size and fill link extended statistics which can be split into multiple messages and are dumped via the new rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute. Also add that attribute to the idx mask check since it is expected to be able to save state and resume dumping (e.g. future bridge per-vlan stats will be dumped via this attribute and callbacks). Each link type should nest its private attributes under the per-link type attribute. This allows to have any number of separated private attributes and to avoid one call to get the dev link type. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter	Nikolay Aleksandrov	1	-13/+31
	The new prividx argument allows the current dumping device to save a private state counter which would enable it to continue dumping from where it left off. And the idxattr is used to save the current idx user so multiple prividx using attributes can be requested at the same time as suggested by Roopa Prabhu. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	gre6: Cleanup GREv6 transmit path, call common GRE functions	Tom Herbert	1	-202/+50
	Changes in GREv6 transmit path: - Call gre_checksum, remove gre6_checksum - Rename ip6gre_xmit2 to __gre6_xmit - Call gre_build_header utility function - Call ip6_tnl_xmit common function - Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	ipv6: Generic tunnel cleanup	Tom Herbert	2	-3/+9
	A few generic changes to generalize tunnels in IPv6: - Export ip6_tnl_change_mtu so that it can be called by ip6_gre - Add tun_hlen to ip6_tnl structure. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	gre: Create common functions for transmit	Tom Herbert	2	-47/+49
	Create common functions for both IPv4 and IPv6 GRE in transmit. These are put into gre.h. Common functions are for: - GRE checksum calculation. Move gre_checksum to gre.h. - Building a GRE header. Move GRE build_header and rename gre_build_header. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	ipv6: Create ip6_tnl_xmit	Tom Herbert	2	-17/+32
	This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other users like GRE will be able to call this. The original ip6_tnl_xmit function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit function). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	gre6: Cleanup GREv6 receive path, call common GRE functions	Tom Herbert	1	-117/+23
	- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv. - Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates most of the code in ip6gre_rcv. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	gre: Move utility functions to common headers	Tom Herbert	3	-129/+144
	Several of the GRE functions defined in net/ipv4/ip_gre.c are usable for IPv6 GRE implementation (that is they are protocol agnostic). These include: - GRE flag handling functions are move to gre.h - GRE build_header is moved to gre.h and renamed gre_build_header - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header - iptunnel_pull_header is taken out of gre_parse_header. This is now done by caller. The header length is returned from gre_parse_header in an int* argument. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	ipv6: Cleanup IPv6 tunnel receive path	Tom Herbert	2	-70/+146
	Some basic changes to make IPv6 tunnel receive path look more like IPv4 path: - Make ip6_tnl_rcv non-static so that GREv6 and others can call it - Make ip6_tnl_rcv look like ip_tunnel_rcv - Switch to gro_cells_receive - Make ip6_tnl_rcv non-static and export it Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	tcp: make tcp_sendmsg() aware of socket backlog	Eric Dumazet	3	-2/+24
	Large sendmsg()/write() hold socket lock for the duration of the call, unless sk->sk_sndbuf limit is hit. This is bad because incoming packets are parked into socket backlog for a long time. Critical decisions like fast retransmit might be delayed. Receivers have to maintain a big out of order queue with additional cpu overhead, and also possible stalls in TX once windows are full. Bidirectional flows are particularly hurt since the backlog can become quite big if the copy from user space triggers IO (page faults) Some applications learnt to use sendmsg() (or sendmmsg()) with small chunks to avoid this issue. Kernel should know better, right ? Add a generic sk_flush_backlog() helper and use it right before a new skb is allocated. Typically we put 64KB of payload per skb (unless MSG_EOR is requested) and checking socket backlog every 64KB gives good results. As a matter of fact, tests with TSO/GSO disabled give very nice results, as we manage to keep a small write queue and smaller perceived rtt. Note that sk_flush_backlog() maintains socket ownership, so is not equivalent to a {release_sock(sk); lock_sock(sk);}, to ensure implicit atomicity rules that sendmsg() was giving to (possibly buggy) applications. In this simple implementation, I chose to not call tcp_release_cb(), but we might consider this later. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	net: do not block BH while processing socket backlog	Eric Dumazet	1	-14/+8
	Socket backlog processing is a major latency source. With current TCP socket sk_rcvbuf limits, I have sampled __release_sock() holding cpu for more than 5 ms, and packets being dropped by the NIC once ring buffer is filled. All users are now ready to be called from process context, we can unblock BH and let interrupts be serviced faster. cond_resched_softirq() could be removed, as it has no more user. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02	sctp: prepare for socket backlog behavior change	Eric Dumazet	1	-0/+2
	sctp_inq_push() will soon be called without BH being blocked when generic socket code flushes the socket backlog. It is very possible SCTP can be converted to not rely on BH, but this needs to be done by SCTP experts. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>