aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-02-16ip_tunnel: replace dst_cache with generic implementationPaolo Abeni4-80/+25
The current ip_tunnel cache implementation is prone to a race that will cause the wrong dst to be cached on cuncurrent dst cache miss and ip tunnel update via netlink. Replacing with the generic implementation fix the issue. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: replace dst_cache ip6_tunnel implementation with the generic onePaolo Abeni5-116/+16
This also fix a potential race into the existing tunnel code, which could lead to the wrong dst to be permanenty cached: CPU1: CPU2: <xmit on ip6_tunnel> <cache lookup fails> dst = ip6_route_output(...) <tunnel params are changed via nl> dst_cache_reset() // no effect, // the cache is empty dst_cache_set() // the wrong dst // is permanenty stored // into the cache With the new dst implementation the above race is not possible since the first cache lookup after dst_cache_reset will fail due to the timestamp check Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: add dst_cache supportPaolo Abeni4-0/+270
This patch add a generic, lockless dst cache implementation. The need for lock is avoided updating the dst cache fields only in per cpu scope, and requiring that the cache manipulation functions are invoked with the local bh disabled. The refresh_ts and reset_ts fields are used to ensure the cache consistency in case of cuncurrent cache update (dst_cache_set*) and reset operation (dst_cache_reset). Consider the following scenario: CPU1: CPU2: <cache lookup with emtpy cache: it fails> <get dst via uncached route lookup> <related configuration changes> dst_cache_reset() dst_cache_set() The dst entry set passed to dst_cache_set() should not be used for later dst cache lookup, because it's obtained using old configuration values. Since the refresh_ts is updated only on dst_cache lookup, the cached value in the above scenario will be discarded on the next lookup. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16Merge branch 'bnx2x-next'David S. Miller7-76/+201
Yuval Mintz says: ==================== bnx2x: driver updates This series contains several changes - the biggest change is the addition of Geneve NDO support [allows device to perform RSS according to inner-headers of encapsulated packet, similar to what it does for vxlan]. It also extends dcbx support, as well as introducing some minor changes. Dave, Please consider applying this series to `net-next'. [Do notice patch #3 fails checkpatch due to consistency with existing HSI] ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16bnx2x: Warn about grc timeouts in register dumpYuval Mintz1-0/+5
There are several scenarios where taking a register dump from a device might log benign GRC timeout attentions to system logs. Most common of those is when taking the dump from a 2-port device. Sadly, there's no easy way to mask the problematic attentions during the flow - Changing this behvaior would require a firmware update. For now, simply warn users to ignore the warnings. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16bnx2x: extend DCBx supportYuval Mintz2-15/+47
This adds support for default application priority. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16bnx2x: Add support for single-port DCBxYuval Mintz1-1/+3
Driver is currently looking at shared information for determining whether DCBx can be supported for a given port. On 4-port devices, up-to-date management firmware can support DCBx on each port of a given engine independently - but that would cause bnx2x to misinterpert the support and assume DCBx is supported on both. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16bnx2x: Add Geneve inner-RSS supportYuval Mintz3-59/+146
This adds the ability to perform RSS hashing based on encapsulated headers for a geneve-encapsulated packet. This also changes the Vxlan implementation in bnx2x to be uniform for both vxlan and geneve [from configuration perspective]. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16bnx2x: Remove unneccessary EXPORT_SYMBOLYuval Mintz1-1/+0
bnx2x_schedule_sp_rtnl is exported by bnx2x, although no other module uses it. Reported-by: Benjamin Poirier <bpoirier@suse.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tipc: refactor node xmit and fix memory leaksRichard Alpe2-24/+38
Refactor tipc_node_xmit() to fail fast and fail early. Fix several potential memory leaks in unexpected error paths. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16dmascc: Return correct error codesAmitoj Kaur Chawla1-3/+10
This change has been made with the goal that kernel functions should return something more descriptive than -1 on failure. A variable `err` has been introduced for storing error codes. The return value of kzalloc on failure should return a -1 and not a -ENOMEM. This was found using Coccinelle. A simplified version of the semantic patch used is: //<smpl> @@ expression *e; identifier l1; @@ e = kzalloc(...); if (e == NULL) { ... goto l1; } l1: ... return -1 + -ENOMEM ; //</smpl Furthermore, set `err` to -ENOMEM on failure of alloc_netdev(), and to -ENODEV on failure of register_netdev() and probe_irq_off(). The single call site only checks that the return value is not 0, hence no change is required at the call site. Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller9-439/+637
Jeff Kirsher says: ==================== 1GbE Intel Wired LAN Driver Updates 2016-02-15 This series contains updates to igb only. Shota Suzuki cleans up unnecessary flag setting for 82576 in igb_set_flag_queue_pairs() since the default block already sets IGB_FLAG_QUEUE_PAIRS to the correct value anyways, so the e1000_82576 code block is not necessary and we can simply fall through. Then fixes an issue where IGB_FLAG_QUEUE_PAIRS can now be set by using "ethtool -L" option but is never cleared unless the driver is reloaded, so clear the queue pairing if the pairing becomes unnecessary as a result of "ethtool -L". Mitch fixes the igbvf from giving up if it fails to get the hardware mailbox lock. This can happen when the PF-VF communication channel is heavily loaded and causes complete communications failure between the PF and VF drivers, so add a counter and a delay so that the driver will now retry ten times before giving up on getting the mailbox lock. The remaining patches in the series are from Alex Duyck, starting with the cleaning up code that sets the MAC address. Then refactors the VFTA and VLVF configuration, to simplify and update to similar setups in the ixgbe driver. Fixed an issue were VLANs headers size was being added to the value programmed into the RLPML registers, yet these registers already take into account the size of the VLAN headers when determining the maximum packet length, so we can drop the code that adds the size to the RLPML registers. Cleaned up the configuration of the VF port based VLAN configuration. Also fixed the igb driver so that we can fully support SR-IOV or the recently added NTUPLE filtering while allowing support for VLAN promiscuous mode. Also added the ability to use the bridge utility to add a FDB entry for the PF to an igb port. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16Merge branch 'ethtool-channels-rxfh-conflict'David S. Miller3-4/+82
Jacob Keller says: ==================== ethtool: correct {GS}CHANNELS and {GS}RXFH conflict This patch series fixes up ethtool_set_channels operation which allowed modifying the RXFH table indirectly by reducing the number of queues below the current max queue used by the Rx flow table. Most drivers incorrectly allowed this to destroy the Rx flow table and would then start by reinitializing it to default settings. However, drivers are not able to correctly handle the conflict since there was no way to differentiate between the default settings and the user requested explicit settings. To fix this, implement a new netdev private flag which we use to indicate whether the RXFH has been user configured. If someone has a better alternative of how to store this information, let me know. I am not sure that priv_flags is the best solution but I have not had any better idea. Secondly, we add a function which just calls the driver's get_rxfh callback to determine the current indirection table. Loop through this and we can determine the current highest queue that will be used by RSS. Now, modify ethtool_set_channels to add a check ensuring that if (a) we have had rxfh configured by user, (b) we can get the maximum RSS queue currently used, then we ensure that the newly requested Rx count (or combined count) is at least as high as this maximum RSS queue. The reasoning here is that we can always safely increase the number of queues. If we decrease the queues we must ensure that the decrease does not go lower than the highest in-use queue for the Rx flow table. Drivers may still need to be patched if they currently overwrite the Rx flow table during channel configuration. If the driver currently always resets Rx flow table when increasing number of queues it must be patched to only do this when netif_is_rxfh_configured returns false. The second patch simply adds a check to ensure that all provided channel counts fit within driver defined maximums. The third patch fixes fm10k to correctly reconfigure the RSS reta table whenever it is still unconfigured. This means that the default state will provide RSS to every queue. Once the user has configured RXFH, then we should maintain it. In addition, since the case where we must reconfigure the RSS table in this case should now no longer occur, add a dev_err message to indicate the user that we did so. I have also supplied an ethtool patch to enable setting the default Rx flow indirection table. Without this, current ethtool does not support sending an indir_size of 0, and thus does not correctly support configuring back to the default. Changes in v2: * fixed compile error * fixed incorrect comparison with max_rx_in_use * adjusted looping over dev_size * removed inline on function * dropped patch about separating combined vs asymmetric channels * verified behavior using fm10k driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16fm10k: don't reinitialize RSS flow table when RXFH configuredKeller, Jacob E1-2/+8
Also print an error message incase we do have to reconfigure as this should no longer happen anymore due to ethtool changes. If it somehow does occur, user should be made aware of it. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16ethtool: ensure channel counts are within bounds during SCHANNELSKeller, Jacob E1-2/+11
Add a sanity check to ensure that all requested channel sizes are within bounds, which should reduce errors in driver implementation. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH}Keller, Jacob E2-0/+63
Ethernet drivers implementing both {GS}RXFH and {GS}CHANNELS ethtool ops incorrectly allow SCHANNELS when it would conflict with the settings from SRXFH. This occurs because it is not possible for drivers to understand whether their Rx flow indirection table has been configured or is in the default state. In addition, drivers currently behave in various ways when increasing the number of Rx channels. Some drivers will always destroy the Rx flow indirection table when this occurs, whether it has been set by the user or not. Other drivers will attempt to preserve the table even if the user has never modified it from the default driver settings. Neither of these situation is desirable because it leads to unexpected behavior or loss of user configuration. The correct behavior is to simply return -EINVAL when SCHANNELS would conflict with the current Rx flow table settings. However, it should only do so if the current settings were modified by the user. If we required that the new settings never conflict with the current (default) Rx flow settings, we would force users to first reduce their Rx flow settings and then reduce the number of Rx channels. This patch proposes a solution implemented in net/core/ethtool.c which ensures that all drivers behave correctly. It checks whether the RXFH table has been configured to non-default settings, and stores this information in a private netdev flag. When the number of channels is requested to change, it first ensures that the current Rx flow table is not going to assign flows to now disabled channels. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: fec: Add "phy-reset-active-low" property to DTBernhard Walle2-2/+9
We need that for a custom hardware that needs the reverse reset sequence. Signed-off-by: Bernhard Walle <bernhard@bwalle.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16Merge branch 'bcm7xxx-cleanups'David S. Miller1-38/+20
Florian Fainelli says: ==================== net: phy: bcm7xxx: Misc cleanups These two patches are cleanups to the BCM7xxx internal PHY driver: - fix a constant name missing a X (as in BCM7XXX) - add a macro to reduce the amount of code duplication to add new entries ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: phy: bcm7xxx: Reduce boilerplate code for 40nm EPHYFlorian Fainelli1-36/+18
Introduce a macro which helps adding new 40NM EPHY entries and reduces the amount of boilerplate code. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: phy: bcm7xxx: Make MII_BCM7XX_64CLK_MDIO naming consistentFlorian Fainelli1-2/+2
The driver is BCM7xxx, we were missing an additional X in the constant naming, fix that to be consistent. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-15igb: Add workaround for VLAN tag stripping on 82576Alexander Duyck2-12/+16
There was a workaround partially implemented for the 82576 that is needed in order for VLAN tag stripping to function correctly. The original code had side effects that would make it so the workaround was active on all MACs. I have updated the code so that the workaround is enabled, but limited to the 82576, or activated if we exceed the available unicast addresses. The workaround has a side effect of mirroring all of the traffic outgoing from the VFs back to the PF. As such it is not recommended to use the 82576 in promiscuous mode as it will take a performance hit, though this is now consistent with the performance as seen on the out-of-tree igb driver. I also limited the scope of the UTA bits all being set to only when the VMOLR register is enabled. This should limit the effects of the UTA register so that we don't pick up any excess traffic unless promiscuous mode has been enabled on the PF, whereas before the PF would have ended up in something equivalent to unicast promiscuous mode with VLAN filtering otherwise. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Enable use of "bridge fdb add" to set unicast table entriesAlexander Duyck1-9/+30
This change makes it so that we can use the bridge utility to add a FDB entry for the PF to an igb port. By doing this we can enable the VFs to talk to virtual ports residing on top of the PF. In addition this should also address issues with MACVLANs trying to reside on top of the PF as well as they would have had similar issues when added to the PF with SR-IOV enabled. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Drop unnecessary checks in transmit pathAlexander Duyck1-10/+0
This patch drops several checks that we dropped from ixgbe some ago. It should not be possible for us to be called with either of the conditional statements returning true so we can just drop them from the hot-path. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Add support for VLAN promiscuous with SR-IOV and NTUPLEAlexander Duyck2-72/+242
This change fixes things so that we can fully support SR-IOV or the recently added NTUPLE filtering while allowing support for VLAN promiscuous mode. By making this change we are able to support possible scenarios such as SR-IOV with the PF connected to a Linux bridge hosting other VMs. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Clean-up configuration of VF port VLANsAlexander Duyck1-71/+110
This patch is meant to clean-up the configuration of the VF port based VLAN configuration. The original logic was a bit muddled and had some undesirable side effects such as VLANs being either completely stripped from the port or VLANs being left when they shouldn't be. The idea behind this code is to avoid any events such as spurious spoof notifications when we are removing one VLAN tag and replacing it with another. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Merge VLVF configuration into igb_vfta_setAlexander Duyck3-92/+135
This change makes it so that we can merge the configuration of the VLVF registers into the setting of the VFTA register. By doing this we simplify the logic and make use of similar functionality that we have already added for ixgbe making it easier to maintain both drivers. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Always enable VLAN 0 even if 8021q is not loadedAlexander Duyck1-2/+3
This patch makes it so that we always add VLAN 0. This is important as we need to guarantee the PF can receive untagged frames in the case of SR-IOV being enabled but VLAN filtering not being enabled in the kernel. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Do not factor VLANs into RLPML calculationAlexander Duyck2-42/+2
The RLPML registers already take the size of VLAN headers into account when determining the maximum packet length. This is called out in EAS documents for several parts including the 82576 and the i350. As such we can drop the addition of size to the value programmed into the RLPML registers. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Allow asymmetric configuration of MTU versus Rx frame sizeAlexander Duyck2-68/+42
Since the igb driver is using page based receive there is no point in limiting the Rx capabilities of the device. The driver can receive 9K jumbo frames at all times. The only changes needed due to MTU changes are updates for the FIFO sizes and flow-control watermarks. Update the maximum frame size to reflect the 9.5K limitation of the hardware, and replace all instances of max_frame_size with MAX_JUMBO_FRAME_SIZE when referring to an Rx FIFO or frame. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Refactor VFTA configurationAlexander Duyck4-76/+67
This patch starts the clean-up process on the VFTA configuration. Specifically in this patch I attempt to address and simplify several items while also updating the code to bring it more inline with what is already in ixgbe. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: clean up code for setting MAC addressAlexander Duyck1-5/+4
Drop a bunch of hand written byte swapping code in favor of just doing the byte swapping ourselves. The registers are little endian registers storing a big endian value so if we read the MAC address array as little endian then we will get the CPU registers into the proper layout. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb/igbvf: don't give upMitch Williams2-13/+25
The driver shouldn't just give up if it fails to get the hardware mailbox lock. This can happen in a situation where the PF-VF communication channel is heavily loaded and causes complete communications failure between the PF and VF drivers. Add a counter and a delay. The driver will now retry ten times, waiting one millisecond between retries. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Unpair the queues when changing the number of queuesShota Suzuki1-0/+2
By the commit 72ddef0506da ("igb: Fix oops caused by missing queue pairing"), the IGB_FLAG_QUEUE_PAIRS flag can now be set when changing the number of queues by "ethtool -L", but it is never cleared unless the igb driver is reloaded. This patch clears it if queue pairing becomes unnecessary as a result of "ethtool -L". Signed-off-by: Shota Suzuki <suzuki_shota_t3@lab.ntt.co.jp> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-15igb: Remove unnecessary flag setting in igb_set_flag_queue_pairs()Shota Suzuki1-8/+0
If VFs are enabled (max_vfs >= 1), both max_rss_queues and adapter->rss_queues are set to 2 in the case of e1000_82576. In this case, IGB_FLAG_QUEUE_PAIRS is always set in the default block as a result of fall-through, thus setting it in the e1000_82576 block is not necessary. Signed-off-by: Shota Suzuki <suzuki_shota_t3@lab.ntt.co.jp> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-12Merge branch 'local-checksum-offload'David S. Miller14-77/+197
Edward Cree says: ==================== Local Checksum Offload Re-tested VxLAN; everything else is unchanged from v4. Changes from v4: * Rebased series to fix conflicts with vxlan/vxlan6 merge. Changes from v3: * Fixed inverted checksum values introduced in v3. * Don't mangle zero checksums in GRE. * Clear skb->encapsulation in iptunnel_handle_offloads when not using CHECKSUM_PARTIAL, lest drivers incorrectly interpret that as a request for inner checksum offload. Changes from v2: * Added support for IPv4 GRE. * Split out 'always set up for checksum offload' into its own patch. * Removed csum_help from iptunnel_handle_offloads. * Rewrote LCO callers to only fold once. * Simplified nocheck handling. Changes from v1: * Enabled support in more encapsulation protocols. I think it now covers everything except GRE. * Wrote up some documentation covering TX checksum offload, LCO and RCO. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12Documentation/networking: add checksum-offloads.txt to explain LCOEdward Cree3-0/+123
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloadsEdward Cree9-28/+17
All users now pass false, so we can remove it, and remove the code that was conditional upon it. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: gre: Implement LCO for GRE over IPv4Edward Cree1-3/+13
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12fou: enable LCO in FOU and GUEEdward Cree1-8/+6
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: vxlan: enable local checksum offloadEdward Cree1-4/+2
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: enable LCO for udp_tunnel_handle_offloads() usersEdward Cree1-1/+2
The only protocol affected at present is Geneve. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: udp: always set up for CHECKSUM_PARTIAL offloadEdward Cree2-25/+2
If the dst device doesn't support it, it'll get fixed up later anyway by validate_xmit_skb(). Also, this allows us to take advantage of LCO to avoid summing the payload multiple times. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: local checksum offload for encapsulationEdward Cree4-22/+46
The arithmetic properties of the ones-complement checksum mean that a correctly checksummed inner packet, including its checksum, has a ones complement sum depending only on whatever value was used to initialise the checksum field before checksumming (in the case of TCP and UDP, this is the ones complement sum of the pseudo header, complemented). Consequently, if we are going to offload the inner checksum with CHECKSUM_PARTIAL, we can compute the outer checksum based only on the packed data not covered by the inner checksum, and the initial value of the inner checksum field. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12Merge branch 'tcp_dccp_ports'David S. Miller2-211/+199
Eric Dumazet says: ==================== tcp/dccp: better use of ephemeral ports Big servers have bloated bind table, making very hard to succeed ephemeral port allocations, without special containers/namespace tricks. This patch series extends the strategy added in commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range in connect()"). Since ports used by connect() are much likely to be shared among them, we give a hint to both bind() and connect() to keep the crowds separated if possible. Of course, if on a specific host an application needs to allocate ~30000 ports using bind(), it will still be able to do so. Same for ~30000 connect() to a unique 2-tuple (dst addr, dst port) New implemetation is also more friendly to softirqs and reschedules. v2: rebase after TCP SO_REUSEPORT changes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12tcp/dccp: better use of ephemeral ports in bind()Eric Dumazet1-126/+114
Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12tcp/dccp: better use of ephemeral ports in connect()Eric Dumazet1-85/+85
In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range in connect()"), I added a very simple heuristic, so that we got better chances to use even ports, and allow bind() users to have more available slots. It gave nice results, but with more than 200,000 TCP sessions on a typical server, the ~30,000 ephemeral ports are still a rare resource. I chose to go a step further, by looking at all even ports, and if none was available, fallback to odd ports. The companion patch does the same in bind(), but in opposite way. I've seen exec times of up to 30ms on busy servers, so I no longer disable BH for the whole traversal, but only for each hash bucket. I also call cond_resched() to be gentle to other tasks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11Merge branch 'net-mitigate-kmem_free-slowpath'David S. Miller4-10/+96
Jesper Dangaard Brouer says: ==================== net: mitigating kmem_cache free slowpath This patchset is the first real use-case for kmem_cache bulk _free_. The use of bulk _alloc_ is NOT included in this patchset. The full use have previously been posted here [1]. The bulk free side have the largest benefit for the network stack use-case, because network stack is hitting the kmem_cache/SLUB slowpath when freeing SKBs, due to the amount of outstanding SKBs. This is solved by using the new API kmem_cache_free_bulk(). Introduce new API napi_consume_skb(), that hides/handles bulk freeing for the caller. The drivers simply need to use this call when freeing SKBs in NAPI context, e.g. replacing their calles to dev_kfree_skb() / dev_consume_skb_any(). Driver ixgbe is the first user of this new API. [1] http://thread.gmane.org/gmane.linux.network/384302/focus=397373 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ixgbe: bulk free SKBs during TX completion cleanup cycleJesper Dangaard Brouer1-3/+3
There is an opportunity to bulk free SKBs during reclaiming of resources after DMA transmit completes in ixgbe_clean_tx_irq. Thus, bulk freeing at this point does not introduce any added latency. Simply use napi_consume_skb() which were recently introduced. The napi_budget parameter is needed by napi_consume_skb() to detect if it is called from netpoll. Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost) Single CPU/flow numbers: before: 1982144 pps -> after : 2064446 pps Improvement: +82302 pps, -20 nanosec, +4.1% (SLUB and GCC version 5.1.1 20150618 (Red Hat 5.1.1-4)) Joint work with Alexander Duyck. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: bulk free SKBs that were delay free'ed due to IRQ contextJesper Dangaard Brouer3-3/+14
The network stack defers SKBs free, in-case free happens in IRQ or when IRQs are disabled. This happens in __dev_kfree_skb_irq() that writes SKBs that were free'ed during IRQ to the softirq completion queue (softnet_data.completion_queue). These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ in function net_tx_action(). Take advantage of this a use the skb defer and flush API, as we are already in softirq context. For modern drivers this rarely happens. Although most drivers do call dev_kfree_skb_any(), which detects the situation and calls __dev_kfree_skb_irq() when needed. This due to netpoll can call from IRQ context. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: bulk free infrastructure for NAPI context, use napi_consume_skbJesper Dangaard Brouer3-6/+81
Discovered that network stack were hitting the kmem_cache/SLUB slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk can speedup this slowpath. NAPI context is a bit special, lets take advantage of that for bulk free'ing SKBs. In NAPI context we are running in softirq, which gives us certain protection. A softirq can run on several CPUs at once. BUT the important part is a softirq will never preempt another softirq running on the same CPU. This gives us the opportunity to access per-cpu variables in softirq context. Extend napi_alloc_cache (before only contained page_frag_cache) to be a struct with a small array based stack for holding SKBs. Introduce a SKB defer and flush API for accessing this. Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any() when running in NAPI context. A small trick to handle/detect if we are called from netpoll is to see if budget is 0. In that case, we need to invoke dev_consume_skb_irq(). Joint work with Alexander Duyck. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>