aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2015-02-04Revert "bridge: Let bridge not age 'externally' learnt FDB entries, they are removed when 'external' entity notifies the aging"David S. Miller1-1/+1
This reverts commit 9a05dde59a35eee5643366d3d1e1f43fc9069adb. Requested by Scott Feldman. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04pkt_sched: fq: better control of DDOS trafficEric Dumazet1-2/+17
FQ has a fast path for skb attached to a socket, as it does not have to compute a flow hash. But for other packets, FQ being non stochastic means that hosts exposed to random Internet traffic can allocate million of flows structure (104 bytes each) pretty easily. Not only host can OOM, but lookup in RB trees can take too much cpu and memory resources. This patch adds a new attribute, orphan_mask, that is adding possibility of having a stochastic hash for orphaned skb. Its default value is 1024 slots, to mimic SFQ behavior. Note: This does not apply to locally generated TCP traffic, and no locally generated traffic will share a flow structure with another perfect or stochastic flow. This patch also handles the specific case of SYNACK messages: They are attached to the listener socket, and therefore all map to a single hash bucket. If listener have set SO_MAX_PACING_RATE, hoping to have new accepted socket inherit this rate, SYNACK might be paced and even dropped. This is very similar to an internal patch Google have used more than one year. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsDavid S. Miller15-382/+192
More iov_iter work from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tcp: do not pace pure ack packetsEric Dumazet2-3/+17
When we added pacing to TCP, we decided to let sch_fq take care of actual pacing. All TCP had to do was to compute sk->pacing_rate using simple formula: sk->pacing_rate = 2 * cwnd * mss / rtt It works well for senders (bulk flows), but not very well for receivers or even RPC : cwnd on the receiver can be less than 10, rtt can be around 100ms, so we can end up pacing ACK packets, slowing down the sender. Really, only the sender should pace, according to its own logic. Instead of adding a new bit in skb, or call yet another flow dissection, we tweak skb->truesize to a small value (2), and we instruct sch_fq to use new helper and not pace pure ack. Note this also helps TCP small queue, as ack packets present in qdisc/NIC do not prevent sending a data packet (RPC workload) This helps to reduce tx completion overhead, ack packets can use regular sock_wfree() instead of tcp_wfree() which is a bit more expensive. This has no impact in the case packets are sent to loopback interface, as we do not coalesce ack packets (were we would detect skb->truesize lie) In case netem (with a delay) is used, skb_orphan_partial() also sets skb->truesize to 1. This patch is a combination of two patches we used for about one year at Google. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04netfilter: Use rhashtable walk iteratorHerbert Xu1-17/+36
This patch gets rid of the manual rhashtable walk in nft_hash which touches rhashtable internals that should not be exposed. It does so by using the rhashtable iterator primitives. Note that I'm leaving nft_hash_destroy alone since it's only invoked on shutdown and it shouldn't be affected by changes to rhashtable internals (or at least not what I'm planning to change). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04netlink: Use rhashtable walk iteratorHerbert Xu1-66/+64
This patch gets rid of the manual rhashtable walk in netlink which touches rhashtable internals that should not be exposed. It does so by using the rhashtable iterator primitives. In fact the existing code was very buggy. Some sockets weren't shown at all while others were shown more than once. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04net/core: Add event for a change in slave stateMoni Shoua2-0/+21
Add event which provides an indication on a change in the state of a bonding slave. The event handler should cast the pointer to the appropriate type (struct netdev_bonding_info) in order to get the full info about the slave. Signed-off-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: separate link starting event from link timeout eventJon Paul Maloy1-1/+3
When a new link instance is created, it is trigged to start by sending it a TIPC_STARTING_EVT, whereafter a regular link reset is applied to it. The starting event is codewise treated as a timeout event, and prompts a link RESET message to be sent to the peer node, carrying a link session identifier. The later link_reset() call nudges this session identifier, whereafter all subsequent RESET messages will be sent out with the new identifier. The latter session number overrides the former, causing the peer to unconditionally accept it irrespective of its current working state. We don't think that this causes any problem, but it is not in accordance with the protocol spec, and may cause confusion when debugging TIPC sessions. To avoid this, we make the starting event distinct from the subsequent timeout events, by not allowing the former to send out any RESET message. This eliminates the described problem. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: eliminate race during node creationJon Paul Maloy2-13/+7
Instances of struct node are created in the function tipc_disc_rcv() under the assumption that there is no race between received discovery messages arriving from the same node. This assumption is wrong. When we use more than one bearer, it is possible that discovery messages from the same node arrive at the same moment, resulting in creation of two instances of struct tipc_node. This may later cause confusion during link establishment, and may result in one of the links never becoming activated. We fix this by making lookup and potential creation of nodes atomic. Instead of first looking up the node, and in case of failure, create it, we now start with looking up the node inside node_link_create(), and return a reference to that one if found. Otherwise, we go ahead and create the node as we did before. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: avoid stale link after aborted failoverJon Paul Maloy2-0/+5
During link failover it may happen that the remaining link goes down while it is still in the process of taking over traffic from a previously failed link. When this happens, we currently abort the failover procedure and reset the first failed link to non-failover mode, so that it will be ready to re-establish contact with its peer when it comes available. However, if the first link goes down because its bearer was manually disabled, it is not enough to reset it; it must also be deleted; which is supposed to happen when the failover procedure is finished. Otherwise it will remain a zombie link: attached to the owner node structure, in mode LINK_STOPPED, and permanently blocking any re- establishing of the link to the peer via the interface in question. We fix this by amending the failover abort procedure. Apart from resetting the link to non-failover state, we test if the link is also in LINK_STOPPED mode. If so, we delete it, using the conditional tipc_link_delete() function introduced in the previous commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: add reference count to struct tipc_linkJon Paul Maloy2-31/+50
When a bearer is disabled, all pertaining links will be reset and deleted. However, if there is a second active link towards a killed link's destination, the delete has to be postponed until the failover is finished. During this interval, we currently put the link in zombie mode, i.e., we take it out of traffic, delete its timer, but leave it attached to the owner node structure until all missing packets have been received. When this is done, we detach the link from its node and delete it, assuming that the synchronous timer deletion that was initiated earlier in a different thread has finished. This is unsafe, as the failover may finish before del_timer_sync() has returned in the other thread. We fix this by adding an atomic reference counter of type kref in struct tipc_link. The counter keeps track of the references kept to the link by the owner node and the timer. We then do a conditional delete, based on the reference counter, both after the failover has been finished and when the timer expires, if applicable. Whoever comes last, will actually delete the link. This approach also implies that we can make the deletion of the timer asynchronous. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Merge tag 'mac80211-next-for-davem-2015-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-nextDavid S. Miller32-160/+1325
Last round of updates for net-next: * revert a patch that caused a regression with mesh userspace (Bob) * fix a number of suspend/resume related races (from Emmanuel, Luca and myself - we'll look at backporting later) * add software implementations for new ciphers (Jouni) * add a new ACPI ID for Broadcom's rfkill (Mika) * allow using netns FD for wireless (Vadim) * some other cleanups (various) Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-nextDavid S. Miller9-165/+532
Johan Hedberg says: ==================== pull request: bluetooth-next 2015-02-03 Here's what's likely the last bluetooth-next pull request for 3.20. Notable changes include: - xHCI workaround + a new id for the ath3k driver - Several new ids for the btusb driver - Support for new Intel Bluetooth controllers - Minor cleanups to ieee802154 code - Nested sleep warning fix in socket accept() code path - Fixes for Out of Band pairing handling - Support for LE scan restarting for HCI_QUIRK_STRICT_DUPLICATE_FILTER - Improvements to data we expose through debugfs - Proper handling of Hardware Error HCI events Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04net: add skb functions to process remote checksum offloadTom Herbert1-16/+2
This patch adds skb_remcsum_process and skb_gro_remcsum_process to perform the appropriate adjustments to the skb when receiving remote checksum offload. Updated vxlan and gue to use these functions. Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did not see any change in performance. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04bridge: Let bridge not age 'externally' learnt FDB entries, they are removed when 'external' entity notifies the agingSiva Mannem1-1/+1
When 'learned_sync' flag is turned on, the offloaded switch port syncs learned MAC addresses to bridge's FDB via switchdev notifier (NETDEV_SWITCH_FDB_ADD). Currently, FDB entries learnt via this mechanism are wrongly being deleted by bridge aging logic. This patch ensures that FDB entries synced from offloaded switch ports are not deleted by bridging logic. Such entries can only be deleted via switchdev notifier (NETDEV_SWITCH_FDB_DEL). Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04xps: fix xps for stacked devicesEric Dumazet2-1/+10
A typical qdisc setup is the following : bond0 : bonding device, using HTB hierarchy eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc XPS allows to spread packets on specific tx queues, based on the cpu doing the send. Problem is that dequeues from bond0 qdisc can happen on random cpus, due to the fact that qdisc_run() can dequeue a batch of packets. CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1 CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0 CPUB -> dequeue packet P1 from bond0 enqueue packet on eth1/eth2 CPUC -> dequeue packet P2 from bond0 enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0) get_xps_queue() then might select wrong queue for P1, since current cpu might be different than CPUA. P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping), if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock) Effect of this bug is TCP reorders, and more generally not optimal TX queue placement. (A victim bulk flow can be migrated to the wrong TX queue for a while) To fix this, we have to record sender cpu number the first time dev_queue_xmit() is called for one tx skb. We can union napi_id (used on receive path) and sender_cpu, granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for this union idea) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04net: switch sockets to ->read_iter/->write_iterAl Viro1-29/+27
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04net/socket.c: fold do_sock_{read,write} into callersAl Viro1-35/+21
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04net: bury net/core/iovec.c - nothing in there is used anymoreAl Viro2-138/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04tipc: tipc ->sendmsg() conversionAl Viro2-7/+14
This one needs to copy the same data from user potentially more than once. Sadly, MTU changes can trigger that ;-/ Cc: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04net: switch memcpy_fromiovec()/memcpy_fromiovecend() users to copy_from_iter()Al Viro5-14/+12
That takes care of the majority of ->sendmsg() instances - most of them via memcpy_to_msg() or assorted getfrag() callbacks. One place where we still keep memcpy_fromiovecend() is tipc - there we potentially read the same data over and over; separate patch, that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ip: convert tcp_sendmsg() to iov_iter primitivesAl Viro3-131/+115
patch is actually smaller than it seems to be - most of it is unindenting the inner loop body in tcp_sendmsg() itself... the bit in tcp_input.c is going to get reverted very soon - that's what memcpy_from_msg() will become, but not in this commit; let's keep it reasonably contained... There's one potentially subtle change here: in case of short copy from userland, mainline tcp_send_syn_data() discards the skb it has allocated and falls back to normal path, where we'll send as much as possible after rereading the same data again. This patch trims SYN+data skb instead - that way we don't need to copy from the same place twice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ip: stash a pointer to msghdr in struct ping_fakehdrAl Viro2-6/+4
... instead of storing its ->mgs_iter.iov there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04rxrpc: make the users of rxrpc_kernel_send_data() set kvec-backed msg_iter properlyAl Viro1-3/+0
Use iov_iter_kvec() there, get rid of set_fs() games - now that rxrpc_send_data() uses iov_iter primitives, it'll handle ITER_KVEC just fine. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04rxrpc: switch rxrpc_send_data() to iov_iter primitivesAl Viro1-33/+10
Convert skb_add_data() to iov_iter; allows to get rid of the explicit messing with iovec in its only caller - skb_add_data() will keep advancing ->msg_iter for us, so there's no need to similate that manually. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04vmci: propagate msghdr all way down to __qp_memcpy_to_queue()Al Viro1-2/+1
Switch from passing msg->iov_iter.iov to passing msg itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ipv6: rawv6_send_hdrinc(): pass msghdrAl Viro1-4/+3
Switch from passing msg->iov_iter.iov to passing msg itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ipv4: raw_send_hdrinc(): pass msghdrAl Viro1-4/+3
Switch from passing msg->iov_iter.iov to passing msg itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04netlink: make the check for "send from tx_ring" deterministicAl Viro1-0/+5
As it is, zero msg_iovlen means that the first iovec in the kernel array of iovecs is left uninitialized, so checking if its ->iov_base is NULL is random. Since the real users of that thing are doing sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and msg_iov[0] = {NULL, 0}, which is what this test is trying to catch. As suggested by davem, let's just check that msg_iovlen was 1 and msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches what we want to catch. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-03netlabel: Less function calls in netlbl_mgmt_add_common() after error detectionMarkus Elfring1-25/+24
The functions "cipso_v4_doi_putdef" and "kfree" could be called in some cases by the netlbl_mgmt_add_common() function during error handling even if the passed variables contained still a null pointer. * This implementation detail could be improved by adjustments for jump labels. * Let us return immediately after the first failed function call according to the current Linux coding style convention. * Let us delete also an unnecessary check for the variable "entry" there. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-03netlabel: Deletion of an unnecessary check before the function call "cipso_v4_doi_free"Markus Elfring1-2/+1
The cipso_v4_doi_free() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-03netlabel: Deletion of an unnecessary check before the function call "cipso_v4_doi_putdef"Markus Elfring1-2/+1
The cipso_v4_doi_putdef() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-03net: rfkill: Add Broadcom BCM2E40 bluetooth ACPI IDMika Westerberg1-0/+1
This is yet another Broadcom bluetooth chip with ACPI ID BCM2E40. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-03Bluetooth: Fix potential NULL dereferenceJohan Hedberg1-4/+3
The bnep_get_device function may be triggered by an ioctl just after a connection has gone down. In such a case the respective L2CAP chan->conn pointer will get set to NULL (by l2cap_chan_del). This patch adds a missing NULL check for this case in the bnep_get_device() function. Reported-by: Patrik Flykt <patrik.flykt@linux.intel.com> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02net: sctp: Deletion of an unnecessary check before the function call "kfree"Markus Elfring1-2/+1
The kfree() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-By: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Allow for partial checksums on non-ufo packetsVlad Yasevich1-1/+10
Currntly, if we are not doing UFO on the packet, all UDP packets will start with CHECKSUM_NONE and thus perform full checksum computations in software even if device support IPv6 checksum offloading. Let's start start with CHECKSUM_PARTIAL if the device supports it and we are sending only a single packet at or below mtu size. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02udpv6: Add lockless sendmsg() supportVlad Yasevich1-4/+20
This commit adds the same functionaliy to IPv6 that commit 903ab86d195cca295379699299c5fc10beba31c7 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue Mar 1 02:36:48 2011 +0000 udp: Add lockless transmit path added to IPv4. UDP transmit path can now run without a socket lock, thus allowing multiple threads to send to a single socket more efficiently. This is only used when corking/MSG_MORE is not used. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Introduce udpv6_send_skb()Vlad Yasevich1-27/+40
Now that we can individually construct IPv6 skbs to send, add a udpv6_send_skb() function to populate the udp header and send the skb. This allows udp_v6_push_pending_frames() to re-use this function as well as enables us to add lockless sendmsg() support. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: introduce ipv6_make_skbVlad Yasevich1-19/+84
This commit is very similar to commit 1c32c5ad6fac8cee1a77449f5abf211e911ff830 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue Mar 1 02:36:47 2011 +0000 inet: Add ip_make_skb and ip_finish_skb It adds IPv6 version of the helpers ip6_make_skb and ip6_finish_skb. The job of ip6_make_skb is to collect messages into an ipv6 packet and poplulate ipv6 eader. The job of ip6_finish_skb is to transmit the generated skb. Together they replicated the job of ip6_push_pending_frames() while also provide the capability to be called independently. This will be needed to add lockless UDP sendmsg support. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Append sending data to arbitrary queueVlad Yasevich1-39/+67
Add the ability to append data to arbitrary queue. This will be needed later to implement lockless UDP sends. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: pull cork initialization into its own function.Vlad Yasevich1-69/+89
Pull IPv6 cork initialization into its own function that can be re-used. IPv6 specific cork data did not have an explicit data structure. This patch creats eone so that just ipv6 cork data can be as arguemts. Also, since IPv6 tries to save the flow label into inet_cork_full tructure, pass the full cork. Adjust ip6_cork_release() to take cork data structures. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net: dctcp: loosen requirement to assert ECT(0) during 3WHSFlorian Westphal1-9/+5
One deployment requirement of DCTCP is to be able to run in a DC setting along with TCP traffic. As Glenn Judd's NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter" [1] (tba) explains, one way to solve this on switch side is to split DCTCP and TCP traffic in two queues per switch port based on the DSCP: one queue soley intended for DCTCP traffic and one for non-DCTCP traffic. For the DCTCP queue, there's the marking threshold K as explained in commit e3118e8359bb ("net: tcp: add DCTCP congestion control algorithm") for RED marking ECT(0) packets with CE. For the non-DCTCP queue, there's f.e. a classic tail drop queue. As already explained in e3118e8359bb, running DCTCP at scale when not marking SYN/SYN-ACK packets with ECT(0) has severe consequences as for non-ECT(0) packets, traversing the RED marking DCTCP queue will result in a severe reduction of connection probability. This is due to the DCTCP queue being dominated by ECT(0) traffic and switches handle non-ECT traffic in the RED marking queue after passing K as drops, where K is usually a low watermark in order to leave enough tailroom for bursts. Splitting DCTCP traffic among several queues (ECN and non-ECN queue) is being considered a terrible idea in the network community as it splits single flows across multiple network paths. Therefore, commit e3118e8359bb implements this on Linux as ECT(0) marked traffic, as we argue that marking all packets of a DCTCP flow is the only viable solution and also doesn't speak against the draft. However, recently, a DCTCP implementation for FreeBSD hit also their mainline kernel [2]. In order to let them play well together with Linux' DCTCP, we would need to loosen the requirement that ECT(0) has to be asserted during the 3WHS as not implemented in FreeBSD. This simplifies the ECN test and lets DCTCP work together with FreeBSD. Joint work with Daniel Borkmann. [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595f Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net-timestamp: no-payload only sysctlWillem de Bruijn3-1/+32
Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. Add a sysctl that optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW. The policy is checked when timestamps are generated in the stack. It is possible for timestamps with data to be reported after the sysctl is set, if these were queued internally earlier. No vulnerability is immediately known that exploits knowledge gleaned from packet headers, but it may still be preferable to allow administrators to lock down this path at the cost of possible breakage of legacy applications. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (v1 -> v2) - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW) (rfc -> v1) - document the sysctl in Documentation/sysctl/net.txt - fix access control race: read .._OPT_TSONLY only once, use same value for permission check and skb generation. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net-timestamp: no-payload optionWillem de Bruijn4-11/+25
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit timestamps, this loops timestamps on top of empty packets. Doing so reduces the pressure on SO_RCVBUF. Payload inspection and cmsg reception (aside from timestamps) are no longer possible. This works together with a follow on patch that allows administrators to only allow tx timestamping if it does not loop payload or metadata. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (rfc -> v1) - add documentation - remove unnecessary skb->len test (thanks to Richard Cochran) Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02Bluetooth: Remove mgmt_rp_read_local_oob_ext_data structJohan Hedberg1-16/+9
This extended return parameters struct conflicts with the new Read Local OOB Extended Data command definition. To avoid the conflict simply rename the old "extended" version to the normal one and update the code appropriately to take into account the two possible response PDU sizes. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02Bluetooth: Add restarting to service discoveryJakub Pawlowski1-5/+48
When using LE_SCAN_FILTER_DUP_ENABLE, some controllers would send advertising report from each LE device only once. That means that we don't get any updates on RSSI value, and makes Service Discovery very slow. This patch adds restarting scan when in Service Discovery, and device with filtered uuid is found, but it's not in RSSI range to send event yet. This way if device moves into range, we will quickly get RSSI update. Signed-off-by: Jakub Pawlowski <jpawlowski@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02Bluetooth: Add le_scan_restart work for LE scan restartingJakub Pawlowski2-1/+92
Currently there is no way to restart le scan, and it's needed in service scan method. The way it work: it disable, and then enable le scan on controller. During the restart, we must remember when the scan was started, and it's duration, to later re-schedule the le_scan_disable work, that was stopped during the stop scan phase. Signed-off-by: Jakub Pawlowski <jpawlowski@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-01bridge: offload bridge port attributes to switch asic if feature flag setRoopa Prabhu1-3/+23
This patch adds support to set/del bridge port attributes in hardware from the bridge driver. With this, when the user sends a bridge setlink message with no flags or master flags set, - the bridge driver ndo_bridge_setlink handler sets settings in the kernel - calls the swicthdev api to propagate the attrs to the switchdev hardware You can still use the self flag to go to the switch hw or switch port driver directly. With this, it also makes sure a notification goes out only after the attributes are set both in the kernel and hw. The patch calls switchdev api only if BRIDGE_FLAGS_SELF is not set. This is because the offload cases with BRIDGE_FLAGS_SELF are handled in the caller (in rtnetlink.c). Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01swdevice: add new apis to set and del bridge port attributesRoopa Prabhu1-0/+110
This patch adds two new api's netdev_switch_port_bridge_setlink and netdev_switch_port_bridge_dellink to offload bridge port attributes to switch port (The names of the apis look odd with 'switch_port_bridge', but am more inclined to change the prefix of the api to something else. Will take any suggestions). The api's look at the NETIF_F_HW_SWITCH_OFFLOAD feature flag to pass bridge port attributes to the port device. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellinkRoopa Prabhu3-8/+10
bridge flags are needed inside ndo_bridge_setlink/dellink handlers to avoid another call to parse IFLA_AF_SPEC inside these handlers This is used later in this series Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>