aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-06-15net_sched: sch_fq: defer skb freeingEric Dumazet1-6/+13
Both fq_change() and fq_reset() can use rtnl_kfree_skbs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net_sched: sch_codel: defer skb freeing in codel_change()Eric Dumazet1-1/+1
codel_change() can use rtnl_qdisc_drop() to defer expensive skb freeing after locks are released. codel_reset() already has support for deferred skb freeing because it uses qdisc_reset_queue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net_sched: sch_choke: defer skb freeingEric Dumazet1-4/+4
choke_reset() and choke_change() can use rtnl_qdisc_drop() to defer expensive skb freeing after locks are released. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net_sched: add the ability to defer skb freeingEric Dumazet2-1/+23
qdisc are changed under RTNL protection and often while blocking BH and root qdisc spinlock. When lots of skbs need to be dropped, we free them under these locks causing TX/RX freezes, and more generally latency spikes. This commit adds rtnl_kfree_skbs(), used to queue skbs for deferred freeing. Actual freeing happens right after RTNL is released, with appropriate scheduling points. rtnl_qdisc_drop() can also be used in place of disc_drop() when RTNL is held. qdisc_reset_queue() and __qdisc_reset_queue() get the new behavior, so standard qdiscs like pfifo, pfifo_fast... have their ->reset() method automatically handled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15tipc: add neighbor monitoring frameworkJon Paul Maloy10-31/+797
TIPC based clusters are by default set up with full-mesh link connectivity between all nodes. Those links are expected to provide a short failure detection time, by default set to 1500 ms. Because of this, the background load for neighbor monitoring in an N-node cluster increases with a factor N on each node, while the overall monitoring traffic through the network infrastructure increases at a ~(N * (N - 1)) rate. Experience has shown that such clusters don't scale well beyond ~100 nodes unless we significantly increase failure discovery tolerance. This commit introduces a framework and an algorithm that drastically reduces this background load, while basically maintaining the original failure detection times across the whole cluster. Using this algorithm, background load will now grow at a rate of ~(2 * sqrt(N)) per node, and at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will now have to actively monitor 38 neighbors in a 400-node cluster, instead of as before 399. This "Overlapping Ring Supervision Algorithm" is completely distributed and employs no centralized or coordinated state. It goes as follows: - Each node makes up a linearly ascending, circular list of all its N known neighbors, based on their TIPC node identity. This algorithm must be the same on all nodes. - The node then selects the next M = sqrt(N) - 1 nodes downstream from itself in the list, and chooses to actively monitor those. This is called its "local monitoring domain". - It creates a domain record describing the monitoring domain, and piggy-backs this in the data area of all neighbor monitoring messages (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in the cluster eventually (default within 400 ms) will learn about its monitoring domain. - Whenever a node discovers a change in its local domain, e.g., a node has been added or has gone down, it creates and sends out a new version of its node record to inform all neighbors about the change. - A node receiving a domain record from anybody outside its local domain matches this against its own list (which may not look the same), and chooses to not actively monitor those members of the received domain record that are also present in its own list. Instead, it relies on indications from the direct monitoring nodes if an indirectly monitored node has gone up or down. If a node is indicated lost, the receiving node temporarily activates its own direct monitoring towards that node in order to confirm, or not, that it is actually gone. - Since each node is actively monitoring sqrt(N) downstream neighbors, each node is also actively monitored by the same number of upstream neighbors. This means that all non-direct monitoring nodes normally will receive sqrt(N) indications that a node is gone. - A major drawback with ring monitoring is how it handles failures that cause massive network partitionings. If both a lost node and all its direct monitoring neighbors are inside the lost partition, the nodes in the remaining partition will never receive indications about the loss. To overcome this, each node also chooses to actively monitor some nodes outside its local domain. Those nodes are called remote domain "heads", and are selected in such a way that no node in the cluster will be more than two direct monitoring hops away. Because of this, each node, apart from monitoring the member of its local domain, will also typically monitor sqrt(N) remote head nodes. - As an optimization, local list status, domain status and domain records are marked with a generation number. This saves senders from unnecessarily conveying unaltered domain records, and receivers from performing unneeded re-adaptations of their node monitoring list, such as re-assigning domain heads. - As a measure of caution we have added the possibility to disable the new algorithm through configuration. We do this by keeping a threshold value for the cluster size; a cluster that grows beyond this value will switch from full-mesh to ring monitoring, and vice versa when it shrinks below the value. This means that if the threshold is set to a value larger than any anticipated cluster size (default size is 32) the new algorithm is effectively disabled. A patch set for altering the threshold value and for listing the table contents will follow shortly. - This change is fully backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net_sched: make tcf_hash_check() booleanWANG Cong7-11/+16
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net: vrf: Handle ipv6 multicast and link-local addressesDavid Ahern2-3/+4
IPv6 multicast and link-local addresses require special handling by the VRF driver: 1. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. 2. fail sends/receives on a VRF device to/from a multicast address (e.g, make ping6 ff02::1%<vrf> fail) 3. move the setting of the flow oif to the first dst lookup and revert the change in icmpv6_echo_reply made in ca254490c8dfd ("net: Add VRF support to IPv6 stack"). Linklocal/mcast addresses require use of the skb->dev. With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses work (icmp, tcp, and udp) e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net: ipv6: Do not add multicast route for l3 master devicesDavid Ahern1-1/+1
L3 master devices are virtual devices similar to the loopback device. Link local and multicast routes for these devices do not make sense. The ipv6 addrconf code already skips adding a linklocal address; do the same for the mcast route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net: l3mdev: Remove const from flowi6 arg to get_rt6_dstDavid Ahern1-1/+1
Allow drivers to pass flow arg to functions where the arg is not const and allow the driver to make updates as needed (eg., setting oif). Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15af_iucv: use paged SKBs for big inbound messagesEugene Crosser1-6/+50
When an inbound message is bigger than a page, allocate a paged SKB, and subsequently use IUCV receive primitive with IPBUFLST flag. This relaxes the pressure to allocate big contiguous kernel buffers. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15af_iucv: remove fragment_skb() to use paged SKBsEugene Crosser1-56/+3
Before introducing paged skbs in the receive path, get rid of the function `iucv_fragment_skb()` that replaces one large linear skb with several smaller linear skbs. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15af_iucv: use paged SKBs for big outbound messagesEugene Crosser1-47/+77
When an outbound message is bigger than a page, allocate and fill a paged SKB, and subsequently use IUCV send primitive with IPBUFLST flag. This relaxes the pressure to allocate big contiguous kernel buffers. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15act_police: rename tcf_act_police_locate() to tcf_act_police_init()WANG Cong1-4/+4
This function is just ->init(), rename it to make it obvious. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net_sched: remove internal use of TC_POLICE_*WANG Cong2-3/+3
These should be gone when we removed CONFIG_NET_CLS_POLICE. We can not totally remove them since they are exposed to userspace. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Update rds_conn_destroy to be MP capableSowmini Varadhan1-20/+39
Refactor rds_conn_destroy() so that the per-path dismantling is done in rds_conn_path_destroy, and then iterate as needed over rds_conn_path_destroy(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Update rds_conn_shutdown to work with rds_conn_pathSowmini Varadhan5-38/+44
This commit changes rds_conn_shutdown to take a rds_conn_path * argument, allowing it to shutdown paths other than c_path[0] for MP-capable transports. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_createSowmini Varadhan1-20/+45
Add a for() loop in __rds_conn_create to initialize all the conn_paths, in preparate for MP capable transports. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Add rds_conn_path_error()Sowmini Varadhan3-1/+18
rds_conn_path_error() is the MP-aware analog of rds_conn_error, to be used by multipath-capable callers. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: update rds-info related functions to traverse multiple conn_pathsSowmini Varadhan1-27/+82
This commit updates the callbacks related to the rds-info command so that they walk through all the rds_conn_path structures and report the requested info. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Add rds_conn_path_connect_if_down() for MP-aware callersSowmini Varadhan3-8/+14
rds_conn_path_connect_if_down() works on the rds_conn_path that it is passed. Callers who are not t_m_capable may continue calling rds_conn_connect_if_down, which will invoke rds_conn_path_connect_if_down() with the default c_path[0]. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Make rds_send_pong() take a rds_conn_path argumentSowmini Varadhan3-14/+14
This commit allows rds_send_pong() callers to send back the rds pong message on some path other than c_path[0] by passing in a struct rds_conn_path * argument. It also removes the last dependency on the #defines in rds_single.h from send.c Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Extract rds_conn_path from i_conn_path in rds_send_drop_to() for MP-capable transportsSowmini Varadhan1-3/+8
Explicitly set up rds_conn_path, either from i_conn_path (for MP capable transpots) or as c_path[0], and use this in rds_send_drop_to() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Pass rds_conn_path to rds_send_xmit()Sowmini Varadhan4-70/+87
Pass a struct rds_conn_path to rds_send_xmit so that MP capable transports can transmit packets on something other than c_path[0]. The eventual goal for MP capable transports is to hash the rds socket to a path based on the bound local address/port, and use this path as the argument to rds_send_xmit() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Make rds_send_queue_rm() rds_conn_path awareSowmini Varadhan1-6/+11
Pass the rds_conn_path to rds_send_queue_rm, and use it to initialize the i_conn_path field in struct rds_incoming. This commit also makes rds_send_queue_rm() MP capable, because it now takes locks specific to the rds_conn_path passed in, instead of defaulting to the c_path[0] based defines from rds_single_path.h Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Remove stale function rds_send_get_message()Sowmini Varadhan2-38/+0
The only caller of rds_send_get_message() was rds_iw_send_cq_comp_handler() which was removed as part of commit dcdede0406d3 ("RDS: Drop stale iWARP RDMA transport"), so remove rds_send_get_message() for the same reason. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Add rds_send_path_drop_acked()Sowmini Varadhan2-5/+15
rds_send_path_drop_acked() is the path-specific version of rds_send_drop_acked() to be invoked by MP capable callers. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: Add rds_send_path_reset()Sowmini Varadhan1-17/+22
rds_send_path_reset() is the path specific version of rds_send_reset() intended for MP capable callers. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: rds_inc_path_init() helper function for MP capable transportsSowmini Varadhan2-0/+16
t_mp_capable transports can use rds_inc_path_init to initialize all fields in struct rds_incoming, including the i_conn_path. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: recv path gets the conn_path from rds_incoming for MP capable transportsSowmini Varadhan2-4/+9
Transports that are t_mp_capable should set the rds_conn_path on which the datagram was recived in the ->i_conn_path field of struct rds_incoming. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: add t_mp_capable bit to be set by MP capable transportsSowmini Varadhan1-1/+6
The t_mp_capable bit will be used in the core rds module to support multipathing logic when the transport supports it. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14RDS: split out connection specific state from rds_connection to rds_conn_pathSowmini Varadhan19-93/+199
In preparation for multipath RDS, split the rds_connection structure into a base structure, and a per-path struct rds_conn_path. The base structure tracks information and locks common to all paths. The workqs for send/recv/shutdown etc are tracked per rds_conn_path. Thus the workq callbacks now work with rds_conn_path. This commit allows for one rds_conn_path per rds_connection, and will be extended into multiple conn_paths in subsequent commits. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14tcp: return sizeof tcp_dctcp_info in dctcp_get_info()Neal Cardwell1-2/+2
Make sure that dctcp_get_info() returns only the size of the info->dctcp struct that it zeroes out and fills in. Previously it had been returning the size of the enclosing tcp_cc_info union, sizeof(*info). There is no problem yet, but that union that may one day be larger than struct tcp_dctcp_info, in which case the TCP_CC_INFO code might accidentally copy uninitialized bytes from the stack. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14sctp: fix error return code in sctp_init()Wei Yongjun1-1/+2
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14Merge tag 'rxrpc-rewrite-20160613' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fsDavid S. Miller18-86/+86
David Howells says: ==================== rxrpc: Rename rxrpc source files Here's the next part of the AF_RXRPC rewrite. In this set I rename some of the files in the net/rxrpc/ directory and adjust the Makefile and ar-internal.h to reflect the changes. The aim is twofold: (1) Remove the "ar-" prefix on those files that have it as it's not really useful, especially now that I'm building rxkad in. (2) To aid splitting the local, peer, connection and call handling code into separate files for object and event handling in future patches by making it easier to come up with new filenames. There are two commits: (1) The first commit does a bunch of renames of .c files and alters the Makefile. ar-internal.h isn't renamed at this time to avoid having to change the contents of the files being renamed. (2) The second commit changes the section label comments in ar-internal.h to reflect the changed filenames and reorders the file so that the sections are back in filename order. The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160613 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14net/sched: flower: Return error when hw can't offload and skip_sw is setAmir Vadai1-17/+25
When skip_sw is set and hardware fails to apply filter, return error to user. This will make error propagation logic similar to the one currently used in u32 classifier. Also, changed code to use tc_skip_sw() utility function. Signed-off-by: Amir Vadai <amirva@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13rxrpc: Update the comments in ar-internal.h to reflect renamesDavid Howells1-69/+69
Update the section comments in ar-internal.h that indicate the locations of the referenced items to reflect the renames done to the .c files in net/rxrpc/. This also involves some rearrangement to reflect keep the sections in order of filename. Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-13rxrpc: Rename files matching ar-*.c to git rid of the "ar-" prefixDavid Howells17-17/+17
Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix. This will aid splitting those files by making easier to come up with new names. Note that the not all files are simply renamed from ar-X.c to X.c. The following exceptions are made: (*) ar-call.c -> call_object.c ar-ack.c -> call_event.c call_object.c is going to contain the core of the call object handling. Call event handling is all going to be in call_event.c. (*) ar-accept.c -> call_accept.c Incoming call handling is going to be here. (*) ar-connection.c -> conn_object.c ar-connevent.c -> conn_event.c The former file is going to have the basic connection object handling, but there will likely be some differentiation between client connections and service connections in additional files later. The latter file will have all the connection-level event handling. (*) ar-local.c -> local_object.c This will have the local endpoint object handling code. The local endpoint event handling code will later be split out into local_event.c. (*) ar-peer.c -> peer_object.c This will have the peer endpoint object handling code. Peer event handling code will be placed in peer_event.c (for the moment, there is none). (*) ar-error.c -> peer_event.c This will become the peer event handling code, though for the moment it's actually driven from the local endpoint's perspective. Note that I haven't renamed ar-transport.c to transport_object.c as the intention is to delete it when the rxrpc_transport struct is excised. The only file that actually has its contents changed is net/rxrpc/Makefile. net/rxrpc/ar-internal.h will need its section marker comments updating, but I'll do that in a separate patch to make it easier for git to follow the history across the rename. I may also want to rename ar-internal.h at some point - but that would mean updating all the #includes and I'd rather do that in a separate step. Signed-off-by: David Howells <dhowells@redhat.com.
2016-06-12sched: remove NET_XMIT_POLICEDFlorian Westphal5-7/+4
sch_atm returns this when TC_ACT_SHOT classification occurs. But all other schedulers that use tc_classify (htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS in this case so just do that in atm. BATMAN uses it as an intermediate return value to signal forwarding vs. buffering, but it did not return POLICED to callers outside of BATMAN. Reviewed-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-11ipv6: use TOS marks from sockets for routing decisionHannes Frederic Sowa6-11/+23
In IPv6 the ToS values are part of the flowlabel in flowi6 and get extracted during fib rule lookup, but we forgot to correctly initialize the flowlabel before the routing lookup. Reported-by: <liam.mcbirnie@boeing.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10net_sched: remove generic throttled managementEric Dumazet7-17/+4
__QDISC_STATE_THROTTLED bit manipulation is rather expensive for HTB and few others. I already removed it for sch_fq in commit f2600cf02b5b ("net: sched: avoid costly atomic operation in fq_dequeue()") and so far nobody complained. When one ore more packets are stuck in one or more throttled HTB class, a htb dequeue() performs two atomic operations to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc lock is held. Removing this pair of atomic operations bring me a 8 % performance increase on 200 TCP_RR tests, in presence of throttled classes. This patch has no side effect, since nothing actually uses disc_is_throttled() anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10net_sched: netem: remove qdisc_is_throttled() useEric Dumazet1-3/+0
Looks like it is only there as some optimization attempt. Since __QDISC_STATE_THROTTLED set/unset is way too expensive, and netem is the last user, just remove this check. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10net_sched: cbq: remove a flaky use of qdisc_is_throttled()Eric Dumazet1-1/+1
So far no qdisc ever unset the throttled bit at enqueue() time, so CBQ usage of qdisc_is_throttled() was flaky. Since __QDISC_STATE_THROTTLED set/unset is way too expensive considering that only CBQ was eventually caring for this status, it would make sense to implement a Qdisc ops ->is_throttled() if we find that this is needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10net_sched: sch_plug: use a private throttled statusEric Dumazet1-6/+8
We want to get rid of generic qdisc throttled management, so this qdisc has to use a private flag. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10sctp: sctp should change socket state when shutdown is receivedXin Long2-3/+9
Now sctp doesn't change socket state upon shutdown reception. It changes just the assoc state, even though it's a TCP-style socket. For some cases, if we really need to check sk->sk_state, it's necessary to fix this issue, at least when we use ss or netstat to dump, we can get a more exact information. As an improvement, we will change sk->sk_state when we change asoc->state to SHUTDOWN_RECEIVED, and also do it in sctp_shutdown to keep consistent with sctp_close. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10Merge tag 'mac80211-next-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-nextDavid S. Miller15-199/+745
Johannes Berg says: ==================== For the next cycle, we have the following: * the biggest change is MichaƂ's work on integrating FQ/codel with the mac80211 internal software queues * cfg80211 connect result gets clarified for the "no connection at all" case * advertisement of per-interface type capabilities, in case they differ (which makes a lot of sense for some capabilities) * most of the nl80211 & hwsim unprivileged namespace operation changes * human-readable VHT capabilities in debugfs * some other cleanups, like spelling ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10tcp: add NV congestion controlLawrence Brakmo3-0/+493
TCP-NV (New Vegas) is a major update to TCP-Vegas. An earlier version of NV was presented at 2010's LPC. It is a delayed based congestion avoidance for the data center. This version has been tested within a 10G rack where the HW RTTs are 20-50us and with 1 to 400 flows. A description of TCP-NV, including implementation details as well as experimental results, can be found at: http://www.brakmo.org/networking/tcp-nv/TCPNV.html Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10tcp: add in_flight to tcp_skb_cbLawrence Brakmo2-2/+7
Add in_flight (bytes in flight when packet was sent) field to tx component of tcp_skb_cb and make it available to congestion modules' pkts_acked() function through the ack_sample function argument. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10packet: use common code for virtio_net_hdr and skb GSO conversionMike Rapoport1-34/+2
Replace open coded conversion between virtio_net_hdr to skb GSO info with virtio_net_hdr_from_skb Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10RDS: IB: Remove deprecated create_workqueueBhaktipriya Shridhar1-1/+1
alloc_workqueue replaces deprecated create_workqueue(). Since the driver is infiniband which can be used as block device and the workqueue seems involved in regular operation of the device, so a dedicated workqueue has been used with WQ_MEM_RECLAIM set to guarantee forward progress under memory pressure. Since there are only a fixed number of work items, explicit concurrency limit is unnecessary here. Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10rxrpc: Limit the listening backlogDavid Howells4-8/+28
Limit the socket incoming call backlog queue size so that a remote client can't pump in sufficient new calls that the server runs out of memory. Note that this is partially theoretical at the moment since whilst the number of calls is limited, the number of packets trying to set up new calls is not. This will be addressed in a later patch. If the caller of listen() specifies a backlog INT_MAX, then they get the current maximum; anything else greater than max_backlog or anything negative incurs EINVAL. The limit on the maximum queue size can be set by: echo N >/proc/sys/net/rxrpc/max_backlog where 4<=N<=32. Further, set the default backlog to 0, requiring listen() to be called before we start actually queueing new calls. Whilst this kind of is a change in the UAPI, the caller can't actually *accept* new calls anyway unless they've first called listen() to put the socket into the LISTENING state - thus the aforementioned new calls would otherwise just sit there, eating up kernel memory. (Note that sockets that don't have a non-zero service ID bound don't get incoming calls anyway.) Given that the default backlog is now 0, make the AFS filesystem call kernel_listen() to set the maximum backlog for itself. Possible improvements include: (1) Trimming a too-large backlog to max_backlog when listen is called. (2) Trimming the backlog value whenever the value is used so that changes to max_backlog are applied to an open socket automatically. Note that the AFS filesystem opens one socket and keeps it open for extended periods, so would miss out on changes to max_backlog. (3) Having a separate setting for the AFS filesystem. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>