aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2020-01-27rxrpc: Fix use-after-free in rxrpc_receive_data()David Howells1-5/+7
The subpacket scanning loop in rxrpc_receive_data() references the subpacket count in the private data part of the sk_buff in the loop termination condition. However, when the final subpacket is pasted into the ring buffer, the function is no longer has a ref on the sk_buff and should not be looking at sp->* any more. This point is actually marked in the code when skb is cleared (but sp is not - which is an error). Fix this by caching sp->nr_subpackets in a local variable and using that instead. Also clear 'sp' to catch accesses after that point. This can show up as an oops in rxrpc_get_skb() if sp->nr_subpackets gets trashed by the sk_buff getting freed and reused in the meantime. Fixes: e2de6c404898 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net_sched: ematch: reject invalid TCF_EM_SIMPLEEric Dumazet1-0/+3
It is possible for malicious userspace to set TCF_EM_SIMPLE bit even for matches that should not have this bit set. This can fool two places using tcf_em_is_simple() 1) tcf_em_tree_destroy() -> memory leak of em->data if ops->destroy() is NULL 2) tcf_em_tree_dump() wrongly report/leak 4 low-order bytes of a kernel pointer. BUG: memory leak unreferenced object 0xffff888121850a40 (size 32): comm "syz-executor927", pid 7193, jiffies 4294941655 (age 19.840s) hex dump (first 32 bytes): 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f67036ea>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline] [<00000000f67036ea>] slab_post_alloc_hook mm/slab.h:586 [inline] [<00000000f67036ea>] slab_alloc mm/slab.c:3320 [inline] [<00000000f67036ea>] __do_kmalloc mm/slab.c:3654 [inline] [<00000000f67036ea>] __kmalloc_track_caller+0x165/0x300 mm/slab.c:3671 [<00000000fab0cc8e>] kmemdup+0x27/0x60 mm/util.c:127 [<00000000d9992e0a>] kmemdup include/linux/string.h:453 [inline] [<00000000d9992e0a>] em_nbyte_change+0x5b/0x90 net/sched/em_nbyte.c:32 [<000000007e04f711>] tcf_em_validate net/sched/ematch.c:241 [inline] [<000000007e04f711>] tcf_em_tree_validate net/sched/ematch.c:359 [inline] [<000000007e04f711>] tcf_em_tree_validate+0x332/0x46f net/sched/ematch.c:300 [<000000007a769204>] basic_set_parms net/sched/cls_basic.c:157 [inline] [<000000007a769204>] basic_change+0x1d7/0x5f0 net/sched/cls_basic.c:219 [<00000000e57a5997>] tc_new_tfilter+0x566/0xf70 net/sched/cls_api.c:2104 [<0000000074b68559>] rtnetlink_rcv_msg+0x3b2/0x4b0 net/core/rtnetlink.c:5415 [<00000000b7fe53fb>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477 [<00000000e83a40d0>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442 [<00000000d62ba933>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] [<00000000d62ba933>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328 [<0000000088070f72>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917 [<00000000f70b15ea>] sock_sendmsg_nosec net/socket.c:639 [inline] [<00000000f70b15ea>] sock_sendmsg+0x54/0x70 net/socket.c:659 [<00000000ef95a9be>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330 [<00000000b650f1ab>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384 [<0000000055bfa74a>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417 [<000000002abac183>] __do_sys_sendmsg net/socket.c:2426 [inline] [<000000002abac183>] __se_sys_sendmsg net/socket.c:2424 [inline] [<000000002abac183>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+03c4738ed29d5d366ddf@syzkaller.appspotmail.com Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: include struct nhmsg size in nh nlmsg sizeStephen Worley1-1/+3
Include the size of struct nhmsg size when calculating how much of a payload to allocate in a new netlink nexthop notification message. Without this, we will fail to fill the skbuff at certain nexthop group sizes. You can reproduce the failure with the following iproute2 commands: ip link add dummy1 type dummy ip link add dummy2 type dummy ip link add dummy3 type dummy ip link add dummy4 type dummy ip link add dummy5 type dummy ip link add dummy6 type dummy ip link add dummy7 type dummy ip link add dummy8 type dummy ip link add dummy9 type dummy ip link add dummy10 type dummy ip link add dummy11 type dummy ip link add dummy12 type dummy ip link add dummy13 type dummy ip link add dummy14 type dummy ip link add dummy15 type dummy ip link add dummy16 type dummy ip link add dummy17 type dummy ip link add dummy18 type dummy ip link add dummy19 type dummy ip ro add 1.1.1.1/32 dev dummy1 ip ro add 1.1.1.2/32 dev dummy2 ip ro add 1.1.1.3/32 dev dummy3 ip ro add 1.1.1.4/32 dev dummy4 ip ro add 1.1.1.5/32 dev dummy5 ip ro add 1.1.1.6/32 dev dummy6 ip ro add 1.1.1.7/32 dev dummy7 ip ro add 1.1.1.8/32 dev dummy8 ip ro add 1.1.1.9/32 dev dummy9 ip ro add 1.1.1.10/32 dev dummy10 ip ro add 1.1.1.11/32 dev dummy11 ip ro add 1.1.1.12/32 dev dummy12 ip ro add 1.1.1.13/32 dev dummy13 ip ro add 1.1.1.14/32 dev dummy14 ip ro add 1.1.1.15/32 dev dummy15 ip ro add 1.1.1.16/32 dev dummy16 ip ro add 1.1.1.17/32 dev dummy17 ip ro add 1.1.1.18/32 dev dummy18 ip ro add 1.1.1.19/32 dev dummy19 ip next add id 1 via 1.1.1.1 dev dummy1 ip next add id 2 via 1.1.1.2 dev dummy2 ip next add id 3 via 1.1.1.3 dev dummy3 ip next add id 4 via 1.1.1.4 dev dummy4 ip next add id 5 via 1.1.1.5 dev dummy5 ip next add id 6 via 1.1.1.6 dev dummy6 ip next add id 7 via 1.1.1.7 dev dummy7 ip next add id 8 via 1.1.1.8 dev dummy8 ip next add id 9 via 1.1.1.9 dev dummy9 ip next add id 10 via 1.1.1.10 dev dummy10 ip next add id 11 via 1.1.1.11 dev dummy11 ip next add id 12 via 1.1.1.12 dev dummy12 ip next add id 13 via 1.1.1.13 dev dummy13 ip next add id 14 via 1.1.1.14 dev dummy14 ip next add id 15 via 1.1.1.15 dev dummy15 ip next add id 16 via 1.1.1.16 dev dummy16 ip next add id 17 via 1.1.1.17 dev dummy17 ip next add id 18 via 1.1.1.18 dev dummy18 ip next add id 19 via 1.1.1.19 dev dummy19 ip next add id 1111 group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19 ip next del id 1111 Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net_sched: walk through all child classes in tc_bind_tclass()Cong Wang1-11/+30
In a complex TC class hierarchy like this: tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit \ avpkt 1000 cell 8 tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit \ rate 6Mbit weight 0.6Mbit prio 8 allot 1514 cell 8 maxburst 20 \ avpkt 1000 bounded tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \ sport 80 0xffff flowid 1:3 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \ sport 25 0xffff flowid 1:4 tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 100Mbit \ rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20 \ avpkt 1000 tc class add dev eth0 parent 1:1 classid 1:4 cbq bandwidth 100Mbit \ rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20 \ avpkt 1000 where filters are installed on qdisc 1:0, so we can't merely search from class 1:1 when creating class 1:3 and class 1:4. We have to walk through all the child classes of the direct parent qdisc. Otherwise we would miss filters those need reverse binding. Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net_sched: fix ops->bind_class() implementationsCong Wang12-44/+97
The current implementations of ops->bind_class() are merely searching for classid and updating class in the struct tcf_result, without invoking either of cl_ops->bind_tcf() or cl_ops->unbind_tcf(). This breaks the design of them as qdisc's like cbq use them to count filters too. This is why syzbot triggered the warning in cbq_destroy_class(). In order to fix this, we have to call cl_ops->bind_tcf() and cl_ops->unbind_tcf() like the filter binding path. This patch does so by refactoring out two helper functions __tcf_bind_filter() and __tcf_unbind_filter(), which are lockless and accept a Qdisc pointer, then teaching each implementation to call them correctly. Note, we merely pass the Qdisc pointer as an opaque pointer to each filter, they only need to pass it down to the helper functions without understanding it at all. Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class") Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller13-70/+3896
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This batch contains Netfilter updates for net-next: 1) Add nft_setelem_parse_key() helper function. 2) Add NFTA_SET_ELEM_KEY_END to specify a range with one single element. 3) Add NFTA_SET_DESC_CONCAT to describe the set element concatenation, from Stefano Brivio. 4) Add bitmap_cut() to copy n-bits from source to destination, from Stefano Brivio. 5) Add set to match on arbitrary concatenations, from Stefano Brivio. 6) Add selftest for this new set type. An extract of Stefano's description follows: "Existing nftables set implementations allow matching entries with interval expressions (rbtree), e.g. 192.0.2.1-192.0.2.4, entries specifying field concatenation (hash, rhash), e.g. 192.0.2.1:22, but not both. In other words, none of the set types allows matching on range expressions for more than one packet field at a time, such as ipset does with types bitmap:ip,mac, and, to a more limited extent (netmasks, not arbitrary ranges), with types hash:net,net, hash:net,port, hash:ip,port,net, and hash:net,port,net. As a pure hash-based approach is unsuitable for matching on ranges, and "proxying" the existing red-black tree type looks impractical as elements would need to be shared and managed across all employed trees, this new set implementation intends to fill the functionality gap by employing a relatively novel approach. The fundamental idea, illustrated in deeper detail in patch 5/9, is to use lookup tables classifying a small number of grouped bits from each field, and map the lookup results in a way that yields a verdict for the full set of specified fields. The grouping bit aspect is loosely inspired by the Grouper algorithm, by Jay Ligatti, Josh Kuhn, and Chris Gage (see patch 5/9 for the full reference). A reference, stand-alone implementation of the algorithm itself is available at: https://pipapo.lameexcu.se Some notes about possible future optimisations are also mentioned there. This algorithm reduces the matching problem to, essentially, a repetitive sequence of simple bitwise operations, and is particularly suitable to be optimised by leveraging SIMD instruction sets." ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27selftests: netfilter: Introduce tests for sets with range concatenationStefano Brivio2-1/+1483
This test covers functionality and stability of the newly added nftables set implementation supporting concatenation of ranged fields. For some selected set expression types, test: - correctness, by checking that packets match or don't - concurrency, by attempting races between insertion, deletion, lookup - timeout feature, checking that packets don't match expired entries and (roughly) estimate matching rates, comparing to baselines for simple drop on netdev ingress hook and for hash and rbtrees sets. In order to send packets, this needs one of sendip, netcat or bash. To flood with traffic, iperf3, iperf and netperf are supported. For performance measurements, this relies on the sample pktgen script pktgen_bench_xmit_mode_netif_receive.sh. If none of the tools suitable for a given test are available, specific tests will be skipped. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27nf_tables: Add set type for arbitrary concatenation of rangesStefano Brivio4-1/+2107
This new set type allows for intervals in concatenated fields, which are expressed in the usual way, that is, simple byte concatenation with padding to 32 bits for single fields, and given as ranges by specifying start and end elements containing, each, the full concatenation of start and end values for the single fields. Ranges are expanded to composing netmasks, for each field: these are inserted as rules in per-field lookup tables. Bits to be classified are divided in 4-bit groups, and for each group, the lookup table contains 4^2 buckets, representing all the possible values of a bit group. This approach was inspired by the Grouper algorithm: http://www.cse.usf.edu/~ligatti/projects/grouper/ Matching is performed by a sequence of AND operations between bucket values, with buckets selected according to the value of packet bits, for each group. The result of this sequence tells us which rules matched for a given field. In order to concatenate several ranged fields, per-field rules are mapped using mapping arrays, one per field, that specify which rules should be considered while matching the next field. The mapping array for the last field contains a reference to the element originally inserted. The notes in nft_set_pipapo.c cover the algorithm in deeper detail. A pure hash-based approach is of no use here, as ranges need to be classified. An implementation based on "proxying" the existing red-black tree set type, creating a tree for each field, was considered, but deemed impractical due to the fact that elements would need to be shared between trees, at least as long as we want to keep UAPI changes to a minimum. A stand-alone implementation of this algorithm is available at: https://pipapo.lameexcu.se together with notes about possible future optimisations (in pipapo.c). This algorithm was designed with data locality in mind, and can be highly optimised for SIMD instruction sets, as the bulk of the matching work is done with repetitive, simple bitwise operations. At this point, without further optimisations, nft_concat_range.sh reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB L2$): TEST: performance net,port [ OK ] baseline (drop from netdev hook): 10190076pps baseline hash (non-ranged entries): 6179564pps baseline rbtree (match on first field only): 2950341pps set with 1000 full, ranged entries: 2304165pps port,net [ OK ] baseline (drop from netdev hook): 10143615pps baseline hash (non-ranged entries): 6135776pps baseline rbtree (match on first field only): 4311934pps set with 100 full, ranged entries: 4131471pps net6,port [ OK ] baseline (drop from netdev hook): 9730404pps baseline hash (non-ranged entries): 4809557pps baseline rbtree (match on first field only): 1501699pps set with 1000 full, ranged entries: 1092557pps port,proto [ OK ] baseline (drop from netdev hook): 10812426pps baseline hash (non-ranged entries): 6929353pps baseline rbtree (match on first field only): 3027105pps set with 30000 full, ranged entries: 284147pps net6,port,mac [ OK ] baseline (drop from netdev hook): 9660114pps baseline hash (non-ranged entries): 3778877pps baseline rbtree (match on first field only): 3179379pps set with 10 full, ranged entries: 2082880pps net6,port,mac,proto [ OK ] baseline (drop from netdev hook): 9718324pps baseline hash (non-ranged entries): 3799021pps baseline rbtree (match on first field only): 1506689pps set with 1000 full, ranged entries: 783810pps net,mac [ OK ] baseline (drop from netdev hook): 10190029pps baseline hash (non-ranged entries): 5172218pps baseline rbtree (match on first field only): 2946863pps set with 1000 full, ranged entries: 1279122pps v4: - fix build for 32-bit architectures: 64-bit division needs div_u64() (kbuild test robot <lkp@intel.com>) v3: - rework interface for field length specification, NFT_SET_SUBKEY disappears and information is stored in description - remove scratch area to store closing element of ranges, as elements now come with an actual attribute to specify the upper range limit (Pablo Neira Ayuso) - also remove pointer to 'start' element from mapping table, closing key is now accessible via extension data - use bytes right away instead of bits for field lengths, this way we can also double the inner loop of the lookup function to take care of upper and lower bits in a single iteration (minor performance improvement) - make it clearer that set operations are actually atomic API-wise, but we can't e.g. implement flush() as one-shot action - fix type for 'dup' in nft_pipapo_insert(), check for duplicates only in the next generation, and in general take care of differentiating generation mask cases depending on the operation (Pablo Neira Ayuso) - report C implementation matching rate in commit message, so that AVX2 implementation can be compared (Pablo Neira Ayuso) v2: - protect access to scratch maps in nft_pipapo_lookup() with local_bh_disable/enable() (Florian Westphal) - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's already implied (Florian Westphal) - explain why partial allocation failures don't need handling in pipapo_realloc_scratch(), rename 'm' to clone and update related kerneldoc to make it clear we're not operating on the live copy (Florian Westphal) - add expicit check for priv->start_elem in nft_pipapo_insert() to avoid ending up in nft_pipapo_walk() with a NULL start element, and also zero it out in every operation that might make it invalid, so that insertion doesn't proceed with an invalid element (Florian Westphal) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27bitmap: Introduce bitmap_cut(): cut bits and shift remainingStefano Brivio2-0/+70
The new bitmap function bitmap_cut() copies bits from source to destination by removing the region specified by parameters first and cut, and remapping the bits above the cut region by right shifting them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27netfilter: nf_tables: Support for sets with multiple ranged fieldsStefano Brivio4-1/+115
Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used to specify the length of each field in a set concatenation. This allows set implementations to support concatenation of multiple ranged items, as they can divide the input key into matching data for every single field. Such set implementations would be selected as they specify support for NFT_SET_INTERVAL and allow desc->field_count to be greater than one. Explicitly disallow this for nft_set_rbtree. In order to specify the interval for a set entry, userspace would include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass range endpoints as two separate keys, represented by attributes NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END. While at it, export the number of 32-bit registers available for packet matching, as nftables will need this to know the maximum number of field lengths that can be specified. For example, "packets with an IPv4 address between 192.0.2.0 and 192.0.2.42, with destination port between 22 and 25", can be expressed as two concatenated elements: NFTA_SET_ELEM_KEY: 192.0.2.0 . 22 NFTA_SET_ELEM_KEY_END: 192.0.2.42 . 25 and NFTA_SET_DESC_CONCAT attribute would contain: NFTA_LIST_ELEM NFTA_SET_FIELD_LEN: 4 NFTA_LIST_ELEM NFTA_SET_FIELD_LEN: 2 v4: No changes v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY v2: No changes Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attributePablo Neira Ayuso4-24/+79
Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the interval between kernel and userspace. This patch also adds the NFT_SET_EXT_KEY_END extension to store the closing element value in this interval. v4: No changes v3: New patch [sbrivio: refactor error paths and labels; add corresponding nft_set_ext_type for new key; rebase] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27netfilter: nf_tables: add nft_setelem_parse_key()Pablo Neira Ayuso1-46/+45
Add helper function to parse the set element key netlink attribute. v4: No changes v3: New patch [sbrivio: refactor error paths and labels; use NFT_DATA_VALUE_MAXLEN instead of sizeof(*key) in helper, value can be longer than that; rebase] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-26tcp: export count for rehash attemptsAbdul Kabbani7-1/+18
Using IPv6 flow-label to swiftly route around avoid congested or disconnected network path can greatly improve TCP reliability. This patch adds SNMP counters and a OPT_STATS counter to track both host-level and connection-level statistics. Network administrators can use these counters to evaluate the impact of this new ability better. Export count for rehash attempts to 1) two SNMP counters: TcpTimeoutRehash (rehash due to timeouts), and TcpDuplicateDataRehash (rehash due to receiving duplicate packets) 2) Timestamping API SOF_TIMESTAMPING_OPT_STATS. Signed-off-by: Abdul Kabbani <akabbani@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26Merge branch 's390-qeth-next'David S. Miller7-273/+180
Julian Wiedmann says: ==================== s390/qeth: updates 2020-01-25 please apply the following patch series for qeth to your net-next tree. This brings a number of cleanups for the init/teardown code paths. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26s390/qeth: remove HARDSETUP stateJulian Wiedmann5-21/+3
qeth_l?_stop_card() is _never_ called while in HARDSETUP state, and there's no other usage of the card state that relies on the DOWN -> HARDSETUP -> SOFTSETUP transition. As related cleanup, remove the check in qeth_realloc_buffer_pool() as it is already done by the callers. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26s390/qeth: make cmd/reply matching more flexibleJulian Wiedmann3-12/+39
When data is received on the READ channel, the matching logic for cmds that are waiting for a reply is currently hard-coded into the channel's main IO callback. Move this into a per-cmd callback, so that we can apply custom matching logic for each individual cmd. This also allows us to remove the coarse-grained check for unexpected non-IPA replies, since they will no longer match against _all_ pending cmds. Note that IDX cmds use _no_ matcher, since their reply is synchronously received as part of the cmd's IO. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26s390/qeth: consolidate online/offline codeJulian Wiedmann4-184/+107
Large parts of the online/offline code are identical now, and cleaning up the remaining stuff is easier with a shared core. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26s390/qeth: consolidate QDIO queue setupJulian Wiedmann4-17/+9
Move some duplicated logic into a shared code path. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26s390/qeth: shift some bridgeport code aroundJulian Wiedmann3-39/+22
qeth_l2_setup_bridgeport_attrs() is entirely unrelated to sysfs functionality, move it where it belongs. While at it merge all the bridgeport-specific code in the set-online path together. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller17-43/+4723
Jeff Kirsher says: ==================== 100GbE Intel Wired LAN Driver Updates 2020-01-25 This series contains updates to the ice driver to add support for RSS. Henry and Tony enable the driver to write the filtering hardware tables to allow for changing of RSS rules, also introduced and initialized the structures for storing the configuration. Then followed it up by creating an extraction sequence based on the packet header protocols to be programmed. Next was storing the TCAM entry with the profile data and VSI group in the respective software structures. Md Fahad sets up the configuration to support RSS tables for the virtual function (VF) driver (iavf). Add support for flow types TCP4, TCP6, UDP4, UDP6, SCTP4 and SCTP6 for RSS via ethtool. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller191-1088/+2628
Minor conflict in mlx5 because changes happened to code that has moved meanwhile. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25ice: Bump versionTony Nguyen1-1/+1
Bump version to 0.8.2-k Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-25ice: Implement ethtool get/set rx-flow-hashMd Fahad Iqbal Polash3-0/+273
Provide support to change or retrieve RSS hash options for a flow type. The supported flow-types are: tcp4, tcp6, udp4, udp6, sctp4, sctp6. Signed-off-by: Md Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-25ice: Initilialize VF RSS tablesMd Fahad Iqbal Polash4-1/+225
Set configuration for hardware RSS tables for VFs. Signed-off-by: Md Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-25ice: Optimize table usageTony Nguyen5-1/+773
Attempt to optimize TCAM entries and reduce table resource usage by searching for profiles that can be reused. Provide resource cleanup of both hardware and software structures. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-25ice: Enable writing filtering tablesTony Nguyen3-0/+651
Write the hardware tables based on the populated software structures. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-25ice: Populate TCAM filter software structuresTony Nguyen8-10/+1459
Store the TCAM entry with the profile data and the VSI group in the respective SW structures. This will be subsequently used to write out the tables to hardware. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-25Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds2-4/+20
Pull ARM fixes from Russell King: - fix ftrace relocation type filtering - relax arch timer version check * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 8955/1: virt: Relax arch timer version check during early boot ARM: 8950/1: ftrace/recordmcount: filter relocation types
2020-01-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds118-603/+1469
Pull networking fixes from David Miller: 1) Off by one in mt76 airtime calculation, from Dan Carpenter. 2) Fix TLV fragment allocation loop condition in iwlwifi, from Luca Coelho. 3) Don't confirm neigh entries when doing ipsec pmtu updates, from Xu Wang. 4) More checks to make sure we only send TSO packets to lan78xx chips that they can actually handle. From James Hughes. 5) Fix ip_tunnel namespace move, from William Dauchy. 6) Fix unintended packet reordering due to cooperation between listification done by GRO and non-GRO paths. From Maxim Mikityanskiy. 7) Add Jakub Kicincki formally as networking co-maintainer. 8) Info leak in airo ioctls, from Michael Ellerman. 9) IFLA_MTU attribute needs validation during rtnl_create_link(), from Eric Dumazet. 10) Use after free during reload in mlxsw, from Ido Schimmel. 11) Dangling pointers are possible in tp->highest_sack, fix from Eric Dumazet. 12) Missing *pos++ in various networking seq_next handlers, from Vasily Averin. 13) CHELSIO_GET_MEM operation neds CAP_NET_ADMIN check, from Michael Ellerman. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (109 commits) firestream: fix memory leaks net: cxgb3_main: Add CAP_NET_ADMIN check to CHELSIO_GET_MEM net: bcmgenet: Use netif_tx_napi_add() for TX NAPI tipc: change maintainer email address net: stmmac: platform: fix probe for ACPI devices net/mlx5e: kTLS, Do not send decrypted-marked SKBs via non-accel path net/mlx5e: kTLS, Remove redundant posts in TX resync flow net/mlx5e: kTLS, Fix corner-case checks in TX resync flow net/mlx5e: Clear VF config when switching modes net/mlx5: DR, use non preemptible call to get the current cpu number net/mlx5: E-Switch, Prevent ingress rate configuration of uplink rep net/mlx5: DR, Enable counter on non-fwd-dest objects net/mlx5: Update the list of the PCI supported devices net/mlx5: Fix lowest FDB pool size net: Fix skb->csum update in inet_proto_csum_replace16(). netfilter: nf_tables: autoload modules from the abort path netfilter: nf_tables: add __nft_chain_type_get() netfilter: nf_tables_offload: fix check the chain offload flag netfilter: conntrack: sctp: use distinct states for new SCTP connections ipv6_route_seq_next should increase position index ...
2020-01-25Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/socLinus Torvalds3-0/+8
Pull ARM SoC fixes from Olof Johansson: "A couple of fixes have come in that would be good to include in this release: - A fix for amount of memory on Beaglebone Black. Surfaced now since GRUB2 doesn't update memory size in the booted kernel. - A fix to make SPI interfaces work on am43x-epos-evm. - Small Kconfig fix for OPTEE (adds a depend on MMU) to avoid build failures" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: ARM: dts: am43x-epos-evm: set data pin directions for spi0 and spi1 tee: optee: Fix compilation issue with nommu ARM: dts: am335x-boneblack-common: fix memory size
2020-01-25firestream: fix memory leaksWenwen Wang1-0/+3
In fs_open(), 'vcc' is allocated through kmalloc() and assigned to 'atm_vcc->dev_data.' In the following execution, if an error occurs, e.g., there is no more free channel, an error code EBUSY or ENOMEM will be returned. However, 'vcc' is not deallocated, leading to memory leaks. Note that, in normal cases where fs_open() returns 0, 'vcc' will be deallocated in fs_close(). But, if fs_open() fails, there is no guarantee that fs_close() will be invoked. To fix this issue, deallocate 'vcc' before the error code is returned. Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller13-76/+146
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Missing netlink attribute sanity check for NFTA_OSF_DREG, from Florian Westphal. 2) Use bitmap infrastructure in ipset to fix KASAN slab-out-of-bounds reads, from Jozsef Kadlecsik. 3) Missing initial CLOSED state in new sctp connection through ctnetlink events, from Jiri Wiesner. 4) Missing check for NFT_CHAIN_HW_OFFLOAD in nf_tables offload indirect block infrastructure, from wenxu. 5) Add __nft_chain_type_get() to sanity check family and chain type. 6) Autoload modules from the nf_tables abort path to fix races reported by syzbot. 7) Remove unnecessary skb->csum update on inet_proto_csum_replace16(), from Praveen Chaudhary. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25Merge tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds2-9/+29
Pull btrfs fix from David Sterba: "Here's a last minute fix for a regression introduced in this development cycle. There's a small chance of a silent corruption when device replace and NOCOW data writes happen at the same time in one block group. Metadata or COW data writes are unaffected. The extra fixup patch is there to silence an unnecessary warning" * tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: dev-replace: remove warning for unknown return codes when finished btrfs: scrub: Require mandatory block group RO for dev-replace
2020-01-25Merge tag 'pinctrl-v5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrlLinus Torvalds1-0/+1
Pull pin control fix from Linus Walleij: "A single fix for the Intel Sunrisepoint pin controller that makes the interrupts work properly on it" * tag 'pinctrl-v5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: sunrisepoint: Add missing Interrupt Status register offset
2020-01-25Merge tag 'mlx5-fixes-2020-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller7-41/+80
Saeed Mahameed says: ==================== Mellanox, mlx5 fixes 2020-01-24 This series introduces some fixes to mlx5 driver. Please pull and let me know if there is any problem. Merge conflict: once merge with net-next, a contextual conflict will appear in drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c since the code moved in net-next. To resolve, just delete ALL of the conflicting hunk from net. So sorry for the small mess .. For -stable v5.4: ('net/mlx5: Update the list of the PCI supported devices') ('net/mlx5: Fix lowest FDB pool size') ('net/mlx5e: kTLS, Fix corner-case checks in TX resync flow') ('net/mlx5e: kTLS, Do not send decrypted-marked SKBs via non-accel path') ('net/mlx5: Eswitch, Prevent ingress rate configuration of uplink rep') ('net/mlx5e: kTLS, Remove redundant posts in TX resync flow') ('net/mlx5: DR, Enable counter on non-fwd-dest objects') ('net/mlx5: DR, use non preemptible call to get the current cpu number') ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25btrfs: dev-replace: remove warning for unknown return codes when finishedDavid Sterba1-4/+1
The fstests btrfs/011 triggered a warning at the end of device replace, [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28 [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28 [ 1892.059993] ------------[ cut here ]------------ [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs] [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942 [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs] [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286 [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889 [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080 [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000 [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000 [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200 [ 1892.108642] FS: 00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000 [ 1892.111558] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0 [ 1892.115814] Call Trace: [ 1892.116896] btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs] [ 1892.118962] btrfs_ioctl+0x1d62/0x2550 [btrfs] caused by the previous patch ("btrfs: scrub: Require mandatory block group RO for dev-replace"). Hitting ENOSPC is possible and could happen when the block group is set read-only, preventing NOCOW writes to the area that's being accessed by dev-replace. This has happend with scratch devices of size 12G but not with 5G and 20G, so this is depends on timing and other activity on the filesystem. The whole replace operation is restartable, the space state should be examined by the user in any case. The error code is propagated back to the ioctl caller so the kernel warning is causing false alerts. Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-25Merge branch 'mlxsw-Offload-TBF'David S. Miller18-90/+793
Ido Schimmel says: ==================== mlxsw: Offload TBF Petr says: In order to allow configuration of shapers on Spectrum family of machines, recognize TBF either as root Qdisc, or as a child of ETS or PRIO. Configure rate of maximum shaper according to TBF rate setting, and maximum shaper burst size according to TBF burst setting. - Patches #1 and #2 make the TBF shaper suitable for offloading. - Patches #3, #4 and #5 are refactoring aimed at easier support of leaf Qdiscs in general. - Patches #6 to #10 gradually introduce TBF offload. - Patches #11 to #14 add selftests. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25selftests: mlxsw: Add a TBF selftestPetr Machata8-0/+344
Add a test that runs traffic across a port throttled with TBF. The test checks that the observed throughput is within +-5% from the installed shaper. To allow checking both the software datapath and the offloaded one, make the test suitable for inclusion from driver-specific wrapper. Introduce such wrappers for mlxsw. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25selftests: forwarding: lib: Allow reading TC rule byte countersPetr Machata1-1/+2
The function tc_rule_stats_get() fetches a packet counter of a given TC rule. Extend it to support byte counters as well by adding an optional argument with selector. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25selftests: forwarding: lib: Add helpers for busywaitingPetr Machata1-0/+18
The function busywait() is handy as a safety-latched variant of a while loop. Many selftests deal specifically with counter values, and busywaiting on them is likely to be rather common (it is not quite common now, but busywait() has not been around for very long). To facilitate expressing simply what is tested, introduce two helpers: - until_counter_is(), which can be used as a predicate passed to busywait(), which holds when expression, which is itself passed as an argument to until_counter_is(), reaches a desired value. - busywait_for_counter(), which is useful for waiting until a given counter changes "by" (as opposed to "to") a certain amount. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25selftests: Move two functions from mlxsw's qos_lib to libPetr Machata2-24/+24
The function humanize() is used for converting value in bits/s to a human-friendly approximate value in Kbps, Mbps or Gbps. There is nothing hardware-specific in that, so move the function to lib.sh. Similarly for the rate() function, which just does a bit of math to calculate a rate, given two counter values and a time interval. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: spectrum_qdisc: Support offloading of TBF QdiscPetr Machata3-0/+203
React to the TC messages that were introduced in a preceding patch and configure egress maximum shaper as appropriate. TBF can be used as a root qdisc or under one of PRIO or strict ETS bands. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: spectrum: Configure shaper rate and burst size togetherPetr Machata3-8/+10
In order to allow configuration of burst size together with shaper rate, extend mlxsw_sp_port_ets_maxrate_set() with a burst_size argument. Convert call sites to pass 0 (for default). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: spectrum: Add lowest_shaper_bs to struct mlxsw_spPetr Machata2-0/+4
Lower limit of burst size configuration is dependent on system type. Add a datum to track the value. Initialize as appropriate in mlxsw_spX_init(). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: reg: Increase MLXSW_REG_QEEC_MAS_DISPetr Machata1-2/+2
As the port speeds grow, the current value of "unlimited shaper", 200000000Kbps, might become lower than the actually supported speeds. Bump it to the maximum value that fits in the corresponding QEEC field, which is about 2.1Tbps. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: reg: Add max_shaper_bs to QoS ETS Element ConfigurationPetr Machata1-0/+15
The QEEC register configures scheduling elements. One of the bits of configuration is the burst size to use for the shaper installed on the element. Add the necessary fields to support this configuration. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: spectrum_qdisc: Extract a common leaf unoffload functionPetr Machata1-5/+14
When the RED Qdisc is unoffloaded, it needs to reduce the reported backlog by the amount that is in the HW, so that only the SW backlog is contained in the counter. The same thing will need to be done by TBF, and likely any other leaf Qdisc as well. Extract a helper mlxsw_sp_qdisc_leaf_unoffload() and call it from mlxsw_sp_qdisc_red_unoffload(). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: spectrum_qdisc: Add mlxsw_sp_qdisc_get_class_stats()Petr Machata1-10/+19
Add a wrapper around mlxsw_sp_qdisc_collect_tc_stats() and mlxsw_sp_qdisc_update_stats() for the simple case of doing both in one go: mlxsw_sp_qdisc_get_class_stats(). Dispatch to that function from mlxsw_sp_qdisc_get_red_stats(). This new function will be useful for other leaf Qdiscs as well. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mlxsw: spectrum_qdisc: Extract a per-TC stat functionPetr Machata1-49/+70
Extract from mlxsw_sp_qdisc_get_prio_stats() two new functions: mlxsw_sp_qdisc_collect_tc_stats() to accumulate stats for that one TC only, and mlxsw_sp_qdisc_update_stats() that makes the stats relative to base values stored earlier. Use them from mlxsw_sp_qdisc_get_red_stats(). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25net: sched: Make TBF Qdisc offloadablePetr Machata3-0/+78
Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying and dumping of TBF Qdisc. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>