aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-09-21tcp: switch internal pacing timer to CLOCK_TAIEric Dumazet2-2/+2
Next patch will use tcp_wstamp_ns to feed internal TCP pacing timer, so switch to CLOCK_TAI to share same base. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: provide earliest departure time in skb->tstampEric Dumazet4-10/+9
Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns, from usec units to nsec units. Do not clear skb->tstamp before entering IP stacks in TX, so that qdisc or devices can implement pacing based on the earliest departure time instead of socket sk->sk_pacing_rate Packets are fed with tcp_wstamp_ns, and following patch will update tcp_wstamp_ns when both TCP and sch_fq switch to the earliest departure time mechanism. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: add tcp_wstamp_ns socket fieldEric Dumazet1-0/+16
TCP will soon provide earliest departure time on TX skbs. It needs to track this in a new variable. tcp_mstamp_refresh() needs to update this variable, and became too big to stay an inline. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21net_sched: sch_fq: switch to CLOCK_TAIEric Dumazet1-3/+3
TCP will soon provide per skb->tstamp with earliest departure time, so that sch_fq does not have to determine departure time by looking at socket sk_pacing_rate. We chose in linux-4.19 CLOCK_TAI as the clock base for transports, qdiscs, and NIC offloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: introduce tcp_skb_timestamp_us() helperEric Dumazet5-16/+19
There are few places where TCP reads skb->skb_mstamp expecting a value in usec unit. skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value. Add tcp_skb_timestamp_us() to provide proper conversion when needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21net/tls: Add support for async encryption of records for performanceVakul Garg2-176/+464
In current implementation, tls records are encrypted & transmitted serially. Till the time the previously submitted user data is encrypted, the implementation waits and on finish starts transmitting the record. This approach of encrypt-one record at a time is inefficient when asynchronous crypto accelerators are used. For each record, there are overheads of interrupts, driver softIRQ scheduling etc. Also the crypto accelerator sits idle most of time while an encrypted record's pages are handed over to tcp stack for transmission. This patch enables encryption of multiple records in parallel when an async capable crypto accelerator is present in system. This is achieved by allowing the user space application to send more data using sendmsg() even while previously issued data is being processed by crypto accelerator. This requires returning the control back to user space application after submitting encryption request to accelerator. This also means that zero-copy mode of encryption cannot be used with async accelerator as we must be done with user space application buffer before returning from sendmsg(). There can be multiple records in flight to/from the accelerator. Each of the record is represented by 'struct tls_rec'. This is used to store the memory pages for the record. After the records are encrypted, they are added in a linked list called tx_ready_list which contains encrypted tls records sorted as per tls sequence number. The records from tx_ready_list are transmitted using a newly introduced function called tls_tx_records(). The tx_ready_list is polled for any record ready to be transmitted in sendmsg(), sendpage() after initiating encryption of new tls records. This achieves parallel encryption and transmission of records when async accelerator is present. There could be situation when crypto accelerator completes encryption later than polling of tx_ready_list by sendmsg()/sendpage(). Therefore we need a deferred work context to be able to transmit records from tx_ready_list. The deferred work context gets scheduled if applications are not sending much data through the socket. If the applications issue sendmsg()/sendpage() in quick succession, then the scheduling of tx_work_handler gets cancelled as the tx_ready_list would be polled from application's context itself. This saves scheduling overhead of deferred work. The patch also brings some side benefit. We are able to get rid of the concept of CLOSED record. This is because the records once closed are either encrypted and then placed into tx_ready_list or if encryption fails, the socket error is set. This simplifies the kernel tls sendpath. However since tls_device.c is still using macros, accessory functions for CLOSED records have been retained. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21ipv6: remove redundant null pointer check before kfree_skbzhong jiang1-4/+2
kfree_skb has taken the null pointer into account. hence it is safe to remove the redundant null pointer check before kfree_skb. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21net: nci: remove redundant null pointer check before kfree_skbzhong jiang1-4/+2
kfree_skb has taken the null pointer into account. hence it is safe to remove the redundant null pointer check before kfree_skb. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21ipv4: remove redundant null pointer check before kfree_skbzhong jiang1-2/+1
kfree_skb has taken the null pointer into account. hence it is safe to remove the redundant null pointer check before kfree_skb. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21net_sched: change tcf_del_walker() to take idrinfo->lockVlad Buslov2-3/+30
Action API was changed to work with actions and action_idr in concurrency safe manner, however tcf_del_walker() still uses actions without taking a reference or idrinfo->lock first, and deletes them directly, disregarding possible concurrent delete. Change tcf_del_walker() to take idrinfo->lock while iterating over actions and use new tcf_idr_release_unsafe() to release them while holding the lock. And the blocking function fl_hw_destroy_tmplt() could be called when we put a filter chain, so defer it to a work queue. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> [xiyou.wangcong@gmail.com: heavily modify the code and changelog] Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20netfilter: nft_fib: Convert nft_fib4_eval to new dev helperDavid Ahern1-21/+6
Convert nft_fib4_eval to the new device checking helper and remove the duplicate code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20netfilter: rpfilter: Convert rpfilter_lookup_reverse to new dev helperDavid Ahern1-16/+1
Convert rpfilter_lookup_reverse to the new device checking helper and remove the duplicate code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20net/ipv4: Move device validation to helperDavid Ahern1-17/+27
Move the device matching check in __fib_validate_source to a helper and export it for use by netfilter modules. Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19ipv6: Allow the l3mdev to be a loopbackRobert Shearman3-2/+5
There is no way currently for an IPv6 client connect using a loopback address in a VRF, whereas for IPv4 the loopback address can be added: $ sudo ip addr add dev vrfred 127.0.0.1/8 $ sudo ip -6 addr add ::1/128 dev vrfred RTNETLINK answers: Cannot assign requested address So allow ::1 to be configured on an L3 master device. In order for this to be usable ip_route_output_flags needs to not consider ::1 to be a link scope address (since oif == l3mdev and so it would be dropped), and ipv6_rcv needs to consider the l3mdev to be a loopback device so that it doesn't drop the packets. Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19net: linkwatch: add check for netdevice being present to linkwatch_do_devHeiner Kallweit1-1/+1
When bringing down the netdevice (incl. detaching it) and calling netif_carrier_off directly or indirectly the latter triggers an asynchronous linkwatch event. This linkwatch event eventually may fail to access chip registers in the ndo_get_stats/ndo_get_stats64 callback because the device isn't accessible any longer, see call trace in [0]. To prevent this scenario don't check for IFF_UP only, but also make sure that the netdevice is present. [0] https://lists.openwall.net/netdev/2018/03/15/62 Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19net: core: Use FIELD_SIZEOF directly instead of reimplementing its functionzhong jiang1-5/+5
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore, We prefer to use the macro rather than calculating its value. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19net: sched: Use FIELD_SIZEOF directly instead of reimplementing its functionzhong jiang1-1/+1
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore, We prefer to use the macro rather than calculating its value. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19net: iucv: Use FIELD_SIZEOF directly instead of reimplementing its functionzhong jiang1-1/+1
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore, We prefer to use the macro rather than calculating its value. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19Merge tag 'batadv-next-for-davem-20180919' of git://git.open-mesh.org/linux-mergeDavid S. Miller15-411/+323
Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Inform users about debugfs interface deprecation, by Sven Eckelmann - Implement tracing, planned to replace debugfs log messages, by Sven Eckelmann - Move OGM rebroadcasts to per interface struct, by Sven Eckelmann - Enable LockLess TX to increase throughput, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller15-106/+168
Two new tls tests added in parallel in both net and net-next. Used Stephen Rothwell's linux-next resolution. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17net/ipv6: do not copy dst flags on rt initPeter Oskolkov1-2/+0
DST_NOCOUNT in dst_entry::flags tracks whether the entry counts toward route cache size (net->ipv6.sysctl.ip6_rt_max_size). If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented in dist_init() and decremented in dst_destroy(). This flag is tied to allocation/deallocation of dst_entry and should not be copied from another dst/route. Otherwise it can happen that dst_ops::pcpuc_entries counter grows until no new routes can be allocated because the counter reached ip6_rt_max_size due to DST_NOCOUNT not set and thus no counter decrements on gc-ed routes. Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries") Cc: David Ahern <dsahern@gmail.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17net/ipv4: defensive cipso option parsingStefan Nuernberger1-4/+7
commit 40413955ee26 ("Cipso: cipso_v4_optptr enter infinite loop") fixed a possible infinite loop in the IP option parsing of CIPSO. The fix assumes that ip_options_compile filtered out all zero length options and that no other one-byte options beside IPOPT_END and IPOPT_NOOP exist. While this assumption currently holds true, add explicit checks for zero length and invalid length options to be safe for the future. Even though ip_options_compile should have validated the options, the introduction of new one-byte options can still confuse this code without the additional checks. Signed-off-by: Stefan Nuernberger <snu@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Simon Veith <sveith@amazon.de> Cc: stable@vger.kernel.org Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17net: caif: remove redundant null check on frontpktColin Ian King1-3/+0
It is impossible for frontpkt to be null at the point of the null check because it has been assigned from rearpkt and there is no way rearpkt can be null at the point of the assignment because of the sanity checking and exit paths taken previously. Remove the redundant null check. Detected by CoverityScan, CID#114434 ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17Revert "kcm: remove any offset before parsing messages"David S. Miller1-25/+1
This reverts commit 072222b488bc55cce92ff246bdc10115fd57d3ab. I just read that this causes regressions. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17kcm: remove any offset before parsing messagesDominique Martinet1-1/+25
The current code assumes kcm users know they need to look for the strparser offset within their bpf program, which is not documented anywhere and examples laying around do not do. The actual recv function does handle the offset well, so we can create a temporary clone of the skb and pull that one up as required for parsing. The pull itself has a cost if we are pulling beyond the head data, measured to 2-3% latency in a noisy VM with a local client stressing that path. The clone's impact seemed too small to measure. This bug can be exhibited easily by implementing a "trivial" kcm parser taking the first bytes as size, and on the client sending at least two such packets in a single write(). Note that bpf sockmap has the same problem, both for parse and for recv, so it would pulling twice or a real pull within the strparser logic if anyone cares about that. Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17net: dsa: remove redundant null pointer check before put_devicezhong jiang1-2/+1
put_device has taken the null pinter check into account. So it is safe to remove the duplicated check before put_device. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17net: rds: use memset to optimize the recvZhu Yanjun1-4/+1
The function rds_inc_init is in recv process. To use memset can optimize the function rds_inc_init. The test result: Before: 1) + 24.950 us | rds_inc_init [rds](); After: 1) + 10.990 us | rds_inc_init [rds](); Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17net: dsa: tag_gswip: Add gswip to dsa_tag_protocol_to_str()Hauke Mehrtens1-0/+3
The gswip tag was missing in the dsa_tag_protocol_to_str() function, add it. Fixes: 7969119293f5 ("net: dsa: Add Lantiq / Intel GSWIP tag support") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17tls: fix currently broken MSG_PEEK behaviorDaniel Borkmann1-0/+8
In kTLS MSG_PEEK behavior is currently failing, strace example: [pid 2430] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3 [pid 2430] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4 [pid 2430] bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 [pid 2430] listen(4, 10) = 0 [pid 2430] getsockname(4, {sa_family=AF_INET, sin_port=htons(38855), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 [pid 2430] connect(3, {sa_family=AF_INET, sin_port=htons(38855), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 [pid 2430] setsockopt(3, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0 [pid 2430] setsockopt(3, 0x11a /* SOL_?? */, 1, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0 [pid 2430] accept(4, {sa_family=AF_INET, sin_port=htons(49636), sin_addr=inet_addr("127.0.0.1")}, [16]) = 5 [pid 2430] setsockopt(5, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0 [pid 2430] setsockopt(5, 0x11a /* SOL_?? */, 2, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0 [pid 2430] close(4) = 0 [pid 2430] sendto(3, "test_read_peek", 14, 0, NULL, 0) = 14 [pid 2430] sendto(3, "_mult_recs\0", 11, 0, NULL, 0) = 11 [pid 2430] recvfrom(5, "test_read_peektest_read_peektest"..., 64, MSG_PEEK, NULL, NULL) = 64 As can be seen from strace, there are two TLS records sent, i) 'test_read_peek' and ii) '_mult_recs\0' where we end up peeking 'test_read_peektest_read_peektest'. This is clearly wrong, and what happens is that given peek cannot call into tls_sw_advance_skb() to unpause strparser and proceed with the next skb, we end up looping over the current one, copying the 'test_read_peek' over and over into the user provided buffer. Here, we can only peek into the currently held skb (current, full TLS record) as otherwise we would end up having to hold all the original skb(s) (depending on the peek depth) in a separate queue when unpausing strparser to process next records, minimally intrusive is to return only up to the current record's size (which likely was what c46234ebb4d1 ("tls: RX path for ktls") originally intended as well). Thus, after patch we properly peek the first record: [pid 2046] wait4(2075, <unfinished ...> [pid 2075] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3 [pid 2075] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4 [pid 2075] bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 [pid 2075] listen(4, 10) = 0 [pid 2075] getsockname(4, {sa_family=AF_INET, sin_port=htons(55115), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 [pid 2075] connect(3, {sa_family=AF_INET, sin_port=htons(55115), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 [pid 2075] setsockopt(3, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0 [pid 2075] setsockopt(3, 0x11a /* SOL_?? */, 1, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0 [pid 2075] accept(4, {sa_family=AF_INET, sin_port=htons(45732), sin_addr=inet_addr("127.0.0.1")}, [16]) = 5 [pid 2075] setsockopt(5, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0 [pid 2075] setsockopt(5, 0x11a /* SOL_?? */, 2, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0 [pid 2075] close(4) = 0 [pid 2075] sendto(3, "test_read_peek", 14, 0, NULL, 0) = 14 [pid 2075] sendto(3, "_mult_recs\0", 11, 0, NULL, 0) = 11 [pid 2075] recvfrom(5, "test_read_peek", 64, MSG_PEEK, NULL, NULL) = 14 Fixes: c46234ebb4d1 ("tls: RX path for ktls") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17tls: async support causes out-of-bounds access in crypto APIsJohn Fastabend1-16/+23
When async support was added it needed to access the sk from the async callback to report errors up the stack. The patch tried to use space after the aead request struct by directly setting the reqsize field in aead_request. This is an internal field that should not be used outside the crypto APIs. It is used by the crypto code to define extra space for private structures used in the crypto context. Users of the API then use crypto_aead_reqsize() and add the returned amount of bytes to the end of the request memory allocation before posting the request to encrypt/decrypt APIs. So this breaks (with general protection fault and KASAN error, if enabled) because the request sent to decrypt is shorter than required causing the crypto API out-of-bounds errors. Also it seems unlikely the sk is even valid by the time it gets to the callback because of memset in crypto layer. Anyways, fix this by holding the sk in the skb->sk field when the callback is set up and because the skb is already passed through to the callback handler via void* we can access it in the handler. Then in the handler we need to be careful to NULL the pointer again before kfree_skb. I added comments on both the setup (in tls_do_decryption) and when we clear it from the crypto callback handler tls_decrypt_done(). After this selftests pass again and fixes KASAN errors/warnings. Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Vakul Garg <Vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17ipv6: fix possible use-after-free in ip6_xmit()Eric Dumazet1-4/+2
In the unlikely case ip6_xmit() has to call skb_realloc_headroom(), we need to call skb_set_owner_w() before consuming original skb, otherwise we risk a use-after-free. Bring IPv6 in line with what we do in IPv4 to fix this. Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-1/+2
Daniel Borkmann says: ==================== pull-request: bpf 2018-09-16 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix end boundary calculation in BTF for the type section, from Martin. 2) Fix and revert subtraction of pointers that was accidentally allowed for unprivileged programs, from Alexei. 3) Fix bpf_msg_pull_data() helper by using __GFP_COMP in order to avoid a warning in linearizing sg pages into a single one for large allocs, from Tushar. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16ip6_gre: simplify gre header parsing in ip6gre_errHaishuang Yan1-22/+4
Same as ip_gre, use gre_parse_header to parse gre header in gre error handler code. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16ip_gre: fix parsing gre header in ipgre_errHaishuang Yan2-9/+7
gre_parse_header stops parsing when csum_err is encountered, which means tpi->key is undefined and ip_tunnel_lookup will return NULL improperly. This patch introduce a NULL pointer as csum_err parameter. Even when csum_err is encountered, it won't return error and continue parsing gre header as expected. Fixes: 9f57c67c379d ("gre: Remove support for sharing GRE protocol hook.") Reported-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16net/sched: act_police: don't use spinlock in the data pathDavide Caratti1-64/+92
use RCU instead of spinlocks, to protect concurrent read/write on act_police configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16net/sched: act_police: use per-cpu countersDavide Caratti1-24/+22
use per-CPU counters, instead of sharing a single set of stats with all cores. This removes the need of using spinlock when statistics are read or updated. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16udp6: add missing checks on edumux packet processingPaolo Abeni1-28/+37
Currently the UDPv6 early demux rx code path lacks some mandatory checks, already implemented into the normal RX code path - namely the checksum conversion and no_check6_rx check. Similar to the previous commit, we move the common processing to an UDPv6 specific helper and call it from both edemux code path and normal code path. In respect to the UDPv4, we need to add an explicit check for non zero csum according to no_check6_rx value. Reported-by: Jianlin Shi <jishi@redhat.com> Suggested-by: Xin Long <lucien.xin@gmail.com> Fixes: c9f2c1ae123a ("udp6: fix socket leak on early demux") Fixes: 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16udp4: fix IP_CMSG_CHECKSUM for connected socketsPaolo Abeni1-23/+26
commit 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") left out the early demux path for connected sockets. As a result IP_CMSG_CHECKSUM gives wrong values for such socket when GRO is not enabled/available. This change addresses the issue by moving the csum conversion to a common helper and using such helper in both the default and the early demux rx path. Fixes: 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-15batman-adv: Enable LockLess TX for softifSven Eckelmann1-0/+1
The batadv interfaces are virtual interfaces which just tunnel the traffic over other ethernet compatible interfaces. It doesn't need serialization during the tx phase and is using RCU for most of its internal datastructures. Since it doesn't have actual queues which could be locked independently, the throughput gets significantly reduced by the extra lock in the core net code. 8 parallel TCP connections forwarded by an IPQ4019 based hardware over 5GHz could reach: * without LLTX: 349 Mibit/s * with LLTX: 563 Mibit/s Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-15batman-adv: Move OGM rebroadcast stats to orig_ifinfoSven Eckelmann6-406/+145
B.A.T.M.A.N. IV requires the number of rebroadcast from a neighboring originator. These statistics are gathered per interface which transmitted the OGM (and then received it again). Since an originator is not interface specific, a resizable array was used in each originator. This resizable array had an entry for each interface and had to be resizes (for all OGMs) when the number of active interface was modified. This could cause problems when a large number of interface is added and not enough continuous memory is available to allocate the array. There is already a per interface originator structure "batadv_orig_ifinfo" which can be used to store this information. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-15batman-adv: Provide debug messages as trace eventsSven Eckelmann5-5/+127
A private debug logging infrastructure is currently provided via $debug_fs/batman_adv/*/log when CONFIG_BATMAN_ADV_DEBUG is enabled. This is not well integrated in the rest of the tracing infrastructure of the kernel. Other components (like mac80211 or ath10k) allow to gather the debug messages using generic trace events which are better integrated. This makes it possible to interact with them using the existing userspace tools. The tracepoint batadv:batadv_dbg will now be available when CONFIG_BATMAN_ADV_DEBUG and CONFIG_BATMAN_ADV_TRACING is activated. The log level mask is still used for filtering as usual. A full system trace for offline parsing can be created (and read) using: $ batctl ll all $ trace-cmd record -e batadv:batadv_dbg $ trace-cmd report The same can also be done without recording to a file $ batctl ll all $ trace-cmd stream -e batadv:batadv_dbg The trace infrastructure is especially helpful when tracing processes: $ batctl ll all $ ./tools/perf/perf trace --event "batadv:*" batctl p 10.204.32.1 0.000 batadv:batadv_dbg:batman_adv bat0 Parsing outgoing ARP REQUEST 0.045 batadv:batadv_dbg:batman_adv bat0 ARP MSG = [src: a2:64:14:53:f8:22-10.204.32.185 dst: 00:00:00:00:00:00-10.204.32.1] 0.067 batadv:batadv_dbg:batman_adv bat0 Entry updated: 10.204.32.185 a2:64:14:53:f8:22 (vid: -1) 0.099 batadv:batadv_dbg:batman_adv bat0 batadv_dat_select_candidates(): IP=10.204.32.1 hash(IP)=48902 0.757 batadv:batadv_dbg:batman_adv bat0 dat_select_candidates() 0: selected fe:2c:91:68:29:2b addr=48977 dist=65460 1.178 batadv:batadv_dbg:batman_adv bat0 dat_select_candidates() 1: selected fe:81:ab:c5:e3:03 addr=49181 dist=65256 1.809 batadv:batadv_dbg:batman_adv bat0 dat_select_candidates() 2: selected 66:25:a7:48:37:fb addr=49328 dist=65109 1.828 batadv:batadv_dbg:batman_adv bat0 DHT_SEND for 10.204.32.1 Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-14net/sched: act_sample: fix NULL dereference in the data pathDavide Caratti1-1/+1
Matteo reported the following splat, testing the datapath of TC 'sample': BUG: KASAN: null-ptr-deref in tcf_sample_act+0xc4/0x310 Read of size 8 at addr 0000000000000000 by task nc/433 CPU: 0 PID: 433 Comm: nc Not tainted 4.19.0-rc3-kvm #17 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 Call Trace: kasan_report.cold.6+0x6c/0x2fa tcf_sample_act+0xc4/0x310 ? dev_hard_start_xmit+0x117/0x180 tcf_action_exec+0xa3/0x160 tcf_classify+0xdd/0x1d0 htb_enqueue+0x18e/0x6b0 ? deref_stack_reg+0x7a/0xb0 ? htb_delete+0x4b0/0x4b0 ? unwind_next_frame+0x819/0x8f0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 __dev_queue_xmit+0x722/0xca0 ? unwind_get_return_address_ptr+0x50/0x50 ? netdev_pick_tx+0xe0/0xe0 ? save_stack+0x8c/0xb0 ? kasan_kmalloc+0xbe/0xd0 ? __kmalloc_track_caller+0xe4/0x1c0 ? __kmalloc_reserve.isra.45+0x24/0x70 ? __alloc_skb+0xdd/0x2e0 ? sk_stream_alloc_skb+0x91/0x3b0 ? tcp_sendmsg_locked+0x71b/0x15a0 ? tcp_sendmsg+0x22/0x40 ? __sys_sendto+0x1b0/0x250 ? __x64_sys_sendto+0x6f/0x80 ? do_syscall_64+0x5d/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ? __sys_sendto+0x1b0/0x250 ? __x64_sys_sendto+0x6f/0x80 ? do_syscall_64+0x5d/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ip_finish_output2+0x495/0x590 ? ip_copy_metadata+0x2e0/0x2e0 ? skb_gso_validate_network_len+0x6f/0x110 ? ip_finish_output+0x174/0x280 __tcp_transmit_skb+0xb17/0x12b0 ? __tcp_select_window+0x380/0x380 tcp_write_xmit+0x913/0x1de0 ? __sk_mem_schedule+0x50/0x80 tcp_sendmsg_locked+0x49d/0x15a0 ? tcp_rcv_established+0x8da/0xa30 ? tcp_set_state+0x220/0x220 ? clear_user+0x1f/0x50 ? iov_iter_zero+0x1ae/0x590 ? __fget_light+0xa0/0xe0 tcp_sendmsg+0x22/0x40 __sys_sendto+0x1b0/0x250 ? __ia32_sys_getpeername+0x40/0x40 ? _copy_to_user+0x58/0x70 ? poll_select_copy_remaining+0x176/0x200 ? __pollwait+0x1c0/0x1c0 ? ktime_get_ts64+0x11f/0x140 ? kern_select+0x108/0x150 ? core_sys_select+0x360/0x360 ? vfs_read+0x127/0x150 ? kernel_write+0x90/0x90 __x64_sys_sendto+0x6f/0x80 do_syscall_64+0x5d/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fefef2b129d Code: ff ff ff ff eb b6 0f 1f 80 00 00 00 00 48 8d 05 51 37 0c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41 RSP: 002b:00007fff2f5350c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000056118d60c120 RCX: 00007fefef2b129d RDX: 0000000000002000 RSI: 000056118d629320 RDI: 0000000000000003 RBP: 000056118d530370 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000002000 R13: 000056118d5c2a10 R14: 000056118d5c2a10 R15: 000056118d5303b8 tcf_sample_act() tried to update its per-cpu stats, but tcf_sample_init() forgot to allocate them, because tcf_idr_create() was called with a wrong value of 'cpustats'. Setting it to true proved to fix the reported crash. Reported-by: Matteo Croce <mcroce@redhat.com> Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use IDR") Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action") Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-14batman-adv: Mark debugfs functionality as deprecatedSven Eckelmann4-0/+50
CONFIG_BATMAN_ADV_DEBUGFS is disabled by default because debugfs is not supported for batman-adv interfaces in any non-default netns. Any remaining users of this interface should still be informed about the deprecation and the generic netlink alternative. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-14batman-adv: Start new development cycleSimon Wunderlich1-1/+1
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-13socket: fix struct ifreq size in compat ioctlJohannes Berg1-8/+14
As reported by Reobert O'Callahan, since Viro's commit to kill dev_ifsioc() we attempt to copy too much data in compat mode, which may lead to EFAULT when the 32-bit version of struct ifreq sits at/near the end of a page boundary, and the next page isn't mapped. Fix this by passing the approprate compat/non-compat size to copy and using that, as before the dev_ifsioc() removal. This works because only the embedded "struct ifmap" has different size, and this is only used in SIOCGIFMAP/SIOCSIFMAP which has a different handler. All other parts of the union are naturally compatible. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199469. Fixes: bf4405737f9f ("kill dev_ifsioc()") Reported-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13pktgen: Fix fall-through annotationGustavo A. R. Silva1-1/+1
Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13gso_segment: Reset skb->mac_len after modifying network headerToke Høiland-Jørgensen2-0/+2
When splitting a GSO segment that consists of encapsulated packets, the skb->mac_len of the segments can end up being set wrong, causing packet drops in particular when using act_mirred and ifb interfaces in combination with a qdisc that splits GSO packets. This happens because at the time skb_segment() is called, network_header will point to the inner header, throwing off the calculation in skb_reset_mac_len(). The network_header is subsequently adjust by the outer IP gso_segment handlers, but they don't set the mac_len. Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6 gso_segment handlers, after they modify the network_header. Many thanks to Eric Dumazet for his help in identifying the cause of the bug. Acked-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13gso_segment: Reset skb->mac_len after modifying network headerToke Høiland-Jørgensen2-0/+2
When splitting a GSO segment that consists of encapsulated packets, the skb->mac_len of the segments can end up being set wrong, causing packet drops in particular when using act_mirred and ifb interfaces in combination with a qdisc that splits GSO packets. This happens because at the time skb_segment() is called, network_header will point to the inner header, throwing off the calculation in skb_reset_mac_len(). The network_header is subsequently adjust by the outer IP gso_segment handlers, but they don't set the mac_len. Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6 gso_segment handlers, after they modify the network_header. Many thanks to Eric Dumazet for his help in identifying the cause of the bug. Acked-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetoothDavid S. Miller1-3/+13
Johan Hedberg says: ==================== pull request: bluetooth 2018-09-13 A few Bluetooth fixes for the 4.19-rc series: - Fixed rw_semaphore leak in hci_ldisc - Fixed local Out-of-Band pairing data handling Let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tls: clear key material from kernel memory when do_tls_setsockopt_conf failsSabrina Dubroca1-1/+1
Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>