aboutsummaryrefslogtreecommitdiffstats
path: root/net/core (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-11-01ethtool: push the rtnl_lock into dev_ethtool()Jakub Kicinski1-2/+0
Don't take the lock in net/core/dev_ioctl.c, we'll have things to do outside rtnl_lock soon. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29cls_flower: Fix inability to match GRE/IPIP packetsYoshiki Komachi1-0/+15
When a packet of a new flow arrives in openvswitch kernel module, it dissects the packet and passes the extracted flow key to ovs-vswtichd daemon. If hw- offload configuration is enabled, the daemon creates a new TC flower entry to bypass openvswitch kernel module for the flow (TC flower can also offload flows to NICs but this time that does not matter). In this processing flow, I found the following issue in cases of GRE/IPIP packets. When ovs_flow_key_extract() in openvswitch module parses a packet of a new GRE (or IPIP) flow received on non-tunneling vports, it extracts information of the outer IP header for ip_proto/src_ip/dst_ip match keys. This means ovs-vswitchd creates a TC flower entry with IP protocol/addresses match keys whose values are those of the outer IP header. OTOH, TC flower, which uses flow_dissector (different parser from openvswitch module), extracts information of the inner IP header. The following flow is an example to describe the issue in more detail. <----------- Outer IP -----------------> <---------- Inner IP ----------> +----------+--------------+--------------+----------+----------+----------+ | ip_proto | src_ip | dst_ip | ip_proto | src_ip | dst_ip | | 47 (GRE) | 192.168.10.1 | 192.168.10.2 | 6 (TCP) | 10.0.0.1 | 10.0.0.2 | +----------+--------------+--------------+----------+----------+----------+ In this case, TC flower entry and extracted information are shown as below: - ovs-vswitchd creates TC flower entry with: - ip_proto: 47 - src_ip: 192.168.10.1 - dst_ip: 192.168.10.2 - TC flower extracts below for IP header matches: - ip_proto: 6 - src_ip: 10.0.0.1 - dst_ip: 10.0.0.2 Thus, GRE or IPIP packets never match the TC flower entry, as each dissector behaves differently. IMHO, the behavior of TC flower (flow dissector) does not look correct, as ip_proto/src_ip/dst_ip in TC flower match means the outermost IP header information except for GRE/IPIP cases. This patch adds a new flow_dissector flag FLOW_DISSECTOR_F_STOP_BEFORE_ENCAP which skips dissection of the encapsulated inner GRE/IPIP header in TC flower classifier. Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29devlink: make all symbols GPL-onlyJakub Kicinski1-4/+4
devlink_alloc() and devlink_register() are both GPL. A non-GPL module won't get far, so for consistency we can make all symbols GPL without risking any real life breakage. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29mctp: Add flow extension to skbJeremy Kerr1-0/+19
This change adds a new skb extension for MCTP, to represent a request/response flow. The intention is to use this in a later change to allow i2c controllers to correctly configure a multiplexer over a flow. Since we have a cleanup function in the core path (if an extension is present), we'll need to make CONFIG_MCTP a bool, rather than a tristate. Includes a fix for a build warning with clang: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski6-17/+60
include/net/sock.h 7b50ecfcc6cd ("net: Rename ->stream_memory_read to ->sock_is_readable") 4c1e34c0dbff ("vsock: Enable y2038 safe timeval for timeout") drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c 0daa55d033b0 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table") e77bcdd1f639 ("octeontx2-af: Display all enabled PF VF rsrc_alloc entries.") Adjacent code addition in both cases, keep both. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28devlink: Simplify internal devlink params implementationLeon Romanovsky1-123/+46
Reduce extra indirection from devlink_params_*() API. Such change makes it clear that we can drop devlink->lock from these flows, because everything is executed when the devlink is not registered yet. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27xdp: Remove redundant warningYajun Deng1-2/+0
There is a warning in xdp_rxq_info_unreg_mem_model() when reg_state isn't equal to REG_STATE_REGISTERED, so the warning in xdp_rxq_info_unreg() is redundant. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/20211027013856.1866-1-yajun.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27Revert "devlink: Remove not-executed trap policer notifications"Leon Romanovsky1-5/+7
This reverts commit 22849b5ea5952d853547cc5e0651f34a246b2a4f as it revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters devlink objects during another devlink user triggered command. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27Revert "devlink: Remove not-executed trap group notifications"Leon Romanovsky1-5/+7
This reverts commit 8bbeed4858239ac956a78e5cbaf778bd6f3baef8 as it revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters devlink objects during another devlink user triggered command. Fixes: 22849b5ea595 ("devlink: Remove not-executed trap policer notifications") Reported-by: syzbot+93d5accfaefceedf43c1@syzkaller.appspotmail.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26net: Prevent HW-GRO and LRO features operate togetherBen Ben-ishay1-0/+5
LRO and HW-GRO are mutually exclusive, this commit adds this restriction in netdev_fix_feature. HW-GRO is preferred, that means in case both HW-GRO and LRO features are requested, LRO is cleared. Signed-off-by: Ben Ben-ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-26Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski2-1/+15
Daniel Borkmann says: ==================== pull-request: bpf 2021-10-26 We've added 12 non-merge commits during the last 7 day(s) which contain a total of 23 files changed, 118 insertions(+), 98 deletions(-). The main changes are: 1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen. 2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang. 3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai. 4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer. 5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun. 6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian. 7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix potential race in tail call compatibility check bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET selftests/bpf: Use recv_timeout() instead of retries net: Implement ->sock_is_readable() for UDP and AF_UNIX skmsg: Extract and reuse sk_msg_is_readable() net: Rename ->stream_memory_read to ->sock_is_readable tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function cgroup: Fix memory leak caused by missing cgroup_bpf_offline bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch() bpf: Prevent increasing bpf_jit_limit above max bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT bpf: Define bpf_jit_alloc_exec_limit for riscv JIT ==================== Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26skmsg: Extract and reuse sk_msg_is_readable()Cong Wang1-0/+14
tcp_bpf_sock_is_readable() is pretty much generic, we can extract it and reuse it for non-TCP sockets. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211008203306.37525-3-xiyou.wangcong@gmail.com
2021-10-26net: multicast: calculate csum of looped-back and forwarded packetsCyril Strejc1-1/+2
During a testing of an user-space application which transmits UDP multicast datagrams and utilizes multicast routing to send the UDP datagrams out of defined network interfaces, I've found a multicast router does not fill-in UDP checksum into locally produced, looped-back and forwarded UDP datagrams, if an original output NIC the datagrams are sent to has UDP TX checksum offload enabled. The datagrams are sent malformed out of the NIC the datagrams have been forwarded to. It is because: 1. If TX checksum offload is enabled on the output NIC, UDP checksum is not calculated by kernel and is not filled into skb data. 2. dev_loopback_xmit(), which is called solely by ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY unconditionally. 3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE"), the ip_summed value is preserved during forwarding. 4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during a packet egress. The minimum fix in dev_loopback_xmit(): 1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the case when the original output NIC has TX checksum offload enabled. The effects are: a) If the forwarding destination interface supports TX checksum offloading, the NIC driver is responsible to fill-in the checksum. b) If the forwarding destination interface does NOT support TX checksum offloading, checksums are filled-in by kernel before skb is submitted to the NIC driver. c) For local delivery, checksum validation is skipped as in the case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary(). 2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It means, for CHECKSUM_NONE, the behavior is unmodified and is there to skip a looped-back packet local delivery checksum validation. Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25net-sysfs: initialize uid and gid before calling net_ns_get_ownershipXin Long1-2/+2
Currently in net_ns_get_ownership() it may not be able to set uid or gid if make_kuid or make_kgid returns an invalid value, and an uninit-value issue can be triggered by this. This patch is to fix it by initializing the uid and gid before calling net_ns_get_ownership(), as it does in kobject_get_ownership() Fixes: e6dee9f3893c ("net-sysfs: add netdev_change_owner()") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25net: Prevent infinite while loop in skb_tx_hash()Michael Chan1-0/+6
Drivers call netdev_set_num_tc() and then netdev_set_tc_queue() to set the queue count and offset for each TC. So the queue count and offset for the TCs may be zero for a short period after dev->num_tc has been set. If a TX packet is being transmitted at this time in the code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see nonzero dev->num_tc but zero qcount for the TC. The while loop that keeps looping while hash >= qcount will not end. Fix it by checking the TC's qcount to be nonzero before using it. Fixes: eadec877ce9c ("net: Add support for subordinate traffic classes to netdev_pick_tx") Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24net: rtnetlink: use __dev_addr_set()Jakub Kicinski1-2/+2
Get it ready for constant netdev->dev_addr. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24net: core: constify mac addrs in selftestsJakub Kicinski1-4/+4
Get it ready for constant netdev->dev_addr. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-22bpf: Prevent increasing bpf_jit_limit above maxLorenz Bauer1-1/+1
Restrict bpf_jit_limit to the maximum supported by the arch's JIT. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211014142554.53120-4-lmb@cloudflare.com
2021-10-22devlink: Clean not-executed param notificationsLeon Romanovsky1-4/+11
The parameters are registered before devlink_register() and all the notifications are delayed. This patch removes not-possible parameters notifications along with addition of code annotation logic. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22devlink: Remove not-executed trap group notificationsLeon Romanovsky1-7/+5
The trap logic is registered before devlink_register() and all the notifications are delayed. This patch removes not-possible trap group notifications along with addition of code annotation logic. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22devlink: Remove not-executed trap policer notificationsLeon Romanovsky1-7/+5
The trap policer logic is registered before devlink_register() and all the notifications are delayed. This patch removes not-possible notifications along with addition of code annotation logic. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22devlink: Delete obsolete parameters publish APILeon Romanovsky1-49/+8
The change of devlink_register() to be last devlink command together with delayed notification logic made the publish API to be obsolete. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22skb_expand_head() adjust skb->truesize incorrectlyVasily Averin2-13/+35
Christoph Paasch reports [1] about incorrect skb->truesize after skb_expand_head() call in ip6_xmit. This may happen because of two reasons: - skb_set_owner_w() for newly cloned skb is called too early, before pskb_expand_head() where truesize is adjusted for (!skb-sk) case. - pskb_expand_head() does not adjust truesize in (skb->sk) case. In this case sk->sk_wmem_alloc should be adjusted too. [1] https://lkml.org/lkml/2021/8/20/1082 Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()") Fixes: 2d85a1b31dde ("ipv6: ip6_finish_output2: set sk into newly allocated nskb") Reported-by: Christoph Paasch <christoph.paasch@gmail.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/644330dd-477e-0462-83bf-9f514c41edd1@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-21bpf: Add bpf_skc_to_unix_sock() helperHengqi Chen1-0/+23
The helper is used in tracing programs to cast a socket pointer to a unix_sock pointer. The return value could be NULL if the casting is illegal. Suggested-by: Yonghong Song <yhs@fb.com> Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211021134752.1223426-2-hengqi.chen@gmail.com
2021-10-21net/core: Remove unused assignment operations and variableluo penghao1-2/+1
Although if_info_size is assigned, it has not been used. And the variable should also be deleted. The clang_analyzer complains as follows: net/core/rtnetlink.c:3806: warning: Although the value stored to 'if_info_size' is used in the enclosing expression, the value is never actually read from 'if_info_size'. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-21net: stats: Read the statistics in ___gnet_stats_copy_basic() instead of adding.Sebastian Andrzej Siewior1-6/+37
Since the rework, the statistics code always adds up the byte and packet value(s). On 32bit architectures a seqcount_t is used in gnet_stats_basic_sync to ensure that the 64bit values are not modified during the read since two 32bit loads are required. The usage of a seqcount_t requires a lock to ensure that only one writer is active at a time. This lock leads to disabled preemption during the update. The lack of disabling preemption is now creating a warning as reported by Naresh since the query done by gnet_stats_copy_basic() is in preemptible context. For ___gnet_stats_copy_basic() there is no need to disable preemption since the update is performed on stack and can't be modified by another writer. Instead of disabling preemption, to avoid the warning, simply create a read function to just read the values and return as u64. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: 67c9e6270f301 ("net: sched: Protect Qdisc::bstats with u64_stats") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net-core: use netdev_* calls for kernel messagesJesse Brandeburg1-12/+10
While loading a driver and changing the number of queues, I noticed this message in the kernel log: "[253489.070080] Number of in use tx queues changed invalidating tc mappings. Priority traffic classification disabled!" But I had no idea what interface was being talked about because this message used pr_warn(). After investigating, it appears we can use the netdev_* helpers already defined to create predictably formatted messages, and that already handle <unknown netdev> cases, in more of the messages in dev.c. After this change, this message (and others) will look like this: "[ 170.181093] ice 0000:3b:00.0 ens785f0: Number of in use tx queues changed invalidating tc mappings. Priority traffic classification disabled!" One goal here was not to change the message significantly from the original format so as to not break user's expectations, so I just changed messages that used pr_* and generally started with %s == dev->name. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19devlink: Remove extra device_lock assert checksLeon Romanovsky1-2/+0
PCI core code in the pci_call_probe() has a path that doesn't hold device_lock. It happens because the ->probe() is called through the workqueue mechanism. 349 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, 350 const struct pci_device_id *id) 351 { 352 .... 377 if (cpu < nr_cpu_ids) 378 error = work_on_cpu(cpu, local_pci_probe, &ddi); Luckily enough, the core still ensures that only single flow is executed, so it safe to remove the assert checks that anyway were added for annotations purposes. Fixes: b88f7b1203bf ("devlink: Annotate devlink API calls") Reported-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19net: sched: Allow statistics reads from softirq.Sebastian Andrzej Siewior1-1/+1
Eric reported that the rate estimator reads statics from the softirq which in turn triggers a warning introduced in the statistics rework. The warning is too cautious. The updates happen in the softirq context so reads from softirq are fine since the writes can not be preempted. The updates/writes happen during qdisc_run() which ensures one writer and the softirq context. The remaining bad context for reading statistics remains in hard-IRQ because it may preempt a writer. Fixes: 29cbcd8582837 ("net: sched: Remove Qdisc::running sequence counter") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-4/+15
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS for net-next: 1) Add new run_estimation toggle to IPVS to stop the estimation_timer logic, from Dust Li. 2) Relax superfluous dynset check on NFT_SET_TIMEOUT. 3) Add egress hook, from Lukas Wunner. 4) Nowadays, almost all hook functions in x_table land just call the hook evaluation loop. Remove remaining hook wrappers from iptables and IPVS. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18net: sched: Remove Qdisc::running sequence counterAhmed S. Darwish2-28/+38
The Qdisc::running sequence counter has two uses: 1. Reliably reading qdisc's tc statistics while the qdisc is running (a seqcount read/retry loop at gnet_stats_add_basic()). 2. As a flag, indicating whether the qdisc in question is running (without any retry loops). For the first usage, the Qdisc::running sequence counter write section, qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what is actually needed: the raw qdisc's bstats update. A u64_stats sync point was thus introduced (in previous commits) inside the bstats structure itself. A local u64_stats write section is then started and stopped for the bstats updates. Use that u64_stats sync point mechanism for the bstats read/retry loop at gnet_stats_add_basic(). For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag, accessed with atomic bitops, is sufficient. Using a bit flag instead of a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads to the SMP barriers implicitly added through raw_read_seqcount() and write_seqcount_begin/end() getting removed. All call sites have been surveyed though, and no required ordering was identified. Now that the qdisc->running sequence counter is no longer used, remove it. Note, using u64_stats implies no sequence counter protection for 64-bit architectures. This can lead to the qdisc tc statistics "packets" vs. "bytes" values getting out of sync on rare occasions. The individual values will still be valid. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data typesAhmed S. Darwish2-45/+53
The only factor differentiating per-CPU bstats data type (struct gnet_stats_basic_cpu) from the packed non-per-CPU one (struct gnet_stats_basic_packed) was a u64_stats sync point inside the former. The two data types are now equivalent: earlier commits added a u64_stats sync point to the latter. Combine both data types into "struct gnet_stats_basic_sync". This eliminates redundancy and simplifies the bstats read/write APIs. Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit architectures, u64_stats sync points do not use sequence counter protection. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18net: sched: Use _bstats_update/set() instead of raw writesAhmed S. Darwish1-4/+5
The Qdisc::running sequence counter, used to protect Qdisc::bstats reads from parallel writes, is in the process of being removed. Qdisc::bstats read/writes will synchronize using an internal u64_stats sync point instead. Modify all bstats writes to use _bstats_update(). This ensures that the internal u64_stats sync point is always acquired and released as appropriate. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18net: sched: Protect Qdisc::bstats with u64_statsAhmed S. Darwish2-3/+13
The not-per-CPU variant of qdisc tc (traffic control) statistics, Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running sequence counter. This sequence counter is used for reliably protecting bstats reads from parallel writes. Meanwhile, the seqcount's write section covers a much wider area than bstats update: qdisc_run_begin() => qdisc_run_end(). That read/write section asymmetry can lead to needless retries of the read section. To prepare for removing the Qdisc::running sequence counter altogether, introduce a u64_stats sync point inside bstats instead. Modify _bstats_update() to start/end the bstats u64_stats write section. For bisectability, and finer commits granularity, the bstats read section is still protected with a Qdisc::running read/retry loop and qdisc_run_begin/end() still starts/ends that seqcount write section. Once all call sites are modified to use _bstats_update(), the Qdisc::running seqcount will be removed and bstats read/retry loop will be modified to utilize the internal u64_stats sync point. Note, using u64_stats implies no sequence counter protection for 64-bit architectures. This can lead to the statistics "packets" vs. "bytes" values getting out of sync on rare occasions. The individual values will still be valid. [bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.] Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18gen_stats: Move remaining users to gnet_stats_add_queue().Sebastian Andrzej Siewior1-37/+2
The gnet_stats_queue::qlen member is only used in the SMP-case. qdisc_qstats_qlen_backlog() needs to add qdisc_qlen() to qstats.qlen to have the same value as that provided by qdisc_qlen_sum(). gnet_stats_copy_queue() needs to overwritte the resulting qstats.qlen field whith the caller submitted qlen value. It might be differ from the submitted value. Let both functions use gnet_stats_add_queue() and remove unused __gnet_stats_copy_queue(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18gen_stats: Add gnet_stats_add_queue().Sebastian Andrzej Siewior1-0/+32
This function will replace __gnet_stats_copy_queue(). It reads all arguments and adds them into the passed gnet_stats_queue argument. In contrast to __gnet_stats_copy_queue() it also copies the qlen member. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18gen_stats: Add instead Set the value in __gnet_stats_copy_basic().Sebastian Andrzej Siewior2-14/+17
__gnet_stats_copy_basic() always assigns the value to the bstats argument overwriting the previous value. The later added per-CPU version always accumulated the values in the returning gnet_stats_basic_packed argument. Based on review there are five users of that function as of today: - est_fetch_counters(), ___gnet_stats_copy_basic() memsets() bstats to zero, single invocation. - mq_dump(), mqprio_dump(), mqprio_dump_class_stats() memsets() bstats to zero, multiple invocation but does not use the function due to !qdisc_is_percpu_stats(). Add the values in __gnet_stats_copy_basic() instead overwriting. Rename the function to gnet_stats_add_basic() to make it more obvious. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net: make use of helper netif_is_bridge_master()Kyungrok Chung1-1/+1
Make use of netdev helper functions to improve code readability. Replace 'dev->priv_flags & IFF_EBRIDGE' with netif_is_bridge_master(dev). Signed-off-by: Kyungrok Chung <acadx0@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net: stream: don't purge sk_error_queue in sk_stream_kill_queues()Jakub Kicinski1-3/+0
sk_stream_kill_queues() can be called on close when there are still outstanding skbs to transmit. Those skbs may try to queue notifications to the error queue (e.g. timestamps). If sk_stream_kill_queues() purges the queue without taking its lock the queue may get corrupted, and skbs leaked. This shows up as a warning about an rmem leak: WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x... The leak is always a multiple of 0x300 bytes (the value is in %rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of an empty sk_buff. Indeed if we dump the socket state at the time of the warning the sk_error_queue is often (but not always) corrupted. The ->next pointer points back at the list head, but not the ->prev pointer. Indeed we can find the leaked skb by scanning the kernel memory for something that looks like an skb with ->sk = socket in question, and ->truesize = 0x300. The contents of ->cb[] of the skb confirms the suspicion that it is indeed a timestamp notification (as generated in __skb_complete_tx_timestamp()). Removing purging of sk_error_queue should be okay, since inet_sock_destruct() does it again once all socket refs are gone. Eric suggests this may cause sockets that go thru disconnect() to maintain notifications from the previous incarnations of the socket, but that should be okay since the race was there anyway, and disconnect() is not exactly dependable. Thanks to Jonathan Lemon and Omar Sandoval for help at various stages of tracing the issue. Fixes: cb9eff097831 ("net: new user space API for time stamping of incoming and outgoing packets") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15page_pool: disable dma mapping support for 32-bit arch with 64-bit DMAYunsheng Lin1-4/+6
As the 32-bit arch with 64-bit DMA seems to rare those days, and page pool might carry a lot of code and complexity for systems that possibly. So disable dma mapping support for such systems, if drivers really want to work on such systems, they have to implement their own DMA-mapping fallback tracking outside page_pool. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-14net, neigh: Reject creating NUD_PERMANENT with NTF_MANAGED entriesDaniel Borkmann1-3/+8
The combination of NUD_PERMANENT + NTF_MANAGED is not supported and does not make sense either given the former indicates a static/fixed neighbor entry whereas the latter a dynamically resolved one. While it is possible to transition from one over to the other, we should however reject such creation attempts. Fixes: 7482e3841d52 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries") Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14net, neigh: Use NLA_POLICY_MASK helper for NDA_FLAGS_EXT attributeDaniel Borkmann1-5/+1
Instead of open-coding a check for invalid bits in NTF_EXT_MASK, we can just use the NLA_POLICY_MASK() helper instead, and simplify NDA_FLAGS_EXT sanity check this way. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14net, neigh: Add build-time assertion to avoid neigh->flags overflowDaniel Borkmann1-0/+3
Currently, NDA_FLAGS_EXT flags allow a maximum of 24 bits to be used for extended neighbor flags. These are eventually fed into neigh->flags by shifting with NTF_EXT_SHIFT as per commit 2c611ad97a82 ("net, neigh: Extend neigh->flags to 32 bit to allow for extensions"). If really ever needed in future, the full 32 bits from NDA_FLAGS_EXT can be used, it would only require to move neigh->flags from u32 to u64 inside the kernel. Add a build-time assertion such that when extending the NTF_EXT_MASK with new bits, we'll trigger an error once we surpass the 24th bit. This assumes that no bit holes in new NTF_EXT_* flags will slip in from UAPI, but I think this is reasonable to assume. Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-13/+11
tools/testing/selftests/net/ioam6.sh 7b1700e009cc ("selftests: net: modify IOAM tests for undef bits") bf77b1400a56 ("selftests: net: Test for the IOAM encapsulation with IPv6") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14netfilter: Introduce egress hookLukas Wunner1-2/+13
Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet *is* subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/ Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Laura García Liébana <nevola@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: Generalize ingress hook include fileLukas Wunner1-1/+1
Prepare for addition of a netfilter egress hook by generalizing the ingress hook include file. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: Rename ingress hook include fileLukas Wunner1-1/+1
Prepare for addition of a netfilter egress hook by renaming <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>. The egress hook also necessitates a refactoring of the include file, but that is done in a separate commit to ease reviewing. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-13Revert "net: procfs: add seq_puts() statement for dev_mcast"Vladimir Oltean1-13/+11
This reverts commit ec18e8455484370d633a718c6456ddbf6eceef21. It turns out that there are user space programs which got broken by that change. One example is the "ifstat" program shipped by Debian: https://packages.debian.org/source/bullseye/ifstat which, confusingly enough, seems to not have anything in common with the much more familiar (at least to me) ifstat program from iproute2: https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/misc/ifstat.c root@debian:~# ifstat ifstat: /proc/net/dev: unsupported format. This change modified the header (first two lines of text) in /proc/net/dev so that it looks like this: root@debian:~# cat /proc/net/dev Interface| Receive | Transmit | bytes packets errs drop fifo frame compressed multicast| bytes packets errs drop fifo colls carrier compressed lo: 97400 1204 0 0 0 0 0 0 97400 1204 0 0 0 0 0 0 bond0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sit0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eno2: 5002206 6651 0 0 0 0 0 0 105518642 1465023 0 0 0 0 0 0 swp0: 134531 2448 0 0 0 0 0 0 99599598 1464381 0 0 0 0 0 0 swp1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 swp2: 4867675 4203 0 0 0 0 0 0 58134 631 0 0 0 0 0 0 sw0p0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw0p1: 124739 2448 0 1422 0 0 0 0 93741184 1464369 0 0 0 0 0 0 sw0p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p0: 4850863 4203 0 0 0 0 0 0 54722 619 0 0 0 0 0 0 sw2p1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 br0: 10508 212 0 212 0 0 0 212 61369558 958857 0 0 0 0 0 0 whereas before it looked like this: root@debian:~# cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 13160 164 0 0 0 0 0 0 13160 164 0 0 0 0 0 0 bond0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sit0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eno2: 30824 268 0 0 0 0 0 0 3332 37 0 0 0 0 0 0 swp0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 swp1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 swp2: 30824 268 0 0 0 0 0 0 2428 27 0 0 0 0 0 0 sw0p0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw0p1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw0p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p0: 29752 268 0 0 0 0 0 0 1564 17 0 0 0 0 0 0 sw2p1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The reason why the ifstat shipped by Debian (v1.1, with a Debian patch upgrading it to 1.1-8.1 at the time of writing) is broken is because its "proc" driver/backend parses the header very literally: main/drivers.c#L825 if (!data->checked && strncmp(buf, "Inter-|", 7)) goto badproc; and there's no way in which the header can be changed such that programs parsing like that would not get broken. Even if we fix this ancient and very "lightly" maintained program to parse the text output of /proc/net/dev in a more sensible way, this story seems bound to repeat again with other programs, and modifying them all could cause more trouble than it's worth. On the other hand, the reverted patch had no other reason than an aesthetic one, so reverting it is the simplest way out. I don't know what other distributions would be affected; the fact that Debian doesn't ship the iproute2 version of the program (a different code base altogether, which uses netlink and not /proc/net/dev) is surprising in itself. Fixes: ec18e8455484 ("net: procfs: add seq_puts() statement for dev_mcast") Link: https://lore.kernel.org/netdev/20211009163511.vayjvtn3rrteglsu@skbuf/ Cc: Yajun Deng <yajun.deng@linux.dev> Cc: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20211013001909.3164185-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12devlink: Delete reload enable/disable interfaceLeon Romanovsky1-44/+3
Commit a0c76345e3d3 ("devlink: disallow reload operation during device cleanup") added devlink_reload_{enable,disable}() APIs to prevent reload operation from racing with device probe/dismantle. After recent changes to move devlink_register() to the end of device probe and devlink_unregister() to the beginning of device dismantle, these races can no longer happen. Reload operations will be denied if the devlink instance is unregistered and devlink_unregister() will block until all in-flight operations are done. Therefore, remove these devlink_reload_{enable,disable}() APIs. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12devlink: Allow control devlink ops behavior through feature maskLeon Romanovsky1-2/+22
Introduce new devlink call to set feature mask to control devlink behavior during device initialization phase after devlink_alloc() is already called. This allows us to set reload ops based on device property which is not known at the beginning of driver initialization. For the sake of simplicity, this API lacks any type of locking and needs to be called before devlink_register() to make sure that no parallel access to the ops is possible at this stage. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>