aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2023-03-17inet: preserve const qualifier in inet_sk()Eric Dumazet4-5/+6
We can change inet_sk() to propagate const qualifier of its argument. This should avoid some potential errors caused by accidental (const -> not_const) promotion. Other helpers like tcp_sk(), udp_sk(), raw_sk() will be handled in separate patch series. v2: use container_of_const() as advised by Jakub and Linus Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20230315142841.3a2ac99a@kernel.org/ Link: https://lore.kernel.org/netdev/CAHk-=wiOf12nrYEF2vJMcucKjWPN-Ns_SW9fA7LwST_2Dzp7rw@mail.gmail.com/ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->pressure to an atomic flagEric Dumazet2-7/+9
Not only this removes some READ_ONCE()/WRITE_ONCE(), this also removes one integer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->running to an atomic flagEric Dumazet3-12/+12
Instead of consuming 32 bits for po->running, use one available bit in po->flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->has_vnet_hdr to an atomic flagEric Dumazet3-11/+12
po->has_vnet_hdr can be read locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->tp_loss to an atomic flagEric Dumazet3-6/+6
tp_loss can be read locklessly. Convert it to an atomic flag to avoid races. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->tp_tx_has_off to an atomic flagEric Dumazet2-5/+5
This is to use existing space in po->flags, and reclaim the storage used by the non atomic bit fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: annotate accesses to po->tp_tstampEric Dumazet2-5/+6
tp_tstamp is read locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->auxdata to an atomic flagEric Dumazet3-8/+6
po->auxdata can be read while another thread is changing its value, potentially raising KCSAN splat. Convert it to PACKET_SOCK_AUXDATA flag. Fixes: 8dc419447415 ("[PACKET]: Add optional checksum computation for recvmsg") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->origdev to an atomic flagEric Dumazet3-8/+26
syzbot/KCAN reported that po->origdev can be read while another thread is changing its value. We can avoid this splat by converting this field to an actual bit. Following patches will convert remaining 1bit fields. Fixes: 80feaacb8a64 ("[AF_PACKET]: Add option to return orig_dev to userspace.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: annotate accesses to po->xmitEric Dumazet1-4/+8
po->xmit can be set from setsockopt(PACKET_QDISC_BYPASS), while read locklessly. Use READ_ONCE()/WRITE_ONCE() to avoid potential load/store tearing issues. Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17af_unix: annotate lockless accesses to sk->sk_errEric Dumazet1-4/+5
unix_poll() and unix_dgram_poll() read sk->sk_err without any lock held. Add relevant READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17mptcp: annotate lockless accesses to sk->sk_errEric Dumazet3-7/+7
mptcp_poll() reads sk->sk_err without socket lock held/owned. Add READ_ONCE() and WRITE_ONCE() to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17tcp: annotate lockless access to sk->sk_errEric Dumazet6-14/+15
tcp_poll() reads sk->sk_err without socket lock held/owned. We should used READ_ONCE() here, and update writers to use WRITE_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net: annotate lockless accesses to sk->sk_err_softEric Dumazet6-6/+6
This field can be read/written without lock synchronization. tcp and dccp have been handled in different patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17dccp: annotate lockless accesses to sk->sk_err_softEric Dumazet3-11/+14
This field can be read/written without lock synchronization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17tcp: annotate lockless accesses to sk->sk_err_softEric Dumazet4-12/+13
This field can be read/written without lock synchronization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17vlan: partially enable SIOCSHWTSTAMP in containerVadim Fedorenko1-1/+1
Setting timestamp filter was explicitly disabled on vlan devices in containers because it might affect other processes on the host. But it's absolutely legit in case when real device is in the same namespace. Fixes: 873017af7784 ("vlan: disable SIOCSHWTSTAMP in container") Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17rtnetlink: bridge: mcast: Relax group address validation in common codeIdo Schimmel2-2/+9
In the upcoming VXLAN MDB implementation, the 0.0.0.0 and :: MDB entries will act as catchall entries for unregistered IP multicast traffic in a similar fashion to the 00:00:00:00:00:00 VXLAN FDB entry that is used to transmit BUM traffic. In deployments where inter-subnet multicast forwarding is used, not all the VTEPs in a tenant domain are members in all the broadcast domains. It is therefore advantageous to transmit BULL (broadcast, unknown unicast and link-local multicast) and unregistered IP multicast traffic on different tunnels. If the same tunnel was used, a VTEP only interested in IP multicast traffic would also pull all the BULL traffic and drop it as it is not a member in the originating broadcast domain [1]. Prepare for this change by allowing the 0.0.0.0 group address in the common rtnetlink MDB code and forbid it in the bridge driver. A similar change is not needed for IPv6 because the common code only validates that the group address is not the all-nodes address. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17rtnetlink: bridge: mcast: Move MDB handlers out of bridge driverIdo Schimmel5-318/+244
Currently, the bridge driver registers handlers for MDB netlink messages, making it impossible for other drivers to implement MDB support. As a preparation for VXLAN MDB support, move the MDB handlers out of the bridge driver to the core rtnetlink code. The rtnetlink code will call into individual drivers by invoking their previously added MDB net device operations. Note that while the diffstat is large, the change is mechanical. It moves code out of the bridge driver to rtnetlink code. Also note that a similar change was made in 2012 with commit 77162022ab26 ("net: add generic PF_BRIDGE:RTM_ FDB hooks") that moved FDB handlers out of the bridge driver to the core rtnetlink code. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17bridge: mcast: Implement MDB net device operationsIdo Schimmel3-0/+152
Implement the previously added MDB net device operations in the bridge driver so that they could be invoked by core rtnetlink code in the next patch. The operations are identical to the existing br_mdb_{dump,add,del} functions. The '_new' suffix will be removed in the next patch. The functions are re-implemented in this patch to make the conversion in the next patch easier to review. Add dummy implementations when 'CONFIG_BRIDGE_IGMP_SNOOPING' is disabled, so that an error will be returned to user space when it is trying to add or delete an MDB entry. This is consistent with existing behavior where the bridge driver does not even register rtnetlink handlers for RTM_{NEW,DEL,GET}MDB messages when this Kconfig option is disabled. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-15net/smc: Introduce explicit check for v2 supportStefan Raspl1-1/+1
Previously, v2 support was derived from a very specific format of the SEID as part of the SMC-D codebase. Make this part of the SMC-D device API, so implementers do not need to adhere to a specific SEID format. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-15ipv6: remove one read_lock()/read_unlock() pair in rt6_check_neigh()Eric Dumazet1-4/+4
rt6_check_neigh() uses read_lock() to protect n->nud_state reading. This seems overkill and causes false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-15neighbour: annotate lockless accesses to n->nud_stateEric Dumazet11-32/+33
We have many lockless accesses to n->nud_state. Before adding another one in the following patch, add annotations to readers and writers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-13net: hsr: Don't log netdev_err message on unknown prp dst nodeKristian Overskeid1-1/+1
If no frames has been exchanged with a node for HSR_NODE_FORGET_TIME, the node will be deleted from the node_db list. If a frame is sent to the node after it is deleted, a netdev_err message for each slave interface is produced. This should not happen with dan nodes because of supervision frames, but can happen often with san nodes, which clutters the kernel log. Since the hsr protocol does not support sans, this is only relevant for the prp protocol. Signed-off-by: Kristian Overskeid <koverskeid@gmail.com> Link: https://lore.kernel.org/r/20230309092302.179586-1-koverskeid@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-13net: socket: suppress unused warningVincenzo Palazzo1-1/+1
suppress unused warnings and fix the error that there is with the W=1 enabled. Warning generated net/socket.c: In function ‘__sys_getsockopt’: net/socket.c:2300:13: error: variable ‘max_optlen’ set but not used [-Werror=unused-but-set-variable] 2300 | int max_optlen; Signed-off-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230310221851.304657-1-vincenzopalazzodev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10Merge tag 'wireless-next-2023-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-nextJakub Kicinski17-187/+506
Johannes Berg says: ==================== wireless-next patches for 6.4 Major changes: cfg80211 * 6 GHz improvements * HW timestamping support * support for randomized auth/deauth TA for PASN privacy (also for mac80211) mac80211 * radiotap TLV and EHT support for the iwlwifi sniffer * HW timestamping support * per-link debugfs for multi-link brcmfmac * support for Apple (M1 Pro/Max) devices iwlwifi * support for a few new devices * EHT sniffer support rtw88 * better support for some SDIO devices (e.g. MAC address from efuse) rtw89 * HW scan support for 8852b * better support for 6 GHz scanning * tag 'wireless-next-2023-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (84 commits) wifi: iwlwifi: mvm: fix EOF bit reporting wifi: iwlwifi: Do not include radiotap EHT user info if not needed wifi: iwlwifi: mvm: add EHT RU allocation to radiotap wifi: iwlwifi: Update logs for yoyo reset sw changes wifi: iwlwifi: mvm: clean up duplicated defines wifi: iwlwifi: rs-fw: break out for unsupported bandwidth wifi: iwlwifi: Add support for B step of BnJ-Fm4 wifi: iwlwifi: mvm: make flush code a bit clearer wifi: iwlwifi: mvm: avoid UB shift of snif_queue wifi: iwlwifi: mvm: add primary 80 known for EHT radiotap wifi: iwlwifi: mvm: parse FW frame metadata for EHT sniffer mode wifi: iwlwifi: mvm: decode USIG_B1_B7 RU to nl80211 RU width wifi: iwlwifi: mvm: rename define to generic name wifi: iwlwifi: mvm: allow Microsoft to use TAS wifi: iwlwifi: mvm: add all EHT based on data0 info from HW wifi: iwlwifi: mvm: add EHT radiotap info based on rate_n_flags wifi: iwlwifi: mvm: add an helper function radiotap TLVs wifi: radiotap: separate vendor TLV into header/content wifi: iwlwifi: reduce verbosity of some logging events wifi: iwlwifi: Adding the code to get RF name for MsP device ... ==================== Link: https://lore.kernel.org/r/20230310120159.36518-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10skbuff: Add likely to skb pointer in build_skb()Gal Pressman1-1/+1
Similarly to napi_build_skb(), it is likely the skb allocation in build_skb() succeeded. frag_size != 0 is also likely, as stated in __build_skb_around(). Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10skbuff: Replace open-coded skb_propagate_pfmemalloc()sGal Pressman1-4/+2
Use skb_propagate_pfmemalloc() in build_skb()/build_skb_around() instead of open-coding it. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09udp: introduce __sk_mem_schedule() usageJason Xing1-11/+16
Keep the accounting schema consistent across different protocols with __sk_mem_schedule(). Besides, it adjusts a little bit on how to calculate forward allocated memory compared to before. After applied this patch, we could avoid receive path scheduling extra amount of memory. Link: https://lore.kernel.org/lkml/20230221110344.82818-1-kerneljasonxing@gmail.com/ Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230308021153.99777-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski6-77/+142
Florian Westphal says: ==================== Netfilter updates for net-next 1. nf_tables 'brouting' support, from Sriram Yagnaraman. 2. Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header, from Xin Long. This comes with a test BIG TCP test case, added to tools/testing/selftests/net/. 3. Fix spelling and indentation in conntrack, from Jeremy Sowden. * 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nat: fix indentation of function arguments netfilter: conntrack: fix typo selftests: add a selftest for big tcp netfilter: use nf_ip6_check_hbh_len in nf_ct_skb_network_trim netfilter: move br_nf_check_hbh_len to utils netfilter: bridge: move pskb_trim_rcsum out of br_nf_check_hbh_len netfilter: bridge: check len before accessing more nh data netfilter: bridge: call pskb_may_pull in br_nf_check_hbh_len netfilter: bridge: introduce broute meta statement ==================== Link: https://lore.kernel.org/r/20230308193033.13965-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09neighbour: delete neigh_lookup_nodev as not usedLeon Romanovsky1-31/+0
neigh_lookup_nodev isn't used in the kernel after removal of DECnet. So let's remove it. Fixes: 1202cdd66531 ("Remove DECnet support from kernel") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/eb5656200d7964b2d177a36b77efa3c597d6d72d.1678267343.git.leonro@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09net: sched: remove qdisc_watchdog->last_expiresEric Dumazet1-2/+4
This field mirrors hrtimer softexpires, we can instead use the existing helpers. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230308182648.1150762-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09netlink: remove unused 'compare' functionFlorian Westphal2-3/+0
No users in the tree. Tested with allmodconfig build. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230308142006.20879-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09mctp: remove MODULE_LICENSE in non-modulesNick Alcock1-1/+0
Since commit 8b41fc4454e ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations are used to identify modules. As a consequence, uses of the macro in non-modules will cause modprobe to misidentify their containing object file as a module when it is not (false positives), and modprobe might succeed rather than failing with a suitable error message. So remove it in the files in this commit, none of which can be built as modules. Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Suggested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com> Cc: Jeremy Kerr <jk@codeconstruct.com.au> Cc: Matt Johnston <matt@codeconstruct.com.au> Link: https://lore.kernel.org/r/20230308121230.5354-2-nick.alcock@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski32-105/+202
Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") d56b0c461d19 ("bpf, docs: Fix link to netdev-FAQ target") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09Merge tag 'net-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds28-77/+136
Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and bpf. Current release - regressions: - core: avoid skb end_offset change in __skb_unclone_keeptruesize() - sched: - act_connmark: handle errno on tcf_idr_check_alloc - flower: fix fl_change() error recovery path - ieee802154: prevent user from crashing the host Current release - new code bugs: - eth: bnxt_en: fix the double free during device removal - tools: ynl: - fix enum-as-flags in the generic CLI - fully inherit attrs in subsets - re-license uniformly under GPL-2.0 or BSD-3-clause Previous releases - regressions: - core: use indirect calls helpers for sk_exit_memory_pressure() - tls: - fix return value for async crypto - avoid hanging tasks on the tx_lock - eth: ice: copy last block omitted in ice_get_module_eeprom() Previous releases - always broken: - core: avoid double iput when sock_alloc_file fails - af_unix: fix struct pid leaks in OOB support - tls: - fix possible race condition - fix device-offloaded sendpage straddling records - bpf: - sockmap: fix an infinite loop error - test_run: fix &xdp_frame misplacement for LIVE_FRAMES - fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR - netfilter: tproxy: fix deadlock due to missing BH disable - phylib: get rid of unnecessary locking - eth: bgmac: fix *initial* chip reset to support BCM5358 - eth: nfp: fix csum for ipsec offload - eth: mtk_eth_soc: fix RX data corruption issue Misc: - usb: qmi_wwan: add telit 0x1080 composition" * tag 'net-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits) tools: ynl: fix enum-as-flags in the generic CLI tools: ynl: move the enum classes to shared code net: avoid double iput when sock_alloc_file fails af_unix: fix struct pid leaks in OOB support eth: fealnx: bring back this old driver net: dsa: mt7530: permit port 5 to work without port 6 on MT7621 SoC net: microchip: sparx5: fix deletion of existing DSCP mappings octeontx2-af: Unlock contexts in the queue context cache in case of fault detection net/smc: fix fallback failed while sendmsg with fastopen ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause mailmap: update entries for Stephen Hemminger mailmap: add entry for Maxim Mikityanskiy nfc: change order inside nfc_se_io error path ethernet: ice: avoid gcc-9 integer overflow warning ice: don't ignore return codes in VSI related code ice: Fix DSCP PFC TLV creation net: usb: qmi_wwan: add Telit 0x1080 composition net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990 netfilter: conntrack: adopt safer max chain length net: tls: fix device-offloaded sendpage straddling records ...
2023-03-09sctp: add weighted fair queueing stream schedulerXin Long2-4/+47
As it says in rfc8260#section-3.6 about the weighted fair queueing scheduler: A Weighted Fair Queueing scheduler between the streams is used. The weight is configurable per outgoing SCTP stream. This scheduler considers the lengths of the messages of each stream and schedules them in a specific way to use the capacity according to the given weights. If the weight of stream S1 is n times the weight of stream S2, the scheduler should assign to stream S1 n times the capacity it assigns to stream S2. The details are implementation dependent. Interleaving user messages allows for a better realization of the capacity usage according to the given weights. This patch adds Weighted Fair Queueing Scheduler actually based on the code of Fair Capacity Scheduler by adding fc_weight into struct sctp_stream_out_ext and taking it into account when sorting stream-> fc_list in sctp_sched_fc_sched() and sctp_sched_fc_dequeue_done(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-09sctp: add fair capacity stream schedulerXin Long3-1/+186
As it says in rfc8260#section-3.5 about the fair capacity scheduler: A fair capacity distribution between the streams is used. This scheduler considers the lengths of the messages of each stream and schedules them in a specific way to maintain an equal capacity for all streams. The details are implementation dependent. interleaving user messages allows for a better realization of the fair capacity usage. This patch adds Fair Capacity Scheduler based on the foundations added by commit 5bbbbe32a431 ("sctp: introduce stream scheduler foundations"): A fc_list and a fc_length are added into struct sctp_stream_out_ext and a fc_list is added into struct sctp_stream. In .enqueue, when there are chunks enqueued into a stream, this stream will be linked into stream-> fc_list by its fc_list ordered by its fc_length. In .dequeue, it always picks up the 1st skb from stream->fc_list. In .dequeue_done, fc_length is increased by chunk's len and update its location in stream->fc_list according to the its new fc_length. Note that when the new fc_length overflows in .dequeue_done, instead of resetting all fc_lengths to 0, we only reduced them by U32_MAX / 4 to avoid a moment of imbalance in the scheduling, as Marcelo suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-08net: avoid double iput when sock_alloc_file failsThadeu Lima de Souza Cascardo1-7/+4
When sock_alloc_file fails to allocate a file, it will call sock_release. __sys_socket_file should then not call sock_release again, otherwise there will be a double free. [ 89.319884] ------------[ cut here ]------------ [ 89.320286] kernel BUG at fs/inode.c:1764! [ 89.320656] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 89.321051] CPU: 7 PID: 125 Comm: iou-sqp-124 Not tainted 6.2.0+ #361 [ 89.321535] RIP: 0010:iput+0x1ff/0x240 [ 89.321808] Code: d1 83 e1 03 48 83 f9 02 75 09 48 81 fa 00 10 00 00 77 05 83 e2 01 75 1f 4c 89 ef e8 fb d2 ba 00 e9 80 fe ff ff c3 cc cc cc cc <0f> 0b 0f 0b e9 d0 fe ff ff 0f 0b eb 8d 49 8d b4 24 08 01 00 00 48 [ 89.322760] RSP: 0018:ffffbdd60068bd50 EFLAGS: 00010202 [ 89.323036] RAX: 0000000000000000 RBX: ffff9d7ad3cacac0 RCX: 0000000000001107 [ 89.323412] RDX: 000000000003af00 RSI: 0000000000000000 RDI: ffff9d7ad3cacb40 [ 89.323785] RBP: ffffbdd60068bd68 R08: ffffffffffffffff R09: ffffffffab606438 [ 89.324157] R10: ffffffffacb3dfa0 R11: 6465686361657256 R12: ffff9d7ad3cacb40 [ 89.324529] R13: 0000000080000001 R14: 0000000080000001 R15: 0000000000000002 [ 89.324904] FS: 00007f7b28516740(0000) GS:ffff9d7aeb1c0000(0000) knlGS:0000000000000000 [ 89.325328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 89.325629] CR2: 00007f0af52e96c0 CR3: 0000000002a02006 CR4: 0000000000770ee0 [ 89.326004] PKRU: 55555554 [ 89.326161] Call Trace: [ 89.326298] <TASK> [ 89.326419] __sock_release+0xb5/0xc0 [ 89.326632] __sys_socket_file+0xb2/0xd0 [ 89.326844] io_socket+0x88/0x100 [ 89.327039] ? io_issue_sqe+0x6a/0x430 [ 89.327258] io_issue_sqe+0x67/0x430 [ 89.327450] io_submit_sqes+0x1fe/0x670 [ 89.327661] io_sq_thread+0x2e6/0x530 [ 89.327859] ? __pfx_autoremove_wake_function+0x10/0x10 [ 89.328145] ? __pfx_io_sq_thread+0x10/0x10 [ 89.328367] ret_from_fork+0x29/0x50 [ 89.328576] RIP: 0033:0x0 [ 89.328732] Code: Unable to access opcode bytes at 0xffffffffffffffd6. [ 89.329073] RSP: 002b:0000000000000000 EFLAGS: 00000202 ORIG_RAX: 00000000000001a9 [ 89.329477] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f7b28637a3d [ 89.329845] RDX: 00007fff4e4318a8 RSI: 00007fff4e4318b0 RDI: 0000000000000400 [ 89.330216] RBP: 00007fff4e431830 R08: 00007fff4e431711 R09: 00007fff4e4318b0 [ 89.330584] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff4e441b38 [ 89.330950] R13: 0000563835e3e725 R14: 0000563835e40d10 R15: 00007f7b28784040 [ 89.331318] </TASK> [ 89.331441] Modules linked in: [ 89.331617] ---[ end trace 0000000000000000 ]--- Fixes: da214a475f8b ("net: add __sys_socket_file()") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230307173707.468744-1-cascardo@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-08af_unix: fix struct pid leaks in OOB supportEric Dumazet1-2/+8
syzbot reported struct pid leak [1]. Issue is that queue_oob() calls maybe_add_creds() which potentially holds a reference on a pid. But skb->destructor is not set (either directly or by calling unix_scm_to_skb()) This means that subsequent kfree_skb() or consume_skb() would leak this reference. In this fix, I chose to fully support scm even for the OOB message. [1] BUG: memory leak unreferenced object 0xffff8881053e7f80 (size 128): comm "syz-executor242", pid 5066, jiffies 4294946079 (age 13.220s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff812ae26a>] alloc_pid+0x6a/0x560 kernel/pid.c:180 [<ffffffff812718df>] copy_process+0x169f/0x26c0 kernel/fork.c:2285 [<ffffffff81272b37>] kernel_clone+0xf7/0x610 kernel/fork.c:2684 [<ffffffff812730cc>] __do_sys_clone+0x7c/0xb0 kernel/fork.c:2825 [<ffffffff849ad699>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff849ad699>] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 [<ffffffff84a0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 314001f0bf92 ("af_unix: Add OOB support") Reported-by: syzbot+7699d9e5635c10253a27@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rao Shoaib <rao.shoaib@oracle.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230307164530.771896-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-08Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski3-0/+34
Andrii Nakryiko says: ==================== pull-request: bpf-next 2023-03-08 We've added 23 non-merge commits during the last 2 day(s) which contain a total of 28 files changed, 414 insertions(+), 104 deletions(-). The main changes are: 1) Add more precise memory usage reporting for all BPF map types, from Yafang Shao. 2) Add ARM32 USDT support to libbpf, from Puranjay Mohan. 3) Fix BTF_ID_LIST size causing problems in !CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor. 4) IMA selftests fix, from Roberto Sassu. 5) libbpf fix in APK support code, from Daniel Müller. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits) selftests/bpf: Fix IMA test libbpf: USDT arm arg parsing support libbpf: Refactor parse_usdt_arg() to re-use code libbpf: Fix theoretical u32 underflow in find_cd() function bpf: enforce all maps having memory usage callback bpf: offload map memory usage bpf, net: xskmap memory usage bpf, net: sock_map memory usage bpf, net: bpf_local_storage memory usage bpf: local_storage memory usage bpf: bpf_struct_ops memory usage bpf: queue_stack_maps memory usage bpf: devmap memory usage bpf: cpumap memory usage bpf: bloom_filter memory usage bpf: ringbuf memory usage bpf: reuseport_array memory usage bpf: stackmap memory usage bpf: arraymap memory usage bpf: hashtab memory usage ... ==================== Link: https://lore.kernel.org/r/20230308193533.1671597-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-08netfilter: nat: fix indentation of function argumentsJeremy Sowden1-2/+2
A couple of arguments to a function call are incorrectly indented. Fix them. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08netfilter: conntrack: fix typoJeremy Sowden1-1/+1
There's a spelling mistake in a comment. Fix it. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08netfilter: use nf_ip6_check_hbh_len in nf_ct_skb_network_trimXin Long1-2/+9
For IPv6 Jumbo packets, the ipv6_hdr(skb)->payload_len is always 0, and its real payload_len ( > 65535) is saved in hbh exthdr. With 0 length for the jumbo packets, all data and exthdr will be trimmed in nf_ct_skb_network_trim(). This patch is to call nf_ip6_check_hbh_len() to get real pkt_len of the IPv6 packet, similar to br_validate_ipv6(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08netfilter: move br_nf_check_hbh_len to utilsXin Long2-54/+53
Rename br_nf_check_hbh_len() to nf_ip6_check_hbh_len() and move it to netfilter utils, so that it can be used by other modules, like ovs and tc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08netfilter: bridge: move pskb_trim_rcsum out of br_nf_check_hbh_lenXin Long1-19/+14
br_nf_check_hbh_len() is a function to check the Hop-by-hop option header, and shouldn't do pskb_trim_rcsum() there. This patch is to pass pkt_len out to br_validate_ipv6() and do pskb_trim_rcsum() after calling br_validate_ipv6() instead. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08netfilter: bridge: check len before accessing more nh dataXin Long1-25/+20
In the while loop of br_nf_check_hbh_len(), similar to ip6_parse_tlv(), before accessing 'nh[off + 1]', it should add a check 'len < 2'; and before parsing IPV6_TLV_JUMBO, it should add a check 'optlen > len', in case of overflows. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08netfilter: bridge: call pskb_may_pull in br_nf_check_hbh_lenXin Long1-5/+9
When checking Hop-by-hop option header, if the option data is in nonlinear area, it should do pskb_may_pull instead of discarding the skb as a bad IPv6 packet. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08net: reclaim skb->scm_io_uring bitEric Dumazet2-1/+7
Commit 0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release") added one bit to struct sk_buff. This structure is critical for networking, and we try very hard to not add bloat on it, unless absolutely required. For instance, we can use a specific destructor as a wrapper around unix_destruct_scm(), to identify skbs that unix_gc() has to special case. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-08netfilter: bridge: introduce broute meta statementSriram Yagnaraman1-3/+68
nftables equivalent for ebtables -t broute. Implement broute meta statement to set br_netfilter_broute flag in skb to force a packet to be routed instead of being bridged. Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech> Signed-off-by: Florian Westphal <fw@strlen.de>