linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2021-04-01	include: net: Remove repeated struct declaration	Wan Jiabing	1	-1/+0
	struct ctl_table_header is declared twice. One is declared at 46th line. The blew one is not needed. Remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	ipv6: move ip6_dst_ops first in netns_ipv6	Eric Dumazet	1	-1/+3
	ip6_dst_ops have cache line alignement. Moving it at beginning of netns_ipv6 removes a 48 byte hole, and shrinks netns_ipv6 from 12 to 11 cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	ipv6: convert elligible sysctls to u8	Eric Dumazet	1	-12/+12
	Convert most sysctls that can fit in a byte. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	tcp: convert tcp_comp_sack_nr sysctl to u8	Eric Dumazet	1	-1/+1
	tcp_comp_sack_nr max value was already 255. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	ipv4: convert igmp_link_local_mcast_reports sysctl to u8	Eric Dumazet	1	-1/+1
	This sysctl is a bool, can use less storage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	ipv4: convert fib_multipath_{use_neigh\|hash_policy} sysctls to u8	Eric Dumazet	1	-2/+2
	Make room for better packing of netns_ipv4 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	ipv4: convert udp_l3mdev_accept sysctl to u8	Eric Dumazet	1	-1/+1
	Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	ipv4: convert fib_notify_on_flag_change sysctl to u8	Eric Dumazet	1	-1/+1
	Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	inet: shrink netns_ipv4 by another cache line	Eric Dumazet	1	-3/+3
	By shuffling around some fields to remove 8 bytes of hole, we can save one cache line. pahole result before/after the patch : /* size: 768, cachelines: 12, members: 139 / / sum members: 673, holes: 11, sum holes: 39 / / padding: 56 / / paddings: 2, sum paddings: 7 / / forced alignments: 1 / -> / size: 704, cachelines: 11, members: 139 / / sum members: 673, holes: 10, sum holes: 31 / / paddings: 2, sum paddings: 7 / / forced alignments: 1 */ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	inet: shrink inet_timewait_death_row by 48 bytes	Eric Dumazet	1	-2/+5
	struct inet_timewait_death_row uses two cache lines, because we want tw_count to use a full cache line to avoid false sharing. Rework its definition and placement in netns_ipv4 so that: 1) We add 60 bytes of padding after tw_count to avoid false sharing, knowing that tcp_death_row will have ____cacheline_aligned_in_smp attribute. 2) We do not risk padding before tcp_death_row, because we move it at the beginning of netns_ipv4, even if new fields are added later. 3) We do not waste 48 bytes of padding after it. Note that I have not changed dccp. pahole result for struct netns_ipv4 before/after the patch : /* size: 832, cachelines: 13, members: 139 / / sum members: 721, holes: 12, sum holes: 95 / / padding: 16 / / paddings: 2, sum paddings: 55 / -> / size: 768, cachelines: 12, members: 139 / / sum members: 673, holes: 11, sum holes: 39 / / padding: 56 / / paddings: 2, sum paddings: 7 / / forced alignments: 1 */ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31	ethtool: support FEC settings over netlink	Jakub Kicinski	1	-0/+17
	Add FEC API to netlink. This is not a 1-to-1 conversion. FEC settings already depend on link modes to tell user which modes are supported. Take this further an use link modes for manual configuration. Old struct ethtool_fecparam is still used to talk to the drivers, so we need to translate back and forth. We can revisit the internal API if number of FEC encodings starts to grow. Enforce only one active FEC bit (by using a bit position rather than another mask). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	vxlan: allow L4 GRO passthrough	Paolo Abeni	1	-0/+6
	When passing up an UDP GSO packet with L4 aggregation, there is no need to segment it at the vxlan level. We can propagate the packet untouched and let it be segmented later, if needed. Introduce an helper to allow let the UDP socket to accept any L4 aggregation and use it in the vxlan driver. v1 -> v2: - updated to use the newly introduced UDP socket 'accept*' fields Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	udp: never accept GSO_FRAGLIST packets	Paolo Abeni	1	-3/+13
	Currently the UDP protocol delivers GSO_FRAGLIST packets to the sockets without the expected segmentation. This change addresses the issue introducing and maintaining a couple of new fields to explicitly accept SKB_GSO_UDP_L4 or GSO_FRAGLIST packets. Additionally updates udp_unexpected_gso() accordingly. UDP sockets enabling UDP_GRO stil keep accept_udp_fraglist zeroed. v1 -> v2: - use 2 bits instead of a whole GSO bitmask (Willem) Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	udp: fixup csum for GSO receive slow path	Paolo Abeni	1	-0/+23
	When UDP packets generated locally by a socket with UDP_SEGMENT traverse the following path: UDP tunnel(xmit) -> veth (segmentation) -> veth (gro) -> UDP tunnel (rx) -> UDP socket (no UDP_GRO) ip_summed will be set to CHECKSUM_PARTIAL at creation time and such checksum mode will be preserved in the above path up to the UDP tunnel receive code where we have: __iptunnel_pull_header() -> skb_pull_rcsum() -> skb_postpull_rcsum() -> __skb_postpull_rcsum() The latter will convert the skb to CHECKSUM_NONE. The UDP GSO packet will be later segmented as part of the rx socket receive operation, and will present a CHECKSUM_NONE after segmentation. Additionally the segmented packets UDP CB still refers to the original GSO packet len. Overall that causes unexpected/wrong csum validation errors later in the UDP receive path. We could possibly address the issue with some additional checks and csum mangling in the UDP tunnel code. Since the issue affects only this UDP receive slow path, let's set a suitable csum status there. Note that SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST packets lacking an UDP encapsulation present a valid checksum when landing to udp_queue_rcv_skb(), as the UDP checksum has been validated by the GRO engine. v2 -> v3: - even more verbose commit message and comments v1 -> v2: - restrict the csum update to the packets strictly needing them - hopefully clarify the commit message and code comments Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	ipv6: add ipv6_dev_find to stubs	Andreas Roeseler	1	-0/+2
	Add ipv6_dev_find to ipv6_stub to allow lookup of net_devices by IPV6 address in net/ipv4/icmp.c. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	net: add sysctl for enabling RFC 8335 PROBE messages	Andreas Roeseler	1	-0/+1
	Section 8 of RFC 8335 specifies potential security concerns of responding to PROBE requests, and states that nodes that support PROBE functionality MUST be able to enable/disable responses and that responses MUST be disabled by default Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	ICMPV6: add support for RFC 8335 PROBE	Andreas Roeseler	1	-0/+3
	Add definitions for the ICMPV6 type of Extended Echo Request and Extended Echo Reply, as defined by sections 2 and 3 of RFC 8335. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	icmp: add support for RFC 8335 PROBE	Andreas Roeseler	1	-0/+42
	Add definitions for PROBE ICMP types and codes. Add AFI definitions for IP and IPV6 as specified by IANA Add a struct to represent the additional header when probing by IP address (ctype == 3) for use in parsing incoming PROBE messages Add a struct to represent the entire Interface Identification Object (IIO) section of an incoming PROBE packet Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30	can: bittiming: add CAN_KBPS, CAN_MBPS and CAN_MHZ macros	Vincent Mailhol	1	-0/+8
	Add three macro to simplify the readability of big bit timing numbers: - CAN_KBPS: kilobits per second (one thousand) - CAN_MBPS: megabits per second (one million) - CAN_MHZ: megahertz per second (one million) Example: u32 bitrate_max = 8 * CAN_MBPS; struct can_clock clock = {.freq = 80 * CAN_MHZ}; instead of: u32 bitrate_max = 8000000; struct can_clock clock = {.freq = 80000000}; Apply the new macro to driver/net/can/dev/bittiming.c. Link: https://lore.kernel.org/r/20210306054040.76483-1-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-30	can: bittiming: add calculation for CAN FD Transmitter Delay Compensation (TDC)	Vincent Mailhol	1	-0/+6
	The logic for the tdco calculation is to just reuse the normal sample point: tdco = sp. Because the sample point is expressed in tenth of percent and the tdco is expressed in time quanta, a conversion is needed. At the end, ssp = tdcv + tdco = tdcv + sp. Another popular method is to set tdco to the middle of the bit: tdc->tdco = can_bit_time(dbt) / 2 During benchmark tests, we could not find a clear advantages for one of the two methods. The tdco calculation is triggered each time the data_bittiming is changed so that users relying on automated calculation can use the netlink interface the exact same way without need of new parameters. For example, a command such as: ip link set canX type can bitrate 500000 dbitrate 4000000 fd on would trigger the calculation. The user using CONFIG_CAN_CALC_BITTIMING who does not want automated calculation needs to manually set tdco to zero. For example with: ip link set canX type can tdco 0 bitrate 500000 dbitrate 4000000 fd on (if the tdco parameter is provided in a previous command, it will be overwritten). If tdcv is set to zero (default), it is automatically calculated by the transiver for each frame. As such, there is no code in the kernel to calculate it. tdcf has no automated calculation functions because we could not figure out a formula for this parameter. Link: https://lore.kernel.org/r/20210224002008.4158-6-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-30	can: dev: reorder struct can_priv members for better packing	Vincent Mailhol	1	-6/+7
	Save eight bytes of holes on x86-64 architectures by reordering struct can_priv members. Before: $ pahole -C can_priv drivers/net/can/dev/dev.o struct can_priv { struct net_device * dev; /* 0 8 / struct can_device_stats can_stats; / 8 24 / struct can_bittiming bittiming; / 32 32 / / --- cacheline 1 boundary (64 bytes) --- / struct can_bittiming data_bittiming; / 64 32 / const struct can_bittiming_const bittiming_const; /* 96 8 / const struct can_bittiming_const data_bittiming_const; /* 104 8 / struct can_tdc tdc; / 112 12 / / XXX 4 bytes hole, try to pack / / --- cacheline 2 boundary (128 bytes) --- / const struct can_tdc_const tdc_const; /* 128 8 / const u16 termination_const; /* 136 8 / unsigned int termination_const_cnt; / 144 4 / u16 termination; / 148 2 / / XXX 2 bytes hole, try to pack / const u32 bitrate_const; /* 152 8 / unsigned int bitrate_const_cnt; / 160 4 / / XXX 4 bytes hole, try to pack / const u32 data_bitrate_const; /* 168 8 / unsigned int data_bitrate_const_cnt; / 176 4 / u32 bitrate_max; / 180 4 / struct can_clock clock; / 184 4 / enum can_state state; / 188 4 / / --- cacheline 3 boundary (192 bytes) --- / u32 ctrlmode; / 192 4 / u32 ctrlmode_supported; / 196 4 / u32 ctrlmode_static; / 200 4 / int restart_ms; / 204 4 / struct delayed_work restart_work; / 208 168 / / XXX last struct has 4 bytes of padding / / --- cacheline 5 boundary (320 bytes) was 56 bytes ago --- / int (do_set_bittiming)(struct net_device ); / 376 8 / / --- cacheline 6 boundary (384 bytes) --- / int (do_set_data_bittiming)(struct net_device ); / 384 8 / int (do_set_mode)(struct net_device , enum can_mode); / 392 8 / int (do_set_termination)(struct net_device , u16); / 400 8 / int (do_get_state)(const struct net_device , enum can_state ); /* 408 8 / int (do_get_berr_counter)(const struct net_device , struct can_berr_counter ); /* 416 8 / unsigned int echo_skb_max; / 424 4 / / XXX 4 bytes hole, try to pack / struct sk_buff * echo_skb; /* 432 8 / / size: 440, cachelines: 7, members: 31 / / sum members: 426, holes: 4, sum holes: 14 / / paddings: 1, sum paddings: 4 / / last cacheline: 56 bytes / }; After: $ pahole -C can_priv drivers/net/can/dev/dev.o struct can_priv { struct net_device dev; /* 0 8 / struct can_device_stats can_stats; / 8 24 / const struct can_bittiming_const bittiming_const; /* 32 8 / const struct can_bittiming_const data_bittiming_const; /* 40 8 / struct can_bittiming bittiming; / 48 32 / / --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- / struct can_bittiming data_bittiming; / 80 32 / const struct can_tdc_const tdc_const; /* 112 8 / struct can_tdc tdc; / 120 12 / / --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- / unsigned int bitrate_const_cnt; / 132 4 / const u32 bitrate_const; /* 136 8 / const u32 data_bitrate_const; /* 144 8 / unsigned int data_bitrate_const_cnt; / 152 4 / u32 bitrate_max; / 156 4 / struct can_clock clock; / 160 4 / unsigned int termination_const_cnt; / 164 4 / const u16 termination_const; /* 168 8 / u16 termination; / 176 2 / / XXX 2 bytes hole, try to pack / enum can_state state; / 180 4 / u32 ctrlmode; / 184 4 / u32 ctrlmode_supported; / 188 4 / / --- cacheline 3 boundary (192 bytes) --- / u32 ctrlmode_static; / 192 4 / int restart_ms; / 196 4 / struct delayed_work restart_work; / 200 168 / / XXX last struct has 4 bytes of padding / / --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- / int (do_set_bittiming)(struct net_device ); / 368 8 / int (do_set_data_bittiming)(struct net_device ); / 376 8 / / --- cacheline 6 boundary (384 bytes) --- / int (do_set_mode)(struct net_device , enum can_mode); / 384 8 / int (do_set_termination)(struct net_device , u16); / 392 8 / int (do_get_state)(const struct net_device , enum can_state ); /* 400 8 / int (do_get_berr_counter)(const struct net_device , struct can_berr_counter ); /* 408 8 / unsigned int echo_skb_max; / 416 4 / / XXX 4 bytes hole, try to pack / struct sk_buff * echo_skb; /* 424 8 / / size: 432, cachelines: 7, members: 31 / / sum members: 426, holes: 2, sum holes: 6 / / paddings: 1, sum paddings: 4 / / last cacheline: 48 bytes */ }; Link: https://lore.kernel.org/r/20210224002008.4158-3-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-30	can: add new CAN FD bittiming parameters: Transmitter Delay Compensation (TDC)	Vincent Mailhol	2	-0/+68
	At high bit rates, the propagation delay from the TX pin to the RX pin of the transceiver causes measurement errors: the sample point on the RX pin might occur on the previous bit. This issue is addressed in ISO 11898-1 section 11.3.3 "Transmitter delay compensation" (TDC). This patch adds two new structures: can_tdc and can_tdc_const in order to implement this TDC. The structures are then added to can_priv. A controller supports TDC if an only if can_priv::tdc_const is not NULL. TDC is active if and only if: - fd flag is on - can_priv::tdc.tdco is not zero. It is the driver responsibility to check those two conditions are met. No new controller modes are introduced (i.e. no CAN_CTRL_MODE_TDC) in order not to be redundant with above logic. The names of the parameters are chosen to match existing CAN controllers specification. References: - Bosch C_CAN FD8: https://www.bosch-semiconductors.com/media/ip_modules/pdf_2/c_can_fd8/users_manual_c_can_fd8_r210_1.pdf - Microchip CAN FD Controller Module: http://ww1.microchip.com/downloads/en/DeviceDoc/MCP251XXFD-CAN-FD-Controller-Module-Family-Reference-Manual-20005678B.pdf - SAM E701/S70/V70/V71 Family: https://www.mouser.com/datasheet/2/268/60001527A-1284321.pdf Link: https://lore.kernel.org/r/20210224002008.4158-2-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-30	can: dev: can_free_echo_skb(): extend to return can frame length	Marc Kleine-Budde	1	-1/+2
	In order to implement byte queue limits (bql) in CAN drivers, the length of the CAN frame needs to be passed into the networking stack even if the transmission failed for some reason. To avoid to calculate this length twice, extend can_free_echo_skb() to return that value. Convert all users of this function, too. This patch is the natural extension of commit: \| 9420e1d495e2 ("can: dev: can_get_echo_skb(): extend to return can \| frame length") Link: https://lore.kernel.org/r/20210319142700.305648-3-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-28	nexthop: Rename artifacts related to legacy multipath nexthop groups	Petr Machata	1	-2/+2
	After resilient next-hop groups have been added recently, there are two types of multipath next-hop groups: the legacy "mpath", and the new "resilient". Calling the legacy next-hop group type "mpath" is unfortunate, because that describes the fact that a packet could be forwarded in one of several paths, which is also true for the resilient next-hop groups. Therefore, to make the naming clearer, rename various artifacts to reflect the assumptions made. Therefore as of this patch: - The flag for multipath groups is nh_grp_entry::is_multipath. This includes the legacy and resilient groups, as well as any future group types that behave as multipath groups. Functions that assume this have "mpath" in the name. - The flag for legacy multipath groups is nh_grp_entry::hash_threshold. Functions that assume this have "hthr" in the name. - The flag for resilient groups is nh_grp_entry::resilient. Functions that assume this have "res" in the name. Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as well. UAPI artifacts were obviously left intact. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	mld: add mc_lock for protecting per-interface mld data	Taehee Yoo	1	-0/+1
	The purpose of this lock is to avoid a bottleneck in the query/report event handler logic. By previous patches, almost all mld data is protected by RTNL. So, the query and report event handler, which is data path logic acquires RTNL too. Therefore if a lot of query and report events are received, it uses RTNL for a long time. So it makes the control-plane bottleneck because of using RTNL. In order to avoid this bottleneck, mc_lock is added. mc_lock protect only per-interface mld data and per-interface mld data is used in the query/report event handler logic. So, no longer rtnl_lock is needed in the query/report event handler logic. Therefore bottleneck will be disappeared by mc_lock. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	mld: add new workqueues for process mld events	Taehee Yoo	2	-1/+11
	When query/report packets are received, mld module processes them. But they are processed under BH context so it couldn't use sleepable functions. So, in order to switch context, the two workqueues are added which processes query and report event. In the struct inet6_dev, mc_{query \| report}_queue are added so it is per-interface queue. And mc_{query \| report}_work are workqueue structure. When the query or report event is received, skb is queued to proper queue and worker function is scheduled immediately. Workqueues and queues are protected by spinlock, which is mc_{query \| report}_lock, and worker functions are protected by RTNL. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	mld: convert ifmcaddr6 to RCU	Taehee Yoo	1	-3/+4
	The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ifmcaddr6 actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	mld: convert ip6_sf_list to RCU	Taehee Yoo	1	-3/+4
	The ip6_sf_list has been protected by mca_lock(spin_lock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ip6_sf_list actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. But It doesn't remove mca_lock yet because ifmcaddr6 isn't converted to RCU yet. So, It's not fully converted to the sleepable context. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	mld: convert ipv6_mc_socklist->sflist to RCU	Taehee Yoo	1	-2/+2
	The sflist has been protected by rwlock so that the critical section is atomic context. In order to switch this context, changing locking is needed. The sflist actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	mld: get rid of inet6_dev->mc_lock	Taehee Yoo	1	-1/+0
	The purpose of mc_lock is to protect inet6_dev->mc_tomb. But mc_tomb is already protected by RTNL and all functions, which manipulate mc_tomb are called under RTNL. So, mc_lock is not needed. Furthermore, it is spinlock so the critical section is atomic. In order to reduce atomic context, it should be removed. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	mld: convert from timer to delayed work	Taehee Yoo	1	-4/+4
	mcast.c has several timers for delaying works. Timer's expire handler is working under atomic context so it can't use sleepable things such as GFP_KERNEL, mutex, etc. In order to use sleepable APIs, it converts from timers to delayed work. But there are some critical sections, which is used by both process and BH context. So that it still uses spin_lock_bh() and rwlock. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	ethtool: document the enum values not defines	Jakub Kicinski	1	-10/+10
	kdoc does not have good support for documenting defines, and we can't abuse the enum documentation because it generates warnings. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26	ethtool: fec: add note about reuse of reserved	Jakub Kicinski	1	-0/+4
	struct ethtool_fecparam::reserved can't be used in SET, because ethtool user space doesn't zero-initialize the structure. Make this clear. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	tcp: convert elligible sysctls to u8	Eric Dumazet	1	-34/+34
	Many tcp sysctls are either bools or small ints that can fit into u8. Reducing space taken by sysctls can save few cache line misses when sending/receiving data while cpu caches are empty, for example after cpu idle period. This is hard to measure with typical network performance tests, but after this patch, struct netns_ipv4 has shrunk by three cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	inet: convert tcp_early_demux and udp_early_demux to u8	Eric Dumazet	1	-2/+2
	For these sysctls, their dedicated helpers have to use proc_dou8vec_minmax(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	ipv4: convert ip_forward_update_priority sysctl to u8	Eric Dumazet	1	-1/+1
	This sysctl uses ip_fwd_update_priority() helper, so the conversion needs to change it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	ipv4: shrink netns_ipv4 with sysctl conversions	Eric Dumazet	1	-16/+16
	These sysctls that can fit in one byte instead of one int are converted to save space and thus reduce cache line misses. - icmp_echo_ignore_all, icmp_echo_ignore_broadcasts, - icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr - tcp_ecn, tcp_ecn_fallback - ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu - ip_nonlocal_bind, ip_autobind_reuse - ip_dynaddr, ip_early_demux, raw_l3mdev_accept - nexthop_compat_mode, fwmark_reflect Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	sysctl: add proc_dou8vec_minmax()	Eric Dumazet	1	-0/+2
	Networking has many sysctls that could fit in one u8. This patch adds proc_dou8vec_minmax() for this purpose. Note that the .extra1 and .extra2 fields are pointing to integers, because it makes conversions easier. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	net: stmmac: use interrupt mode INTM=1 for multi-MSI	Wong, Vee Khee	1	-0/+1
	For interrupt mode INTM=0, TX/RX transfer complete will trigger signal not only on sbd_perch_[tx\|rx]_intr_o (Transmit/Receive Per Channel) but also on the sbd_intr_o (Common). As for multi-MSI implementation, setting interrupt mode INTM=1 is more efficient as each TX intr and RX intr (TI/RI) will be handled by TX/RX ISR without the need of calling the common MAC ISR. Updated the TX/RX NORMAL interrupts status checking process as the NIS status bit is not asserted for any RI/TI events for INTM=1. Signed-off-by: Wong, Vee Khee <vee.khee.wong@intel.com> Co-developed-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	net: stmmac: introduce MSI Interrupt routines for mac, safety, RX & TX	Ong Boon Leong	1	-0/+8
	Now we introduce MSI interrupt service routines and hook these routines up if stmmac_open() sees valid irq line being requested:- stmmac_mac_interrupt() :- MAC (dev->irq), WOL (wol_irq), LPI (lpi_irq) stmmac_safety_interrupt() :- Safety Feat Correctible Error (sfty_ce_irq) & Uncorrectible Error (sfty_ue_irq) stmmac_msi_intr_rx() :- For all RX MSI irq (rx_irq) stmmac_msi_intr_tx() :- For all TX MSI irq (tx_irq) Each of IRQs will have its unique name so that we can differentiate them easily under /proc/interrupts. Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	ethtool: clarify the ethtool FEC interface	Jakub Kicinski	1	-7/+30
	The definition of the FEC driver interface is quite unclear. Improve the documentation. This is based on current driver and user space code, as well as the discussions about the interface: RFC v1 (24 Oct 2016): https://lore.kernel.org/netdev/1477363849-36517-1-git-send-email-vidya@cumulusnetworks.com/ - this version has the autoneg field - no active_fec field - none vs off confusion is already present RFC v2 (10 Feb 2017): https://lore.kernel.org/netdev/1486727004-11316-1-git-send-email-vidya@cumulusnetworks.com/ - autoneg removed - active_fec added v1 (10 Feb 2017): https://lore.kernel.org/netdev/1486751311-42019-1-git-send-email-vidya@cumulusnetworks.com/ - no changes in the code v1 (24 Jun 2017): https://lore.kernel.org/netdev/1498331985-8525-1-git-send-email-roopa@cumulusnetworks.com/ - include in tree user v2 (27 Jul 2017): https://lore.kernel.org/netdev/1501199248-24695-1-git-send-email-roopa@cumulusnetworks.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	ethtool: fec: sanitize ethtool_fecparam->active_fec	Jakub Kicinski	1	-1/+1
	struct ethtool_fecparam::active_fec is a GET-only field, all in-tree drivers correctly ignore it on SET. Clear the field on SET to avoid any confusion. Again, we can't reject non-zero now since ethtool user space does not zero-init the param correctly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	ethtool: fec: sanitize ethtool_fecparam->reserved	Jakub Kicinski	1	-1/+1
	struct ethtool_fecparam::reserved is never looked at by the core. Make sure it's actually 0. Unfortunately we can't return an error because old ethtool doesn't zero-initialize the structure for SET. On GET we can be more verbose, there are no in tree (ab)users. Fix up the kdoc on the structure. Remove the mention of FEC bypass. Seems like a niche thing to configure in the first place. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	ethtool: fec: remove long structure description	Jakub Kicinski	1	-4/+0
	Digging through the mailing list archive @autoneg was part of the first version of the RFC, this left over comment was pointed out twice in review but wasn't removed. The sentence is an exact copy-paste from pauseparam. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	ethtool: fec: fix typo in kdoc	Jakub Kicinski	1	-1/+1
	s/porte/the port/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next	David S. Miller	1	-0/+1
	Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-24 The following pull-request contains BPF updates for your net-next tree. We've added 37 non-merge commits during the last 15 day(s) which contain a total of 65 files changed, 3200 insertions(+), 738 deletions(-). The main changes are: 1) Static linking of multiple BPF ELF files, from Andrii. 2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo. 3) Spelling fixes from various folks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	David S. Miller	56	-115/+283
	Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	4	-12/+35
	Merge misc fixes from Andrew Morton: "14 patches. Subsystems affected by this patch series: mm (hugetlb, kasan, gup, selftests, z3fold, kfence, memblock, and highmem), squashfs, ia64, gcov, and mailmap" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mailmap: update Andrey Konovalov's email address mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP mm: memblock: fix section mismatch warning again kfence: make compatible with kmemleak gcov: fix clang-11+ support ia64: fix format strings for err_inject ia64: mca: allocate early mca with GFP_ATOMIC squashfs: fix xattr id and id lookup sanity checks squashfs: fix inode lookup sanity checks z3fold: prevent reclaim/free race for headless pages selftests/vm: fix out-of-tree build mm/mmu_notifiers: ensure range_end() is paired with range_start() kasan: fix per-page tags for non-page_alloc pages hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
2021-03-25	mm: memblock: fix section mismatch warning again	Mike Rapoport	1	-2/+2
	Commit 34dc2efb39a2 ("memblock: fix section mismatch warning") marked memblock_bottom_up() and memblock_set_bottom_up() as __init, but they could be referenced from non-init functions like memblock_find_in_range_node() on architectures that enable CONFIG_ARCH_KEEP_MEMBLOCK. For such builds kernel test robot reports: WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up() The function memblock_find_in_range_node() references the function __init memblock_bottom_up(). This is often because memblock_find_in_range_node lacks a __init annotation or the annotation of memblock_bottom_up is wrong. Replace __init annotations with __init_memblock annotations so that the appropriate section will be selected depending on CONFIG_ARCH_KEEP_MEMBLOCK. Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org Fixes: 34dc2efb39a2 ("memblock: fix section mismatch warning") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-25	mm/mmu_notifiers: ensure range_end() is paired with range_start()	Sean Christopherson	1	-5/+5
	If one or more notifiers fails .invalidate_range_start(), invoke .invalidate_range_end() for "all" notifiers. If there are multiple notifiers, those that did not fail are expecting _start() and _end() to be paired, e.g. KVM's mmu_notifier_count would become imbalanced. Disallow notifiers that can fail _start() from implementing _end() so that it's unnecessary to either track which notifiers rejected _start(), or had already succeeded prior to a failed _start(). Note, the existing behavior of calling _start() on all notifiers even after a previous notifier failed _start() was an unintented "feature". Make it canon now that the behavior is depended on for correctness. As of today, the bug is likely benign: 1. The only caller of the non-blocking notifier is OOM kill. 2. The only notifiers that can fail _start() are the i915 and Nouveau drivers. 3. The only notifiers that utilize _end() are the SGI UV GRU driver and KVM. 4. The GRU driver will never coincide with the i195/Nouveau drivers. 5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the _guest_, and the guest is already doomed due to being an OOM victim. Fix the bug now to play nice with future usage, e.g. KVM has a potential use case for blocking memslot updates in KVM while an invalidation is in-progress, and failure to unblock would result in said updates being blocked indefinitely and hanging. Found by inspection. Verified by adding a second notifier in KVM that periodically returns -EAGAIN on non-blockable ranges, triggering OOM, and observing that KVM exits with an elevated notifier count. Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") Signed-off-by: Sean Christopherson <seanjc@google.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>