<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-dev/include/net/netns, branch master</title>
<subtitle>Linux kernel development work - see feature branches</subtitle>
<id>https://git.zx2c4.com/linux-dev/atom/include/net/netns?h=master</id>
<link rel='self' href='https://git.zx2c4.com/linux-dev/atom/include/net/netns?h=master'/>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/'/>
<updated>2022-09-29T01:51:47Z</updated>
<entry>
<title>netns: Replace zero-length array with DECLARE_FLEX_ARRAY() helper</title>
<updated>2022-09-29T01:51:47Z</updated>
<author>
<name>Gustavo A. R. Silva</name>
<email>gustavoars@kernel.org</email>
</author>
<published>2022-09-26T23:02:20Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=63a8bf85568b30438431d4d78afb2fde4a280760'/>
<id>urn:sha1:63a8bf85568b30438431d4d78afb2fde4a280760</id>
<content type='text'>
Zero-length arrays are deprecated and we are moving towards adopting
C99 flexible-array members, instead. So, replace zero-length arrays
declarations in anonymous union with the new DECLARE_FLEX_ARRAY()
helper macro.

This helper allows for flexible-array members in unions.

Link: https://github.com/KSPP/linux/issues/193
Link: https://github.com/KSPP/linux/issues/225
Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Signed-off-by: Gustavo A. R. Silva &lt;gustavoars@kernel.org&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/YzIvfGXxfjdXmIS3@work
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net/smc: Unbind r/w buffer size from clcsock and make them tunable</title>
<updated>2022-09-22T10:58:21Z</updated>
<author>
<name>Tony Lu</name>
<email>tonylu@linux.alibaba.com</email>
</author>
<published>2022-09-20T09:52:22Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=0227f058aa29f5ab6f6ec79c3a36ae41f1e03a13'/>
<id>urn:sha1:0227f058aa29f5ab6f6ec79c3a36ae41f1e03a13</id>
<content type='text'>
Currently, SMC uses smc-&gt;sk.sk_{rcv|snd}buf to create buffers for
send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
in clcsock.

The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
are usually larger than TCP for SMC-R/-D to get higher performance, for
they are different underlay devices and paths.

So this patch unbinds buffer size from TCP, and introduces two sysctl
knobs to tune them independently. Also, these knobs are per net
namespace and work for containers.

Signed-off-by: Tony Lu &lt;tonylu@linux.alibaba.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>net/smc: Introduce a specific sysctl for TEST_LINK time</title>
<updated>2022-09-22T10:58:21Z</updated>
<author>
<name>Wen Gu</name>
<email>guwen@linux.alibaba.com</email>
</author>
<published>2022-09-20T09:52:21Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=77eee32514314209961af5c2982e871ecb364445'/>
<id>urn:sha1:77eee32514314209961af5c2982e871ecb364445</id>
<content type='text'>
SMC-R tests the viability of link by sending out TEST_LINK LLC
messages over RoCE fabric when connections on link have been
idle for a time longer than keepalive interval (testlink time).

But using tcp_keepalive_time as testlink time maybe not quite
suitable because it is default no less than two hours[1], which
is too long for single link to find peer dead. The active host
will still use peer-dead link (QP) sending messages, and can't
find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes
more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally.

So this patch introduces a independent sysctl for SMC-R to set
link keepalive time, in order to detect link down in time. The
default value is 30 seconds.

[1] https://www.rfc-editor.org/rfc/rfc1122#page-101

Signed-off-by: Wen Gu &lt;guwen@linux.alibaba.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>tcp: Introduce optional per-netns ehash.</title>
<updated>2022-09-20T17:21:50Z</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-09-08T01:10:22Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=d1e5e6408b305ff78b825d437df8d3f77e82a4be'/>
<id>urn:sha1:d1e5e6408b305ff78b825d437df8d3f77e82a4be</id>
<content type='text'>
The more sockets we have in the hash table, the longer we spend looking
up the socket.  While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.

The root cause might be a single workload that consumes much more
resources than the others.  It often happens on a cloud service where
different workloads share the same computing resource.

On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.

 thash_entries		sockets		length		Gbps
	524288		      1		     1		50.7
			   24Mi		    48		45.1

It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.

 thash_entries		sockets		length		Gbps
        131072		      1		     1		50.7
			    1Mi		     8		49.9
			    2Mi		    16		48.9
			    4Mi		    32		47.3
			    8Mi		    64		44.6
			   16Mi		   128		40.6
			   24Mi		   192		36.3
			   32Mi		   256		32.5
			   40Mi		   320		27.0
			   48Mi		   384		25.0

To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.

With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours.  In addition, we can reduce lock contention.

We can control the ehash size by a new sysctl knob.  However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash.  The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.

We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).

  # dmesg | cut -d ' ' -f 5- | grep "established hash"
  TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)

  # sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = -524288  # share the global ehash

  # sysctl -w net.ipv4.tcp_child_ehash_entries=100
  net.ipv4.tcp_child_ehash_entries = 100

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets

When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:

  1) Share the global ehash and create per-netns ehash

  First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
  netns sysctl knobs where we can safely change tcp_child_ehash_entries
  and clone()/unshare() to create a per-netns ehash.

  2) Control write on sysctl by BPF

  We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
  sysctl knobs.

Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy.  By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node.  Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.

Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:

  tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
  tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)

As a bonus, we can dismantle netns faster.  Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns.  Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.

With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.

In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.</title>
<updated>2022-09-20T17:21:49Z</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-09-08T01:10:18Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=e9bd0cca09d13ac2f08d25e195203e42d4ad1ce8'/>
<id>urn:sha1:e9bd0cca09d13ac2f08d25e195203e42d4ad1ce8</id>
<content type='text'>
We will soon introduce an optional per-netns ehash and access hash
tables via net-&gt;ipv4.tcp_death_row-&gt;hashinfo instead of &amp;tcp_hashinfo
in most places.

It could harm the fast path because dereferences of two fields in net
and tcp_death_row might incur two extra cache line misses.  To save one
dereference, let's place tcp_death_row back in netns_ipv4 and fetch
hashinfo via net-&gt;ipv4.tcp_death_row"."hashinfo.

Note tcp_death_row was initially placed in netns_ipv4, and commit
fbb8295248e1 ("tcp: allocate tcp_death_row outside of struct netns_ipv4")
changed it to a pointer so that we can fire TIME_WAIT timers after freeing
net.  However, we don't do so after commit 04c494e68a13 ("Revert "tcp/dccp:
get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a
pointer.

Also, we move refcount_dec_and_test(&amp;tw_refcount) from tcp_sk_exit() to
tcp_sk_exit_batch() as a debug check.

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2022-09-08T16:38:30Z</updated>
<author>
<name>Paolo Abeni</name>
<email>pabeni@redhat.com</email>
</author>
<published>2022-09-08T16:34:54Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=9f8f1933dce555d3c246f447f54fca8de8889da9'/>
<id>urn:sha1:9f8f1933dce555d3c246f447f54fca8de8889da9</id>
<content type='text'>
drivers/net/ethernet/freescale/fec.h
  7d650df99d52 ("net: fec: add pm_qos support on imx6q platform")
  40c79ce13b03 ("net: fec: add stop mode support for imx8 platform")

Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf</title>
<updated>2022-09-03T02:38:25Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-09-03T02:38:25Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=05a5474efe93054893f1a7936398381e23ecea60'/>
<id>urn:sha1:05a5474efe93054893f1a7936398381e23ecea60</id>
<content type='text'>
Florian Westphal says:

====================
netfilter: bug fixes for net

1. Fix IP address check in irc DCC conntrack helper, this should check
   the opposite direction rather than the destination address of the
   packets' direction, from David Leadbeater.

2. bridge netfilter needs to drop dst references, from Harsh Modi.
   This was fine back in the day the code was originally written,
   but nowadays various tunnels can pre-set metadata dsts on packets.

3. Remove nf_conntrack_helper sysctl and the modparam toggle, users
   need to explicitily assign the helpers to use via nftables or
   iptables.  Conntrack helpers, by design, may be used to add dynamic
   port redirections to internal machines, so its necessary to restrict
   which hosts/peers are allowed to use them.
   It was discovered that improper checking in the irc DCC helper makes
   it possible to trigger the 'please do dynamic port forward'
   from outside by embedding a 'DCC' in a PING request; if the client
   echos that back a expectation/port forward gets added.
   The auto-assign-for-everything mechanism has been in "please don't do this"
   territory since 2012.  From Pablo.

4. Fix a memory leak in the netdev hook error unwind path, also from Pablo.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_conntrack_irc: Fix forged IP logic
  netfilter: nf_tables: clean up hook list when offload flags check fails
  netfilter: br_netfilter: Drop dst references before setting.
  netfilter: remove nf_conntrack_helper sysctl and modparam toggles
====================

Link: https://lore.kernel.org/r/20220901071238.3044-1-fw@strlen.de
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2022-09-01T19:58:02Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-09-01T19:58:02Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=60ad1100d525699bce83690757ff3077c6ab83ab'/>
<id>urn:sha1:60ad1100d525699bce83690757ff3077c6ab83ab</id>
<content type='text'>
tools/testing/selftests/net/.gitignore
  sort the net-next version and use it

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: make global challenge ack rate limitation per net-ns and default disabled</title>
<updated>2022-09-01T02:56:48Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2022-08-30T18:56:56Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=79e3602caa6f9d59c4f66a268407080496dae408'/>
<id>urn:sha1:79e3602caa6f9d59c4f66a268407080496dae408</id>
<content type='text'>
Because per host rate limiting has been proven problematic (side channel
attacks can be based on it), per host rate limiting of challenge acks ideally
should be per netns and turned off by default.

This is a long due followup of following commits:

083ae308280d ("tcp: enable per-socket rate limiting of all 'challenge acks'")
f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock")
75ff39ccc1bd ("tcp: make challenge acks less predictable")

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>netfilter: remove nf_conntrack_helper sysctl and modparam toggles</title>
<updated>2022-08-31T10:12:32Z</updated>
<author>
<name>Pablo Neira Ayuso</name>
<email>pablo@netfilter.org</email>
</author>
<published>2022-08-26T06:49:16Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=b118509076b39cc5e616c0680312b5caaca535fe'/>
<id>urn:sha1:b118509076b39cc5e616c0680312b5caaca535fe</id>
<content type='text'>
__nf_ct_try_assign_helper() remains in place but it now requires a
template to configure the helper.

A toggle to disable automatic helper assignment was added by:

  a9006892643a ("netfilter: nf_ct_helper: allow to disable automatic helper assignment")

in 2012 to address the issues described in "Secure use of iptables and
connection tracking helpers". Automatic conntrack helper assignment was
disabled by:

  3bb398d925ec ("netfilter: nf_ct_helper: disable automatic helper assignment")

back in 2016.

This patch removes the sysctl and modparam toggles, users now have to
rely on explicit conntrack helper configuration via ruleset.

Update tools/testing/selftests/netfilter/nft_conntrack_helper.sh to
check that auto-assignment does not happen anymore.

Acked-by: Aaron Conole &lt;aconole@redhat.com&gt;
Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
</content>
</entry>
</feed>
