<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-dev/net/packet/internal.h, branch linus/master</title>
<subtitle>Linux kernel development work - see feature branches</subtitle>
<id>https://git.zx2c4.com/linux-dev/atom/net/packet/internal.h?h=linus%2Fmaster</id>
<link rel='self' href='https://git.zx2c4.com/linux-dev/atom/net/packet/internal.h?h=linus%2Fmaster'/>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/'/>
<updated>2021-04-14T21:34:38Z</updated>
<entry>
<title>net/packet: remove data races in fanout operations</title>
<updated>2021-04-14T21:34:38Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2021-04-14T19:36:44Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=94f633ea8ade8418634d152ad0931133338226f6'/>
<id>urn:sha1:94f633ea8ade8418634d152ad0931133338226f6</id>
<content type='text'>
af_packet fanout uses RCU rules to ensure f-&gt;arr elements
are not dismantled before RCU grace period.

However, it lacks rcu accessors to make sure KCSAN and other tools
wont detect data races. Stupid compilers could also play games.

Fixes: dc99f600698d ("packet: Add fanout support.")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: "Gong, Sishuai" &lt;sishuai@purdue.edu&gt;
Cc: Willem de Bruijn &lt;willemb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: packet: make pkt_sk() inline</title>
<updated>2021-01-30T01:38:55Z</updated>
<author>
<name>Menglong Dong</name>
<email>dong.menglong@zte.com.cn</email>
</author>
<published>2021-01-27T12:33:02Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=8c22475148a8d3222be712bd02a74d7279d50daf'/>
<id>urn:sha1:8c22475148a8d3222be712bd02a74d7279d50daf</id>
<content type='text'>
It's better make 'pkt_sk()' inline here, as non-inline function
shouldn't occur in headers. Besides, this function is simple
enough to be inline.

Signed-off-by: Menglong Dong &lt;dong.menglong@zte.com.cn&gt;
Link: https://lore.kernel.org/r/20210127123302.29842-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net/packet: make packet_fanout.arr size configurable up to 64K</title>
<updated>2020-11-10T00:41:40Z</updated>
<author>
<name>Tanner Love</name>
<email>tannerlove@google.com</email>
</author>
<published>2020-11-06T18:07:40Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=9c661b0b85444e426d3f23250305eeb16f6ffe88'/>
<id>urn:sha1:9c661b0b85444e426d3f23250305eeb16f6ffe88</id>
<content type='text'>
One use case of PACKET_FANOUT is lockless reception with one socket
per CPU. 256 is a practical limit on increasingly many machines.

Increase PACKET_FANOUT_MAX to 64K. Expand setsockopt PACKET_FANOUT to
take an extra argument max_num_members. Also explicitly define a
fanout_args struct, instead of implicitly casting to an integer. This
documents the API and simplifies the control flow.

If max_num_members is not specified or is set to 0, then 256 is used,
same as before.

Signed-off-by: Tanner Love &lt;tannerlove@google.com&gt;
Signed-off-by: Willem de Bruijn &lt;willemb@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>af_packet: TPACKET_V3: replace busy-wait loop</title>
<updated>2020-07-16T18:40:39Z</updated>
<author>
<name>John Ogness</name>
<email>john.ogness@linutronix.de</email>
</author>
<published>2020-07-07T15:22:04Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=632ca50f2cbd8809a2fc1934b4b2c9c49bd55e98'/>
<id>urn:sha1:632ca50f2cbd8809a2fc1934b4b2c9c49bd55e98</id>
<content type='text'>
A busy-wait loop is used to implement waiting for bits to be copied
from the skb to the kernel buffer before retiring a block. This is
a problem on PREEMPT_RT because the copying task could be preempted
by the busy-waiting task and thus live lock in the busy-wait loop.

Replace the busy-wait logic with an rwlock_t. This provides lockdep
coverage and makes the code RT ready.

Signed-off-by: John Ogness &lt;john.ogness@linutronix.de&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net/packet: tpacket_rcv: avoid a producer race condition</title>
<updated>2020-03-15T07:25:25Z</updated>
<author>
<name>Willem de Bruijn</name>
<email>willemb@google.com</email>
</author>
<published>2020-03-13T16:18:09Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=61fad6816fc10fb8793a925d5c1256d1c3db0cd2'/>
<id>urn:sha1:61fad6816fc10fb8793a925d5c1256d1c3db0cd2</id>
<content type='text'>
PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).

Synchronize kernel thread ownership of rx ring slots with a bitmap.

Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.

Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.

This is the one of a variety of discussed options (see Link below):

* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.

* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.

* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.

* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.

The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.

Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.

Based on an earlier patchset by Jon Rosen. See links below.

I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.

Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen &lt;jrosen@cisco.com&gt;
Signed-off-by: Willem de Bruijn &lt;willemb@google.com&gt;
Signed-off-by: Jon Rosen &lt;jrosen@cisco.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2019-06-28T04:06:39Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2019-06-28T04:06:39Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=d96ff269a04be286989ead13bf8b4be55bdee8ee'/>
<id>urn:sha1:d96ff269a04be286989ead13bf8b4be55bdee8ee</id>
<content type='text'>
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.

In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.

The aquantia driver conflicts were simple overlapping changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET</title>
<updated>2019-06-27T02:38:29Z</updated>
<author>
<name>Neil Horman</name>
<email>nhorman@tuxdriver.com</email>
</author>
<published>2019-06-25T21:57:49Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=89ed5b519004a7706f50b70f611edbd3aaacff2c'/>
<id>urn:sha1:89ed5b519004a7706f50b70f611edbd3aaacff2c</id>
<content type='text'>
When an application is run that:
a) Sets its scheduler to be SCHED_FIFO
and
b) Opens a memory mapped AF_PACKET socket, and sends frames with the
MSG_DONTWAIT flag cleared, its possible for the application to hang
forever in the kernel.  This occurs because when waiting, the code in
tpacket_snd calls schedule, which under normal circumstances allows
other tasks to run, including ksoftirqd, which in some cases is
responsible for freeing the transmitted skb (which in AF_PACKET calls a
destructor that flips the status bit of the transmitted frame back to
available, allowing the transmitting task to complete).

However, when the calling application is SCHED_FIFO, its priority is
such that the schedule call immediately places the task back on the cpu,
preventing ksoftirqd from freeing the skb, which in turn prevents the
transmitting task from detecting that the transmission is complete.

We can fix this by converting the schedule call to a completion
mechanism.  By using a completion queue, we force the calling task, when
it detects there are no more frames to send, to schedule itself off the
cpu until such time as the last transmitted skb is freed, allowing
forward progress to be made.

Tested by myself and the reporter, with good results

Change Notes:

V1-&gt;V2:
	Enhance the sleep logic to support being interruptible and
allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)

V2-&gt;V3:
	Rearrage the point at which we wait for the completion queue, to
avoid needing to check for ph/skb being null at the end of the loop.
Also move the complete call to the skb destructor to avoid needing to
modify __packet_set_status.  Also gate calling complete on
packet_read_pending returning zero to avoid multiple calls to complete.
(Willem de Bruijn)

	Move timeo computation within loop, to re-fetch the socket
timeout since we also use the timeo variable to record the return code
from the wait_for_complete call (Neil Horman)

V3-&gt;V4:
	Willem has requested that the control flow be restored to the
previous state.  Doing so lets us eliminate the need for the
po-&gt;wait_on_complete flag variable, and lets us get rid of the
packet_next_frame function, but introduces another complexity.
Specifically, but using the packet pending count, we can, if an
applications calls sendmsg multiple times with MSG_DONTWAIT set, each
set of transmitted frames, when complete, will cause
tpacket_destruct_skb to issue a complete call, for which there will
never be a wait_on_completion call.  This imbalance will lead to any
future call to wait_for_completion here to return early, when the frames
they sent may not have completed.  To correct this, we need to re-init
the completion queue on every call to tpacket_snd before we enter the
loop so as to ensure we wait properly for the frames we send in this
iteration.

	Change the timeout and interrupted gotos to out_put rather than
out_status so that we don't try to free a non-existant skb
	Clean up some extra newlines (Willem de Bruijn)

Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Reported-by: Matteo Croce &lt;mcroce@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net/packet: make tp_drops atomic</title>
<updated>2019-06-15T01:52:14Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2019-06-12T16:52:30Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=8e8e2951e3095732d7e780c241f61ea130955a57'/>
<id>urn:sha1:8e8e2951e3095732d7e780c241f61ea130955a57</id>
<content type='text'>
Under DDOS, we want to be able to increment tp_drops without
touching the spinlock. This will help readers to drain
the receive queue slightly faster :/

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Revert "packet: switch kvzalloc to allocate memory"</title>
<updated>2018-09-01T06:00:28Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2018-08-29T18:50:12Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=3a7ad0634f0986d807772ba74f66f7c3a73612e5'/>
<id>urn:sha1:3a7ad0634f0986d807772ba74f66f7c3a73612e5</id>
<content type='text'>
This reverts commit 71e41286203c017d24f041a7cd71abea7ca7b1e0.

mmap()/munmap() can not be backed by kmalloced pages :

We fault in :

    VM_BUG_ON_PAGE(PageSlab(page), page);

    unmap_single_vma+0x8a/0x110
    unmap_vmas+0x4b/0x90
    unmap_region+0xc9/0x140
    do_munmap+0x274/0x360
    vm_munmap+0x81/0xc0
    SyS_munmap+0x2b/0x40
    do_syscall_64+0x13e/0x1c0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: 71e41286203c ("packet: switch kvzalloc to allocate memory")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: John Sperbeck &lt;jsperbeck@google.com&gt;
Bisected-by: John Sperbeck &lt;jsperbeck@google.com&gt;
Cc: Zhang Yu &lt;zhangyu31@baidu.com&gt;
Cc: Li RongQing &lt;lirongqing@baidu.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>packet: switch kvzalloc to allocate memory</title>
<updated>2018-08-13T16:21:05Z</updated>
<author>
<name>Li RongQing</name>
<email>lirongqing@baidu.com</email>
</author>
<published>2018-08-13T02:42:46Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=71e41286203c017d24f041a7cd71abea7ca7b1e0'/>
<id>urn:sha1:71e41286203c017d24f041a7cd71abea7ca7b1e0</id>
<content type='text'>
The patches includes following change:

*Use modern kvzalloc()/kvfree() instead of custom allocations.

*Remove order argument for alloc_pg_vec, it can get from req.

*Remove order argument for free_pg_vec, free_pg_vec now uses
kvfree which does not need order argument.

*Remove pg_vec_order from struct packet_ring_buffer, no longer
need to save/restore 'order'

*Remove variable 'order' for packet_set_ring, it is now unused

Signed-off-by: Zhang Yu &lt;zhangyu31@baidu.com&gt;
Signed-off-by: Li RongQing &lt;lirongqing@baidu.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
