summaryrefslogtreecommitdiffstats
path: root/sys/kern/uipc_socket.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* In sorflush() use m_purge() instead of handrolling it.bluhm2021-02-241-3/+2
| | | | no objections mvs@
* Release mbuf(9) chain with a simple m_freem(9) loop in sorflush().mvs2021-02-181-6/+7
| | | | | | | | | | | | Passing local copy of socket to sbrelease() is too complicated to just free receive buffer. We don't allocate large object on the stack. Also we don't pass unlocked socket to soassertlocked() within sbdrop(). This was not triggered because we lock the whole layer with one lock. Also sorflush() is now private to kern/uipc_socket.c, so it's definition was made to be in accordance. ok claudio@ mpi@
* Replace SB_KNOTE and sb_flagsintr with direct checking of klist.visa2021-01-171-7/+1
| | | | OK mpi@ as part of a larger diff
* If the loop check in somove(9) goes to release without setting anbluhm2021-01-091-3/+2
| | | | | | | error, a broadcast mbuf will stay in the socket buffer forever. This is bad as multiple mbufs can use up all the space. Better report ELOOP, dissolve splicing, and let userland handle it. OK anton@
* Refactor klist insertion and removalvisa2020-12-251-4/+4
| | | | | | | | | | | | Rename klist_{insert,remove}() to klist_{insert,remove}_locked(). These functions assume that the caller has locked the klist. The current state of locking remains intact because the kernel lock is still used with all klists. Add new functions klist_insert() and klist_remove() that lock the klist internally. This allows some code simplification. OK mpi@
* Rename the macro MCLGETI to MCLGETL and removes the dead parameter ifp.jan2020-12-121-3/+3
| | | | | | OK dlg@, bluhm@ No Opinion mpi@ Not against it claudio@
* Fix handling of MSG_PEEK in soreceive() for the case where an emptyclaudio2020-11-171-6/+7
| | | | | | | | | | | | | | | mbuf is encountered in a seqpacket socket. This diff uses the fact that setting orig_resid to 0 results in soreceive() to return instead of looping back with the intent to sleep for more data. orig_resid is now always set to 0 in the control message case (instead of only if controlp is defined). This is the same behaviour as for the PR_NAME case. Additionally orig_resid is set to 0 in the data reader when MSG_PEEK is used. Tested in snaps for a while and by anton@ Reported-by: syzbot+4b0e9698b344b0028b14@syzkaller.appspotmail.com
* Move the solock() call outside of solisten(). The reason is that theclaudio2020-09-291-7/+5
| | | | | | | so_state and splice checks were done without the proper lock which is incorrect. This is similar to sobind(), soconnect() which also require the callee to hold the socket lock. Found by, with and OK mvs@, OK mpi@
* sosplice(9): fully validate idle timeoutcheloha2020-08-071-2/+3
| | | | | | | | | | | | The socket splice idle timeout is a timeval, so we need to check that tv_usec is both non-negative and less than one million. Otherwise it isn't in canonical form. We can check for this with timerisvalid(3). benno@ says this shouldn't break anything in base. ok benno@, bluhm@
* Extend kqueue interface with EVFILT_EXCEPT filter.mpi2020-06-221-3/+19
| | | | | | | | | | This filter, already implemented in macOS and Dragonfly BSD, returns exceptional conditions like the reception of out-of-band data. The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and it can be used by the kqfilter-based poll & select implementation. ok millert@ on a previous version, ok visa@
* Compare `so' and `sosp' types just after `sosp' obtaining. We can't splicemvs2020-06-181-5/+5
| | | | | | | | sockets from different domains so there is no reason to have locking and memory allocation in this error path. Also in this case only `so' will be locked by solock() so we should avoid `sosp' modification. ok mpi@
* Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.mpi2020-06-151-1/+9
| | | | | | This is only done in poll-compatibility mode, when __EV_POLL is set. ok visa@, millert@
* In sosplice(), temporarily release the socket lock before callinganton2020-04-121-1/+9
| | | | | | | | | | | FRELE() as the last reference could be dropped which in turn will cause soclose() to be called where the socket lock is unconditionally acquired. Note that this is only a problem for sockets protected by the non-recursive NET_LOCK() right now. ok mpi@ visa@ Reported-by: syzbot+7c805a09545d997b924d@syzkaller.appspotmail.com
* Abstract the head of knote lists. This allows extending the lists,visa2020-04-071-6/+6
| | | | | | for example, with locking assertions. OK mpi@, anton@
* Fix unlimited recursion caused by local outbound bcast/mcast packetsashan2020-03-111-3/+9
| | | | | | | | sent via spliced socket. Reported-by: syzbot+2f9616f39d3f3b281cfb@syzkaller.appspotmail.com OK bluhm@
* Replace field f_isfd with field f_flags in struct filterops to allowvisa2020-02-201-4/+4
| | | | | | adding more filter properties without cluttering the struct. OK mpi@, anton@
* Push the KERNEL_LOCK() insidge pgsigio() and selwakeup().mpi2020-02-141-3/+1
| | | | | | | | | | | The 3 subsystems: signal, poll/select and kqueue can now be addressed separatly. Note that bpf(4) and audio(4) currently delay the wakeups to a separate context in order to respect the KERNEL_LOCK() requirement. Sockets (UDP, TCP) and pipes spin to grab the lock for the sames reasons. ok anton@, visa@
* Keep socket timeout intervals in nsecs and use them with tsleep_nsec(9).mpi2020-01-151-14/+30
| | | | | | | | | | | | | | Introduce and use TIMEVAL_TO_NSEC() to convert SO_RCVTIMEO/SO_SNDTIMEO specified values into nanoseconds. As a side effect it is now possible to specify a timeout larger that (USHRT_MAX / 100) seconds. To keep code simple `so_linger' now represents a number of seconds with 0 meaning no timeout or 'infinity'. Yes, the 0 -> INFSLP API change makes conversions complicated as many timeout holders are still memset()'d. Inputs from cheloha@ and bluhm@, ok bluhm@
* Use C99 designated initializers with struct filterops. In addition,visa2019-12-311-7/+21
| | | | | | make the structs const so that the data are put in .rodata. OK mpi@, deraadt@, anton@, bluhm@
* Reintroduce socket locking inside socket event filters.visa2019-12-121-3/+16
| | | | | Tested by anton@, sashan@ OK mpi@, anton@, sashan@
* Revert "timeout(9): switch to tickless backend"cheloha2019-12-021-4/+3
| | | | | | | | | It appears to have caused major performance regressions all over the network stack. Reported by bluhm@ ok deraadt@
* timeout(9): switch to tickless backendcheloha2019-11-261-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rebase the timeout wheel on the system uptime clock. Timeouts are now set to run at or after an absolute time as returned by nanouptime(9). Timeouts are thus "tickless": they expire at a real time on that clock instead of at a particular value of the global "ticks" variable. To facilitate this change the timeout struct's .to_time member becomes a timespec. Hashing timeouts into a bucket on the wheel changes slightly: we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of subseconds (.tv_nsec). 7 bits of subseconds means the width of the lowest wheel level is now 2 seconds on all platforms and each bucket in that lowest level corresponds to 1/128 seconds on the uptime clock. These values were chosen to closely align with the current 100hz hardclock(9) typical on almost all of our platforms. At 100hz a bucket is currently ~1/100 seconds wide on the lowest level and the lowest level itself is ~2.56 seconds wide. Not a huge change, but a change nonetheless. Because a bucket no longer corresponds to a single tick more than one bucket may be dumped during an average timeout_hardclock_update() call. On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket, but you are doing extra work in softclock() to reschedule timeouts that aren't due yet. To avoid changing current behavior all timeout_add*(9) interfaces convert their timeout interval into ticks, compute an equivalent timespec interval, and then add that interval to the timestamp of the most recent timeout_hardclock_update() call to determine an absolute deadline. So all current timeouts still "use" ticks, but the ticks are faked in the timeout layer. A new interface, timeout_at_ts(9), is introduced here to bypass this backwardly compatible behavior. It will be used in subsequent diffs to add absolute timeout support for userland and to clean up some of the messier parts of kernel timekeeping, especially at the syscall layer. Because timeouts are based against the uptime clock they are subject to NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy adjfreq(2) adjustment set this will not change the expiration behavior of your timeouts. Tons of design feedback from mpi@, visa@, guenther@, and kettenis@. Additional amd64 testing from anton@ and visa@. Octeon testing from visa@. macppc testing from me. Positive feedback from deraadt@, ok visa@
* implement SO_DOMAIN and SO_PROTOCOL so that the domain and the protocolrobert2019-07-221-1/+9
| | | | | | | | can also be retrieved with getsockopt(3) it looks like these will also be in the next issue of posix: http://austingroupbugs.net/view.php?id=840#c2263 ok claudio@, sthen@
* listen(2) should return EINVAL if the socket is connected.bluhm2019-07-111-4/+3
| | | | | | This behavior matches NetBSD, POSIX, and our own man page. Fix whitespace while here. from Moritz Buhl; OK millert@
* Remove a useless kernel lock from the TCP socket splicing path.bluhm2019-07-041-11/+27
| | | | | | | | | | | When send buffer space in the drain socket becomes available, a task is added to move data, and also the userland was informed. The latter is not usefull as this would mix a kernel and user stream. So programs do not wait for this event. Avoid calling sowakeup() from sowwakeup(), this also reduces grabing the kernel lock. Instead inform the userland about the write event when the splicing is dissolved in sounsplice(). OK claudio@
* When using MSG_WAITALL, soreceive() can sleep while processing thebluhm2018-12-171-3/+11
| | | | | | | | receive buffer of a stream socket. Then a new pair of control and data mbuf can be appended to the mbuf queue. In this case, terminate the loop with a short read to prevent a panic. Userland should read the control message with the next system call. OK claudio@ deraadt@
* Trivial MH_ALIGN/M_ALIGN to m_align conversions.claudio2018-11-301-2/+2
| | | | OK bluhm@
* When using MSG_PEEK to peak into packets skip control messages holdingclaudio2018-11-211-10/+12
| | | | | | SCM_RIGHTS from being sent to the userland since they hold kernel internal data and it does not make sense to externalize it. OK deraadt@, guenther@, visa@
* Utilize sigio with sockets.visa2018-11-191-5/+7
| | | | OK mpi@
* If the control message of IP_SENDSRCADDR did not fit into the socketbluhm2018-08-211-5/+9
| | | | | | | | | | buffer together with an UDP packet, sosend(9) returned EWOULDBLOCK. As it is an persistent problem, EMSGSIZE is the correct error code. Split the AF_UNIX case into a separate condition and do not change its logic. For atomic protocols, check that both data and control message length fit into the socket buffer. original bug report from Alexander Markert discussed with jca@; OK vgross@
* Use FNONBLOCK instead of SS_NBIO to check/indicate that the I/O modempi2018-07-301-11/+7
| | | | | | | | | | | | | for sockets is non-blocking. This allows us to G/C SS_NBIO. Having to keep the two flags in sync in a mp-safe way is complicated. This change introduce a behavior change in sosplice(), it can now always block. However this should not matter much due to the socket lock being taken beforhand. ok bluhm@, benno@, visa@
* Serialize the sosplice taskq allocation. This prevents an unlikelyvisa2018-07-051-4/+16
| | | | | | | | duplicate allocation that could happen in the future when each socket has a dedicated lock. Right now, the code path is serialized also by the NET_LOCK() (and the KERNEL_LOCK()). OK mpi@
* In soclose() and soaccept() convert the KASSERT(SS_NOFDREF) backbluhm2018-06-141-4/+6
| | | | | | to a panic message. The latter prints socket pointer and type to help debugging. OK mpi@
* Pass the socket to sounlock(), this prepare the terrain for per-socketmpi2018-06-061-23/+28
| | | | | | locking. ok visa@, bluhm@
* Asseert that a pfkey or routing socket is referenced by a `fp' insteadmpi2018-06-061-5/+3
| | | | | | | | | | | | | of calling sofree(), when its PCB is detached. This is different from TCP which does not always detach `inpcb's from sockets. In the pfkey & routing case caling sofree() there is a noop whereas for TCP it's needed to free closed connections. Having fewer sofree() makes it easier to understand the code and move the locks down. ok visa@
* Socket splicing can delay operations by task or timeout. Introducebluhm2018-05-081-8/+20
| | | | | | | soreaper() that is scheduled onto the timer thread. soput() is scheduled from there onto the sosplice task thread. After that it is save to pool_put() the socket and splicing data structures. OK mpi@ visa@
* AF_LOCAL was a failed attempt (by POSIX?) to seem less UNIX-specific, butguenther2018-04-081-4/+4
| | | | | | | AF_UNIX is both the historical _and_ standard name, so prefer and recommend it in the headers, manpages, and kernel. ok miller@ deraadt@ schwarze@
* Use a goto to merge multiple error blocks in sosplice().mpi2018-03-271-5/+4
| | | | ok bluhm@
* When socket splicing is involved, delay the pool_put() after thebluhm2018-03-011-4/+27
| | | | | | splicing thread has finished sotask() with the socket to be freed. Use after free reported and fix successfully tested by Rivo Nurges. discussed with mpi@
* Grab solock() inside soconnect2() instead of asserting for it to be held.mpi2018-02-191-3/+4
| | | | ok millert@
* Remove almost unused `flags' argument of suser().mpi2018-02-191-3/+3
| | | | | | | The account flag `ASU' will no longer be set but that makes suser() mpsafe since it no longer mess with a per-process field. No objection from millert@, ok tedu@, bluhm@
* Mark sosplice task mp safe, do not grab kernel lock for tcp output.bluhm2018-01-101-2/+3
| | | | OK mpi@
* Change `so_state' and `so_error' to unsigned int such that they canmpi2018-01-091-2/+2
| | | | | | be atomically read from any context. ok bluhm@, visa@
* Do not memset() the whole structure in sorflush() to keep `sb_flagsintr'mpi2018-01-021-7/+3
| | | | | | untouched. ok bluhm@, visa@
* Remove unnecessary unlock/lock dance when following a goto.mpi2017-12-191-5/+5
| | | | ok bluhm@
* Revert grabbing the socket lock in kqueue(2) filters.mpi2017-12-181-16/+3
| | | | | | | | | | This change exposed or created a situation where a CPU started to be irresponsive while holding the KERNEL_LOCK(). These led to lockups and even with MP_LOCKDEBUG it was not clear what happened to this CPU. These situations have been experience by dhill@ with dcrwallet and jcs@ with syncthing. Both applications are written in Go and do kevent(2) & networking across multiple threads.
* Move SB_SPLICE, SB_WAIT and SB_SEL to `sb_flags', serialized by solock().mpi2017-12-101-15/+16
| | | | | | | | | | | SB_KNOTE remains the only bit set on `sb_flagsintr' as it is set/unset in contexts related to kqueue(2) where we'd like to avoid grabbing solock(). While here add some KERNEL_LOCK()/UNLOCK() dances around selwakeup() and csignal() to mark which remaining functions need to be addressed in the socket layer. ok visa@, bluhm@
* Constify protocol tables and remove an assert now that ip_deliver() ismpi2017-11-231-5/+5
| | | | | | mp-safe. ok bluhm@, visa@
* We want `sb_flags' to be protected by the socket lock rather than thempi2017-11-231-12/+12
| | | | | | | | | KERNEL_LOCK(), so change asserts accordingly. This is now possible since sblock()/sbunlock() are always called with the socket lock held. ok bluhm@, visa@
* Make it possible for multiple threads to enter kqueue_scan() in parallel.mpi2017-11-041-3/+16
| | | | | | | | | | | | | | | | | This is a requirement to use a sleeping lock inside kqueue filters. It is now possible, but not recommended, to sleep inside ``f_event''. Threads iterating over the list of pending events are now recognizing and skipping other threads' markers. knote_acquire() and knote_release() must be used to "own" a knote to make sure no other thread is sleeping with a reference on it. Acquire and marker logic taken from DragonFly but the KERNEL_LOCK() is still serializing the execution of the kqueue code. This also enable the NET_LOCK() in socket filters. Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@