Age | Commit message (Collapse) | Author | Files | Lines |
|
What we are trying to do is change the '=' character to a NUL terminator
and then at the end of the function we restore it back to an '='. The
problem is there are two error paths where we jump to the end of the
function before we have replaced the '=' with NUL.
We end up putting the '=' in the wrong place (possibly one element
before the start of the buffer).
Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On 32-bit platform the size of long is only 32 bits which makes wrong
offset in the array of 64 bit size.
Calculate offset based on BITS_PER_LONG.
Link: http://lkml.kernel.org/r/20200109103601.45929-1-andriy.shevchenko@linux.intel.com
Fixes: 30544ed5de43 ("lib/bitmap: introduce bitmap_replace() helper")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the
epoll_ctl(2) system call.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Also make it available outside of epoll, along with the helper that
decides if we need to copy the passed in epoll_event.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're not consistent in how the file table is grabbed and assigned if we
have a command linked that requires the use of it.
Add ->file_table to the io_op_defs[] array, and use that to determine
when to grab the table instead of having the handlers set it if they
need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
flag. We always initialize work->files, so io-wq can just check for
that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For personalities previously registered via IORING_REGISTER_PERSONALITY,
allow any command to select them. This is done through setting
sqe->personality to the id returned from registration, and then flagging
sqe->flags with IOSQE_PERSONALITY.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If an application wants to use a ring with different kinds of
credentials, it can register them upfront. We don't lookup credentials,
the credentials of the task calling IORING_REGISTER_PERSONALITY is used.
An 'id' is returned for the application to use in subsequent personality
support.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to
be a valid io_uring fd io-wq of which will be shared with the newly
created io_uring instance. If the flag is set but it can't share io-wq,
it fails.
This allows creation of "sibling" io_urings, where we prefer to keep the
SQ/CQ private, but want to share the async backend to minimize the amount
of overhead associated with having multiple rings that belong to the same
backend.
Reported-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Daurnimator <quae@daurnimator.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Export a helper to attach to an existing io-wq, rather than setting up
a new one. This is doable now that we have reference counted io_wq's.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently setup the io_wq with a static set of mm and creds. Even for
a single-use io-wq per io_uring, this is suboptimal as we have may have
multiple enters of the ring. For sharing the io-wq backend, it doesn't
work at all.
Switch to passing in the creds and mm when the work item is setup. This
means that async work is no longer deferred to the io_uring mm and creds,
it is done with the current mm and creds.
Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
they can rely on the current personality (mm and creds) being the same
for direct issue and async issue.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 245595e83fbedda9e107eb0b37cec0ad07733778.
Guido Günther reported issues with this patch that broke existing
user space. Let's revert it for now and fix it properly later on.
Link: https://patchwork.kernel.org/patch/11291089/
https://lore.kernel.org/lkml/20200121114553.2667556-1-arnd@arndb.de/
Cc: Guido Günther <agx@sigxcpu.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
While the document is there, it is currently missing at the
index file.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/3b8e7783b1fcc71e4f94af5ea8e5fa264392f8c4.1580193653.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
It seems that some files were removed from USB documentation.
Update the links accordingly.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/00008303fde6b4e06d027d3b76ae7032614a7030.1580193653.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
test.d/ftrace/func-filter-glob.tc is failing on s390 because it has
ARCH_INLINE_SPIN_LOCK and friends set to 'y'. So the usual
__raw_spin_lock symbol isn't in the ftrace function list. Change
'*aw*lock' to '*spin*lock' which would hopefully match some of the
locking functions on all platforms.
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|
|
Add default MDIO_BCM_IPROC Kconfig setting such that it is default
on for IPROC family of devices.
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Multicast and broadcast packets can be looped from egress to ingress
pre segmentation with dev_loopback_xmit. That function unconditionally
sets ip_summed to CHECKSUM_UNNECESSARY.
udp_rcv_segment segments gso packets in the udp rx path. Segmentation
usually executes on egress, and does not expect packets of this type.
__udp_gso_segment interprets !CHECKSUM_PARTIAL as CHECKSUM_NONE. But
the offsets are not correct for gso_make_checksum.
UDP GSO packets are of type CHECKSUM_PARTIAL, with their uh->check set
to the correct pseudo header checksum. Reset ip_summed to this type.
(CHECKSUM_PARTIAL is allowed on ingress, see comments in skbuff.h)
Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The old netem mailing list was inactive and recently was targeted by
spammers. Switch to just using netdev mailing list which is where all
the real change happens.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.
In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.
Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory. Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.
[ 1626.609191] msgr-worker-1 D 0 1026 1 0x00004000
[ 1626.609193] Call Trace:
[ 1626.609195] ? __schedule+0x29b/0x630
[ 1626.609197] ? wait_for_completion+0xe0/0x170
[ 1626.609198] schedule+0x30/0xb0
[ 1626.609200] schedule_timeout+0x1f6/0x2f0
[ 1626.609202] ? blk_finish_plug+0x21/0x2e
[ 1626.609204] ? _xfs_buf_ioapply+0x2e6/0x410
[ 1626.609206] ? wait_for_completion+0xe0/0x170
[ 1626.609208] wait_for_completion+0x108/0x170
[ 1626.609210] ? wake_up_q+0x70/0x70
[ 1626.609212] ? __xfs_buf_submit+0x12e/0x250
[ 1626.609214] ? xfs_bwrite+0x25/0x60
[ 1626.609215] xfs_buf_iowait+0x22/0xf0
[ 1626.609218] __xfs_buf_submit+0x12e/0x250
[ 1626.609220] xfs_bwrite+0x25/0x60
[ 1626.609222] xfs_reclaim_inode+0x2e8/0x310
[ 1626.609224] xfs_reclaim_inodes_ag+0x1b6/0x300
[ 1626.609227] xfs_reclaim_inodes_nr+0x31/0x40
[ 1626.609228] super_cache_scan+0x152/0x1a0
[ 1626.609231] do_shrink_slab+0x12c/0x2d0
[ 1626.609233] shrink_slab+0x9c/0x2a0
[ 1626.609235] shrink_node+0xd7/0x470
[ 1626.609237] do_try_to_free_pages+0xbf/0x380
[ 1626.609240] try_to_free_pages+0xd9/0x1f0
[ 1626.609245] __alloc_pages_slowpath+0x3a4/0xd30
[ 1626.609251] ? ___slab_alloc+0x238/0x560
[ 1626.609254] __alloc_pages_nodemask+0x30c/0x350
[ 1626.609259] skb_page_frag_refill+0x97/0xd0
[ 1626.609274] sk_page_frag_refill+0x1d/0x80
[ 1626.609279] tcp_sendmsg_locked+0x2bb/0xdd0
[ 1626.609304] tcp_sendmsg+0x27/0x40
[ 1626.609307] sock_sendmsg+0x54/0x60
[ 1626.609308] ___sys_sendmsg+0x29f/0x320
[ 1626.609313] ? sock_poll+0x66/0xb0
[ 1626.609318] ? ep_item_poll.isra.15+0x40/0xc0
[ 1626.609320] ? ep_send_events_proc+0xe6/0x230
[ 1626.609322] ? hrtimer_try_to_cancel+0x54/0xf0
[ 1626.609324] ? ep_read_events_proc+0xc0/0xc0
[ 1626.609326] ? _raw_write_unlock_irq+0xa/0x20
[ 1626.609327] ? ep_scan_ready_list.constprop.19+0x218/0x230
[ 1626.609329] ? __hrtimer_init+0xb0/0xb0
[ 1626.609331] ? _raw_spin_unlock_irq+0xa/0x20
[ 1626.609334] ? ep_poll+0x26c/0x4a0
[ 1626.609337] ? tcp_tsq_write.part.54+0xa0/0xa0
[ 1626.609339] ? release_sock+0x43/0x90
[ 1626.609341] ? _raw_spin_unlock_bh+0xa/0x20
[ 1626.609342] __sys_sendmsg+0x47/0x80
[ 1626.609347] do_syscall_64+0x5f/0x1c0
[ 1626.609349] ? prepare_exit_to_usermode+0x75/0xa0
[ 1626.609351] entry_SYSCALL_64_after_hwframe+0x44/0xa9
This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Masato Suzuki <masato.suzuki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
In preparation for sharing an io-wq across different users, add a
reference count that manages destruction of it.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In case of out of memory the second argument of percpu_ref_put_many() in
io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly
wrong.
Fixes: 2b85edfc0c90 ("io_uring: batch getting pcpu references")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Draining the middle of a link is tricky, so leave a comment there
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the non-vectored variant of READV/WRITEV, we don't need to setup an
async io context, and we flag that appropriately in the io_op_defs
array. However, in fixing this for the 5.5 kernel in commit 74566df3a71c
we didn't have these opcodes, so the check there was added just for the
READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a
single check for needing async context, that covers all four of these
read/write variants that don't use an iovec.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
scripts/find-unused-docs.sh invokes scripts/kernel-doc to find out if a
source file contains kerneldoc or not.
However, as it passes the no longer supported "-text" option to
scripts/kernel-doc, the latter prints out its help text, causing all
files to be considered containing kerneldoc.
Get rid of these false positives by removing the no longer supported
"-text" option from the scripts/kernel-doc invocation.
Cc: stable@vger.kernel.org # 4.16+
Fixes: b05142675310d2ac ("scripts: kernel-doc: get rid of unused output formats")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20200127093107.26401-1-geert+renesas@glider.be
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).
Otherwise generic_make_request() will crash simply by doing:
dmsetup create -n test
mount /dev/dm-N /mnt
While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.
Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab34583a ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c3860c ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
This macro is never used from it was introduced in commit e6b1db98cf4d5
("security: Support early LSMs"), better to remove it.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Commit 852c8cbf34d3 ("selftests/kselftest/runner.sh: Add 45 second
timeout per test") adds support for a new per-test-directory "settings"
file. But this only works for tests not in a sub-subdirectories, e.g.
- tools/testing/selftests/rtc (rtc) is OK,
- tools/testing/selftests/net/mptcp (net/mptcp) is not.
We have to increase the timeout for net/mptcp tests which are not
upstreamed yet but this fix is valid for other tests if they need to add
a "settings" file, see the full list with:
tools/testing/selftests/*/*/**/Makefile
Note that this patch changes the text header message printed at the end
of the execution but this text is modified only for the tests that are
in sub-subdirectories, e.g.
ok 1 selftests: net/mptcp: mptcp_connect.sh
Before we had:
ok 1 selftests: mptcp: mptcp_connect.sh
But showing the full target name is probably better, just in case a
subsubdir has the same name as another one in another subdirectory.
Fixes: 852c8cbf34d3 (selftests/kselftest/runner.sh: Add 45 second timeout per test)
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|