Age | Commit message (Collapse) | Author | Files | Lines |
|
lookup_one_len already has this check.
The only effect of this patch is to return access instead of perm in the
0-length-filename case. I actually prefer nfserr_perm (or _inval?), but
I doubt anyone cares.
The isdotent check seems redundant too, but I worry that some client
might actually care about that strange nfserr_exist error.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
When doing a create (mkdir/mknod) on a name, it's worth
checking the name exists first before returning EACCES in case
the directory is not writeable by the user.
This makes return values on the client more consistent
regardless of whenever the entry there is cached in the local
cache or not.
Another positive side effect is certain programs only expect
EEXIST in that case even despite POSIX allowing any valid
error to be returned.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
This modification is useful for debugging issues that happen while
the socket is being initialised.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
We're seeing traces of the following form:
[10952.396347] svc: transport ffff88042ba4a 000 dequeued, inuse=2
[10952.396351] svc: tcp_accept ffff88042ba4 a000 sock ffff88042a6e4c80
[10952.396362] nfsd: connect from 10.2.6.1, port=187
[10952.396364] svc: svc_setup_socket ffff8800b99bcf00
[10952.396368] setting up TCP socket for reading
[10952.396370] svc: svc_setup_socket created ffff8803eb10a000 (inet ffff88042b75b800)
[10952.396373] svc: transport ffff8803eb10a000 put into queue
[10952.396375] svc: transport ffff88042ba4a000 put into queue
[10952.396377] svc: server ffff8800bb0ec000 waiting for data (to = 3600000)
[10952.396380] svc: transport ffff8803eb10a000 dequeued, inuse=2
[10952.396381] svc_recv: found XPT_CLOSE
[10952.396397] svc: svc_delete_xprt(ffff8803eb10a000)
[10952.396398] svc: svc_tcp_sock_detach(ffff8803eb10a000)
[10952.396399] svc: svc_sock_detach(ffff8803eb10a000)
[10952.396412] svc: svc_sock_free(ffff8803eb10a000)
i.e. an immediate close of the socket after initialisation.
The culprit appears to be the test at the end of svc_tcp_init, which
checks if the newly created socket is in the TCP_ESTABLISHED state,
and immediately closes it if not. The evidence appears to suggest that
the socket might still be in the SYN_RECV state at this time.
The fix is to check for both states, and then to add a check in
svc_tcp_state_change() to ensure we don't close the socket when
it transitions into TCP_ESTABLISHED.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
If the underlying filesystem supports multiple layout types, then there
is little reason not to advertise that fact to clients and let them
choose what type to use.
Turn the ex_layout_type field into a bitfield. For each supported
layout type, we set a bit in that field. When the client requests a
layout, ensure that the bit for that layout type is set. When the
client requests attributes, send back a list of supported types.
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Reviewed-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
nfsd4_release_lockowner finds a lock owner that has no lock state,
and drops cl_lock. Then release_lockowner picks up cl_lock and
unhashes the lock owner.
During the window where cl_lock is dropped, I don't see anything
preventing a concurrent nfsd4_lock from finding that same lock owner
and adding lock state to it.
Move release_lockowner() into nfsd4_release_lockowner and hang onto
the cl_lock until after the lock owner's state cannot be found
again.
Found by inspection, we don't currently have a reproducer.
Fixes: 2c41beb0e5cf ("nfsd: reduce cl_lock thrashing in ... ")
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
These values are all multiples of 4 already, so there's no change in
behavior from this patch. But perhaps this will prevent mistakes in the
future.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Instead of creeping pnfs layout configuration into filesystems, move the
definition of block-based export operations under a more abstract
configuration.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
The current server rpc tcp code attempts to predict how much writeable
socket space will be available to a given RPC call before accepting it
for processing. On a 40GigE network, we've found this throttles
individual clients long before the network or disk is saturated. The
server may handle more clients easily, but the bandwidth of individual
clients is still artificially limited.
Instead of trying (and failing) to predict how much writeable socket space
will be available to the RPC call, just fall back to the simple model of
deferring processing until the socket is uncongested.
This may increase the risk of fast clients starving slower clients; in
such cases, the previous patch allows setting a hard per-connection
limit.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Allow the user to limit the number of requests serviced through a single
connection, to help prevent faster clients from starving slower clients.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Don't call svc_xprt_enqueue() if the XPT_DATA flag is already set.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Rather than code up our own versions of the socket callbacks, just
call the defaults.
This also allows us to merge svc_udp_data_ready() and svc_tcp_data_ready().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Prevent callbacks from triggering while we're detaching the socket.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Dropping and/or deferring requests has an impact on performance. Let's
make sure we can trace those events.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Add a tracepoint to track when the processing of incoming RPC data gets
deferred due to out-of-space issues on the outgoing transport.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Silent a few smatch warnings about indentation
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Those are now defined in fs/nfsd/vfs.h
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Have a simple flex file server where the mds (NFSv4.1 or NFSv4.2)
is also the ds (NFSv3). I.e., the metadata and the data file are
the exact same file.
This will allow testing of the flex file client.
Simply add the "pnfs" export option to your export
in /etc/exports and mount from a client that supports
flex files.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
GSS-Proxy doesn't produce very much debug logging at all. Printing out
the gss minor status will aid in troubleshooting if the
GSS_Accept_sec_context upcall fails.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
This addresses the conundrum referenced in RFC5661 18.35.3,
and will allow clients to return state to the server using the
machine credentials.
The biggest part of the problem is that we need to allow the client
to send a compound op with integrity/privacy on mounts that don't
have it enabled.
Add server support for properly decoding and using spo_must_enforce
and spo_must_allow bits. Add support for machine credentials to be
used for CLOSE, OPEN_DOWNGRADE, LOCKU, DELEGRETURN,
and TEST/FREE STATEID.
Implement a check so as to not throw WRONGSEC errors when these
operations are used if integrity/privacy isn't turned on.
Without this, Linux clients with credentials that expired while holding
delegations were getting stuck in an endless loop.
Signed-off-by: Andrew Elble <aweits@rit.edu>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Rename mach_creds_match() to nfsd4_mach_creds_match() and un-staticify
Signed-off-by: Andrew Elble <aweits@rit.edu>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
So these may be used in nfsd as well
Signed-off-by: Andrew Elble <aweits@rit.edu>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
This field is not currently in use.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
We're always tracing IPv4 or IPv6 addresses, so we can save a lot
of space on the ringbuffer by allocating the correct sockaddr size.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Fixes: 83a712e0afef "sunrpc: add some tracepoints around ..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
(Another one for the f_path debacle.)
ltp fcntl33 testcase caused an Oops in selinux_file_send_sigiotask.
The reason is that generic_add_lease() used filp->f_path.dentry->inode
while all the others use file_inode(). This makes a difference for files
opened on overlayfs since the former will point to the overlay inode the
latter to the underlying inode.
So generic_add_lease() added the lease to the overlay inode and
generic_delete_lease() removed it from the underlying inode. When the file
was released the lease remained on the overlay inode's lock list, resulting
in use after free.
Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
If the lockd service fails to start up then we need to be sure that the
notifier blocks are not registered, otherwise a subsequent start of the
service could cause the same notifier to be registered twice, leading to
soft lockups.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 0751ddf77b6a "lockd: Register callbacks on the inetaddr_chain..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
|
|
Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.
This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.
Tested with xfstests and by doing a parallel kernel build:
while make tinyconfig && make -j4 && git clean dqfx; do
:
done
along with a bunch of parallel finds in another shell:
while true; do
for ((i=0; i<4; i++)); do
find . >/dev/null &
done
wait
done
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:
kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
kernel/fork.c:355:8: error: assignment from incompatible pointer type
stack = alloc_thread_stack_node(tsk, node);
^
Fix it by renaming free_stack() to free_thread_stack(), and updating the
return type of alloc_thread_stack_node().
Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When I replaced kasprintf("%pf") with a direct call to
sprint_symbol_no_offset I must have broken the initcall blacklisting
feature on the arches where dereference_function_descriptor() is
non-trivial.
Fixes: c8cdd2be213f (init/main.c: simplify initcall_blacklisted())
Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__vfs_write() returns a negative value in a error case.
Link: http://lkml.kernel.org/r/20160616083108.6278.65815.stgit@pluto.themaw.net
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We have dereferenced page_ext before checking it. Lets check it first
and then used it.
Fixes: f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites")
Link: http://lkml.kernel.org/r/1465249059-7883-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
trivial fix to spelling mistake
Link: http://lkml.kernel.org/r/1466672144-831-1-git-send-email-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The value `bytes' comes from the filesystem which is about to be
mounted. We cannot trust that the value is always in the range we
expect it to be.
Check its value before using it to calculate the length for the crc32_le
call. It value must be larger (or equal) sumoff + 4.
This fixes a kernel bug when accidentially mounting an image file which
had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
s_bytes value of 1. This caused an underflow when substracting sumoff +
4 (20) in the call to crc32_le.
BUG: unable to handle kernel paging request at ffff88021e600000
IP: crc32_le+0x36/0x100
...
Call Trace:
nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
nilfs_load_super_block+0x142/0x300 [nilfs2]
init_nilfs+0x60/0x390 [nilfs2]
nilfs_mount+0x302/0x520 [nilfs2]
mount_fs+0x38/0x160
vfs_kern_mount+0x67/0x110
do_mount+0x269/0xe00
SyS_mount+0x9f/0x100
entry_SYSCALL_64_fastpath+0x16/0x71
Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:
(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.
This situation is unfortunate. We cannot move oom_killer_disable after
all the freezable kernel threads are frozen because the oom victim might
depend on some of those kthreads to make a forward progress to exit so
we could deadlock. It is also far from trivial to teach the oom_reaper
to not call exit_oom_victim() because then we would lose a guarantee of
the OOM killer and oom_killer_disable forward progress because
exit_mm->mmput might block and never call exit_oom_victim.
It seems the easiest way forward is to workaround this race by calling
try_to_freeze_tasks again after oom_killer_disable. This will make sure
that all the tasks are frozen or it bails out.
Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
According to some high-load testing, these two BUG assertions were
encountered, this led system panic. Actually, there were some
discussions about removing these two BUG() assertions, it would not
bring any side effect.
Then, I did the the following changes,
1) use the existing macro CATCH_BH_JBD_RACES to wrap BUG() in the
ocfs2_read_blocks_sync function like before.
2) disable the macro CATCH_BH_JBD_RACES in Makefile by default.
Link: http://lkml.kernel.org/r/1466574294-26863-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If the memory compaction free scanner cannot successfully split a free
page (only possible due to per-zone low watermark), terminate the free
scanner rather than continuing to scan memory needlessly. If the
watermark is insufficient for a free page of order <= cc->order, then
terminate the scanner since all future splits will also likely fail.
This prevents the compaction freeing scanner from scanning all memory on
very large zones (very noticeable for zones > 128GB, for instance) when
all splits will likely fail while holding zone->lock.
compaction_alloc() iterating a 128GB zone has been benchmarked to take
over 400ms on some systems whereas any free page isolated and ready to
be split ends up failing in split_free_page() because of the low
watermark check and thus the iteration continues.
The next time compaction occurs, the freeing scanner will likely start
at the end of the zone again since no success was made previously and we
get the same lengthy iteration until the zone is brought above the low
watermark. All thp page faults can take >400ms in such a state without
this fix.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When kmemleak dumps contents of leaked objects it reads whole objects
regardless of user-requested size. This upsets KASAN. Disable KASAN
checks around object dump.
Link: http://lkml.kernel.org/r/1466617631-68387-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While working on s390 support for gigantic hugepages I ran into the
following "Bad page state" warning when freeing gigantic pages:
BUG: Bad page state in process bash pfn:580001
page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
flags: 0x7fffc0000000000()
page dumped because: non-NULL mapping
This is because page->compound_mapcount, which is part of a union with
page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
and not cleared again during destroy_compound_gigantic_page(). Fix this
by clearing the compound_mapcount in destroy_compound_gigantic_page()
before clearing compound_head.
Interestingly enough, the warning will not show up on x86_64, although
this should not be architecture specific. Apparently there is an
endianness issue, combined with the fact that the union contains both a
64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
members. The resulting bogus page->mapping on x86_64 therefore contains
00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
trigger the PageAnon() check in free_pages_prepare() because
page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
like x86_64 in this case (the page is not compound anymore,
->compound_head was already cleared before). As a result, page->mapping
will be cleared before doing the checks in free_pages_check().
Not sure if the bogus "PageAnon() returning true" on x86_64 for the
first tail page of a gigantic page (at this stage) has other theoretical
implications, but they would also be fixed with this patch.
Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently we can have compound pages held on per cpu pagevecs, which
leads to a lot of memory unavailable for reclaim when needed. In the
systems with hundreads of processors it can be GBs of memory.
On of the way of reproducing the problem is to not call munmap
explicitly on all mapped regions (i.e. after receiving SIGTERM). After
that some pages (with THP enabled also huge pages) may end up on
lru_add_pvec, example below.
void main() {
#pragma omp parallel
{
size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS
void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
if (p != MAP_FAILED)
memset(p, 0, size);
//munmap(p, size); // uncomment to make the problem go away
}
}
When we run it with THP enabled it will leave significant amount of
memory on lru_add_pvec. This memory will be not reclaimed if we hit
OOM, so when we run above program in a loop:
for i in `seq 100`; do ./a.out; done
many processes (95% in my case) will be killed by OOM.
The primary point of the LRU add cache is to save the zone lru_lock
contention with a hope that more pages will belong to the same zone and
so their addition can be batched. The huge page is already a form of
batched addition (it will add 512 worth of memory in one go) so skipping
the batching seems like a safer option when compared to a potential
excess in the caching which can be quite large and much harder to fix
because lru_add_drain_all is way to expensive and it is not really clear
what would be a good moment to call it.
Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
madvise(p, size, MADV_FREE); after memset.
This patch flushes lru pvecs on compound page arrival making the problem
less severe - after applying it kill rate of above example drops to 0%,
due to reducing maximum amount of memory held on pvec from 28MB (with
THP) to 56kB per CPU.
Suggested-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ming Li <mingli199x@qq.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
expected it to return an ERR_PTR value leading to the following NULL
deref after a css allocation failure. Fix it by return
ERR_PTR(-ENOMEM) instead. I'll also update cgroup core so that it
can handle NULL returns.
mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
...
Call Trace:
dump_stack+0x68/0xa1
warn_alloc_failed+0xd6/0x130
__alloc_pages_nodemask+0x4c6/0xf20
alloc_pages_current+0x66/0xe0
alloc_kmem_pages+0x14/0x80
kmalloc_order_trace+0x2a/0x1a0
__kmalloc+0x291/0x310
memcg_update_all_caches+0x6c/0x130
mem_cgroup_css_alloc+0x590/0x610
cgroup_apply_control_enable+0x18b/0x370
cgroup_mkdir+0x1de/0x2e0
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xb9/0x150
SyS_mkdir+0x66/0xd0
do_syscall_64+0x53/0x120
entry_SYSCALL64_slow_path+0x25/0x25
...
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: init_and_link_css+0x37/0x220
PGD 34b1e067 PUD 3a109067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
RIP: 0010:[<ffffffff810f2ca7>] [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220
RSP: 0018:ffff8800666d7d90 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
FS: 00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
cgroup_apply_control_enable+0x1ac/0x370
cgroup_mkdir+0x1de/0x2e0
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xb9/0x150
SyS_mkdir+0x66/0xd0
do_syscall_64+0x53/0x120
entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
RIP init_and_link_css+0x37/0x220
RSP <ffff8800666d7d90>
CR2: 00000000000000d0
---[ end trace a2d8836ae1e852d1 ]---
Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
with irq disabled from migrate_page_copy(). This ends up enabling irq
while holding a irq context lock triggering the following lockdep
warning. Fix it by using irq_save/restore instead.
=================================
[ INFO: inconsistent lock state ]
4.7.0-rc1+ #52 Tainted: G W
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8
{IN-SOFTIRQ-W} state was registered at:
__lock_acquire+0x5b6/0x1930
lock_acquire+0xee/0x270
_raw_spin_lock_irqsave+0x66/0xb0
aio_complete+0x98/0x328
dio_complete+0xe4/0x1e0
blk_update_request+0xd4/0x450
scsi_end_request+0x48/0x1c8
scsi_io_completion+0x272/0x698
blk_done_softirq+0xca/0xe8
__do_softirq+0xc8/0x518
irq_exit+0xee/0x110
do_IRQ+0x6a/0x88
io_int_handler+0x11a/0x25c
__mutex_unlock_slowpath+0x144/0x1d8
__mutex_unlock_slowpath+0x140/0x1d8
kernfs_iop_permission+0x64/0x80
__inode_permission+0x9e/0xf0
link_path_walk+0x6e/0x510
path_lookupat+0xc4/0x1a8
filename_lookup+0x9c/0x160
user_path_at_empty+0x5c/0x70
SyS_readlinkat+0x68/0x140
system_call+0xd6/0x270
irq event stamp: 971410
hardirqs last enabled at (971409): migrate_page_move_mapping+0x3ea/0x588
hardirqs last disabled at (971410): _raw_spin_lock_irqsave+0x3c/0xb0
softirqs last enabled at (970526): __do_softirq+0x460/0x518
softirqs last disabled at (970519): irq_exit+0xee/0x110
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&ctx->completion_lock)->rlock);
<Interrupt>
lock(&(&ctx->completion_lock)->rlock);
*** DEADLOCK ***
3 locks held by kcompactd0/151:
#0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: aio_migratepage+0x42/0x1e8
#1: (&ctx->ring_lock){+.+.+.}, at: aio_migratepage+0x5a/0x1e8
#2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: aio_migratepage+0x156/0x1e8
stack backtrace:
CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G W 4.7.0-rc1+ #52
Call Trace:
show_trace+0xea/0xf0
show_stack+0x72/0xf0
dump_stack+0x9a/0xd8
print_usage_bug.part.27+0x2d4/0x2e8
mark_lock+0x17e/0x758
mark_held_locks+0xa2/0xd0
trace_hardirqs_on_caller+0x140/0x1c0
mem_cgroup_migrate+0x266/0x370
aio_migratepage+0x16a/0x1e8
move_to_new_page+0xb0/0x260
migrate_pages+0x8f4/0x9f0
compact_zone+0x4dc/0xdc8
kcompactd_do_work+0x1aa/0x358
kcompactd+0xba/0x2c8
kthread+0x10a/0x110
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.
Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
Fixes: 74485cf2bc85 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We account HugeTLB's shared page table to all processes who share it.
The accounting happens during huge_pmd_share().
If somebody populates pud entry under us, we should decrease pagetable's
refcount and decrease nr_pmds of the process.
By mistake, I increase nr_pmds again in this case. :-/ It will lead to
"BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.
Let's fix this by increasing nr_pmds only when we're sure that the page
table will be used.
Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit d0834a6c2c5b0c76cfb806bd7dba6556d8b4edbb.
After revert of 5c0a85fad949 ("mm: make faultaround produce old ptes")
faultaround doesn't have dependencies on hardware accessed bit, so let's
revert this one too.
Link: http://lkml.kernel.org/r/1465893750-44080-3-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 5c0a85fad949212b3e059692deecdeed74ae7ec7.
The commit causes ~6% regression in unixbench.
Let's revert it for now and consider other solution for reclaim problem
later.
Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are different versions of Boris' name and email in the log, and
one typo. Add his emails in mailmap to have all of his contributions
under the same name/email tuple.
Link: http://lkml.kernel.org/r/20160609130323.27706-2-antoine.tenart@free-electrons.com
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I used "Antoine Ténart" at first but then moved to a name without accent
as this cause some issues from time to time... Add my email in the
mailmap file to have a consistent shortlog output.
Link: http://lkml.kernel.org/r/20160609130323.27706-1-antoine.tenart@free-electrons.com
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Cc: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable
to sleep, unwilling to sleep and avoiding waking kswapd") modified
__GFP_WAIT to explicitly identify the difference between atomic callers
and those that were unwilling to sleep. Later the definition was
removed entirely.
The GFP_RECLAIM_MASK is the set of flags that affect watermark checking
and reclaim behaviour but __GFP_ATOMIC was never added. Without it,
atomic users of the slab allocator strip the __GFP_ATOMIC flag and
cannot access the page allocator atomic reserves. This patch addresses
the problem.
The user-visible impact depends on the workload but potentially atomic
allocations unnecessarily fail without this path.
Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Marcin Wojtas <mw@semihalf.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|