Age | Commit message (Collapse) | Author | Files | Lines |
|
Use kernel.h macro definition.
Thanks to Julia Lawall for Coccinelle scripting support.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The commit cf108bca465d: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock. However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.
Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.
This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:
jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0
- and -
kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
...
Call Trace:
[<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
[<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
[<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
[<c027d883>] ? lock_buffer+0x36/0x36
[<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
[<c0229139>] do_invalidatepage+0x22/0x26
[<c0229198>] truncate_inode_page+0x5b/0x85
[<c022934b>] truncate_inode_pages_range+0x156/0x38c
[<c0229592>] truncate_inode_pages+0x11/0x15
[<c022962d>] truncate_pagecache+0x55/0x71
[<c02b913b>] ext4_setattr+0x4a9/0x560
[<c01ca542>] ? current_kernel_time+0x10/0x44
[<c026c4d8>] notify_change+0x1c7/0x2be
[<c0256a00>] do_truncate+0x65/0x85
[<c0226f31>] ? file_ra_state_init+0x12/0x29
- and -
WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
...
Call Trace:
[<c01b879f>] ? console_unlock+0x3a1/0x3ce
[<c082cbb4>] dump_stack+0x48/0x60
[<c0178b65>] warn_slowpath_common+0x89/0xa0
[<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
[<c0178bef>] warn_slowpath_null+0x14/0x18
[<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
[<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
[<c02b2f44>] write_end_fn+0x40/0x53
[<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
[<c02b59e7>] ext4_writepage+0x354/0x3b8
[<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
[<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
[<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
[<c02b5a5b>] __writepage+0x10/0x2e
[<c0225956>] write_cache_pages+0x22d/0x32c
[<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
[<c02b6ee8>] ext4_writepages+0x102/0x607
[<c019adfe>] ? sched_clock_local+0x10/0x10e
[<c01a8a7c>] ? __lock_is_held+0x2e/0x44
[<c01a8ad5>] ? lock_is_held+0x43/0x51
[<c0226dff>] do_writepages+0x1c/0x29
[<c0276bed>] __writeback_single_inode+0xc3/0x545
[<c0277c07>] writeback_sb_inodes+0x21f/0x36d
...
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
We currently don't correctly handle the case where blocksize !=
pagesize, so disallow the mount in those cases.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch implements fallocate's FALLOC_FL_INSERT_RANGE for Ext4.
1) Make sure that both offset and len are block size aligned.
2) Update the i_size of inode by len bytes.
3) Compute the file's logical block number against offset. If the computed
block number is not the starting block of the extent, split the extent
such that the block number is the starting block of the extent.
4) Shift all the extents which are lying between [offset, last allocated extent]
towards right by len bytes. This step will make a hole of len bytes
at offset.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
|
|
jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
frequently called for buffers that are already part of the running
transaction - most frequently it is the case for bitmaps, inode table
blocks, and superblock. Since in such cases we have nothing to do, it is
unfortunate we still grab reference to journal head, lock the bh, lock
bh_state only to find out there's nothing to do.
Improving this is a bit subtle though since until we find out journal
head is attached to the running transaction, it can disappear from under
us because checkpointing / commit decided it's no longer needed. We deal
with this by protecting journal_head slab with RCU. We still have to be
careful about journal head being freed & reallocated within slab and
about exposing journal head in consistent state (in particular
b_modified and b_frozen_data must be in correct state before we allow
user to touch the buffer).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Check for the simple case of unjournaled buffer first, handle it and
bail out. This allows us to remove one if and unindent the difficult case
by one tab. The result is easier to read.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We were acquiring bh_state_lock when allocation of buffer failed in
do_get_write_access() only to be able to jump to a label that releases
the lock and does all other checks that don't make sense for this error
path. Just jump into the right label instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
needs_copy is set only in one place in do_get_write_access(), just move
the frozen buffer copying into that place and factor it out to a
separate function to make do_get_write_access() slightly more readable.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
[ Added another sparse fix for EXT4_IOC_GET_ENCRYPTION_POLICY while
we're at it. --tytso ]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
During a source code review of fs/ext4/extents.c I noted identical
consecutive lines. An assertion is repeated for inode1 and never done
for inode2. This is not in keeping with the rest of the code in the
ext4_swap_extents function and appears to be a bug.
Assert that the inode2 mutex is not locked.
Signed-off-by: David Moore <dmoorefo@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
|
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
the allocation group is suitable for use or not. However we might get
various errors and fail while initializing new group including -EIO
which would never get propagated up the call chain. This might lead to
an endless loop at writeback when we're trying to find a good group to
allocate from and we fail to initialize new group (read error for
example).
Fix this by returning proper error code from ext4_mb_good_group() and
using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
we will always return only the first occurred error from
ext4_mb_good_group() and we only propagate it back to the caller if we
do not get any other errors and we fail to allocate any blocks.
Note that with other modes than errors=continue, we will fail
immediately in ext4_mb_good_group() in case of error, however with
errors=continue we should try to continue using the file system, that's
why we're not going to fail immediately when we see an error from
ext4_mb_good_group(), but rather when we fail to find a suitable block
group to allocate from due to an problem in group initialization.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Currently on the machines with page size > block size when initializing
block group buddy cache we initialize it for all the block group bitmaps
in the page. However in the case of read error, checksum error, or if
a single bitmap is in any way corrupted we would fail to initialize all
of the bitmaps. This is problematic because we will not have access to
the other allocation groups even though those might be perfectly fine
and usable.
Fix this by reading all the bitmaps instead of error out on the first
problem and simply skip the bitmaps which were either not read properly,
or are not valid.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If we want to rely on the buffer_verified() flag of the block bitmap
buffer, we have to set it consistently. However currently if we're
initializing uninitialized block bitmap in
ext4_read_block_bitmap_nowait() we're not going to set buffer verified
at all.
We can do this by simply setting the flag on the buffer, but I think
it's actually better to run ext4_validate_block_bitmap() to make sure
that what we did in the ext4_init_block_bitmap() is right.
So run ext4_validate_block_bitmap() even after the block bitmap
initialization. Also bail out early from ext4_validate_block_bitmap() if
we see corrupt bitmap, since we already know it's corrupt and we do not
need to verify that.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2
layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
to open coding the endless loop around the allocator rather than
removing the dependency on the non failing allocation. So the
deprecation was a clear failure and the reality tells us that
__GFP_NOFAIL is not even close to go away.
It is still true that __GFP_NOFAIL allocations are generally discouraged
and new uses should be evaluated and an alternative (pre-allocations or
reservations) should be considered but it doesn't make any sense to lie
the allocator about the requirements. Allocator can take steps to help
making a progress if it knows the requirements.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: David Rientjes <rientjes@google.com>
|
|
Previously we allocated bounce pages using a combination of
alloc_page() and mempool_alloc() with the __GFP_WAIT bit set.
Instead, use mempool_alloc() with GFP_NOWAIT. The mempool_alloc()
function will try using alloc_pages() initially, and then only use the
mempool reserve of pages if alloc_pages() is unable to fulfill the
request.
This minimizes the the impact on the mm layer when we need to do a
large amount of writeback of encrypted files, as Jaeguk Kim had
reported that under a heavy fio workload on a system with restricted
amounts memory (which unfortunately, includes many mobile handsets),
he had observed the the OOM killer getting triggered several times.
Using GFP_NOWAIT
If the mempool_alloc() function fails, we will retry the page
writeback at a later time; the function of the mempool is to ensure
that we can writeback at least 32 pages at a time, so we can more
efficiently dispatch I/O under high memory pressure situations. In
the future we should make this be a tunable so we can determine the
best tradeoff between permanently sequestering memory and the ability
to quickly launder pages so we can free up memory quickly when
necessary.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Crypto resource should be released when ext4 module exits, otherwise
it will cause memory leak.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fix up attempts by users to try to write to a file when they don't
have access to the encryption key.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously we were taking the required padding when allocating space
for the on-disk symlink. This caused a buffer overrun which could
trigger a krenel crash when running fsstress.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fix a potential memory leak where fname->crypto_buf.name wouldn't get
freed in some error paths, and also make the error handling easier to
understand/audit.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Thanks to Chao Yu <chao2.yu@samsung.com> for pointing out we were
missing this check.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Thanks to Chao Yu <chao2.yu@samsung.com> for pointing out the need for
this check.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Factor out calls to ext4_inherit_context() and move them to
__ext4_new_inode(); this fixes a problem where ext4_tmpfile() wasn't
calling calling ext4_inherit_context(), so the temporary file wasn't
getting protected. Since the blocks for the tmpfile could end up on
disk, they really should be protected if the tmpfile is created within
the context of an encrypted directory.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Set up the encryption information for newly created inodes immediately
after they inherit their encryption context from their parent
directories.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_encrypted_zeroout() could end up leaking a bio and bounce page.
Fortunately it's not used much. While we're fixing things up,
refactor out common code into the static function alloc_bounce_page()
and fix up error handling if mempool_alloc() fails.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
we read or write a page. Instead we can use a single tfm hanging off
the inode's crypt_info structure for all of our encryption needs for
that inode, since the tfm can be used by multiple crypto requests in
parallel.
Also use cmpxchg() to avoid races that could result in crypt_info
structure getting doubly allocated or doubly freed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
On arm64 this is apparently needed for CTS mode to function correctly.
Otherwise attempts to use CTS return ENOENT.
Change-Id: I732ea9a5157acc76de5b89edec195d0365f4ca63
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Some fields are only used when the crypto_ctx is being used on the
read path, some are only used on the write path, and some are only
used when the structure is on free list. Optimize memory use by using
a union.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The ci_mode field was superfluous, and getting rid of it gets rid of
an unused hole in the structure.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
slighly better memory efficiency and debuggability.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The superblock fields s_file_encryption_mode and s_dir_encryption_mode
are vestigal, so remove them as a cleanup. While we're at it, allow
file systems with both encryption and inline_data enabled at the same
time to work correctly. We can't have encrypted inodes with inline
data, but there's no reason to prohibit unencrypted inodes from using
the inline data feature.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This is a pretty massive patch which does a number of different things:
1) The per-inode encryption information is now stored in an allocated
data structure, ext4_crypt_info, instead of directly in the node.
This reduces the size usage of an in-memory inode when it is not
using encryption.
2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
encryption structure instead. This remove an unnecessary memory
allocation and free for the fname_crypto_ctx as well as allowing us
to reuse the ctfm in a directory for multiple lookups and file
creations.
3) We also cache the inode's policy information in the ext4_crypt_info
structure so we don't have to continually read it out of the
extended attributes.
4) We now keep the keyring key in the inode's encryption structure
instead of releasing it after we are done using it to derive the
per-inode key. This allows us to test to see if the key has been
revoked; if it has, we prevent the use of the derived key and free
it.
5) When an inode is released (or when the derived key is freed), we
will use memset_explicit() to zero out the derived key, so it's not
left hanging around in memory. This implies that when a user logs
out, it is important to first revoke the key, and then unlink it,
and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
release any decrypted pages and dcache entries from the system
caches.
6) All this, and we also shrink the number of lines of code by around
100. :-)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use struct ext4_encryption_key only for the master key passed via the
kernel keyring.
For internal kernel space users, we now use struct ext4_crypt_info.
This will allow us to put information from the policy structure so we
can cache it and avoid needing to constantly looking up the extended
attribute. We will do this in a spearate patch. This patch is mostly
mechnical to make it easier for patch review.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Encrypt the filename as soon it is passed in by the user. This avoids
our needing to encrypt the filename 2 or 3 times while in the process
of creating a filename.
Similarly, when looking up a directory entry, encrypt the filename
early, or if the encryption key is not available, base-64 decode the
file syystem so that the hash value and the last 16 bytes of the
encrypted filename is available in the new struct ext4_filename data
structure.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
|
|
Two watchdog changes that came through different trees had a non
conflicting conflict, that is, one changed the semantics of a variable
but no actual code conflict happened. So the merge appeared fine, but
the resulting code did not behave as expected.
Commit 195daf665a62 ("watchdog: enable the new user interface of the
watchdog mechanism") changes the semantics of watchdog_user_enabled,
which thereafter is only used by the functions introduced by
b3738d293233 ("watchdog: Add watchdog enable/disable all functions").
There further appears to be a distinct lack of serialization between
setting and using watchdog_enabled, so perhaps we should wrap the
{en,dis}able_all() things in watchdog_proc_mutex.
This patch fixes a s2r failure reported by Michal; which I cannot
readily explain. But this does make the code internally consistent
again.
Reported-and-tested-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In commit 8ff16cf77ce3 ("Documentation: devicetree: m25p80: add "nor-jedec"
binding"), we added a generic "nor-jedec" binding to catch all
mostly-compatible SPI NOR flash which can be detected via the READ ID
opcode (0x9F). This was discussed and reviewed at the time, however
objections have come up since then as part of this discussion:
http://lkml.kernel.org/g/20150511224646.GJ32500@ld-irv-0074
It seems the parties involved agree that "jedec,spi-nor" does a better
job of capturing the fact that this is SPI-specific, not just any NOR
flash.
This binding was only merged for v4.1-rc1, so it's still OK to change
the naming.
At the same time, let's move the documentation to a better name.
Next up: stop referring to code (drivers/mtd/devices/m25p80.c) from the
documentation.
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Cc: Marek Vasut <marex@denx.de>
Cc: Rafał Miłecki <zajec5@gmail.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Pawel Moll <pawel.moll@arm.com>
Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
Cc: Kumar Gala <galak@codeaurora.org>
Cc: devicetree@vger.kernel.org
Acked-by: Stephen Warren <swarren@nvidia.com>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Mark Rutland <mark.rutland@arm.com>
|
|
The ELPA bit in PageGrain is all about large *physical* addresses, so
correct the reference to "large virtual address" in the comment above
where it is set for MIPS64.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/10038/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
|
|
cpu_set was removed (along with a bunch of cpumask helpers) by
commit 2f0f267ea072 ("cpumask: remove deprecated functions.").
Fix this by replacing cpu_set with cpumask_set_cpu. Without this
fix the following error is triggered when CONFIG_MIPS_MT_FPAFF=y.
arch/mips/kernel/smp-cps.c: In function 'cps_smp_setup':
arch/mips/kernel/smp-cps.c:95:3: error: implicit declaration of function 'cpu_set' [-Werror=implicit-function-declaration]
Fixes: 90db024f140d ("MIPS: smp-cps: cpu_set FPU mask if FPU present")
Signed-off-by: Ezequiel Garcia <ezequiel.garcia@imgtec.com>
Acked-by: Niklas Cassel <niklass@axis.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9912/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
|
|
Since many releases, the modifications of the mvebu and berlin device
tree files are merged through the mvebu subsystem. This patch makes it
official in order to help the contributors using the get_maintainer.pl
to find the accurate peoples.
In the same time, updated the mvebu description which now includes the
kirkwood SoCs and new Armada SoCs.
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Acked-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Acked-by: Jason Cooper <jason@lakedaemon.net>
Acked-by: Andrew Lunn <andrew@lunn.ch>
|
|
The xfstests test suite assumes that an attempt to collapse range on
the range (0, 1) will return EOPNOTSUPP if the file system does not
support collapse range. Commit 280227a75b56: "ext4: move check under
lock scope to close a race" broke this, and this caused xfstests to
fail when run when testing file systems that did not have the extents
feature enabled.
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
NUMA balancing is meant to be disabled by default on UMA machines but
the check is using nr_node_ids (highest node) instead of
num_online_nodes (online nodes).
The consequences are that a UMA machine with a node ID of 1 or higher
will enable NUMA balancing. This will incur useless overhead due to
minor faults with the impact depending on the workload. These are the
impact on the stats when running a kernel build on a single node machine
whose node ID happened to be 1:
vanilla patched
NUMA base PTE updates 5113158 0
NUMA huge PMD updates 643 0
NUMA page range updates 5442374 0
NUMA hint faults 2109622 0
NUMA hint local faults 2109622 0
NUMA hint local percent 100 100
NUMA pages migrated 0 0
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [3.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Change my private email address.
Signed-off-by: Jingoo Han <jingoohan1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I had an issue:
Unable to handle kernel NULL pointer dereference at virtual address 0000082a
pgd = cc970000
[0000082a] *pgd=00000000
Internal error: Oops: 5 [#1] PREEMPT SMP ARM
PC is at get_pageblock_flags_group+0x5c/0xb0
LR is at unset_migratetype_isolate+0x148/0x1b0
pc : [<c00cc9a0>] lr : [<c0109874>] psr: 80000093
sp : c7029d00 ip : 00000105 fp : c7029d1c
r10: 00000001 r9 : 0000000a r8 : 00000004
r7 : 60000013 r6 : 000000a4 r5 : c0a357e4 r4 : 00000000
r3 : 00000826 r2 : 00000002 r1 : 00000000 r0 : 0000003f
Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: 2cb7006a DAC: 00000015
Backtrace:
get_pageblock_flags_group+0x0/0xb0
unset_migratetype_isolate+0x0/0x1b0
undo_isolate_page_range+0x0/0xdc
__alloc_contig_range+0x0/0x34c
alloc_contig_range+0x0/0x18
This issue is because when calling unset_migratetype_isolate() to unset
a part of CMA memory, it try to access the buddy page to get its status:
if (order >= pageblock_order) {
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);
if (!is_migrate_isolate_page(buddy)) {
But the begin addr of this part of CMA memory is very close to a part of
memory that is reserved at boot time (not in buddy system). So add a
check before accessing it.
[akpm@linux-foundation.org: use conventional code layout]
Signed-off-by: Hui Zhu <zhuhui@xiaomi.com>
Suggested-by: Laura Abbott <labbott@redhat.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
{u,g}id_valid call {u,g}id_eq, which calls __k{u,g}id_val on both
arguments and compares. With !CONFIG_MULTIUSER, __k{u,g}id_val return a
constant 0, which makes {u,g}id_valid always return false. Change
{u,g}id_valid to compare their argument against -1 instead. That produces
identical results in the normal CONFIG_MULTIUSER=y case, but with
!CONFIG_MULTIUSER will make {u,g}id_valid constant-fold into "return
true;" rather than "return false;".
This fixes uses of devpts without CONFIG_MULTIUSER.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>,
Cc: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
root->ino_ida is used for kernfs inode number allocations. Since IDA has
a layered structure, different IDs can reside on the same layer, which
is currently accounted to some memory cgroup. The problem is that each
kmem cache of a memory cgroup has its own directory on sysfs (under
/sys/fs/kernel/<cache-name>/cgroup). If the inode number of such a
directory or any file in it gets allocated from a layer accounted to the
cgroup which the cache is created for, the cgroup will get pinned for
good, because one has to free all kmem allocations accounted to a cgroup
in order to release it and destroy all its kmem caches. That said we
must not account layers of ino_ida to any memory cgroup.
Since per net init operations may create new sysfs entries directly
(e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
per each namespace, which, in turn, creates new sysfs entries), an easy
way to reproduce this issue is by creating network namespace(s) from
inside a kmem-active memory cgroup.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [4.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Not all kmem allocations should be accounted to memcg. The following
patch gives an example when accounting of a certain type of allocations to
memcg can effectively result in a memory leak. This patch adds the
__GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
allocation to go through the root cgroup. It will be used by the next
patch.
Note, since in case of kmemleak enabled each kmalloc implies yet another
allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
gfp_kmemleak_mask.
Alternatively, we could introduce a per kmem cache flag disabling
accounting for all allocations of a particular kind, but (a) we would not
be able to bypass accounting for kmalloc then and (b) a kmem cache with
this flag set could not be merged with a kmem cache without this flag,
which would increase the number of global caches and therefore
fragmentation even if the memory cgroup controller is not used.
Despite its generic name, currently __GFP_NOACCOUNT disables accounting
only for kmem allocations while user page allocations are always charged.
To catch abusing of this flag, a warning is issued on an attempt of
passing it to mem_cgroup_try_charge.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [4.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
libabikfs.a doesn't exist anymore, so we now need to link with libapi.a.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|