Age | Commit message (Collapse) | Author | Files | Lines |
|
Every new architecture has to add itself to the growing list of those
that do not support the old "generic" RTC driver.
This replaces the long list of architectures that don't support it
with a shorter list of those that do.
The list is taken from those architectures that have a non-empty
asm/rtc.h header file and were not explicitly blacklisted.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In commit 941943cf519f7cacbbcecee5c4ef4b77b466bd5c ("drivers/hwtracing:
make coresight-* explicitly non-modular") we removed all uses of
modular functions/macros in favour of their built-in equivlents in
this subsystem.
However that commit and commit 0bcbf2e30ff2271b54f54c8697a185f7d86ec6e4
("coresight: etm-perf: new PMU driver for ETM tracers") were in flight
at the same time, and hence one new non-modular user of module_init
crept back in. Fix it up like we did all the others.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
IS_ERR_VALUE macro should be used only with unsigned long type.
Specifically it works incorrectly with longer types.
The patch follows conclusion from discussion on LKML [1][2].
[1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
[2]: http://permalink.gmane.org/gmane.linux.kernel/2150581
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
A couple of fields in a data structure, which is used by the driver only,
were not initialized properly during the driver's setup.
The primary issue with this bug was that channel->wr_buf_size remained zero,
so calls to dma_sync_single_for_cpu() took place with zero size, and
consequently did nothing.
This had a rather minimal practical impact, because
(a) these calls are NOPs on Intel/AMD platforms, as well as other platforms
with coherent cache, and
(b) it's extremely rare that any cache line would survive between two reads
from a given DMA buffer
Hence no significant practical difference is expected with this patch.
Signed-off-by: Eli Billauer <eli.billauer@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The error return err is not initialized for the case when pci_map_rom
fails and no ROM can me mapped. Fix this by setting ret to -ENODATA;
(this is the same error value that is returned if the ROM data is
successfully mapped but does not match the expected ROM signature.).
Issue found from static code analysis using CoverityScan.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
WS2012 R2 and above hosts can support kexec in that thay can support
reconnecting to the host (as would be needed in the kexec path)
on any CPU. Enable this. Pre ws2012 r2 hosts don't have this ability
and consequently cannot support kexec.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Starting with Windows 2012 R2, message inteerupts can be delivered
on any VCPU in the guest. Support this functionality.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
If util transport fails to initialize for any reason, the list of transport
handlers may become corrupted due to freeing the transport handler without
removing it from the list. Fix this by cleaning it up from the list.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pass the channel information to the util drivers that need to defer
reading the channel while they are processing a request. This would address
the following issue reported by Vitaly:
Commit 3cace4a61610 ("Drivers: hv: utils: run polling callback always in
interrupt context") removed direct *_transaction.state = HVUTIL_READY
assignments from *_handle_handshake() functions introducing the following
race: if a userspace daemon connects before we get first non-negotiation
request from the server hv_poll_channel() won't set transaction state to
HVUTIL_READY as (!channel) condition will fail, we set it to non-NULL on
the first real request from the server.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Message header is modified by the hypervisor and we read it in a loop,
we need to prevent compilers from optimizing accesses. There are no such
optimizations at this moment, this is just a future proof.
Suggested-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Radim Kr.má<rkrcmar@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We have 3 functions dealing with messages and they all implement
the same logic to finalize reads, move it to vmbus_signal_eom().
Suggested-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Radim Kr.má<rkrcmar@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
wait_for_completion() may sleep, it enables interrupts and this
is something we really want to avoid on crashes because interrupt
handlers can cause other crashes. Switch to the recently introduced
vmbus_wait_for_unload() doing busy wait instead.
Reported-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Radim Kr.má<rkrcmar@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We must handle HVMSG_TIMER_EXPIRED messages in the interrupt context
and we offload all the rest to vmbus_on_msg_dpc() tasklet. This functions
loops to see if there are new messages pending. In case we'll ever see
HVMSG_TIMER_EXPIRED message there we're going to lose it as we can't
handle it from there. Avoid looping in vmbus_on_msg_dpc(), we're OK
with handling one message per interrupt.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Radim Kr.má<rkrcmar@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Now that the AT24 uses the NVMEM framework, replace the
memory_accessor in the setup() callback with nvmem API calls.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Tested-by: Sekhar Nori <nsekhar@ti.com>
Acked-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add a regmap for accessing the EEPROM, and then use that with the
NVMEM framework. Enable backward compatibility in the NVMEM config
structure, so that the 'eeprom' file in sys is provided by the
framework.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add a regmap for accessing the EEPROM, and then use that with the
NVMEM framework. Enable backwards compatibility in the NVMEM config,
so that the 'eeprom' file in sys is provided by the framework.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The setup() callback is not used by any in kernel code. Remove it.
Any new code which requires access to the eeprom can use the NVMEM
API.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Acked-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add a regmap for accessing the EEPROM, and then use that with the
NVMEM framework. Set the NVMEM config structure to enable backward, so
that the 'eeprom' file in sys is provided by the framework.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Tested-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Older drivers made an 'eeprom' file available in the /sys device
directory. Have the NVMEM core provide this to retain backwards
compatibility.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Legacy AT24, AT25 EEPROMs are exported in sys so that only root can
read the contents. The EEPROMs may contain sensitive information. Add
a flag so the provide can indicate that NVMEM should also restrict
access to root only.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Currently writing the attributes with "echo" will result in comparing:
"enabled\n" with "enabled\0" and attribute is always set to false.
Use the sysfs_streq() instead because it treats both NUL and
new-line-then-NUL as equivalent string terminations.
Signed-off-by: Dan Bogdan Nechita <dan.bogdan.nechita@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Fix RDAC read back errors caused by a typo. Value must shift by 2.
Fixes: a4bd394956f2 ("drivers/misc/ad525x_dpot.c: new features")
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add device ids for Broxton SoC based devices.
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This improves the order of operations on the use-after-free tests to
try to make sure we've executed any available sanity-checking code,
and to report the poisoning that was found.
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
dmesg output of running this LKDTM test with PaX:
[187095.475573] lkdtm: No crash points registered, enable through debugfs
[187118.020257] lkdtm: Performing direct entry WRAP_ATOMIC
[187118.030045] lkdtm: attempting atomic underflow
[187118.030929] PAX: refcount overflow detected in: bash:1790, uid/euid: 0/0
[187118.071667] PAX: refcount overflow occured at: lkdtm_do_action+0x19e/0x400 [lkdtm]
[187118.081423] CPU: 3 PID: 1790 Comm: bash Not tainted 4.2.6-pax-refcount-split+ #2
[187118.083403] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[187118.102596] task: ffff8800da8de040 ti: ffff8800da8e4000 task.ti: ffff8800da8e4000
[187118.111321] RIP: 0010:[<ffffffffc00fc2fe>] [<ffffffffc00fc2fe>] lkdtm_do_action+0x19e/0x400 [lkdtm]
...
[187118.128074] lkdtm: attempting atomic overflow
[187118.128080] PAX: refcount overflow detected in: bash:1790, uid/euid: 0/0
[187118.128082] PAX: refcount overflow occured at: lkdtm_do_action+0x1b6/0x400 [lkdtm]
[187118.128085] CPU: 3 PID: 1790 Comm: bash Not tainted 4.2.6-pax-refcount-split+ #2
[187118.128086] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[187118.128088] task: ffff8800da8de040 ti: ffff8800da8e4000 task.ti: ffff8800da8e4000
[187118.128092] RIP: 0010:[<ffffffffc00fc316>] [<ffffffffc00fc316>] lkdtm_do_action+0x1b6/0x400 [lkdtm]
Signed-off-by: David Windsor <dave@progbits.org>
[cleaned up whitespacing, keescook]
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
The current tests for read/write after free work on slab
allocated memory. Memory straight from the buddy allocator
may behave slightly differently and have a different set
of parameters to test. Add tests for those cases as well.
On a basic x86 boot:
# echo WRITE_BUDDY_AFTER_FREE > /sys/kernel/debug/provoke-crash/DIRECT
[ 22.291950] lkdtm: Performing direct entry WRITE_BUDDY_AFTER_FREE
[ 22.292983] lkdtm: Writing to the buddy page before free
[ 22.293950] lkdtm: Attempting bad write to the buddy page after free
# echo READ_BUDDY_AFTER_FREE > /sys/kernel/debug/provoke-crash/DIRECT
[ 32.375601] lkdtm: Performing direct entry READ_BUDDY_AFTER_FREE
[ 32.379896] lkdtm: Value in memory before free: 12345678
[ 32.383854] lkdtm: Attempting to read from freed memory
[ 32.389309] lkdtm: Buddy page was not poisoned
On x86 with CONFIG_DEBUG_PAGEALLOC and debug_pagealloc=on:
# echo WRITE_BUDDY_AFTER_FREE > /sys/kernel/debug/provoke-crash/DIRECT
[ 17.475533] lkdtm: Performing direct entry WRITE_BUDDY_AFTER_FREE
[ 17.477360] lkdtm: Writing to the buddy page before free
[ 17.479089] lkdtm: Attempting bad write to the buddy page after free
[ 17.480904] BUG: unable to handle kernel paging request at
ffff88000ebd8000
# echo READ_BUDDY_AFTER_FREE > /sys/kernel/debug/provoke-crash/DIRECT
[ 14.606433] lkdtm: Performing direct entry READ_BUDDY_AFTER_FREE
[ 14.607447] lkdtm: Value in memory before free: 12345678
[ 14.608161] lkdtm: Attempting to read from freed memory
[ 14.608860] BUG: unable to handle kernel paging request at
ffff88000eba3000
Note that arches without ARCH_SUPPORTS_DEBUG_PAGEALLOC may not
produce the same crash.
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
The SLUB allocator may use the first word of a freed block to store the
freelist information. This may make it harder to test poisoning
features. Change the WRITE_AFTER_FREE test to better match what
the READ_AFTER_FREE test does and also print out a big more information.
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
In a similar manner to WRITE_AFTER_FREE, add a READ_AFTER_FREE
test to test free poisoning features. Sample output when
no sanitization is present:
# echo READ_AFTER_FREE > /sys/kernel/debug/provoke-crash/DIRECT
[ 17.542473] lkdtm: Performing direct entry READ_AFTER_FREE
[ 17.543866] lkdtm: Value in memory before free: 12345678
[ 17.545212] lkdtm: Attempting bad read from freed memory
[ 17.546542] lkdtm: Memory was not poisoned
with slub_debug=P:
# echo READ_AFTER_FREE > /sys/kernel/debug/provoke-crash/DIRECT
[ 22.415531] lkdtm: Performing direct entry READ_AFTER_FREE
[ 22.416366] lkdtm: Value in memory before free: 12345678
[ 22.417137] lkdtm: Attempting bad read from freed memory
[ 22.417897] lkdtm: Memory correctly poisoned, calling BUG
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Officially claim maintainership over the LKDTM code.
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Like a signal return, we should use synchronize_user_stack() rather
than flush_user_windows().
Reported-by: Ilya Malakhov <ilmalakhovthefirst@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Binutils used to be (erroneously) extremely permissive about
instruction usage. But that got fixed and if you don't properly tell
it to accept classes of instructions it will fail.
This uncovered a specs bug on sparc in gcc where it wouldn't pass the
proper options to binutils options.
Deal with this in the kernel build by adding -Wa,-Av8 to KBUILD_CFLAGS.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Palams extcon IRQs are nested threaded and wired to the Palmas
inerrupt controller. So, this flag is not required for nested irqs
anymore, since commit 3c646f2c6aa9 ("genirq: Don't suspend
nested_thread irqs over system suspend") was merged.
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Roger Quadros <rogerq@ti.com>
Cc: Nishanth Menon <nm@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
|
|
|
|
... or we risk seeing a bogus value of d_is_symlink() there.
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... otherwise d_is_symlink() above might have nothing to do with
the inode value we've got.
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
both do_last() and walk_component() risk picking a NULL inode out
of dentry about to become positive, *then* checking its flags and
seeing that it's not negative anymore and using (already stale by
then) value they'd fetched earlier. Usually ends up oopsing soon
after that...
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... into returning a positive to path_openat(), which would interpret that
as "symlink had been encountered" and proceed to corrupt memory, etc.
It can only happen due to a bug in some ->open() instance or in some LSM
hook, etc., so we report any such event *and* make sure it doesn't trick
us into further unpleasantness.
Cc: stable@vger.kernel.org # v3.6+, at least
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
-EBADF is a rather confusing error if an operations is not supported,
and nfsd gets rather upset about it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The delete opration can allocate additional space on the HPFS filesystem
due to btree split. The HPFS driver checks in advance if there is
available space, so that it won't corrupt the btree if we run out of space
during splitting.
If there is not enough available space, the HPFS driver attempted to
truncate the file, but this results in a deadlock since the commit
7dd29d8d865efdb00c0542a5d2c87af8c52ea6c7 ("HPFS: Introduce a global mutex
and lock it on every callback from VFS").
This patch removes the code that tries to truncate the file and -ENOSPC is
returned instead. If the user hits -ENOSPC on delete, he should try to
delete other files (that are stored in a leaf btree node), so that the
delete operation will make some space for deleting the file stored in
non-leaf btree node.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.
Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().
Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.
Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response
to sync(2).
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases. This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.
Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block. This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption. This was observed with xfstests ext4/307 and
ext4/308.
Fix this by only allowing online defrag for non-DAX files.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. Any dirty data associated
with the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().
With the current code, though, this isn't always true. For example,
ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
data stored as dirty page cache entries. For these types of inodes,
having S_DAX set doesn't really make sense since their I/O doesn't
actually happen through the DAX code path.
Instead, only allow S_DAX to be set for regular inodes for ext2 and
ext4. This allows us to have strict DAX vs non-DAX paths in the
writeback code.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.
We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.
dump_stack+0x85/0xc2
dax_writeback_mapping_range+0x60/0xe0
blkdev_writepages+0x3f/0x50
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait+0x4a/0xa0
set_blocksize+0x70/0xd0
sb_set_blocksize+0x1d/0x50
ext4_fill_super+0x75b/0x3360
mount_bdev+0x180/0x1b0
ext4_mount+0x15/0x20
mount_fs+0x38/0x170
Mark the support broken so its disabled by default, but otherwise still
available for testing.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When doing append direct io cleanup, if deleting inode fails, it goes
out without unlocking inode, which will cause the inode deadlock.
This issue was introduced by commit cf1776a9e834 ("ocfs2: fix a tiny
race when truncate dio orohaned entry").
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long(). Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.
Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit d07e22597d1d ("mm: mmap: add new /proc tunable for mmap_base
ASLR") added the ability to choose from a range of values to use for
entropy count in generating the random offset to the mmap_base address.
The maximum value on this range was set to 32 bits for 64-bit x86
systems, but this value could be increased further, requiring more than
the 32 bits of randomness provided by get_random_int(), as is already
possible for arm64. Add a new function: get_random_long() which more
naturally fits with the mmap usage of get_random_int() but operates
exactly the same as get_random_int().
Also, fix the shifting constant in mmap_rnd() to be an unsigned long so
that values greater than 31 bits generate an appropriate mask without
overflow. This is especially important on x86, as its shift instruction
uses a 5-bit mask for the shift operand, which meant that any value for
mmap_rnd_bits over 31 acts as a no-op and effectively disables mmap_base
randomization.
Finally, replace calls to get_random_int() with get_random_long() where
appropriate.
This patch (of 2):
Add get_random_long().
Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 4167e9b2cf10 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
flag combination due to confusing semantics. It noted that
alloc_misplaced_dst_page() was one such user after changes made by
commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify").
Unfortunately when GFP_THISNODE was removed, users of
alloc_misplaced_dst_page() started waking kswapd and entering direct
reclaim because the wrong GFP flags are cleared. The consequence is
that workloads that used to fit into memory now get reclaimed which is
addressed by this patch.
The problem can be demonstrated with "mutilate" that exercises memcached
which is software dedicated to memory object caching. The configuration
uses 80% of memory and is run 3 times for varying numbers of clients.
The results on a 4-socket NUMA box are
mutilate
4.4.0 4.4.0
vanilla numaswap-v1
Hmean 1 8394.71 ( 0.00%) 8395.32 ( 0.01%)
Hmean 4 30024.62 ( 0.00%) 34513.54 ( 14.95%)
Hmean 7 32821.08 ( 0.00%) 70542.96 (114.93%)
Hmean 12 55229.67 ( 0.00%) 93866.34 ( 69.96%)
Hmean 21 39438.96 ( 0.00%) 85749.21 (117.42%)
Hmean 30 37796.10 ( 0.00%) 50231.49 ( 32.90%)
Hmean 47 18070.91 ( 0.00%) 38530.13 (113.22%)
The metric is queries/second with the more the better. The results are
way outside of the noise and the reason for the improvement is obvious
from some of the vmstats
4.4.0 4.4.0
vanillanumaswap-v1r1
Minor Faults 1929399272 2146148218
Major Faults 19746529 3567
Swap Ins 57307366 9913
Swap Outs 50623229 17094
Allocation stalls 35909 443
DMA allocs 0 0
DMA32 allocs 72976349 170567396
Normal allocs 5306640898 5310651252
Movable allocs 0 0
Direct pages scanned 404130893 799577
Kswapd pages scanned 160230174 0
Kswapd pages reclaimed 55928786 0
Direct pages reclaimed 1843936 41921
Page writes file 2391 0
Page writes anon 50623229 17094
The vanilla kernel is swapping like crazy with large amounts of direct
reclaim and kswapd activity. The figures are aggregate but it's known
that the bad activity is throughout the entire test.
Note that simple streaming anon/file memory consumers also see this
problem but it's not as obvious. In those cases, kswapd is awake when
it should not be.
As there are at least two reclaim-related bugs out there, it's worth
spelling out the user-visible impact. This patch only addresses bugs
related to excessive reclaim on NUMA hardware when the working set is
larger than a NUMA node. There is a bug related to high kswapd CPU
usage but the reports are against laptops and other UMA hardware and is
not addressed by this patch.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
introduced to locklessy (but atomically) detect when a pmd is a regular
(stable) pmd or when the pmd is unstable and can infinitely transition
from pmd_none() and pmd_trans_huge() from under us, while only holding
the mmap_sem for reading (for writing not).
While holding the mmap_sem only for reading, MADV_DONTNEED can run from
under us and so before we can assume the pmd to be a regular stable pmd
we need to compare it against pmd_none() and pmd_trans_huge() in an
atomic way, with pmd_trans_unstable(). The old pmd_trans_huge() left a
tiny window for a race.
Useful applications are unlikely to notice the difference as doing
MADV_DONTNEED concurrently with a page fault would lead to undefined
behavior.
[akpm@linux-foundation.org: tidy up comment grammar/layout]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|