Age | Commit message (Collapse) | Author | Files | Lines |
|
bio_free_pages is introduced in commit 1dfa0f68c040
("block: add a helper to free bio bounce buffer pages"),
we can reuse the func in other modules after it was
imported.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
device_add() may fail, and all callers are supposed to check the
return value, but one new user in lightnvm doesn't:
drivers/lightnvm/sysfs.c: In function 'nvm_sysfs_register_dev':
drivers/lightnvm/sysfs.c:184:2: error: ignoring return value of 'device_add',
declared with attribute warn_unused_result [-Werror=unused-result]
This changes the caller to propagate any error codes, which avoids
the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 38c9e260b9f9 ("lightnvm: expose device geometry through sysfs")
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
For a host to access an Open-Channel SSD, it has to know its geometry,
so that it writes and reads at the appropriate device bounds.
Currently, the geometry information is kept within the kernel, and not
exported to user-space for consumption. This patch exposes the
configuration through sysfs and enables user-space libraries, such as
liblightnvm, to use the sysfs implementation to get the geometry of an
Open-Channel SSD.
The sysfs entries are stored within the device hierarchy, and can be
found using the "lightnvm" device type.
An example configuration looks like this:
/sys/class/nvme/
└── nvme0n1
├── capabilities: 3
├── device_mode: 1
├── erase_max: 1000000
├── erase_typ: 1000000
├── flash_media_type: 0
├── media_capabilities: 0x00000001
├── media_type: 0
├── multiplane: 0x00010101
├── num_blocks: 1022
├── num_channels: 1
├── num_luns: 4
├── num_pages: 64
├── num_planes: 1
├── page_size: 4096
├── prog_max: 100000
├── prog_typ: 100000
├── read_max: 10000
├── read_typ: 10000
├── sector_oob_size: 0
├── sector_size: 4096
├── media_manager: gennvm
├── ppa_format: 0x380830082808001010102008
├── vendor_opcode: 0
├── max_phys_secs: 64
└── version: 1
Signed-off-by: Simon A. F. Lund <slund@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
LightNVM compatible device drivers does not have a method to expose
LightNVM specific sysfs entries.
To enable LightNVM sysfs entries to be exposed, lightnvm device
drivers require a struct device to attach it to. To allow both the
actual device driver and lightnvm sysfs entries to coexist, the device
driver tracks the lifetime of the nvm_dev structure.
This patch refactors NVMe and null_blk to handle the lifetime of struct
nvm_dev, which eliminates the need for struct gendisk when a lightnvm
compatible device is provided.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
With LightNVM enabled devices, the gendisk structure is not exposed
to the user. This hides the device driver specific sysfs entries, and
prevents binding of LightNVM geometry information to the device.
Refactor the device registration process, so that gendisk and
non-gendisk devices are easily managed.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
With LightNVM enabled namespaces, the gendisk structure is not exposed
to the user. This prevents LightNVM users from accessing the NVMe device
driver specific sysfs entries, and LightNVM namespace geometry.
Refactor the revalidation process, so that a namespace, instead of a
gendisk, is revalidated. This later allows patches to wire up the
sysfs entries up to a non-gendisk namespace.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
If NO_DMA=y:
drivers/built-in.o: In function `nvme_nvm_dev_dma_free':
lightnvm.c:(.text+0x23df1a): undefined reference to `dma_pool_free'
drivers/built-in.o: In function `nvme_nvm_dev_dma_alloc':
lightnvm.c:(.text+0x23df38): undefined reference to `dma_pool_alloc'
drivers/built-in.o: In function `nvme_nvm_destroy_dma_pool':
lightnvm.c:(.text+0x23df4c): undefined reference to `dma_pool_destroy'
drivers/built-in.o: In function `nvme_nvm_create_dma_pool':
lightnvm.c:(.text+0x23df7e): undefined reference to `dma_pool_create'
and
ERROR: "dma_pool_destroy" [drivers/nvme/host/nvme-core.ko] undefined!
ERROR: "dma_pool_free" [drivers/nvme/host/nvme-core.ko] undefined!
ERROR: "dma_pool_alloc" [drivers/nvme/host/nvme-core.ko] undefined!
ERROR: "dma_pool_create" [drivers/nvme/host/nvme-core.ko] undefined!
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This patch only reserves two CPU hotplug states for block/mq so the block tree
can apply the conversion patches.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160906170457.32393-20-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Variable weight is not being initialized to zero before it is
used to compute the weight sum. Ensure it is initialized to zero.
Found with static analysis with cppcheck:
[lib/sbitmap.c:177]: (error) Uninitialized variable: weight
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
If we have a bunch of high-numbered bits allocated and then we resize
the struct sbitmap_queue, when those bits get cleared, we'll update the
hint and then have to re-randomize it repeatedly. Avoid that by checking
that the cleared bit is still a valid hint. No measurable performance
difference in the common case.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
After a struct sbitmap_queue is resized smaller, the allocation hints
may still be set to bits beyond the new depth of the bitmap. This means
that, for example, if the number of blk-mq tags is reduced through
sysfs, more requests than the nominal queue depth may be in flight.
It's tempting to fix this at resize time by doing a one-time
reinitialization of the hints, but this can race with
__sbitmap_queue_get() updating the hint. Instead, check the hint before
we use it. This caused no measurable performance difference in my
synthetic benchmarks.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In order to get good cache behavior from a sbitmap, we want each CPU to
stick to its own cacheline(s) as much as possible. This might happen
naturally as the bitmap gets filled up and the alloc_hint values spread
out, but we really want this behavior from the start. blk-mq apparently
intended to do this, but the code to do this was never wired up. Get rid
of the dead code and make it part of the sbitmap library.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Again, there's no point in passing this in every time. Make it part of
struct sbitmap_queue and clean up the API.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The original bt_alloc() we converted from was using kzalloc(), not
kzalloc_node(), to allocate the wait queues. This was probably an
oversight, so fix it for sbitmap_queue_init_node().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.
The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.
This should be a complete noop functionality-wise.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We currently account a '0' dispatch, and anything above that still falls
below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
we don't account it.
Change the last bucket to be inclusive of anything above the range we
track, and have the sysfs file reflect that by including a '+' in the
output:
$ cat /sys/block/nvme0n1/mq/0/dispatched
0 1006
1 20229
2 1
4 0
8 0
16 0
32+ 0
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
|
|
blk_mq_delay_kick_requeue_list() provides the ability to kick the
q->requeue_list after a specified time. To do this the request_queue's
'requeue_work' member was changed to a delayed_work.
blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
requests while it doesn't make sense to immediately requeue them
(e.g. when all paths in a DM multipath have failed).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Since REQ_OP_BITS == 3 and __REQ_NR_BITS == 30 it is not that hard
to pass an op_flags argument to bio_set_op_attrs() that is larger
than the number of bits reserved for the op_flags argument. Complain
if this happens. Additionally, ensure that negative arguments trigger
a complaint (1 << ... is signed while 1U << ... is unsigned; adding
0U to an integer expression causes it to be promoted to an unsigned
type).
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Make it clear that the sizeof(unsigned int) expression in BIO_OP_SHIFT
refers to the bi_opf member of struct bio.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
commit e1defc4ff0cf57aca6c5e3ff99fa503f5943c1f1
"block: Do away with the notion of hardsect_size"
removed the notion of "hardware sector size" from
the kernel in favor of logical block size, but
references remain in comments and documentation.
Update the remaining sites mentioning hardsect.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The blk_mq_alloc_single_hw_queue() is a prototype artifact that
should have been removed with
commit cdef54dd85ad66e77262ea57796a3e81683dd5d6
"blk-mq: remove alloc_hctx and free_hctx methods" where the last
users of it were deleted.
Fixes: cdef54dd85ad ("blk-mq: remove alloc_hctx and free_hctx methods")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
DAX support for block devices was removed in commits 03cdad
("block: disable block device DAX by default") and 99a01cd
("block: remove BLK_DEV_DAX config option"), but we still kept a call to
dax_do_io and some uneeded i_flags manipulations introduced in commit
bbab37 ("block: Add support for DAX reads/writes to block devices").
Remove those leftovers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Allow the io_poll statistics to be zeroed to make for easier logging
of polling event.
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In order to help determine the effectiveness of polling in a running
system it is usful to determine the ratio of how often the poll
function is called vs how often the completion is checked. For this
reason we add a poll_considered variable and add it to the sysfs entry
for io_poll.
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Instead of rolling our own timer, just utilize the blk mq req timeout and do the
disconnect if any of our commands timeout.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In preparation for some future changes, change a few of the state bools over to
normal bits to set/clear properly.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We hit a warning when shutting down the nbd connection because we have irq's
disabled. We don't really need to do the shutdown under the lock, just clear
the nbd->sock. So do the shutdown outside of the irq. This gets rid of the
warning.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This moves NBD over to using blkmq, which allows us to get rid of the NBD
wide queue lock and the async submit kthread. We will start with 1 hw
queue for now, but I plan to add multiple tcp connection support in the
future and we'll fix how we set the hwqueue's.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This patch adds the ability for a given state to have multiple
instances. Until now all states have a single instance and the startup /
teardown callback use global variables.
A few drivers need to perform a the same callbacks on multiple
"instances". Currently we have three drivers in tree which all have a
global list which they iterate over. With multi instance they support
don't need their private list and the functionality has been moved into
core code. Plus we hold the hotplug lock in core so no cpus comes/goes
while instances are registered and we do rollback in error case :)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-3-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This is preparation for the following patch.
This rework here changes the arguments of cpuhp_invoke_callback(). It
passes now `state' and whether `startup' or `teardown' callback should
be invoked. The callback then is looked up by the function.
The following is a clanup of callers:
- cpuhp_issue_call() has one argument less
- struct cpuhp_cpu_state (which is used by the hotplug thread) gets also
its callback removed. The decision if it is a single callback
invocation moved to the `single' variable. Also a `bringup' variable
has been added to distinguish between startup and teardown callback.
- take_cpu_down() needs to start one step earlier. We always get here
via CPUHP_TEARDOWN_CPU callback. Before that change cpuhp_ap_states +
CPUHP_TEARDOWN_CPU pointed to an empty entry because TEARDOWN is saved
in bp_states for this reason. Now that we use cpuhp_get_step() to
lookup the state we must explicitly skip it in order not to invoke it
twice.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
We get 1 warning when biuld kernel with W=1:
drivers/block/mtip32xx/mtip32xx.c:3689:6: warning: no previous prototype for
'mtip_block_release' [-Wmissing-prototypes]
In fact, this function is only used in the file in which it is declared
and don't need a declaration, but can be made static.
so this patch marks it 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When drivers or the core calls this function, they usually
dereference the request shortly there after. Prefetch the first
cache line.
Profiling IO workloads shows that this is the most common cache
miss on the block side of things.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Various cache line optimizations:
- Move delay_work towards the end. It's huge, and we don't use it
a lot (only SCSI).
- Move the atomic state into the same cacheline as the the dispatch
list and lock.
- Rearrange a few members to pack it better.
- Shrink the max-order for dispatch accounting from 10 to 7. This
means that ->dispatched[] and ->run now take up their own
cacheline.
This shrinks struct blk_mq_hw_ctx down to 8 cachelines.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We don't need the larger delayed work struct, since we always run it
immediately.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Add a helper to schedule a regular struct work on a particular CPU.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Like cancel_delayed_work(), but for regular work.
Signed-off-by: Jens Axboe <axboe@fb.com>
Mehed-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
|
|
|
|
Due to assigning the 'replaced' value instead of or'ing it,
if drm_atomic_crtc_set_property() gets called multiple times,
the last call will define the color_mgmt_changed flag, so
a non-updating call to a property can reset the flag and
prevent actual hw state updates required by preceding
property updates.
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: <stable@vger.kernel.org> # v4.6+
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
For DAX inodes we need to be careful to never have page cache pages in
the mapping->page_tree. This radix tree should be composed only of DAX
exceptional entries and zero pages.
ltp's readahead02 test was triggering a warning because we were trying
to insert a DAX exceptional entry but found that a page cache page had
already been inserted into the tree. This page was being inserted into
the radix tree in response to a readahead(2) call.
Readahead doesn't make sense for DAX inodes, but we don't want it to
report a failure either. Instead, we just return success and don't do
any work.
Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The data offset for a dax region needs to account for a reservation in
the resource range. Otherwise, device-dax is allowing mappings directly
into the memmap or device-info-block area with crash signatures like the
following:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: get_zone_device_page+0x11/0x30
Call Trace:
follow_devmap_pmd+0x298/0x2c0
follow_page_mask+0x275/0x530
__get_user_pages+0xe3/0x750
__gfn_to_pfn_memslot+0x1b2/0x450 [kvm]
tdp_page_fault+0x130/0x280 [kvm]
kvm_mmu_page_fault+0x5f/0xf0 [kvm]
handle_ept_violation+0x94/0x180 [kvm_intel]
vmx_handle_exit+0x1d3/0x1440 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm]
kvm_vcpu_ioctl+0x33c/0x620 [kvm]
do_vfs_ioctl+0xa2/0x5d0
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x1a/0xa4
Fixes: ab68f2622136 ("/dev/dax, pmem: direct access to persistent memory")
Link: http://lkml.kernel.org/r/147205536732.1606.8994275381938837346.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Abhilash Kumar Mulumudi <m.abhilash-kumar@hpe.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
seq_read() is a nasty piece of work, not to mention buggy.
It has (I think) an old bug which allows unprivileged userspace to read
beyond the end of m->buf.
I was getting these:
BUG: KASAN: slab-out-of-bounds in seq_read+0xcd2/0x1480 at addr ffff880116889880
Read of size 2713 by task trinity-c2/1329
CPU: 2 PID: 1329 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #96
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
kasan_object_err+0x1c/0x80
kasan_report_error+0x2cb/0x7e0
kasan_report+0x4e/0x80
check_memory_region+0x13e/0x1a0
kasan_check_read+0x11/0x20
seq_read+0xcd2/0x1480
proc_reg_read+0x10b/0x260
do_loop_readv_writev.part.5+0x140/0x2c0
do_readv_writev+0x589/0x860
vfs_readv+0x7b/0xd0
do_readv+0xd8/0x2c0
SyS_readv+0xb/0x10
do_syscall_64+0x1b3/0x4b0
entry_SYSCALL64_slow_path+0x25/0x25
Object at ffff880116889100, in cache kmalloc-4096 size: 4096
Allocated:
PID = 1329
save_stack_trace+0x26/0x80
save_stack+0x46/0xd0
kasan_kmalloc+0xad/0xe0
__kmalloc+0x1aa/0x4a0
seq_buf_alloc+0x35/0x40
seq_read+0x7d8/0x1480
proc_reg_read+0x10b/0x260
do_loop_readv_writev.part.5+0x140/0x2c0
do_readv_writev+0x589/0x860
vfs_readv+0x7b/0xd0
do_readv+0xd8/0x2c0
SyS_readv+0xb/0x10
do_syscall_64+0x1b3/0x4b0
return_from_SYSCALL_64+0x0/0x6a
Freed:
PID = 0
(stack is not available)
Memory state around the buggy address:
ffff88011688a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88011688a080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88011688a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff88011688a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88011688a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Disabling lock debugging due to kernel taint
This seems to be the same thing that Dave Jones was seeing here:
https://lkml.org/lkml/2016/8/12/334
There are multiple issues here:
1) If we enter the function with a non-empty buffer, there is an attempt
to flush it. But it was not clearing m->from after doing so, which
means that if we try to do this flush twice in a row without any call
to traverse() in between, we are going to be reading from the wrong
place -- the splat above, fixed by this patch.
2) If there's a short write to userspace because of page faults, the
buffer may already contain multiple lines (i.e. pos has advanced by
more than 1), but we don't save the progress that was made so the
next call will output what we've already returned previously. Since
that is a much less serious issue (and I have a headache after
staring at seq_read() for the past 8 hours), I'll leave that for now.
Link: http://lkml.kernel.org/r/1471447270-32093-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A bugfix in v4.8-rc2 introduced a harmless warning when
CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:
mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
This moves the function inside of the #ifdef block that hides the
calling function, to avoid the warning.
Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The current wording of the COMPACTION Kconfig help text doesn't
emphasise that disabling COMPACTION might cripple the page allocator
which relies on the compaction quite heavily for high order requests and
an unexpected OOM can happen with the lack of compaction. Make sure we
are vocal about that.
Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 97f2645f358b ("tree-wide: replace config_enabled() with
IS_ENABLED()") mostly killed config_enabled(), but some new users have
appeared for v4.8-rc1. They are all used for a boolean option, so can
be replaced with IS_ENABLED() safely.
Link: http://lkml.kernel.org/r/1471970749-24867-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|