aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-1/+1
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01md/bitmap: clear bitmap if bitmap_create failedGuoqing Jiang1-8/+11
If bitmap_create returns an error, we need to call either bitmap_destroy or bitmap_free to do clean up, and the selection is based on mddev->bitmap is set or not. And the sysfs_put(bitmap->sysfs_can_clear) is moved from bitmap_destroy to bitmap_free, and the comment of bitmap_create is changed as well. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-31MD: add rdev reference for super writeShaohua Li1-0/+3
Xiao Ni reported below crash: [26396.335146] BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 [26396.342990] IP: [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod] [26396.349449] PGD 0 [26396.351468] Oops: 0002 [#1] SMP [26396.354898] Modules linked in: ext4 mbcache jbd2 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_td [26396.408404] CPU: 5 PID: 3261 Comm: loop0 Not tainted 4.5.0 #1 [26396.414140] Hardware name: Dell Inc. PowerEdge R715/0G2DP3, BIOS 3.2.2 09/15/2014 [26396.421608] task: ffff8808339be680 ti: ffff8808365f4000 task.ti: ffff8808365f4000 [26396.429074] RIP: 0010:[<ffffffffa0425b00>] [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod] [26396.437952] RSP: 0018:ffff8808365f7c38 EFLAGS: 00010046 [26396.443252] RAX: ffffffffa0425ae0 RBX: ffff8804336a7900 RCX: ffffe8f9f7b41198 [26396.450371] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8804336a7900 [26396.457489] RBP: ffff8808365f7c50 R08: 0000000000000005 R09: 00001801e02ce3d7 [26396.464608] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [26396.471728] R13: ffff8808338d9a00 R14: 0000000000000000 R15: ffff880833f9fe00 [26396.478849] FS: 00007f9e5066d740(0000) GS:ffff880237b40000(0000) knlGS:0000000000000000 [26396.486922] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [26396.492656] CR2: 00000000000002a8 CR3: 00000000019ea000 CR4: 00000000000006e0 [26396.499775] Stack: [26396.501781] ffff8804336a7900 0000000000000000 0000000000000000 ffff8808365f7c68 [26396.509199] ffffffff81308cd0 ffff8804336a7900 ffff8808365f7ca8 ffffffff81310637 [26396.516618] 00000000a0233a00 ffff880833f9fe00 0000000000000000 ffff880833fb0000 [26396.524038] Call Trace: [26396.526485] [<ffffffff81308cd0>] bio_endio+0x40/0x60 [26396.531529] [<ffffffff81310637>] blk_update_request+0x87/0x320 [26396.537439] [<ffffffff8131a20a>] blk_mq_end_request+0x1a/0x70 [26396.543261] [<ffffffff81313889>] blk_flush_complete_seq+0xd9/0x2a0 [26396.549517] [<ffffffff81313ccf>] flush_end_io+0x15f/0x240 [26396.554993] [<ffffffff8131a22a>] blk_mq_end_request+0x3a/0x70 [26396.560815] [<ffffffff8131a314>] __blk_mq_complete_request+0xb4/0xe0 [26396.567246] [<ffffffff8131a35c>] blk_mq_complete_request+0x1c/0x20 [26396.573506] [<ffffffffa04182df>] loop_queue_work+0x6f/0x72c [loop] [26396.579764] [<ffffffff81697844>] ? __schedule+0x2b4/0x8f0 [26396.585242] [<ffffffff810a7812>] kthread_worker_fn+0x52/0x170 [26396.591065] [<ffffffff810a77c0>] ? kthread_create_on_node+0x1a0/0x1a0 [26396.597582] [<ffffffff810a7238>] kthread+0xd8/0xf0 [26396.602453] [<ffffffff810a7160>] ? kthread_park+0x60/0x60 [26396.607929] [<ffffffff8169bdcf>] ret_from_fork+0x3f/0x70 [26396.613319] [<ffffffff810a7160>] ? kthread_park+0x60/0x60 md_super_write() and corresponding md_super_wait() generally are called with reconfig_mutex locked, which prevents disk disappears. There is one case this rule is broken. write_sb_page of bitmap.c doesn't hold the mutex. next_active_rdev does increase rdev reference, but it decreases the reference too early (eg, before IO finish). disk can disappear at the window. We unconditionally increase rdev reference in md_super_write() to avoid the race. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-31md: fix a trivial typo in commentsWei Fang1-1/+1
Fix a trivial typo in md_ioctl(). Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-31md:raid1: fix a dead loop when read from a WriteMostly diskWei Fang1-1/+1
If first_bad == this_sector when we get the WriteMostly disk in read_balance(), valid disk will be returned with zero max_sectors. It'll lead to a dead loop in make_request(), and OOM will happen because of endless allocation of struct bio. Since we can't get data from this disk in this case, so continue for another disk. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-21Merge tag 'md/4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds8-39/+57
Pull MD updates from Shaohua Li: "This update mainly fixes bugs. - a raid5 discard related fix from Jes - a MD multipath bio clone fix from Ming - raid1 error handling deadlock fix from Nate and corresponding raid10 fix from myself - a raid5 stripe batch fix from Neil - a patch from Sebastian to avoid unnecessary uevent - several cleanup/debug patches" * tag 'md/4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/raid5: Cleanup cpu hotplug notifier raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang md: fix typos for stipe md/bitmap: remove redundant return in bitmap_checkpage md/raid1: remove unnecessary BUG_ON md: multipath: don't hardcopy bio in .make_request path md/raid5: output stripe state for debug md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list Update MD git tree URL md/bitmap: remove redundant check MD: warn for potential deadlock md: Drop sending a change uevent when stopping RAID5: revert e9e4c377e2f563 to fix a livelock RAID5: check_reshape() shouldn't call mddev_suspend md/raid5: Compare apples to apples (or sectors to sectors)
2016-03-18Merge branch 'for-4.6/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-13/+33
Pull block driver updates from Jens Axboe: "This is the block driver pull request for this merge window. It sits on top of for-4.6/core, that was just sent out. This contains: - A set of fixes for lightnvm. One from Alan, fixing an overflow, and the rest from the usual suspects, Javier and Matias. - A set of fixes for nbd from Markus and Dan, and a fixup from Arnd for correct usage of the signed 64-bit divider. - A set of bug fixes for the Micron mtip32xx, from Asai. - A fix for the brd discard handling from Bart. - Update the maintainers entry for cciss, since that hardware has transferred ownership. - Three bug fixes for bcache from Eric Wheeler. - Set of fixes for xen-blk{back,front} from Jan and Konrad. - Removal of the cpqarray driver. It has been disabled in Kconfig since 2013, and we were initially scheduled to remove it in 3.15. - Various updates and fixes for NVMe, with the most important being: - Removal of the per-device NVMe thread, replacing that with a watchdog timer instead. From Christoph. - Exposing the namespace WWID through sysfs, from Keith. - Set of cleanups from Ming Lin. - Logging the controller device name instead of the underlying PCI device name, from Sagi. - And a bunch of fixes and optimizations from the usual suspects in this area" * 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits) NVMe: Expose ns wwid through single sysfs entry drivers:block: cpqarray clean up brd: Fix discard request processing cpqarray: remove it from the kernel cciss: update MAINTAINERS NVMe: Remove unused sq_head read in completion path bcache: fix cache_set_flush() NULL pointer dereference on OOM bcache: cleaned up error handling around register_cache() bcache: fix race of writeback thread starting before complete initialization NVMe: Create discard zero quirk white list nbd: use correct div_s64 helper mtip32xx: remove unneeded variable in mtip_cmd_timeout() lightnvm: generalize rrpc ppa calculations lightnvm: remove struct nvm_dev->total_blocks lightnvm: rename ->nr_pages to ->nr_sects lightnvm: update closed list outside of intr context xen/blback: Fit the important information of the thread in 17 characters lightnvm: fold get bb tbl when using dual/quad plane mode lightnvm: fix up nonsensical configure overrun checking xen-blkback: advertise indirect segment support earlier ...
2016-03-17md/raid5: Cleanup cpu hotplug notifierAnna-Maria Gleixner1-0/+2
The raid456_cpu_notify() hotplug callback lacks handling of the CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch buffer is leaked. Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions to free the scratch buffer. CC: Shaohua Li <shli@kernel.org> CC: linux-raid@vger.kernel.org Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17raid10: include bio_end_io_list in nr_queued to prevent freeze_array hangShaohua Li1-2/+5
This is the raid10 counterpart of the bug fixed by Nate (raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang) Fixes: 95af587e95(md/raid10: ensure device failure recorded before write request returns) Cc: stable@vger.kernel.org (V4.3+) Cc: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17raid1: include bio_end_io_list in nr_queued to prevent freeze_array hangNate Dailey1-2/+5
If raid1d is handling a mix of read and write errors, handle_read_error's call to freeze_array can get stuck. This can happen because, though the bio_end_io_list is initially drained, writes can be added to it via handle_write_finished as the retry_list is processed. These writes contribute to nr_pending but are not included in nr_queued. If a later entry on the retry_list triggers a call to handle_read_error, freeze array hangs waiting for nr_pending == nr_queued+extra. The writes on the bio_end_io_list aren't included in nr_queued so the condition will never be satisfied. To prevent the hang, include bio_end_io_list writes in nr_queued. There's probably a better way to handle decrementing nr_queued, but this seemed like the safest way to avoid breaking surrounding code. I'm happy to supply the script I used to repro this hang. Fixes: 55ce74d4bfe1b(md/raid1: ensure device failure recorded before write request returns.) Cc: stable@vger.kernel.org (v4.3+) Signed-off-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds1-45/+48
Pull crypto update from Herbert Xu: "Here is the crypto update for 4.6: API: - Convert remaining crypto_hash users to shash or ahash, also convert blkcipher/ablkcipher users to skcipher. - Remove crypto_hash interface. - Remove crypto_pcomp interface. - Add crypto engine for async cipher drivers. - Add akcipher documentation. - Add skcipher documentation. Algorithms: - Rename crypto/crc32 to avoid name clash with lib/crc32. - Fix bug in keywrap where we zero the wrong pointer. Drivers: - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver. - Add PIC32 hwrng driver. - Support BCM6368 in bcm63xx hwrng driver. - Pack structs for 32-bit compat users in qat. - Use crypto engine in omap-aes. - Add support for sama5d2x SoCs in atmel-sha. - Make atmel-sha available again. - Make sahara hashing available again. - Make ccp hashing available again. - Make sha1-mb available again. - Add support for multiple devices in ccp. - Improve DMA performance in caam. - Add hashing support to rockchip" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits) crypto: qat - remove redundant arbiter configuration crypto: ux500 - fix checks of error code returned by devm_ioremap_resource() crypto: atmel - fix checks of error code returned by devm_ioremap_resource() crypto: qat - Change the definition of icp_qat_uof_regtype hwrng: exynos - use __maybe_unused to hide pm functions crypto: ccp - Add abstraction for device-specific calls crypto: ccp - CCP versioning support crypto: ccp - Support for multiple CCPs crypto: ccp - Remove check for x86 family and model crypto: ccp - memset request context to zero during import lib/mpi: use "static inline" instead of "extern inline" lib/mpi: avoid assembler warning hwrng: bcm63xx - fix non device tree compatibility crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode. crypto: qat - The AE id should be less than the maximal AE number lib/mpi: Endianness fix crypto: rockchip - add hash support for crypto engine in rk3288 crypto: xts - fix compile errors crypto: doc - add skcipher API documentation crypto: doc - update AEAD AD handling ...
2016-03-14dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()Bryn M. Reeves1-1/+1
An "old" (.request_fn) DM 'struct request' stores a pointer to the associated 'struct dm_rq_target_io' in rq->special. dm_requeue_original_request(), previously named dm_requeue_unmapped_original_request(), called dm_unprep_request() to reset rq->special to NULL. But rq_end_stats() would go on to hit a NULL pointer deference because its call to tio_from_request() returned NULL. Fix this by calling rq_end_stats() _before_ dm_unprep_request() Signed-off-by: Bryn M. Reeves <bmr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: e262f34741 ("dm stats: add support for request-based DM devices") Cc: stable@vger.kernel.org # 4.2+
2016-03-14md: fix typos for stipeGuoqing Jiang1-2/+2
Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14md/bitmap: remove redundant return in bitmap_checkpageGuoqing Jiang1-1/+0
The "return 0" is not needed since bitmap_checkpage will finally return 0 for the case. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14md/raid1: remove unnecessary BUG_ONGuoqing Jiang1-1/+0
Since bitmap_start_sync will not return until sync_blocks is not less than PAGE_SIZE>>9, so the BUG_ON is not needed anymore. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14md: multipath: don't hardcopy bio in .make_request pathMing Lei1-1/+3
Inside multipath_make_request(), multipath maps the incoming bio into low level device's bio, but it is totally wrong to copy the bio into mapped bio via '*mapped_bio = *bio'. For example, .__bi_remaining is kept in the copy, especially if the incoming bio is chained to via bio splitting, so .bi_end_io can't be called for the mapped bio at all in the completing path in this kind of situation. This patch fixes the issue by using clone style. Cc: stable@vger.kernel.org (v3.14+) Reported-and-tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-11dm thin: consistently return -ENOSPC if pool has run out of data spaceMike Snitzer1-4/+17
Commit 0a927c2f02 ("dm thin: return -ENOSPC when erroring retry list due to out of data space") was a step in the right direction but didn't go far enough. Add a new 'out_of_data_space' flag to 'struct pool' and set it if/when the pool runs of of data space. This fixes cell_error() and error_retry_list() to not blindly return -EIO. We cannot rely on the 'error_if_no_space' feature flag since it is transient (in that it can be reset once space is added, plus it only controls whether errors are issued, it doesn't reflect whether the pool is actually out of space). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm cache: bump the target versionMike Snitzer1-1/+1
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm cache: make sure every metadata function checks fail_ioJoe Thornber3-43/+71
Otherwise operations may be attempted that will only ever go on to crash (since the metadata device is either missing or unreliable if 'fail_io' is set). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2016-03-10dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIOMike Snitzer1-0/+1
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm cache policy smq: clarify that mq registration failure was for 'mq'Mike Snitzer1-1/+1
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: return error if bio_integrity_clone() fails in clone_bio()Mike Snitzer1-7/+20
clone_bio() now checks if bio_integrity_clone() returned an error rather than just drop it on the floor. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm thin metadata: don't issue prefetches if a transaction abort has failedJoe Thornber1-1/+4
If a transaction abort has failed then we can no longer use the metadata device. Typically this happens if the superblock is unreadable. This fix addresses a crash seen during metadata device failure testing. Fixes: 8a01a6af75 ("dm thin: prefetch missing metadata pages") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm snapshot: disallow the COW and origin devices from being identicalDingXiang2-12/+33
Otherwise loading a "snapshot" table using the same device for the origin and COW devices, e.g.: echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap will trigger: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1958.989655] PGD 0 [ 1958.991903] Oops: 0000 [#1] SMP ... [ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G IO 4.5.0-rc5.snitm+ #150 ... [ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000 [ 1959.091865] RIP: 0010:[<ffffffffa040efba>] [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.104295] RSP: 0018:ffff88032a957b30 EFLAGS: 00010246 [ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001 [ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00 [ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001 [ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80 [ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80 [ 1959.150021] FS: 00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000 [ 1959.159047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0 [ 1959.173415] Stack: [ 1959.175656] ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc [ 1959.183946] ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000 [ 1959.192233] ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf [ 1959.200521] Call Trace: [ 1959.203248] [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot] [ 1959.211986] [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot] [ 1959.219469] [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod] [ 1959.226656] [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod] [ 1959.234328] [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod] [ 1959.241129] [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod] [ 1959.248607] [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod] [ 1959.255307] [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20 [ 1959.261816] [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [ 1959.268615] [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0 [ 1959.274637] [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100 [ 1959.281726] [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70 [ 1959.288814] [<ffffffff81216449>] SyS_ioctl+0x79/0x90 [ 1959.294450] [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71 ... [ 1959.323277] RIP [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.333090] RSP <ffff88032a957b30> [ 1959.336978] CR2: 0000000000000098 [ 1959.344121] ---[ end trace b049991ccad1169e ]--- Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899 Cc: stable@vger.kernel.org Signed-off-by: Ding Xiang <dingxiang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm cache: make the 'mq' policy an alias for 'smq'Joe Thornber4-1492/+85
smq seems to be performing better than the old mq policy in all situations, as well as using a quarter of the memory. Make 'mq' an alias for 'smq' when choosing a cache policy. The tunables that were present for the old mq are faked, and have no effect. mq should be considered deprecated now. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: drop unnecessary assignment of md->queueBob Liu1-6/+1
md->queue and q are the same thing in dm_old_init_request_queue() and dm_mq_init_request_queue(). Also drop the temporary 'struct request_queue *q' in dm_old_init_request_queue(). Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: reorder 'struct mapped_device' members to fix alignment and holesMike Snitzer1-16/+18
Saves 16 bytes by eliminating 4 4byte holes but more importantly: numerous members that crossed cachelines were fixed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: remove dummy definition of 'struct dm_table'Mike Snitzer1-11/+3
Change the map pointer in 'struct mapped_device' from 'struct dm_table __rcu *' to 'void __rcu *' to avoid the need for the dummy definition. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: add 'dm_numa_node' module parameterMike Snitzer1-8/+44
Allows user to control which NUMA node the memory for DM device structures (e.g. mapped_device, request_queue, gendisk, blk_mq_tag_set) is allocated from. Defaults to NUMA_NO_NODE (-1). Allowable range is from -1 until the last online NUMA node id. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm thin metadata: remove needless newline from subtree_dec() DMERR messageMike Snitzer1-1/+1
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm mpath: cleanup reinstate_path() et al based on code reviewMike Snitzer1-6/+2
fail_path() will print a "Failing path ..." message but reinstate_path() doesn't print a "Reinstating path ...". Add that message to reinstate_path() to add symmetry and aid system debugging. Remove reinstate_path()'s check for the path_selector providing .reinstate_path hook. All path selectors provide this and any future ones must too. activate_path() calls pg_init_done() with SCSI_DH_DEV_OFFLINED but pg_init_done() doesn't expicitly handle it in its swicth statement. Add SCSI_DH_DEV_OFFLINED to the default case. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-09md/raid5: output stripe state for debugShaohua Li1-4/+6
Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be quite convenient if we know the stripe state. This is what this patch does. Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_listNeilBrown1-1/+1
break_stripe_batch_list breaks up a batch and copies some flags from the batch head to the members, preserving others. It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a stripe_head is added to a batch, and is not set on stripe_heads already in a batch. However there is no locking to ensure one thread doesn't set the flag after it has just been cleared in another. This does occasionally happen. md/raid5 maintains a count of the number of stripe_heads with STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently this could becomes incorrect and will never again return to zero. md/raid5 delays the handling of some stripe_heads until preread_active_stripes becomes zero. So when the above mention race happens, those stripe_heads become blocked and never progress, resulting is write to the array handing. So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE in the members of a batch. URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741 URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153 URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz Reported-by: Martin Svec <martin.svec@zoner.cz> (and others) Tested-by: Tom Weber <linux@junkyard.4t2.com> Fixes: 1b956f7a8f9a ("md/raid5: be more selective about distributing flags across batch.") Cc: stable@vger.kernel.org (v4.1 and later) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-08bcache: fix cache_set_flush() NULL pointer dereference on OOMEric Wheeler1-0/+3
When bch_cache_set_alloc() fails to kzalloc the cache_set, the asyncronous closure handling tries to dereference a cache_set that hadn't yet been allocated inside of cache_set_flush() which is called by __cache_set_unregister() during cleanup. This appears to happen only during an OOM condition on bcache_register. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org
2016-03-08bcache: cleaned up error handling around register_cache()Eric Wheeler1-12/+22
Fix null pointer dereference by changing register_cache() to return an int instead of being void. This allows it to return -ENOMEM or -ENODEV and enables upper layers to handle the OOM case without NULL pointer issues. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521 Fixes this error: gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register bcache: register_cache() error opening sdh2: cannot allocate memory BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8 IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] PGD 120dff067 PUD 1119a3067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: veth ip6table_filter ip6_tables (...) CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 Workqueue: events cache_set_flush [bcache] task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000 RIP: 0010:[<ffffffffc05a7e8d>] [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org>
2016-03-08bcache: fix race of writeback thread starting before complete initializationEric Wheeler1-1/+8
The bch_writeback_thread might BUG_ON in read_dirty() if dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed its related initialization. This patch downs the dc->writeback_lock until after initialization is complete, thus preventing bch_writeback_thread from proceeding prematurely. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453 Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-07md/bitmap: remove redundant checkEric Engestrom1-2/+1
daemon_sleep is an unsigned, so testing if it's 0 or less than 1 does the same thing. Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26MD: warn for potential deadlockShaohua Li1-0/+1
The personality thread shouldn't call mddev_suspend(). Because mddev_suspend() will for all IO finish, but IO is handled in personality thread, so this could cause deadlock. To trigger this early, add a warning if mddev_suspend() is called from personality thread. Suggested-by: NeilBrown <neilb@suse.com> Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26md: Drop sending a change uevent when stoppingSebastian Parschauer1-1/+0
When stopping an MD device, then its device node /dev/mdX may still exist afterwards or it is recreated by udev. The next open() call can lead to creation of an inoperable MD device. The reason for this is that a change event (KOBJ_CHANGE) is sent to udev which races against the remove event (KOBJ_REMOVE) from md_free(). So drop sending the change event. A change is likely also required in mdadm as many versions send the change event to udev as well. Neil mentioned the change event is a workaround for old kernel Commit: 934d9c23b4c7 ("md: destroy partitions and notify udev when md array is stopped.") new mdadm can handle device remove now, so this isn't required any more. Cc: NeilBrown <neilb@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26RAID5: revert e9e4c377e2f563 to fix a livelockShaohua Li2-20/+9
Revert commit e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe) The problem is raid5_get_active_stripe waits on conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes in this order: - release all stripes with hash 0 - raid5_get_active_stripe still sleeps since active_stripes > max_nr_stripes * 3 / 4 - release all stripes with hash other than 0. active_stripes becomes 0 - raid5_get_active_stripe still sleeps, since nobody wakes up wait_for_stripe[0] The system live locks. The problem is active_stripes isn't a per-hash count. Revert the patch makes the live lock go away. Cc: stable@vger.kernel.org (v4.2+) Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26RAID5: check_reshape() shouldn't call mddev_suspendShaohua Li2-0/+20
check_reshape() is called from raid5d thread. raid5d thread shouldn't call mddev_suspend(), because mddev_suspend() waits for all IO finish but IO is handled in raid5d thread, we could easily deadlock here. This issue is introduced by 738a273 ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-25md/raid5: Compare apples to apples (or sectors to sectors)Jes Sorensen1-2/+2
'max_discard_sectors' is in sectors, while 'stripe' is in bytes. This fixes the problem where DISCARD would get disabled on some larger RAID5 configurations (6 or more drives in my testing), while it worked as expected with smaller configurations. Fixes: 620125f2bf8 ("MD: raid5 trim support") Cc: stable@vger.kernel.org v3.7+ Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-22dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busyMike Snitzer1-4/+2
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: switch from 'unsigned' to 'bool' for flags where appropriateMike Snitzer1-50/+51
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm round robin: use percpu 'repeat_count' and 'current_path'Mike Snitzer1-12/+48
Now that dm-mpath core is lockless in the per-IO fast path it is critical, for performance, to have the .select_path hook (rr_select_path) also be as lockless as possible. The new percpu members of 'struct selector' allow for lockless support of 'repeat_count' governed repeat use of a previously selected path. If a path fails while it is 'current_path' the worst case is concurrent IO might be mapped to the failed path until the .fail_path hook (rr_fail_path) is called. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm path selector: remove 'repeat_count' return from .select_path hookMike Snitzer5-18/+4
If a path selector has any use for a repeat_count it should be handled locally and not depend on the dm-mpath core to be concerned with it. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: push path selector locking down to path selectorsMike Snitzer3-11/+59
Proper locking of the lists used by the path selectors should be handled within the selectors (relying on dm-mpath.c code's use of the m->lock spinlock was reckless). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: remove repeat_count support from multipath coreMike Snitzer4-10/+24
Preparation for making __multipath_map() avoid taking the m->lock spinlock -- in favor of using RCU locking. repeat_count was primarily for bio-based DM multipath's benefit. There is really no need for it anymore now that DM multipath is request-based. As such, repeat_count > 1 is no longer honored and a warning is displayed if the user attempts to use a value > 1. This is a temporary change for the round-robin path-selector (as a later commit will restore its support for repeat_count > 1). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: remove unnecessary casts in front of ti->privateMike Snitzer1-5/+5
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: use blk_mq_alloc_request() and blk_mq_free_request() directlyMike Snitzer1-3/+4
There isn't any need to support both old .request_fn and blk-mq paths in the blk-mq specific portion of __multipath_map(). Call blk_mq_alloc_request() directly rather than use blk_get_request(). Similarly, call blk_mq_free_request(), rather than blk_put_request(), in multipath_release_clone(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>