aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md/md.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-05-19Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds1-33/+53
Pull MD updates from Shaohua Li: "Several patches from Guoqing fixing md-cluster bugs and several patches from Heinz fixing dm-raid bugs" * tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md-cluster: check the return value of process_recvd_msg md-cluster: gather resync infos and enable recv_thread after bitmap is ready md: set MD_CHANGE_PENDING in a atomic region md: raid5: add prerequisite to run underneath dm-raid md: raid10: add prerequisite to run underneath dm-raid md: md.c: fix oops in mddev_suspend for raid0 md-cluster: fix ifnullfree.cocci warnings md-cluster/bitmap: unplug bitmap to sync dirty pages to disk md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit md-cluster/bitmap: fix wrong calcuation of offset md-cluster: sync bitmap when node received RESYNCING msg md-cluster: always setup in-memory bitmap md-cluster: wakeup thread if activated a spare disk md-cluster: change array_sectors and update size are not supported md-cluster: fix locking when node joins cluster during message broadcast md-cluster: unregister thread if err happened md-cluster: wake up thread to continue recovery md-cluser: make resync_finish only called after pers->sync_request md-cluster: change resync lock from asynchronous to synchronous
2016-05-17Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...
2016-05-09md: set MD_CHANGE_PENDING in a atomic regionGuoqing Jiang1-13/+14
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09md: md.c: fix oops in mddev_suspend for raid0Heinz Mauelshagen1-1/+1
Introduced by upstream commit 70d9798b95562abac005d4ba71d28820f9a201eb The raid0 personality does not create mddev->thread as oposed to other personalities leading to its unconditional access in mddev_suspend() causing an oops. Patch checks for mddev->thread in order to keep the intention of aforementioned commit. Fixes: 70d9798b9556 ("MD: warn for potential deadlock") Cc: stable@vger.kernel.org (4.5+) Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluster: wakeup thread if activated a spare diskGuoqing Jiang1-0/+5
When a device is re-added, it will ultimately need to be activated and that happens in md_check_recovery, so we need to set MD_RECOVERY_NEEDED right after remove_and_add_spares. A specifical issue without the change is that when one node perform fail/remove/readd on a disk, but slave nodes could not add the disk back to array as expected (added as missed instead of in sync). So give slave nodes a chance to do resync. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluster: change array_sectors and update size are not supportedGuoqing Jiang1-0/+8
Currently, some features are not supported yet, such as change array_sectors and update size, so return EINVAL for them and listed it in document. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluser: make resync_finish only called after pers->sync_requestGuoqing Jiang1-12/+13
It is not reasonable that cluster raid to release resync lock before the last pers->sync_request has finished. As the metadata will be changed when node performs resync, we need to inform other nodes to update metadata, so the MD_CHANGE_PENDING flag is set before finish resync. Then metadata_update_finish is move ahead to ensure that METADATA_UPDATED msg is sent before finish resync, and metadata_update_start need to be run after "repeat:" label accordingly. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluster: change resync lock from asynchronous to synchronousGuoqing Jiang1-9/+14
If multiple nodes choose to attempt do resync at the same time they need to be serialized so they don't duplicate effort. This serialization is done by locking the 'resync' DLM lock. Currently if a node cannot get the lock immediately it doesn't request notification when the lock becomes available (i.e. DLM_LKF_NOQUEUE is set), so it may not reliably find out when it is safe to try again. Rather than trying to arrange an async wake-up when the lock becomes available, switch to using synchronous locking - this is a lot easier to think about. As it is not permitted to block in the 'raid1d' thread, move the locking to the resync thread. So the rsync thread is forked immediately, but it blocks until the resync lock is available. Once the lock is locked it checks again if any resync action is needed. A particular symptom of the current problem is that a node can get stuck with "resync=pending" indefinitely. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-25MD: make bio mergeableShaohua Li1-0/+2
blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable. In the reported bug, this bug causes trim against raid0 performance slash https://bugzilla.kernel.org/show_bug.cgi?id=117051 Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio) Cc: stable@vger.kernel.org (v4.3+) Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-12md: update to using blk_queue_write_cache()Jens Axboe1-1/+1
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-03-31MD: add rdev reference for super writeShaohua Li1-0/+3
Xiao Ni reported below crash: [26396.335146] BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 [26396.342990] IP: [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod] [26396.349449] PGD 0 [26396.351468] Oops: 0002 [#1] SMP [26396.354898] Modules linked in: ext4 mbcache jbd2 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_td [26396.408404] CPU: 5 PID: 3261 Comm: loop0 Not tainted 4.5.0 #1 [26396.414140] Hardware name: Dell Inc. PowerEdge R715/0G2DP3, BIOS 3.2.2 09/15/2014 [26396.421608] task: ffff8808339be680 ti: ffff8808365f4000 task.ti: ffff8808365f4000 [26396.429074] RIP: 0010:[<ffffffffa0425b00>] [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod] [26396.437952] RSP: 0018:ffff8808365f7c38 EFLAGS: 00010046 [26396.443252] RAX: ffffffffa0425ae0 RBX: ffff8804336a7900 RCX: ffffe8f9f7b41198 [26396.450371] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8804336a7900 [26396.457489] RBP: ffff8808365f7c50 R08: 0000000000000005 R09: 00001801e02ce3d7 [26396.464608] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [26396.471728] R13: ffff8808338d9a00 R14: 0000000000000000 R15: ffff880833f9fe00 [26396.478849] FS: 00007f9e5066d740(0000) GS:ffff880237b40000(0000) knlGS:0000000000000000 [26396.486922] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [26396.492656] CR2: 00000000000002a8 CR3: 00000000019ea000 CR4: 00000000000006e0 [26396.499775] Stack: [26396.501781] ffff8804336a7900 0000000000000000 0000000000000000 ffff8808365f7c68 [26396.509199] ffffffff81308cd0 ffff8804336a7900 ffff8808365f7ca8 ffffffff81310637 [26396.516618] 00000000a0233a00 ffff880833f9fe00 0000000000000000 ffff880833fb0000 [26396.524038] Call Trace: [26396.526485] [<ffffffff81308cd0>] bio_endio+0x40/0x60 [26396.531529] [<ffffffff81310637>] blk_update_request+0x87/0x320 [26396.537439] [<ffffffff8131a20a>] blk_mq_end_request+0x1a/0x70 [26396.543261] [<ffffffff81313889>] blk_flush_complete_seq+0xd9/0x2a0 [26396.549517] [<ffffffff81313ccf>] flush_end_io+0x15f/0x240 [26396.554993] [<ffffffff8131a22a>] blk_mq_end_request+0x3a/0x70 [26396.560815] [<ffffffff8131a314>] __blk_mq_complete_request+0xb4/0xe0 [26396.567246] [<ffffffff8131a35c>] blk_mq_complete_request+0x1c/0x20 [26396.573506] [<ffffffffa04182df>] loop_queue_work+0x6f/0x72c [loop] [26396.579764] [<ffffffff81697844>] ? __schedule+0x2b4/0x8f0 [26396.585242] [<ffffffff810a7812>] kthread_worker_fn+0x52/0x170 [26396.591065] [<ffffffff810a77c0>] ? kthread_create_on_node+0x1a0/0x1a0 [26396.597582] [<ffffffff810a7238>] kthread+0xd8/0xf0 [26396.602453] [<ffffffff810a7160>] ? kthread_park+0x60/0x60 [26396.607929] [<ffffffff8169bdcf>] ret_from_fork+0x3f/0x70 [26396.613319] [<ffffffff810a7160>] ? kthread_park+0x60/0x60 md_super_write() and corresponding md_super_wait() generally are called with reconfig_mutex locked, which prevents disk disappears. There is one case this rule is broken. write_sb_page of bitmap.c doesn't hold the mutex. next_active_rdev does increase rdev reference, but it decreases the reference too early (eg, before IO finish). disk can disappear at the window. We unconditionally increase rdev reference in md_super_write() to avoid the race. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-31md: fix a trivial typo in commentsWei Fang1-1/+1
Fix a trivial typo in md_ioctl(). Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26MD: warn for potential deadlockShaohua Li1-0/+1
The personality thread shouldn't call mddev_suspend(). Because mddev_suspend() will for all IO finish, but IO is handled in personality thread, so this could cause deadlock. To trigger this early, add a warning if mddev_suspend() is called from personality thread. Suggested-by: NeilBrown <neilb@suse.com> Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26md: Drop sending a change uevent when stoppingSebastian Parschauer1-1/+0
When stopping an MD device, then its device node /dev/mdX may still exist afterwards or it is recreated by udev. The next open() call can lead to creation of an inoperable MD device. The reason for this is that a change event (KOBJ_CHANGE) is sent to udev which races against the remove event (KOBJ_REMOVE) from md_free(). So drop sending the change event. A change is likely also required in mdadm as many versions send the change event to udev as well. Neil mentioned the change event is a workaround for old kernel Commit: 934d9c23b4c7 ("md: destroy partitions and notify udev when md array is stopped.") new mdadm can handle device remove now, so this isn't required any more. Cc: NeilBrown <neilb@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-15Merge tag 'md/4.5' of git://neil.brown.name/mdLinus Torvalds1-66/+105
Pull md updates from Neil Brown: "Mostly clustered-raid1 and raid5 journal updates. one Y2038 fix and other minor stuff. One patch removes me from the MAINTAINERS file and adds a record of my md maintainership to Credits" Many thanks to Neil, who has been around for a _looong_ time. * tag 'md/4.5' of git://neil.brown.name/md: (26 commits) md/raid: only permit hot-add of compatible integrity profiles Remove myself as MD Maintainer, and add to Credits. raid5-cache: handle journal hotadd in quiesce MD: add journal with array suspended md: set MD_HAS_JOURNAL in correct places md: Remove 'ready' field from mddev. md: remove unnecesary md_new_event_inintr raid5: allow r5l_io_unit allocations to fail raid5-cache: use a mempool for the metadata block raid5-cache: use a bio_set raid5-cache: add journal hot add/remove support drivers: md: use ktime_get_real_seconds() md: avoid warning for 32-bit sector_t raid5-cache: free meta_page earlier raid5-cache: simplify r5l_move_io_unit_list md: update comment for md_allow_write md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY md-cluster: Protect communication with mutexes md-cluster: Defer MD reloading to mddev->thread md-cluster: update the documentation ...
2016-01-13Merge tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds1-491/+25
Pull libnvdimm updates from Dan Williams: "The bulk of this has appeared in -next and independently received a build success notification from the kbuild robot. The 'for-4.5/block- dax' topic branch was rebased over the weekend to drop the "block device end-of-life" rework that Al would like to see re-implemented with a notifier, and to address bug reports against the badblocks integration. There is pending feedback against "libnvdimm: Add a poison list and export badblocks" received last week. Linda identified some localized fixups that we will handle incrementally. Summary: - Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. - Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. - Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. - Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes" * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits) block: kill disk_{check|set|clear|alloc}_badblocks libnvdimm, pmem: nvdimm_read_bytes() badblocks support pmem, dax: disable dax in the presence of bad blocks pmem: fail io-requests to known bad blocks libnvdimm: convert to statically allocated badblocks libnvdimm: don't fail init for full badblocks list block, badblocks: introduce devm_init_badblocks block: clarify badblocks lifetime badblocks: rename badblocks_free to badblocks_exit libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h libnvdimm: Add a poison list and export badblocks nfit_test: Enable DSMs for all test NFITs md: convert to use the generic badblocks code block: Add badblock management for gendisks badblocks: Add core badblock management code block: fix del_gendisk() vs blkdev_ioctl crash block: enable dax for raw block devices block: introduce bdev_file_inode() restrict /dev/mem to idle io memory ranges arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug ...
2016-01-14md/raid: only permit hot-add of compatible integrity profilesDan Williams1-12/+16
It is not safe for an integrity profile to be changed while i/o is in-flight in the queue. Prevent adding new disks or otherwise online spares to an array if the device has an incompatible integrity profile. The original change to the blk_integrity_unregister implementation in md, commmit c7bfced9a671 "md: suspend i/o during runtime blk_integrity_unregister" introduced an immediate hang regression. This policy of disallowing changes the integrity profile once one has been established is shared with DM. Here is an abbreviated log from a test run that: 1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127] 2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209] 3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277] [ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors [ 59.078302] md: data integrity enabled on md0 [..] [ 90.489209] md0: incompatible integrity profile for pmem1m [..] [ 205.671277] md: super_written gets error=-5 [ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device. [ 205.677386] md/raid1:md0: Operation continuing on 1 devices. [ 205.683037] RAID1 conf printout: [ 205.684699] --- wd:1 rd:2 [ 205.685972] disk 0, wo:0, o:1, dev:pmem0s [ 205.687562] disk 1, wo:1, o:1, dev:pmem1s [ 205.691717] md: recovery of RAID array md0 Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister") Cc: <stable@vger.kernel.org> Cc: Mike Snitzer <snitzer@redhat.com> Reported-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14MD: add journal with array suspendedShaohua Li1-1/+6
Hot add journal disk in recovery thread context brings a lot of trouble as IO could be running. Unlike spare disk hot add, adding journal disk with array suspended makes more sense and implmentation is much easier. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14md: set MD_HAS_JOURNAL in correct placesShaohua Li1-4/+5
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-09badblocks: rename badblocks_free to badblocks_exitDan Williams1-1/+1
For symmetry with badblocks_init() make it clear that this path only destroys incremental allocations of a badblocks instance, and does not free the badblocks instance itself. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09md: convert to use the generic badblocks codeVishal Verma1-491/+25
Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-07md: Remove 'ready' field from mddev.NeilBrown1-4/+1
This field is always set in tandem with ->pers, and when it is tested ->pers is also tested. So ->ready is not needed. It was needed once, but code rearrangement and locking changes have removed that needed. Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-07md: remove unnecesary md_new_event_inintrGuoqing Jiang1-10/+1
md_new_event had removed sysfs_notify since 'commit 72a23c211e45 ("Make sure all changes to md/sync_action are notified.")', so we can use md_new_event and delete md_new_event_inintr. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06raid5-cache: add journal hot add/remove supportShaohua Li1-12/+30
Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06drivers: md: use ktime_get_real_seconds()Deepa Dinamani1-9/+9
get_seconds() API is not y2038 safe on 32 bit systems and the API is deprecated. Replace it with calls to ktime_get_real_seconds() API instead. Change mddev structure types to time64_t accordingly. 32 bit signed timestamps will overflow in the year 2038. Change the user interface mdu_array_info_s structure timestamps: ctime and utime values used in ioctls GET_ARRAY_INFO and SET_ARRAY_INFO to unsigned int. This will extend the field to last until the year 2106. The long term plan is to get rid of ctime and utime values in this structure as this information can be read from the on-disk meta data directly. Clamp the tim64_t timestamps to positive values with a max of U32_MAX when returning from GET_ARRAY_INFO ioctl to accommodate above changes in the data type of timestamps to unsigned int. v0.90 on disk meta data uses u32 for maintaining time stamps. So this will also last until year 2106. Assumption is that the usage of v0.90 will be deprecated by year 2106. Timestamp fields in the on disk meta data for v1.0 version already use 64 bit data types. Remove the truncation of the bits while writing to or reading from these from the disk. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md: avoid warning for 32-bit sector_tArnd Bergmann1-4/+6
When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which means we cannot have devices with more than 2TB, and the code that is trying to handle compatibility support for large devices in md version 0.90 is meaningless but also causes a compile-time warning: drivers/md/md.c: In function 'super_90_load': drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow] drivers/md/md.c: In function 'super_90_rdev_size_change': drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow] This adds a check for CONFIG_LBDAF to avoid even getting into this code path, and also adds an explicit cast to let the compiler know it doesn't have to warn about the truncation. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md: update comment for md_allow_writeGuoqing Jiang1-1/+1
MD_CHANGE_CLEAN had been replaced with MD_CHANGE_PENDING after commit 070dc6 ("md: resolve confusion of MD_CHANGE_CLEAN"), so make the change accordingly. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md-cluster: Defer MD reloading to mddev->threadGuoqing Jiang1-0/+4
Reloading of superblock must be performed under reconfig_mutex. However, this cannot be done with md_reload_sb because it would deadlock with the message DLM lock. So, we defer it in md_check_recovery() which is executed by mddev->thread. This introduces a new flag, MD_RELOAD_SB, which if set, will reload the superblock. And good_device_nr is also added to 'struct mddev' which is used to get the num of the good device within cluster raid. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md-cluster: append some actions when change bitmap from clustered to noneGuoqing Jiang1-0/+13
For clustered raid, we need to do extra actions when change bitmap to none. 1. check if all the bitmap lock could be get or not, if yes then we can continue the change since cluster raid is only active in current node. Otherwise return fail and unlock the related bitmap locks 2. set nodes to 0 and then leave cluster environment. 3. release other nodes's bitmap lock. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md-cluster: Allow spare devices to be marked as faultyGoldwyn Rodrigues1-1/+0
If a spare device was marked faulty, it would not be reflected in receiving nodes because it would mark it as activated and continue. Continue the operation, so it may be set as faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md-cluster: Fix the remove sequence with the new MD reload codeGoldwyn Rodrigues1-8/+1
The remove disk message does not need metadata_update_start(), but can be an independent message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06md-cluster: remove a disk asynchronously from cluster environmentGuoqing Jiang1-0/+12
For cluster raid, if one disk couldn't be reach in one node, then other nodes would receive the REMOVE message for the disk. In receiving node, we can't call md_kick_rdev_from_array to remove the disk from array synchronously since the disk might still be busy in this node. So let's set a ClusterRemove flag on the disk, then let the thread to do the removal job eventually. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-21md: remove check for MD_RECOVERY_NEEDED in action_store.NeilBrown1-4/+7
md currently doesn't allow a 'sync_action' such as 'reshape' to be set while MD_RECOVERY_NEEDED is set. This s a problem, particularly since commit 738a273806ee as that can cause ->check_shape to call mddev_resume() which sets MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is very likely that MD_RECOVERY_NEEDED is still set. Testing for this flag is not really needed and is in any case very racy as it can be set at any moment - asynchronously. Any race between setting a sync_action and setting MD_RECOVERY_NEEDED must already be handled properly in some locked code, probably md_check_recovery(), so remove the test here. The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case so we should test it again after getting mddev_lock(). As this fixes a race and a regression which can cause 'reshape' to fail, it is suitable for -stable kernels since 4.1 Reported-by: Xiao Ni <xni@redhat.com> Fixes: 738a273806ee ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18Fix remove_and_add_spares removes drive added as spare in slot_storeGoldwyn Rodrigues1-3/+10
Commit 2910ff17d154baa5eb50e362a91104e831eb2bb6 introduced a regression which would remove a recently added spare via slot_store. Revert part of the patch which touches slot_store() and add the disk directly using pers->hot_add_disk() Fixes: 2910ff17d154 ("md: remove_and_add_spares() to activate specific rdev") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18md: fix bug due to nested suspendMikulas Patocka1-3/+4
The patch c7bfced9a6716ff66c9d61f934bb60af08d4688c committed to 4.4-rc causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with this BUG, the reason is that we attempt to suspend a device that is already suspended. See also https://bugzilla.redhat.com/show_bug.cgi?id=1283491 This patch fixes the bug by changing functions mddev_suspend and mddev_resume to always nest. The number of nested calls to mddev_nested_suspend is kept in the variable mddev->suspended. [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend] kernel BUG at drivers/md/md.c:317! CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1 task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001000000000000001111 Not tainted r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810 r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810 r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4 r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001 r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1 r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000 r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001 sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390 IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748 CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000 ORIG_R28: 00000000000000b0 IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod] IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod] RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1] Backtrace: [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1] [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid] [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod] [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod] [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod] [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod] [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod] [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558 Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18MD: change journal disk role to disk 0Shaohua Li1-1/+1
Neil pointed out setting journal disk role to raid_disks will confuse reshape if we support reshape eventually. Switching the role to 0 (we should be fine as long as the value >=0) and skip sysfs file creation to avoid error. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-07block: change ->make_request_fn() and users to return a queue cookieJens Axboe1-3/+5
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-04Merge tag 'md/4.4' of git://neil.brown.name/mdLinus Torvalds1-130/+365
Pull md updates from Neil Brown: "Two major components to this update. 1) The clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2) The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. * tag 'md/4.4' of git://neil.brown.name/md: (66 commits) MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW MD: set journal disk ->raid_disk MD: kick out journal disk if it's not fresh raid5-cache: start raid5 readonly if journal is missing MD: add new bit to indicate raid array with journal raid5-cache: IO error handling raid5: journal disk can't be removed raid5-cache: add trim support for log MD: fix info output for journal disk raid5-cache: use bio chaining raid5-cache: small log->seq cleanup raid5-cache: new helper: r5_reserve_log_entry raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta raid5-cache: take rdev->data_offset into account early on raid5-cache: refactor bio allocation raid5-cache: clean up r5l_get_meta raid5-cache: simplify state machine when caches flushes are not needed raid5-cache: factor out a helper to run all stripes for an I/O unit raid5-cache: rename flushed_ios to finished_ios raid5-cache: free I/O units earlier ...
2015-11-04Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+4
Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk
2015-11-01MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RWSong Liu1-2/+25
When RAID-4/5/6 array suffers from missing journal device, we put the array in read only state. We should not allow trasition to read-write states (clean and active) before replacing journal device. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: set journal disk ->raid_diskShaohua Li1-5/+22
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of 0, because we already have a disk with ->raid_disk 0 and this causes sysfs entry creation conflict. A lot of places assumes disk with ->raid_disk >=0 is normal raid disk, so we add check for journal disk. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: kick out journal disk if it's not freshSong Liu1-1/+2
When journal disk is faulty and we are reassemabling the raid array, the journal disk is old. We don't allow the journal disk added to the raid array. Since journal disk is missing in the array, the raid5 will mark the array readonly. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: add new bit to indicate raid array with journalSong Liu1-3/+7
If a raid array has journal feature bit set, add a new bit to indicate this. If the array is started without journal disk existing, we know there is something wrong. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: fix info output for journal diskShaohua Li1-5/+4
journal disk can be faulty. The Journal and Faulty aren't exclusive with each other. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01md: show journal for journal disk in disk state sysfsShaohua Li1-0/+4
Journal disk state sysfs entry should indicate it's journal Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01skip match_mddev_units check for special rolesSong Liu1-2/+12
match_mddev_units is used to check whether 2 RAID arrays share same disk(s). Arrays that share disk(s) will not do resync at the same time for better performance (fewer HDD seek). However, this check should not apply to Spare, Faulty, and Journal disks, as they do not paticipate in resync. In this patch, match_mddev_units skips check for disks with flag "Faulty" or "Journal" or raid_disk < 0. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01md: skip resync for raid array with journalShaohua Li1-0/+4
If a raid array has journal, the journal can guarantee the consistency, we can skip resync after a unclean shutdown. The exception is raid creation or user initiated resync, which we still do a raid resync. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-31Revert "md: allow a partially recovered device to be hot-added to an array."NeilBrown1-2/+1
This reverts commit 7eb418851f3278de67126ea0c427641ab4792c57. This commit is poorly justified, I can find not discusison in email, and it clearly causes a problem. If a device which is being recovered fails and is subsequently re-added to an array, there could easily have been changes to the array *before* the point where the recovery was up to. So the recovery must start again from the beginning. If a spare is being recovered and fails, then when it is re-added we really should do a bitmap-based recovery up to the recovery-offset, and then a full recovery from there. Before this reversion, we only did the "full recovery from there" which is not corect. After this reversion with will do a full recovery from the start, which is safer but not ideal. It will be left to a future patch to arrange the two different styles of recovery. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (3.14+) Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.")
2015-10-24md: override md superblock recovery_offset for journal deviceShaohua Li1-0/+6
Journal device stores data in a log structure. We need record the log start. Here we override md superblock recovery_offset for this purpose. This field of a journal device is meaningless otherwise. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24MD: add a new disk role to present write journal deviceSong Liu1-2/+21
Next patches will use a disk as raid5/6 journaling. We need a new disk role to present the journal device and add MD_FEATURE_JOURNAL to feature_map for backward compability. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>