aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2019-05-07dm integrity: don't report unused optionsMikulas Patocka1-3/+7
If we are not journaling, don't report journaling options in the table status. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07dm integrity: don't check null pointer before kvfree and vfreeMikulas Patocka1-4/+2
The functions kfree, vfree and kvfree do nothing if we pass a NULL pointer to them. So we don't need to test the pointer for NULL. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07dm integrity: correctly calculate the size of metadata areaMikulas Patocka1-2/+2
When we use separate devices for data and metadata, dm-integrity would incorrectly calculate the size of the metadata device as if it had 512-byte block size - and it would refuse activation with larger block size and smaller metadata device. Fix this so that it takes actual block size into account, which fixes the following reported issue: https://gitlab.com/cryptsetup/cryptsetup/issues/450 Fixes: 356d9d52e122 ("dm integrity: allow separate metadata device") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07dm dust: Make dm_dust_init and dm_dust_exit staticYueHaibing1-2/+2
Fix sparse warnings: drivers/md/dm-dust.c:495:12: warning: symbol 'dm_dust_init' was not declared. Should it be static? drivers/md/dm-dust.c:505:13: warning: symbol 'dm_dust_exit' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07dm dust: remove redundant unsigned comparison to less than zeroColin Ian King1-1/+1
Variable block is an unsigned long long hence the less than zero comparison is always false, hence it is redundant and can be removed. Addresses-Coverity: ("Unsigned compared against 0") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30dm mpath: always free attached_handler_name in parse_path()Martin Wilck1-1/+1
Commit b592211c33f7 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer") fixed a memory leak for the case where setup_scsi_dh() returns failure. But setup_scsi_dh may return success and not "use" attached_handler_name if the retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh properly "steals" the pointer by nullifying it, freeing it unconditionally in parse_path() is safe. Fixes: b592211c33f7 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer") Cc: stable@vger.kernel.org Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30dm init: fix max devices/targets checksHelen Koike1-4/+4
dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets, and not DM_MAX_{DEVICES,TARGETS} - 1. Fix the checks and also fix the error message when the number of devices is surpassed. Fixes: 6bbc923dfcf57d ("dm: add support to directly boot to a mapped device") Cc: stable@vger.kernel.org Signed-off-by: Helen Koike <helen.koike@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30dm: add dust targetBryan Gurney4-0/+797
Add the dm-dust target, which simulates the behavior of bad sectors at arbitrary locations, and the ability to enable the emulation of the read failures at an arbitrary time. This target behaves similarly to a linear target. At a given time, the user can send a message to the target to start failing read requests on specific blocks. When the failure behavior is enabled, reads of blocks configured "bad" will fail with EIO. Writes of blocks configured "bad" will result in the following: 1. Remove the block from the "bad block list". 2. Successfully complete the write. After this point, the block will successfully contain the written data, and will service reads and writes normally. This emulates the behavior of a "remapped sector" on a hard disk drive. dm-dust provides logging of which blocks have been added or removed to the "bad block list", as well as logging when a block has been removed from the bad block list. These messages can be used alongside the messages from the driver using a dm-dust device to analyze the driver's behavior when a read fails at a given time. (This logging can be reduced via a "quiet" mode, if desired.) NOTE: If the block size is larger than 512 bytes, only the first sector of each "dust block" is detected. Placing a limiting layer above a dust target, to limit the minimum I/O size to the dust block size, will ensure proper emulation of the given large block size. Signed-off-by: Bryan Gurney <bgurney@redhat.com> Co-developed-by: Joe Shimkus <jshimkus@redhat.com> Co-developed-by: John Dorminy <jdorminy@redhat.com> Co-developed-by: John Pittman <jpittman@redhat.com> Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26dm writecache: avoid unnecessary lookups in writecache_find_entry()Mikulas Patocka1-6/+5
This is a small optimization in writecache_find_entry(). If we go past the condition "if (unlikely(!node))", we can be certain that there is no entry in the tree that has the block equal to the "block" variable. Consequently, we can return the next entry directly, we don't need to go to the second part of the function that finds the entry with lowest or highest seq number that matches the "block" variable. Also, add some whitespace and cleanup needless braces. Suggested-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26dm writecache: remove unused member page_offset in writeback_structHuaisheng Ye1-2/+0
The stucture member page_offset in writeback_struct never has been used actually. Remove it. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26dm delay: fix a crash when invalid device is specifiedMikulas Patocka1-1/+2
When the target line contains an invalid device, delay_ctr() will call delay_dtr() with NULL workqueue. Attempting to destroy the NULL workqueue causes a crash. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26dm: only initialize md->dax_dev if CONFIG_DAX_DRIVER is enabledPeng Wang1-4/+2
md->dax_dev defaults to NULL and there is no need to initialize it if CONFIG_DAX_DRIVER is disabled. Signed-off-by: Peng Wang <rocking@whu.edu.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-25dm mpath: fix missing call of path selector type->end_ioYufen Yu4-7/+24
After commit 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"), map_request() will requeue the tio when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. Thus, if device driver status is error, a tio may be requeued multiple times until the return value is not DM_MAPIO_REQUEUE. That means type->start_io may be called multiple times, while type->end_io is only called when IO complete. In fact, even without commit 396eaf21ee17, setup_clone() failure can also cause tio requeue and associated missed call to type->end_io. The service-time path selector selects path based on in_flight_size, which is increased by st_start_io() and decreased by st_end_io(). Missed calls to st_end_io() can lead to in_flight_size count error and will cause the selector to make the wrong choice. In addition, queue-length path selector will also be affected. To fix the problem, call type->end_io in ->release_clone_rq before tio requeue. map_info is passed to ->release_clone_rq() for map_request() error path that result in requeue. Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Cc: stable@vger.kernl.org Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm thin metadata: do not write metadata if no changes occurredMike Snitzer1-3/+23
Otherwise, just activating a thin-pool and thin device and then deactivating them will cause the thin-pool metadata to be changed (e.g. superblock written) -- even without any metadata being changed. Add 'in_service' flag to struct dm_pool_metadata and set it in pmd_write_lock() because all on-disk metadata changes must take a write lock of pmd->root_lock. Once 'in_service' is set it is never cleared. __commit_transaction() will return 0 if 'in_service' is not set. dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that it isn't the sole reason for putting a thin-pool in service. Also fix dm_pool_commit_metadata() to open the next transaction if the return from __commit_transaction() is 0. Not seeing why the early return ever made since for a return of 0 given that dm-io's async_io(), as used by bufio, always returns 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm thin metadata: add wrappers for managing write locking of metadataMike Snitzer1-44/+64
No functional change, but this prepares to hook off of pmd_write_lock() with additional functionality (as provided in next commit). Suggested-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm thin metadata: check __commit_transaction()'s returnMike Snitzer1-1/+6
Fix __reserve_metadata_snap() to return early if __commit_transaction() fails. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm space map common: zero entire ll_diskMike Snitzer1-0/+2
Otherwise, memory that is allocated (and potentially not previously zeroed) will get written to disk as part of the space maps. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm writecache: add unlikely for returned value of rb_next/prevHuaisheng Ye1-2/+2
In functions writecache_discard() and writecache_find_entry() there is a high probablity that the pointer of structure rb_node won't equal NULL. Add unlikely for the pointer node NULL. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm writecache: remove needless dereferences in __writecache_writeback_pmem()Huaisheng Ye1-6/+6
bio is already available so there is no need to access it in terms of the wb pointer. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Use fine-grained locking schemeNikos Tsironis1-40/+44
Substitute the global locking scheme with a fine grained one, employing the read-write semaphore and the scalable exception tables with per-bucket locks introduced by the previous two commits. Summarizing, we now use a read-write semaphore to protect the mostly read fields of the snapshot structure, e.g., valid, active, etc., and per-bucket bit spinlocks to protect accesses to the complete and pending exception tables. Finally, we use an extra spinlock (pe_allocation_lock) to serialize the allocation of new exceptions by the exception store. This allocation is really fast, so the extra spinlock doesn't hurt the performance. This scheme allows dm-snapshot to scale better, resulting in increased IOPS and reduced latency. Following are some benchmark results using the null_blk device: modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \ queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1 * Benchmark fio_origin_randwrite_throughput_N, from the device mapper test suite [1] (direct IO, random 4K writes to origin device, IO engine libaio): +--------------+-------------+------------+ | # of workers | IOPS Before | IOPS After | +--------------+-------------+------------+ | 1 | 57708 | 66421 | | 2 | 63415 | 77589 | | 4 | 67276 | 98839 | | 8 | 60564 | 109258 | +--------------+-------------+------------+ * Benchmark fio_origin_randwrite_latency_N, from the device mapper test suite [1] (direct IO, random 4K writes to origin device, IO engine psync): +--------------+-----------------------+----------------------+ | # of workers | Latency (usec) Before | Latency (usec) After | +--------------+-----------------------+----------------------+ | 1 | 16.25 | 13.27 | | 2 | 31.65 | 25.08 | | 4 | 55.28 | 41.08 | | 8 | 121.47 | 74.44 | +--------------+-----------------------+----------------------+ * Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper test suite [1] (direct IO, random 4K writes to snapshot device, IO engine libaio): +--------------+-------------+------------+ | # of workers | IOPS Before | IOPS After | +--------------+-------------+------------+ | 1 | 72593 | 84938 | | 2 | 97379 | 134973 | | 4 | 90610 | 143077 | | 8 | 90537 | 180085 | +--------------+-------------+------------+ * Benchmark fio_snapshot_randwrite_latency_N, from the device mapper test suite [1] (direct IO, random 4K writes to snapshot device, IO engine psync): +--------------+-----------------------+----------------------+ | # of workers | Latency (usec) Before | Latency (usec) After | +--------------+-----------------------+----------------------+ | 1 | 12.53 | 10.6 | | 2 | 19.78 | 14.89 | | 4 | 40.37 | 23.47 | | 8 | 89.32 | 48.48 | +--------------+-----------------------+----------------------+ [1] https://github.com/jthornber/device-mapper-test-suite Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Make exception tables scalableNikos Tsironis2-24/+116
Use list_bl to implement the exception hash tables' buckets. This change permits concurrent access, to distinct buckets, by multiple threads. Also, implement helper functions to lock and unlock the exception tables based on the chunk number of the exception at hand. We retain the global locking, by means of down_write(), which is replaced by the next commit. Still, we must acquire the per-bucket spinlocks when accessing the hash tables, since list_bl does not allow modification on unlocked lists. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Replace mutex with rw semaphoreNikos Tsironis1-45/+43
dm-snapshot uses a single mutex to serialize every access to the snapshot state. This includes all accesses to the complete and pending exception tables, which occur at every origin write, every snapshot read/write and every exception completion. The lock statistics indicate that this mutex is a bottleneck (average wait time ~480 usecs for 8 processes doing random 4K writes to the origin device) preventing dm-snapshot to scale as the number of threads doing IO increases. The major contention points are __origin_write()/snapshot_map() and pending_complete(), i.e., the submission and completion of pending exceptions. Replace this mutex with a rw semaphore. We essentially revert commit ae1093be5a0ef9 ("dm snapshot: use mutex instead of rw_semaphore") and together with the next two patches we substitute the single mutex with a fine-grained locking scheme, where we use a read-write semaphore to protect the mostly read fields of the snapshot structure, e.g., valid, active, etc., and per-bucket bit spinlocks to protect accesses to the complete and pending exception tables. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Don't sleep holding the snapshot lockNikos Tsironis1-37/+65
When completing a pending exception, pending_complete() waits for all conflicting reads to drain, before inserting the final, completed exception. Conflicting reads are snapshot reads redirected to the origin, because the relevant chunk is not remapped to the COW device the moment we receive the read. The completed exception must be inserted into the exception table after all conflicting reads drain to ensure snapshot reads don't return corrupted data. This is required because inserting the completed exception into the exception table signals that the relevant chunk is remapped and both origin writes and snapshot merging will now overwrite the chunk in origin. This wait is done holding the snapshot lock to ensure that pending_complete() doesn't starve if new snapshot reads keep coming for this chunk. In preparation for the next commit, where we use a spinlock instead of a mutex to protect the exception tables, we remove the need for holding the lock while waiting for conflicting reads to drain. We achieve this in two steps: 1. pending_complete() inserts the completed exception before waiting for conflicting reads to drain and removes the pending exception after all conflicting reads drain. This ensures that new snapshot reads will be redirected to the COW device, instead of the origin, and thus pending_complete() will not starve. Moreover, we use the existence of both a completed and a pending exception to signify that the COW is done but there are conflicting reads in flight. 2. In __origin_write() we check first if there is a pending exception and then if there is a completed exception. If there is a pending exception any submitted BIO is delayed on the pe->origin_bios list and DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the origin nor snapshot merging can overwrite the origin chunk, until all conflicting reads drain, and thus snapshot reads will not return corrupted data. Summarizing, we now have the following possible combinations of pending and completed exceptions for a chunk, along with their meaning: A. No exceptions exist: The chunk has not been remapped yet. B. Only a pending exception exists: The chunk is currently being copied to the COW device. C. Both a pending and a completed exception exist: COW for this chunk has completed but there are snapshot reads in flight which had been redirected to the origin before the chunk was remapped. D. Only the completed exception exists: COW has been completed and there are no conflicting reads in flight. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18list_bl: Add hlist_bl_add_before/behind helpersNikos Tsironis1-0/+26
Add hlist_bl_add_before/behind helpers to add an element before/after an existing element in a bl_list. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18list: Don't use WRITE_ONCE() in hlist_add_behind()Nikos Tsironis1-1/+1
Commit 1c97be677f72b3 ("list: Use WRITE_ONCE() when adding to lists and hlists") introduced the use of WRITE_ONCE() to atomically write the list head's ->next pointer. hlist_add_behind() doesn't touch the hlist head's ->first pointer so there is no reason to use WRITE_ONCE() in this case. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm cache metadata: Fix loading discard bitsetNikos Tsironis1-1/+8
Add missing dm_bitset_cursor_next() to properly advance the bitset cursor. Otherwise, the discarded state of all blocks is set according to the discarded state of the first block. Fixes: ae4a46a1f6 ("dm cache metadata: use bitset cursor api to load discard bitset") Cc: stable@vger.kernel.org Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm zoned: Fix zone report handlingDamien Le Moal1-0/+5
The function blkdev_report_zones() returns success even if no zone information is reported (empty report). Empty zone reports can only happen if the report start sector passed exceeds the device capacity. The conditions for this to happen are either a bug in the caller code, or, a change in the device that forced the low level driver to change the device capacity to a value that is lower than the report start sector. This situation includes a failed disk revalidation resulting in the disk capacity being changed to 0. If this change happens while dm-zoned is in its initialization phase executing dmz_init_zones(), this function may enter an infinite loop and hang the system. To avoid this, add a check to disallow empty zone reports and bail out early. Also fix the function dmz_update_zone() to make sure that the report for the requested zone was correctly obtained. Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Shaun Tancheff <shaun@tancheff.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm zoned: Silence a static checker warningDan Carpenter1-1/+2
My static checker complains about this line from dmz_get_zoned_device() aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1); The problem is that "aligned_capacity" and "dev->capacity" are sector_t type (which is a u64 under most configs) but blk_queue_zone_sectors(q) returns a u32 so the higher 32 bits in aligned_capacity are cleared to zero. This patch adds a cast to address the issue. Fixes: 114e025968b5 ("dm zoned: ignore last smaller runt zone") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm crypt: fix endianness annotations around org_sector_of_dmreqChristoph Hellwig1-4/+4
The sector used here is a little endian value, so use the right type for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-14Linux 5.1-rc5Linus Torvalds1-1/+1
2019-04-14Merge branch 'page-refs' (page ref overflow)Linus Torvalds8-28/+92
Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit
2019-04-14fs: prevent page refcount overflow in pipe_buf_getMatthew Wilcox5-15/+29
Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-14mm: prevent get_user_pages() from overflowing page refcountLinus Torvalds2-12/+49
If the page refcount wraps around past zero, it will be freed while there are still four billion references to it. One of the possible avenues for an attacker to try to make this happen is by doing direct IO on a page multiple times. This patch makes get_user_pages() refuse to take a new page reference if there are already more than two billion references to the page. Reported-by: Jann Horn <jannh@google.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-14mm: add 'try_get_page()' helper functionLinus Torvalds1-0/+9
This is the same as the traditional 'get_page()' function, but instead of unconditionally incrementing the reference count of the page, it only does so if the count was "safe". It returns whether the reference count was incremented (and is marked __must_check, since the caller obviously has to be aware of it). Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count. The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened. NOTE! The count access for the "safety" check is inherently racy, but that doesn't matter since the buffer we use is basically half the range of the reference count (ie we look at the sign of the count). Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-14mm: make page ref count overflow check tighter and more explicitLinus Torvalds1-1/+5
We have a VM_BUG_ON() to check that the page reference count doesn't underflow (or get close to overflow) by checking the sign of the count. That's all fine, but we actually want to allow people to use a "get page ref unless it's already very high" helper function, and we want that one to use the sign of the page ref (without triggering this VM_BUG_ON). Change the VM_BUG_ON to only check for small underflows (or _very_ close to overflowing), and ignore overflows which have strayed into negative territory. Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-13Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-blockLinus Torvalds20-109/+174
Pull block fixes from Jens Axboe: "Set of fixes that should go into this round. This pull is larger than I'd like at this time, but there's really no specific reason for that. Some are fixes for issues that went into this merge window, others are not. Anyway, this contains: - Hardware queue limiting for virtio-blk/scsi (Dongli) - Multi-page bvec fixes for lightnvm pblk - Multi-bio dio error fix (Jason) - Remove the cache hint from the io_uring tool side, since we didn't move forward with that (me) - Make io_uring SETUP_SQPOLL root restricted (me) - Fix leak of page in error handling for pc requests (Jérôme) - Fix BFQ regression introduced in this merge window (Paolo) - Fix break logic for bio segment iteration (Ming) - Fix NVMe cancel request error handling (Ming) - NVMe pull request with two fixes (Christoph): - fix the initial CSN for nvme-fc (James) - handle log page offsets properly in the target (Keith)" * tag 'for-linus-20190412' of git://git.kernel.dk/linux-block: block: fix the return errno for direct IO nvmet: fix discover log page when offsets are used nvme-fc: correct csn initialization and increments on error block: do not leak memory in bio_copy_user_iov() lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs nvme: cancel request synchronously blk-mq: introduce blk_mq_complete_request_sync() scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids virtio-blk: limit number of hw queues by nr_cpu_ids block, bfq: fix use after free in bfq_bfqq_expire io_uring: restrict IORING_SETUP_SQPOLL to root tools/io_uring: remove IOCQE_FLAG_CACHEHIT block: don't use for-inside-for in bio_for_each_segment_all
2019-04-13Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds7-53/+16
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fix: - Fix a deadlock in close() due to incorrect draining of RDMA queues Bugfixes: - Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping" as it is causing stack overflows - Fix a regression where NFSv4 getacl and fs_locations stopped working - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family. - Fix xfstests failures due to incorrect copy_file_range() return values" * tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping" NFSv4.1 fix incorrect return value in copy_file_range xprtrdma: Fix helper that drains the transport NFS: Fix handling of reply page vector NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
2019-04-13Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-1/+4
Pull SCSI fix from James Bottomley: "One obvious fix for a ciostor data corruption on error bug" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: csiostor: fix missing data copy in csio_scsi_err_handler()
2019-04-13Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linuxLinus Torvalds12-27/+99
Pull clk fixes from Stephen Boyd: "Here's more than a handful of clk driver fixes for changes that came in during the merge window: - Fix the AT91 sama5d2 programmable clk prescaler formula - A bunch of Amlogic meson clk driver fixes for the VPU clks - A DMI quirk for Intel's Bay Trail SoC's driver to properly mark pmc clks as critical only when really needed - Stop overwriting CLK_SET_RATE_PARENT flag in mediatek's clk gate implementation - Use the right structure to test for a frequency table in i.MX's PLL_1416x driver" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: imx: Fix PLL_1416X not rounding rates clk: mediatek: fix clk-gate flag setting platform/x86: pmc_atom: Drop __initconst on dmi table clk: x86: Add system specific quirk to mark clocks as critical clk: meson: vid-pll-div: remove warning and return 0 on invalid config clk: meson: pll: fix rounding and setting a rate that matches precisely clk: meson-g12a: fix VPU clock parents clk: meson: g12a: fix VPU clock muxes mask clk: meson-gxbb: round the vdec dividers to closest clk: at91: fix programmable clock for sama5d2
2019-04-13Merge tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pciLinus Torvalds2-0/+6
Pull PCI fixes from Bjorn Helgaas: - Add a DMA alias quirk for another Marvell SATA device (Andre Przywara) - Fix a pciehp regression that broke safe removal of devices (Sergey Miroshnichenko) * tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: pciehp: Ignore Link State Changes after powering off a slot PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller
2019-04-13Merge tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds4-10/+14
Pull powerpc fixes from Michael Ellerman: "A minor build fix for 64-bit FLATMEM configs. A fix for a boot failure on 32-bit powermacs. My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on 64-bit kernels, ie. compat mode, which is only used on big endian. The rewrite of the SLB code we merged in 4.20 missed the fact that the 0x380 exception is also used with the Radix MMU to report out of range accesses. This could lead to an oops if userspace tried to read from addresses outside the user or kernel range. Thanks to: Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas Piggin" * tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs powerpc/64s/radix: Fix radix segment exception handling powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64 powerpc/32: Fix early boot failure with RTAS built-in
2019-04-13Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linuxLinus Torvalds4-16/+23
Pull arm64 fixes from Will Deacon: "The main thing is a fix to our FUTEX_WAKE_OP implementation which was unbelievably broken, but did actually work for the one scenario that GLIBC used to use. Summary: - Fix stack unwinding so we ignore user stacks - Fix ftrace module PLT trampoline initialisation checks - Fix terminally broken implementation of FUTEX_WAKE_OP atomics" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value arm64: backtrace: Don't bother trying to unwind the userspace stack arm64/ftrace: fix inadvertent BUG() in trampoline check
2019-04-12Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-26/+21
Pull x86 fixes from Ingo Molnar: "Fix typos in user-visible resctrl parameters, and also fix assembly constraint bugs that might result in miscompilation" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Use stricter assembly constraints in bitops x86/resctrl: Fix typos in the mba_sc mount option
2019-04-12Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull timer fix from Ingo Molnar: "Fix the alarm_timer_remaining() return value" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Return correct remaining time
2019-04-12Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+3
Pull scheduler fix from Ingo Molnar: "Fix a NULL pointer dereference crash in certain environments" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
2019-04-12Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds5-27/+190
Pull perf fixes from Ingo Molnar: "Six kernel side fixes: three related to NMI handling on AMD systems, a race fix, a kexec initialization fix and a PEBS sampling fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix perf_event_disable_inatomic() race x86/perf/amd: Remove need to check "running" bit in NMI handler x86/perf/amd: Resolve NMI latency issues for active PMCs x86/perf/amd: Resolve race condition when disabling PMC perf/x86/intel: Initialize TFA MSR perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS
2019-04-12Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-17/+12
Pull locking fix from Ingo Molnar: "Fixes a crash when accessing /proc/lockdep" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Zap lock classes even with lock debugging disabled
2019-04-12Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-0/+6
Pull irq fixes from Ingo Molnar: "Two genirq fixes, plus an irqchip driver error handling fix" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent() genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n irqchip/irq-ls1x: Missing error code in ls1x_intc_of_init()
2019-04-12Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-2/+3
Pull core fixes from Ingo Molnar: "Fix an objtool warning plus fix a u64_to_user_ptr() macro expansion bug" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Add rewind_stack_do_exit() to the noreturn list linux/kernel.h: Use parentheses around argument in u64_to_user_ptr()
2019-04-12clk: imx: Fix PLL_1416X not rounding ratesLeonard Crestez1-1/+1
Code which initializes the "clk_init_data.ops" checks pll->rate_table before that field is ever assigned to so it always picks "clk_pll1416x_min_ops". This breaks dynamic rate rounding for features such as cpufreq. Fix by checking pll_clk->rate_table instead, here pll_clk refers to the constant initialization data coming from per-soc clk driver. Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com> Fixes: 8646d4dcc7fb ("clk: imx: Add PLLs driver for imx8mm soc") Signed-off-by: Stephen Boyd <sboyd@kernel.org>