aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2017-04-04nvme_fc: Sync FC-NVME header with standardJames Smart1-9/+59
Update FC-NVME definitions to match FC-NVME r1.14 (16-020vB) plus change voted in by 2/22 FC-NVME Adhoc (see HOSTID below). Includes the following: - Addition of "status_code" field to ERSP IU - Addition of FC-NVME LS RJT reason_codes and reason_explanations - CreateAssociation payload, HostID field shortened to 16 bytes Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-rdma: Support ctrl_loss_tmoSagi Grimberg1-13/+28
Before scheduling a reconnect attempt, check nr_reconnects against max_reconnects, if not exhausted (or max_reconnects is not -1), schedule a reconnect attempts, otherwise schedule ctrl removal. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-fabrics: Allow ctrl loss timeout configurationSagi Grimberg2-0/+38
When a host sense that its controller session is damaged, it tries to re-establish it periodically (reconnect every reconnect_delay). It may very well be that the controller is gone and never coming back, in this case the host will try to reconnect forever. Add a ctrl_loss_tmo to bound the number of reconnect attempts to a specific controller (default to a reasonable 10 minutes). The timeout configuration is actually translated into number of reconnect attempts and not a schedule on its own but rather divided with reconnect_delay. This is useful to prevent racing flows of remove and reconnect, and it doesn't really matter if we remove slightly sooner than what the user requested. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-rdma: get rid of local reconnect_delaySagi Grimberg1-5/+3
we already have it in opts. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-loop: retrieve iod from the cqe command_idSagi Grimberg1-6/+25
useful to validate that the we didn't mess up the command_id. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-loop: remove unneeded includesSagi Grimberg1-2/+0
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-fc: fix module_init (theoretical) error pathSagi Grimberg1-1/+10
If nvmf_register_transport happened to fail (it can't, but theoretically) we leak memory. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-loop: fix module_init (theoretical) error pathSagi Grimberg1-1/+6
if nvmf_register_transport happend to fail, we need to nvmet_unregister_transport as well. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-rdma: fix module_init (theoretical) error pathSagi Grimberg1-5/+13
If nvmf_register_transport happened to fail (it can't, but theoretically) we leak memory. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvmet: use symbolic constants for log identifiersMax Gurtovoy1-6/+6
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvmet: Introduced helper routine for controller status check.Parav Pandit6-34/+41
This patch introduces helper function for checking controller status during admin and io command processing which returns u16 status. As to bring consistency on returning status, other friend functions also now return u16 status instead of int to match the spec. As part of the theseerror log prints in also prints qid on which command error occured. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvmet: Fixed avoided printing nvmet: twice in error logs.Parav Pandit4-21/+20
This patch avoids printing "nvmet:" twice in error logs as its already coming through pr_fmt macro. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04iscsi-target: use generic inet_pton_with_scopeSagi Grimberg1-34/+12
Instead of parsing address strings, use a generic helper. Acked-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-rdma: use inet_pton_with_scope helperSagi Grimberg1-44/+19
Both the destination and the host addresses are now parsed using inet_pton_with_scope helper. We also get ipv6 (with address scopes support). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvmet-rdma: use generic inet_pton_with_scopeSagi Grimberg1-13/+29
Instead of parsing address strings, use a generic helper. This also adds ipv6 (with address scopes) support. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04net/utils: generic inet_pton_with_scope helperSagi Grimberg2-0/+109
Several locations in the stack need to handle ipv4/ipv6 (with scope) and port strings conversion to sockaddr. Add a helper that takes either AF_INET, AF_INET6 or AF_UNSPEC (for wildcard) to centralize this handling. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-loop: remove some code duplicationSagi Grimberg1-12/+21
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-rdma: Give some more grace for rdma connection establishmentSagi Grimberg1-1/+1
The target might be occupied with multiple hosts so lets give it some more grace before failing the connection establishment. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvmet-rdma: occasionally flush ongoing controller teardownSagi Grimberg1-0/+5
If we are attacked with establishments/teradowns we need to make sure we do not consume too much system memory. Thus let ongoing controller teardowns complete before accepting new controller establishments. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-rdma: handle cpu unplug when re-establishing the controllerSagi Grimberg1-14/+14
If a cpu unplug event has occured, we need to take the minimum of the provided nr_io_queues and the number of online cpus, otherwise we won't be able to connect them as blk-mq mapping won't dispatch to those queues. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvmet-rdma: Fix a possible uninitialized variable dereferenceSagi Grimberg1-5/+3
When handling a new recv command, we grab a new rsp resource and check for the queue state being live. In case the queue is not in live state, we simply restore the rsp back to the free list. However in this flow we didn't set rsp->queue yet, so we cannot dereference it. Instead, make sure to initialize rsp->queue (and other rsp members) as soon as possible so we won't reference uninitialized variables. Reported-by: Yi Zhang <yizhan@redhat.com> Reported-by: Raju Rangoju <rajur@chelsio.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Raju Rangoju <rajur@chelsio.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvmet: confirm sq percpu has scheduled and switched to atomicSagi Grimberg2-1/+11
percpu_ref_kill is not enough to prevent subsequent percpu_ref_tryget_live from failing. Hence call perfcpu_ref_kill_confirm to make it safe. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04nvme-loop: handle cpu unplug when re-establishing the controllerSagi Grimberg1-38/+50
If a cpu unplug event has occured, we need to take the minimum of the provided nr_io_queues and the number of online cpus, otherwise we won't be able to connect them as blk-mq mapping won't dispatch to those queues. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-04nvme-loop: fix a possible use-after-free when destroying the admin queueSagi Grimberg1-1/+1
we need to destroy the nvmet sq and let it finish gracefully before continue to cleanup the queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-03-31blk-mq: constify struct blk_mq_opsEric Biggers14-18/+18
Constify all instances of blk_mq_ops, as they are never modified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30null_blk: add blocking modeJens Axboe1-0/+9
This adds a new module parameter to null_blk, blocking. If set, null_blk will set the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always needs to block in its ->queue_rq() function. The intent is to help find regressions in blocking drivers, since not many of them exist. If null_blk is loaded with submit_queues > 1 and blocking=1, this shows the regression recently fixed by bf4907c05e61. Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30blk-mq: fix schedule-under-preempt for blocking driversJens Axboe1-3/+14
Commit a4d907b6a33b unified the single and multi queue request handlers, but in the process, it also screwed up the locking balance and calls blk_mq_try_issue_directly() with the ctx preempt lock held. This is a problem for drivers that have set BLK_MQ_F_BLOCKING, since now they can't reliably sleep. While in there, protect against similar issues in the future, by adding a might_sleep() trigger in the BLOCKING path for direct issue or queue run. Reported-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Josef Bacik <josef@toxicpanda.com> Fixes: a4d907b6a33b ("blk-mq: streamline blk_mq_make_request") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30block/sed-opal: fix spelling mistake: "Lifcycle" -> "Lifecycle"Colin Ian King1-1/+1
trivial fix to spelling mistake in pr_err error message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30block: do not put mq context in blk_mq_alloc_request_hctxMinchan Kim1-1/+0
In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't get sw context so we don't need to put the context with blk_mq_put_ctx. Unless, we will see preempt counter underflow. Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29blk-mq: include errors in did_work calculationJens Axboe1-3/+4
Currently we return true in blk_mq_dispatch_rq_list() if we queued IO successfully, but we really want to return whether or not the we made progress. Progress includes if we got an error return. If we don't, this can lead to a hang in blk_mq_sched_dispatch_requests() when a driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of manually ending the IO in error and return BLK_MQ_QUEUE_OK. Tested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29block-mq: don't re-queue if we get a queue errorJosef Bacik1-2/+1
When try to issue a request directly and we fail we will requeue the request, but call blk_mq_end_request() as well. This leads to the completed request being on a queuelist and getting ended twice, which causes list corruption in schedulers and other shenanigans. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29blkcg: allocate struct blkcg_gq outside request queue spinlockTahsin Erdogan1-25/+98
blkg_conf_prep() currently calls blkg_lookup_create() while holding request queue spinlock. This means allocating memory for struct blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM failures in call paths like below: pcpu_alloc+0x68f/0x710 __alloc_percpu_gfp+0xd/0x10 __percpu_counter_init+0x55/0xc0 cfq_pd_alloc+0x3b2/0x4e0 blkg_alloc+0x187/0x230 blkg_create+0x489/0x670 blkg_lookup_create+0x9a/0x230 blkg_conf_prep+0x1fb/0x240 __cfqg_set_weight_device.isra.105+0x5c/0x180 cfq_set_weight_on_dfl+0x69/0xc0 cgroup_file_write+0x39/0x1c0 kernfs_fop_write+0x13f/0x1d0 __vfs_write+0x23/0x120 vfs_write+0xc2/0x1f0 SyS_write+0x44/0xb0 entry_SYSCALL_64_fastpath+0x18/0xad In the code path above, percpu allocator cannot call vmalloc() due to queue spinlock. A failure in this call path gives grief to tools which are trying to configure io weights. We see occasional failures happen shortly after reboots even when system is not under any memory pressure. Machines with a lot of cpus are more vulnerable to this condition. Do struct blkcg_gq allocations outside the queue spinlock to allow blocking during memory allocations. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29Revert "blkcg: allocate struct blkcg_gq outside request queue spinlock"Jens Axboe2-91/+53
I inadvertently applied the v5 version of this patch, whereas the agreed upon version was v5. Revert this one so we can apply the right one. This reverts commit 7fc6b87a9ff537e7df32b1278118ce9c5bcd6788.
2017-03-29blk-mq: fix a typo and a spelling mistakeJens Axboe1-2/+2
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29blk-mq-pci: Fix two spelling mistakesSagi Grimberg1-1/+1
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29block: fix leak of q->rq_wbOmar Sandoval1-1/+3
CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb when a request queue is reregistered. This has been a problem since wbt was introduced, but the WARN_ON(!list_empty(&stats->callbacks)) in the blk-stat rework exposed it. Fix it by cleaning up wbt when we unregister the queue. Fixes: 87760e5eef35 ("block: hook up writeback throttling") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29blk-mq: fix leak of q->statsOmar Sandoval1-4/+0
blk_alloc_queue_node() already allocates q->stats, so blk_mq_init_allocated_queue() is overwriting it with a new allocation. Fixes: a83b576c9c25 ("block: fix stacked driver stats init and free") Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29block: warn if sharing request queue across gendisksOmar Sandoval2-0/+8
Now that the remaining drivers have been converted to one request queue per gendisk, let's warn if a request queue gets registered more than once. This will catch future drivers which might do it inadvertently or any old drivers that I may have missed. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29block: block new I/O just after queue is set as dyingMing Lei1-3/+10
Before commit 780db2071a(blk-mq: decouble blk-mq freezing from generic bypassing), the dying flag is checked before entering queue, and Tejun converts the checking into .mq_freeze_depth, and assumes the counter is increased just after dying flag is set. Unfortunately we doesn't do that in blk_set_queue_dying(). This patch calls blk_freeze_queue_start() in blk_set_queue_dying(), so that we can block new I/O coming once the queue is set as dying. Given blk_set_queue_dying() is always called in remove path of block device, and queue will be cleaned up later, we don't need to worry about undoing the counter. Cc: Tejun Heo <tj@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29block: rename blk_mq_freeze_queue_start()Ming Lei5-9/+9
As the .q_usage_counter is used by both legacy and mq path, we need to block new I/O if queue becomes dead in blk_queue_enter(). So rename it and we can use this function in both paths. Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29block: add a read barrier in blk_queue_enter()Ming Lei1-0/+9
Without the barrier, reading DEAD flag of .q_usage_counter and reading .mq_freeze_depth may be reordered, then the following wait_event_interruptible() may never return. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29blk-mq: comment on races related with timeout handlerMing Lei1-0/+22
This patch adds comment on two races related with timeout handler: - requeue from queue busy vs. timeout - rq free & reallocation vs. timeout Both the races themselves and current solution aren't explicit enough, so add comments on them. Cc: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29blk-mq: don't complete un-started request in timeout handlerMing Lei1-10/+1
When iterating busy requests in timeout handler, if the STARTED flag of one request isn't set, that means the request is being processed in block layer or driver, and isn't submitted to hardware yet. In current implementation of blk_mq_check_expired(), if the request queue becomes dying, un-started requests are handled as being completed/freed immediately. This way is wrong, and can cause rq corruption or double allocation[1][2], when doing I/O and removing&resetting NVMe device at the sametime. This patch fixes several issues reported by Yi Zhang. [1]. oops log 1 [ 581.789754] ------------[ cut here ]------------ [ 581.789758] kernel BUG at block/blk-mq.c:374! [ 581.789760] invalid opcode: 0000 [#1] SMP [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror dm_region_hash dm_log dm_mod [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4 [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 581.789804] Workqueue: kblockd blk_mq_timeout_work [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000 [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70 [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202 [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88 [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000 [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00 [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80 [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0 [ 581.789820] Call Trace: [ 581.789825] blk_mq_check_expired+0x76/0x80 [ 581.789828] bt_iter+0x45/0x50 [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0 [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789840] ? __switch_to+0x140/0x450 [ 581.789841] blk_mq_timeout_work+0x88/0x170 [ 581.789845] process_one_work+0x165/0x410 [ 581.789847] worker_thread+0x137/0x4c0 [ 581.789851] kthread+0x101/0x140 [ 581.789853] ? rescuer_thread+0x3b0/0x3b0 [ 581.789855] ? kthread_park+0x90/0x90 [ 581.789860] ret_from_fork+0x2c/0x40 [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48 8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f> 0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00 [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50 [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]--- [2]. oops log2 [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857373] PGD 0 [ 6984.857374] [ 6984.857376] Oops: 0000 [#1] SMP [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror dm_region_hash dm_log dm_mod [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted 4.10.0-2.el7.bz1420297.x86_64 #1 [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000 [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246 [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000 [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950 [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000 [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000 [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780 [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000 [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0 [ 6984.857443] Call Trace: [ 6984.857451] ? mempool_free+0x2b/0x80 [ 6984.857455] ? bio_free+0x4e/0x60 [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230 [ 6984.857462] blk_mq_process_rq_list+0x133/0x170 [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0 [ 6984.857467] blk_mq_run_work_fn+0x12/0x20 [ 6984.857473] process_one_work+0x165/0x410 [ 6984.857475] worker_thread+0x137/0x4c0 [ 6984.857478] kthread+0x101/0x140 [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0 [ 6984.857481] ? kthread_park+0x90/0x90 [ 6984.857489] ret_from_fork+0x2c/0x40 [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44 89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c> 8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50 [ 6984.857512] CR2: 0000000000000010 [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]--- Cc: stable@vger.kernel.org Reported-by: Yi Zhang <yizhan@redhat.com> Tested-by: Yi Zhang <yizhan@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28blkcg: allocate struct blkcg_gq outside request queue spinlockTahsin Erdogan2-53/+91
blkg_conf_prep() currently calls blkg_lookup_create() while holding request queue spinlock. This means allocating memory for struct blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM failures in call paths like below: pcpu_alloc+0x68f/0x710 __alloc_percpu_gfp+0xd/0x10 __percpu_counter_init+0x55/0xc0 cfq_pd_alloc+0x3b2/0x4e0 blkg_alloc+0x187/0x230 blkg_create+0x489/0x670 blkg_lookup_create+0x9a/0x230 blkg_conf_prep+0x1fb/0x240 __cfqg_set_weight_device.isra.105+0x5c/0x180 cfq_set_weight_on_dfl+0x69/0xc0 cgroup_file_write+0x39/0x1c0 kernfs_fop_write+0x13f/0x1d0 __vfs_write+0x23/0x120 vfs_write+0xc2/0x1f0 SyS_write+0x44/0xb0 entry_SYSCALL_64_fastpath+0x18/0xad In the code path above, percpu allocator cannot call vmalloc() due to queue spinlock. A failure in this call path gives grief to tools which are trying to configure io weights. We see occasional failures happen shortly after reboots even when system is not under any memory pressure. Machines with a lot of cpus are more vulnerable to this condition. Update blkg_create() function to temporarily drop the rcu and queue locks when it is allowed by gfp mask. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28jsflash: stop sharing request queue across multiple gendisksOmar Sandoval1-14/+36
Compile-tested only (by hacking it to compile on x86). Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28swim: stop sharing request queue across multiple gendisksOmar Sandoval1-19/+36
Compile-tested only (by hacking it to compile on x86). Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28parport/pf: stop sharing request queue across multiple gendisksOmar Sandoval1-19/+38
Compile-tested only. Cc: Tim Waugh <tim@cyberelk.net> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28parport/pcd: stop sharing request queue across multiple gendisksOmar Sandoval1-19/+38
Compile-tested only. Cc: Tim Waugh <tim@cyberelk.net> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28parport/pd: stop sharing request queue across multiple gendisksOmar Sandoval1-16/+34
Compile-tested only. Cc: Tim Waugh <tim@cyberelk.net> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28hd: stop sharing request queue across multiple gendisksOmar Sandoval1-23/+39
Compile-tested only. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>