aboutsummaryrefslogtreecommitdiffstats
path: root/.gitattributes (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2017-04-19dax: introduce dax_operationsDan Williams4-3/+23
Track a set of dax_operations per dax_device that can be set at alloc_dax() time. These operations will be used to stop the abuse of block_device_operations for communicating dax capabilities to filesystems. It will also be used to replace the "pmem api" and move pmem-specific cache maintenance, and other dax-driver-specific filesystem-dax operations, to dax device methods. In particular this allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(), with a driver specific replacement. This is a standalone introduction of the operations. Follow on patches convert each dax-driver and teach fs/dax.c to use ->direct_access() from dax_operations instead of block_device_operations. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-19dax: add a facility to lookup a dax device by 'host' device nameDan Williams4-6/+86
For the current block_device based filesystem-dax path, we need a way for it to lookup the dax_device associated with a block_device. Add a 'host' property of a dax_device that can be used for this purpose. It is a free form string, but for a dax_device associated with a block device it is the bdev name. This is a stop-gap until filesystems are able to mount on a dax-inode directly. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12dax: refactor dax-fs into a generic provider of 'struct dax_device' instancesDan Williams10-217/+404
We want dax capable drivers to be able to publish a set of dax operations [1]. However, we do not want to further abuse block_devices to advertise these operations. Instead we will attach these operations to a dax device and add a lookup mechanism to go from block device path to a dax device. A dax capable driver like pmem or brd is responsible for registering a dax device, alongside a block device, and then a dax capable filesystem is responsible for retrieving the dax device by path name if it wants to call dax_operations. For now, we refactor the dax pseudo-fs to be a generic facility, rather than an implementation detail, of the device-dax use case. Where a "dax device" is just an inode + dax infrastructure, and "Device DAX" is a mapping service layered on top of that base 'struct dax_device'. "Filesystem DAX" is then a mapping service that layers a filesystem on top of that same base device. Filesystem DAX is associated with a block_device for now, but perhaps directly to a dax device in the future, or for new pmem-only filesystems. [1]: https://lkml.org/lkml/2017/1/19/880 Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12device-dax: rename 'dax_dev' to 'dev_dax'Dan Williams3-109/+109
In preparation for introducing a struct dax_device type to the kernel global type namespace, rename dax_dev to dev_dax. A 'dax_device' instance will be a generic device-driver object for any provider of dax functionality. A 'dev_dax' object is a device-dax-driver local / internal instance. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12device-dax: improve fault handler debug outputOliver O'Halloran1-7/+10
A couple of minor improvements to the debug output in the fault handlers: a) Print the region alignment and fault size when we sent a SIGBUS because the region alignment is greater than the fault size. b) Fix the message in the PFN_{DEV|MAP} check. c) Additionally print the fault size enum value in the huge fault handler. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12device-dax: fix dax_dev_huge_fault() unknown fault size handlingPushkar Jambhlekar1-1/+1
The default case for dax_dev_huge_fault() fault size handling mistakenly returns when it should unlock. This is not a problem in practice since the only three possible fault sizes are handled. Going forward, if the core mm adds a new fault size beyond pte, pmd, or pud device-dax should abort VM_FAULT_SIGBUS requests not VM_FAULT_FALLBACK since device-dax guarantees a configured fault granularity for all faults. Signed-off-by: Pushkar Jambhlekar <pushkar.iit@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12x86, pmem: fix broken __copy_user_nocache cache-bypass assumptionsDan Williams1-11/+31
Before we rework the "pmem api" to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache. The problem occurs when copy_from_iter_pmem() is used for arbitrary data transfers from userspace. There is no guarantee that these transfers, performed by dax_iomap_actor(), will have aligned destinations or aligned transfer lengths. Backstop the usage __copy_user_nocache() with explicit cache management in these unaligned cases. Yes, copy_from_iter_pmem() is now too big for an inline, but addressing that is saved for a later patch that moves the entirety of the "pmem api" into the pmem driver directly. Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()") Cc: <stable@vger.kernel.org> Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12device-dax: switch to srcu, fix rcu_read_lock() vs pte allocationDan Williams2-6/+8
The following warning triggers with a new unit test that stresses the device-dax interface. =============================== [ ERR: suspicious RCU usage. ] 4.11.0-rc4+ #1049 Tainted: G O ------------------------------- ./include/linux/rcupdate.h:521 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 2 locks held by fio/9070: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8d0739d7>] __do_page_fault+0x167/0x4f0 #1: (rcu_read_lock){......}, at: [<ffffffffc03fbd02>] dax_dev_huge_fault+0x32/0x620 [dax] Call Trace: dump_stack+0x86/0xc3 lockdep_rcu_suspicious+0xd7/0x110 ___might_sleep+0xac/0x250 __might_sleep+0x4a/0x80 __alloc_pages_nodemask+0x23a/0x360 alloc_pages_current+0xa1/0x1f0 pte_alloc_one+0x17/0x80 __pte_alloc+0x1e/0x120 __get_locked_pte+0x1bf/0x1d0 insert_pfn.isra.70+0x3a/0x100 ? lookup_memtype+0xa6/0xd0 vm_insert_mixed+0x64/0x90 dax_dev_huge_fault+0x520/0x620 [dax] ? dax_dev_huge_fault+0x32/0x620 [dax] dax_dev_fault+0x10/0x20 [dax] __do_fault+0x1e/0x140 __handle_mm_fault+0x9af/0x10d0 handle_mm_fault+0x16d/0x370 ? handle_mm_fault+0x47/0x370 __do_page_fault+0x28c/0x4f0 trace_do_page_fault+0x58/0x2a0 do_async_page_fault+0x1a/0xa0 async_page_fault+0x28/0x30 Inserting a page table entry may trigger an allocation while we are holding a read lock to keep the device instance alive for the duration of the fault. Use srcu for this keep-alive protection. Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-10libnvdimm: band aid btt vs clear poison lockingDan Williams1-1/+9
The following warning results from holding a lane spinlock, preempt_disable(), or the btt map spinlock and then trying to take the reconfig_mutex to walk the poison list and potentially add new entries. BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd [..] Call Trace: dump_stack+0x85/0xc8 ___might_sleep+0x184/0x250 __might_sleep+0x4a/0x90 __mutex_lock+0x58/0x9b0 ? nvdimm_bus_lock+0x21/0x30 [libnvdimm] ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm] ? acpi_nfit_forget_poison+0x79/0x80 [nfit] ? _raw_spin_unlock+0x27/0x40 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] nsio_rw_bytes+0x164/0x270 [libnvdimm] btt_write_pg+0x1de/0x3e0 [nd_btt] ? blk_queue_enter+0x30/0x290 btt_make_request+0x11a/0x310 [nd_btt] ? blk_queue_enter+0xb7/0x290 ? blk_queue_enter+0x30/0x290 generic_make_request+0x118/0x3b0 As a minimal fix, disable error clearing when the BTT is enabled for the namespace. For the final fix a larger rework of the poison list locking is needed. Note that this is not a problem in the blk case since that path never calls nvdimm_clear_poison(). Cc: <stable@vger.kernel.org> Fixes: 82bf1037f2ca ("libnvdimm: check and clear poison before writing to pmem") Cc: Dave Jiang <dave.jiang@intel.com> [jeff: dynamically disable error clearing in the btt case] Suggested-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-10libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splatDan Williams1-0/+6
Holding the reconfig_mutex over a potential userspace fault sets up a lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl path. Move the user access outside of the lock. [ INFO: possible circular locking dependency detected ] 4.11.0-rc3+ #13 Tainted: G W O ------------------------------------------------------- fallocate/16656 is trying to acquire lock: (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffa00080b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm] but task is already holding lock: (jbd2_handle){++++..}, at: [<ffffffff813b4944>] start_this_handle+0x104/0x460 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (jbd2_handle){++++..}: lock_acquire+0xbd/0x200 start_this_handle+0x16a/0x460 jbd2__journal_start+0xe9/0x2d0 __ext4_journal_start_sb+0x89/0x1c0 ext4_dirty_inode+0x32/0x70 __mark_inode_dirty+0x235/0x670 generic_update_time+0x87/0xd0 touch_atime+0xa9/0xd0 ext4_file_mmap+0x90/0xb0 mmap_region+0x370/0x5b0 do_mmap+0x415/0x4f0 vm_mmap_pgoff+0xd7/0x120 SyS_mmap_pgoff+0x1c5/0x290 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #1 (&mm->mmap_sem){++++++}: lock_acquire+0xbd/0x200 __might_fault+0x70/0xa0 __nd_ioctl+0x683/0x720 [libnvdimm] nvdimm_ioctl+0x8b/0xe0 [libnvdimm] do_vfs_ioctl+0xa8/0x740 SyS_ioctl+0x79/0x90 do_syscall_64+0x6c/0x200 return_from_SYSCALL_64+0x0/0x7a -> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}: __lock_acquire+0x16b6/0x1730 lock_acquire+0xbd/0x200 __mutex_lock+0x88/0x9b0 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] pmem_do_bvec+0x1c2/0x2b0 [nd_pmem] pmem_make_request+0xf9/0x270 [nd_pmem] generic_make_request+0x118/0x3b0 submit_bio+0x75/0x150 Cc: <stable@vger.kernel.org> Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices") Cc: Dave Jiang <dave.jiang@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-04libnvdimm: fix blk free space accountingDan Williams1-66/+11
Commit a1f3e4d6a0c3 "libnvdimm, region: update nd_region_available_dpa() for multi-pmem support" reworked blk dpa (DIMM Physical Address) accounting to comprehend multiple pmem namespace allocations aliasing with a given blk-dpa range. The following call trace is a result of failing to account for allocated blk capacity. WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names 4 size_store+0x6f3/0x930 [libnvdimm] nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 size_store+0x6f3/0x930 [libnvdimm] dev_attr_store+0x18/0x30 If a given blk-dpa allocation does not alias with any pmem ranges then the full allocation should be accounted as busy space, not the size of the current pmem contribution to the region. The thinkos that led to this confusion was not realizing that the struct resource management is already guaranteeing no collisions between pmem allocations and blk allocations on the same dimm. Also, we do not try to support blk allocations in aliased pmem holes. This patch also fixes a case where the available blk goes negative. Cc: <stable@vger.kernel.org> Fixes: a1f3e4d6a0c3 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support"). Reported-by: Dariusz Dokupil <dariusz.dokupil@intel.com> Reported-by: Dave Jiang <dave.jiang@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-03-27acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)Dan Williams1-1/+5
While reviewing the -stable patch for commit 86ef58a4e35e "nfit, libnvdimm: fix interleave set cookie calculation" Ben noted: "This is returning an int, thus it's effectively doing a 32-bit comparison and not the 64-bit comparison you say is needed." Update the compare operation to be immune to this integer demotion problem. Cc: <stable@vger.kernel.org> Cc: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Fixes: 86ef58a4e35e ("nfit, libnvdimm: fix interleave set cookie calculation") Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-03-26Linux 4.11-rc4Linus Torvalds1-1/+1
2017-03-25ext4: fix two spelling nitsTheodore Ts'o2-2/+2
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-03-25ext4: lock the xattr block before checksuming itTheodore Ts'o1-34/+31
We must lock the xattr block before calculating or verifying the checksum in order to avoid spurious checksum failures. https://bugzilla.kernel.org/show_bug.cgi?id=193661 Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2017-03-24IB/qib: fix false-postive maybe-uninitialized warningArnd Bergmann1-1/+1
aarch64-linux-gcc-7 complains about code it doesn't fully understand: drivers/infiniband/hw/qib/qib_iba7322.c: In function 'qib_7322_txchk_change': include/asm-generic/bitops/non-atomic.h:105:35: error: 'shadow' may be used uninitialized in this function [-Werror=maybe-uninitialized] The code is right, and despite trying hard, I could not come up with a version that I liked better than just adding a fake initialization here to shut up the warning. Fixes: f931551bafe1 ("IB/qib: Add new qib driver for QLogic PCIe InfiniBand adapters") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24RDMA/iser: Fix possible mr leak on device removal eventSagi Grimberg2-3/+7
When the rdma device is removed, we must cleanup all the rdma resources within the DEVICE_REMOVAL event handler to let the device teardown gracefully. When this happens with live I/O, some memory regions are occupied. Thus, track them too and dereg all the mr's. We are safe with mr access by iscsi_iser_cleanup_task. Reported-by: Raju Rangoju <rajur@chelsio.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24IB/device: Convert ib-comp-wq to be CPU-boundSagi Grimberg1-2/+1
This workqueue is used by our storage target mode ULPs via the new CQ API. Recent observations when working with very high-end flash storage devices reveal that UNBOUND workqueue threads can migrate between cpu cores and even numa nodes (although some numa locality is accounted for). While this attribute can be useful in some workloads, it does not fit in very nicely with the normal run-to-completion model we usually use in our target-mode ULPs and the block-mq irq<->cpu affinity facilities. The whole block-mq concept is that the completion will land on the same cpu where the submission was performed. The fact that our submitter thread is migrating cpus can break this locality. We assume that as a target mode ULP, we will serve multiple initiators/clients and we can spread the load enough without having to use unbound kworkers. Also, while we're at it, expose this workqueue via sysfs which is harmless and can be useful for debug. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>-- Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24IB/cq: Don't process more than the given budgetSagi Grimberg1-1/+7
The caller might not want this overhead. Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24IB/rxe: increment msn only when completing a requestDavid Marchand1-5/+4
According to C9-147, MSN should only be incremented when the last packet of a multi packet request has been received. "Logically, the requester associates a sequential Send Sequence Number (SSN) with each WQE posted to the send queue. The SSN bears a one- to-one relationship to the MSN returned by the responder in each re- sponse packet. Therefore, when the requester receives a response, it in- terprets the MSN as representing the SSN of the most recent request completed by the responder to determine which send WQE(s) can be completed." Fixes: 8700e3e7c485 ("Soft RoCE driver") Signed-off-by: David Marchand <david.marchand@6wind.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24uapi: fix rdma/mlx5-abi.h userspace compilation errorsDmitry V. Levin1-1/+2
Consistently use types from linux/types.h to fix the following rdma/mlx5-abi.h userspace compilation errors: /usr/include/rdma/mlx5-abi.h:69:25: error: 'u64' undeclared here (not in a function) MLX5_LIB_CAP_4K_UAR = (u64)1 << 0, /usr/include/rdma/mlx5-abi.h:69:29: error: expected ',' or '}' before numeric constant MLX5_LIB_CAP_4K_UAR = (u64)1 << 0, Include <linux/if_ether.h> to fix the following rdma/mlx5-abi.h userspace compilation error: /usr/include/rdma/mlx5-abi.h:286:12: error: 'ETH_ALEN' undeclared here (not in a function) __u8 dmac[ETH_ALEN]; Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24IB/core: Restore I/O MMU, s390 and powerpc supportBart Van Assche2-19/+37
Avoid that the following error message is reported on the console while loading an RDMA driver with I/O MMU support enabled: DMAR: Allocating domain for mlx5_0 failed Ensure that DMA mapping operations that use to_pci_dev() to access to struct pci_dev see the correct PCI device. E.g. the s390 and powerpc DMA mapping operations use to_pci_dev() even with I/O MMU support disabled. This patch preserves the following changes of the DMA mapping updates patch series: - Introduction of dma_virt_ops. - Removal of ib_device.dma_ops. - Removal of struct ib_dma_mapping_ops. - Removal of an if-statement from each ib_dma_*() operation. - IB HW drivers no longer set dma_device directly. Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reported-by: Parav Pandit <parav@mellanox.com> Fixes: commit 99db9494035f ("IB/core: Remove ib_device.dma_device") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: parav@mellanox.com Tested-by: parav@mellanox.com Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24IB/rxe: Update documentation linkLeon Romanovsky1-1/+1
All Soft-RoCE (rxe) is handled now in rdma-core user space library, so the documentation. The patch below updates the documentation link to that new location. Reported-by: Josh Beavers <josh.beavers@gmail.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24RDMA/ocrdma: fix a type issue in ocrdma_put_pd_num()Dan Carpenter1-1/+1
We want to return zero on success or negative error codes. The type should be int and not u8. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24IB/rxe: double free on errorDan Carpenter1-1/+1
"goto err;" has it's own kfree_skb() call so it's a double free. We only need to free on the "goto exit;" path. Fixes: 8700e3e7c485 ("Soft RoCE driver") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24RDMA/vmw_pvrdma: Activate device on ethernet link upAditya Sarwade2-3/+12
Restore device state when ethernet link changes to active. Acked-by: George Zhang <georgezhang@vmware.com> Acked-by: Jorgen Hansen <jhansen@vmware.com> Acked-by: Bryan Tan <bryantan@vmware.com> Signed-off-by: Aditya Sarwade <asarwade@vmware.com> Signed-off-by: Adit Ranadive <aditr@vmware.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24RDMA/vmw_pvrdma: Dont hardcode QP header pageAdit Ranadive2-4/+6
Moved the header page count to a macro. Reported-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Adit Ranadive <aditr@vmware.com> Reviewed-by: Aditya Sarwade <asarwade@vmware.com> Tested-by: Andrew Boyer <andrew.boyer@dell.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24RDMA/vmw_pvrdma: Cleanup unused variablesAdit Ranadive3-22/+17
Removed the unused nreq and redundant index variables. Moved hardcoded async and cq ring pages number to macro. Reported-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Adit Ranadive <aditr@vmware.com> Reviewed-by: Aditya Sarwade <asarwade@vmware.com> Tested-by: Andrew Boyer <andrew.boyer@dell.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24infiniband: Fix alignment of mmap cookies to support VIPT cachingJason Gunthorpe2-4/+4
When vmalloc_user is used to create memory that is supposed to be mmap'd to user space, it is necessary for the mmap cookie (eg the offset) to be aligned to SHMLBA. This creates a situation where all virtual mappings of the same physical page share the same virtual cache index and guarantees VIPT coherence. Otherwise the cache is non-coherent and the kernel will not see writes by userspace when reading the shared page (or vice-versa). Reported-by: Josh Beavers <josh.beavers@gmail.com> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24IB/core: Protect against self-requeue of a cq work itemSagi Grimberg1-1/+1
We need to make sure that the cq work item does not run when we are destroying the cq. Unlike flush_work, cancel_work_sync protects against self-requeue of the work item (which we can do in ib_cq_poll_work). Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>-- Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24i40iw: Receive netdev events post INET_NOTIFIER stateShiraz Saleem1-0/+8
Netdev notification events are de-registered only when all client iwdev instances are removed. If a single client is closed and re-opened, netdev events could arrive even before the Control Queue-Pair (CQP) is created, causing a NULL pointer dereference crash in i40iw_get_cqp_request. Fix this by allowing netdev event notification only after we have reached the INET_NOTIFIER state with respect to device initialization. Reported-by: Stefan Assmann <sassmann@redhat.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-23hwmon: (asus_atk0110) fix uninitialized data accessArnd Bergmann1-0/+3
The latest gcc-7 snapshot adds a warning to point out that when atk_read_value_old or atk_read_value_new fails, we copy uninitialized data into sensor->cached_value: drivers/hwmon/asus_atk0110.c: In function 'atk_input_show': drivers/hwmon/asus_atk0110.c:651:26: error: 'value' may be used uninitialized in this function [-Werror=maybe-uninitialized] Adding an error check avoids this. All versions of the driver are affected. Fixes: 2c03d07ad54d ("hwmon: Add Asus ATK0110 support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2017-03-23xen/acpi: upload PM state from init-domain to XenAnkur Arora1-8/+26
This was broken in commit cd979883b9ed ("xen/acpi-processor: fix enabling interrupts on syscore_resume"). do_suspend (from xen/manage.c) and thus xen_resume_notifier never get called on the initial-domain at resume (it is if running as guest.) The rationale for the breaking change was that upload_pm_data() potentially does blocking work in syscore_resume(). This patch addresses the original issue by scheduling upload_pm_data() to execute in workqueue context. Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: stable@vger.kernel.org Based-on-patch-by: Konrad Wilk <konrad.wilk@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2017-03-23drm/fb-helper: Allow var->x/yres(_virtual) < fb->width/height againMichel Dänzer1-3/+3
Otherwise this can also prevent modesets e.g. for switching VTs, when multiple monitors with different native resolutions are connected. The depths must match though, so keep the != test for that. Also update the DRM_DEBUG output to be slightly more accurate, this doesn't only affect requests from userspace. Bugzilla: https://bugs.freedesktop.org/99841 Fixes: 865afb11949e ("drm/fb-helper: reject any changes to the fbdev") Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Reviewed-by: Daniel Stone <daniels@collabora.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/20170323085326.20185-1-michel@daenzer.net
2017-03-23xen/acpi: Replace hard coded "ACPI0007"Ankur Arora1-1/+1
Replace hard coded "ACPI0007" with ACPI_PROCESSOR_DEVICE_HID Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2017-03-23libceph: force GFP_NOIO for socket allocationsIlya Dryomov1-0/+6
sock_alloc_inode() allocates socket+inode and socket_wq with GFP_KERNEL, which is not allowed on the writeback path: Workqueue: ceph-msgr con_work [libceph] ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 Call Trace: [<ffffffff816dd629>] schedule+0x29/0x70 [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200 [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120 [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70 [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180 [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390 [<ffffffff81086335>] flush_work+0x165/0x250 [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0 [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs] [<ffffffff816d6b42>] ? __slab_free+0xee/0x234 [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30 [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs] [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs] [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40 [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs] [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [<ffffffff811c0c18>] super_cache_scan+0x178/0x180 [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340 [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450 [<ffffffff8115af70>] shrink_slab+0x100/0x140 [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490 [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0 [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40 [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110 [<ffffffff811a0ac5>] new_slab+0x2c5/0x390 [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0 [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0 [<ffffffff811d8566>] alloc_inode+0x26/0xa0 [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70 [<ffffffff815b933e>] sock_alloc+0x1e/0x80 [<ffffffff815ba855>] __sock_create+0x95/0x220 [<ffffffff815baa04>] sock_create_kern+0x24/0x30 [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph] [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd] [<ffffffff81084c19>] process_one_work+0x159/0x4f0 [<ffffffff8108561b>] worker_thread+0x11b/0x530 [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0 [<ffffffff8108b6f9>] kthread+0xc9/0xe0 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 [<ffffffff816e1b98>] ret_from_fork+0x58/0x90 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here. Cc: stable@vger.kernel.org # 3.10+, needs backporting Link: http://tracker.ceph.com/issues/19309 Reported-by: Sergey Jerusalimov <wintchester@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-03-23ALSA: hda - Adding a group of pin definition to fix headset problemHui Wang1-0/+2
A new Dell laptop needs to apply ALC269_FIXUP_DELL1_MIC_NO_PRESENCE to fix the headset problem, and the pin definiton of this machine is not in the pin quirk table yet, now adding it to the table. Signed-off-by: Hui Wang <hui.wang@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-03-23mmc: sdhci-pci: Do not disable interrupts in sdhci_intel_set_powerAdrian Hunter1-0/+4
Disabling interrupts for even a millisecond can cause problems for some devices. That can happen when Intel host controllers wait for the present state to propagate. The spin lock is not necessary here. Anything that is racing with changes to the I/O state is already broken. The mmc core already provides synchronization via "claiming" the host. Although the spin lock probably should be removed from the code paths that lead to this point, such a patch would touch too much code to be suitable for stable trees. Consequently, for this patch, just drop the spin lock while waiting. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Ludovic Desroches <ludovic.desroches@microchip.com>
2017-03-23mmc: sdhci: Do not disable interrupts while waiting for clockAdrian Hunter1-1/+3
Disabling interrupts for even a millisecond can cause problems for some devices. That can happen when sdhci changes clock frequency because it waits for the clock to become stable under a spin lock. The spin lock is not necessary here. Anything that is racing with changes to the I/O state is already broken. The mmc core already provides synchronization via "claiming" the host. Although the spin lock probably should be removed from the code paths that lead to this point, such a patch would touch too much code to be suitable for stable trees. Consequently, for this patch, just drop the spin lock while waiting. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Ludovic Desroches <ludovic.desroches@microchip.com>
2017-03-22net:ethernet:aquantia: Fix for RX checksum offload.Pavel Belous2-0/+2
Since AQC-100/107/108 chips supports hardware checksums for RX we should indicate this via NETIF_F_RXCSUM flag. v1->v2: 'Signed-off-by' tag added. Signed-off-by: Pavel Belous <pavel.belous@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22amd-xgbe: Fix the ECC-related bit position definitionsLendacky, Thomas1-12/+12
The ECC bit positions that describe whether the ECC interrupt is for Tx, Rx or descriptor memory and whether the it is a single correctable or double detected error were defined in incorrectly (reversed order). Fix the bit position definitions for these settings so that the proper ECC handling is performed. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22sfc: cleanup a condition in efx_udp_tunnel_del()Dan Carpenter1-1/+1
Presumably if there is an "add" function, there is also a "del" function. But it causes a static checker warning because it looks like a common cut and paste bug. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22Bluetooth: btqcomsmd: fix compile-test dependencyArnd Bergmann1-1/+2
compile-testing fails when QCOM_SMD is a loadable module: drivers/bluetooth/built-in.o: In function `btqcomsmd_send': btqca.c:(.text+0xa8): undefined reference to `qcom_smd_send' drivers/bluetooth/built-in.o: In function `btqcomsmd_probe': btqca.c:(.text+0x3ec): undefined reference to `qcom_wcnss_open_channel' btqca.c:(.text+0x46c): undefined reference to `qcom_smd_set_drvdata' This clarifies the dependency to allow compile-testing only when SMD is completely disabled, otherwise the dependency on QCOM_SMD will make sure we can link against it. Fixes: e27ee2b16bad ("Bluetooth: btqcomsmd: Allow driver to build if COMPILE_TEST is enabled") Signed-off-by: Arnd Bergmann <arnd@arndb.de> [bjorn: Restructure and clarify dependency to QCOM_WCNSS_CTRL] Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22inet: frag: release spinlock before calling icmp_send()Eric Dumazet1-8/+17
Dmitry reported a lockdep splat [1] (false positive) that we can fix by releasing the spinlock before calling icmp_send() from ip_expire() This is a false positive because sending an ICMP message can not possibly re-enter the IP frag engine. [1] [ INFO: possible circular locking dependency detected ] 4.10.0+ #29 Not tainted ------------------------------------------------------- modprobe/12392 is trying to acquire lock: (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] spin_lock include/linux/spinlock.h:299 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] __netif_tx_lock include/linux/netdevice.h:3486 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 but task is already holding lock: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&q->lock)->rlock){+.-...}: validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669 ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713 packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459 deliver_skb net/core/dev.c:1834 [inline] dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890 xmit_one net/core/dev.c:2903 [inline] dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923 sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308 neigh_output include/net/neighbour.h:478 [inline] ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228 ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672 ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545 ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985 __sys_sendmmsg+0x25c/0x750 net/socket.c:2075 SYSC_sendmmsg net/socket.c:2106 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2101 do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281 return_from_SYSCALL_64+0x0/0x7a -> #0 (_xmit_ETHER#2){+.-...}: check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 __read_once_size include/linux/compiler.h:254 [inline] atomic_read arch/x86/include/asm/atomic.h:26 [inline] rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] __rcu_is_watching kernel/rcu/tree.c:1133 [inline] rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); *** DEADLOCK *** 10 locks held by modprobe/12392: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff81329758>] __do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336 #1: (rcu_read_lock){......}, at: [<ffffffff8188cab6>] filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324 #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] spin_lock include/linux/spinlock.h:299 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] pte_alloc_one_map mm/memory.c:2944 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072 #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] lockdep_copy_map include/linux/lockdep.h:175 [inline] #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258 #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 #5: (rcu_read_lock){......}, at: [<ffffffff8389a633>] ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216 #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] spin_trylock include/linux/spinlock.h:309 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_xmit_lock net/ipv4/icmp.c:219 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681 #7: (rcu_read_lock_bh){......}, at: [<ffffffff838ab9a1>] ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198 #8: (rcu_read_lock_bh){......}, at: [<ffffffff836d1dee>] __dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324 #9: (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at: [<ffffffff836d3a27>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 stack backtrace: CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:52 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204 check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline] RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline] RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline] RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10 RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000 R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25 R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000 </IRQ> rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 RIP: 0033:0x7f83172f2786 RSP: 002b:00007fffe859ae80 EFLAGS: 00010293 RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970 RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000 R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040 R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22tcp: initialize icsk_ack.lrcvtime at session start timeEric Dumazet2-1/+2
icsk_ack.lrcvtime has a 0 value at socket creation time. tcpi_last_data_recv can have bogus value if no payload is ever received. This patch initializes icsk_ack.lrcvtime for active sessions in tcp_finish_connect(), and for passive sessions in tcp_create_openreq_child() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22genetlink: fix counting regression on ctrl_dumpfamily()Stanislaw Gruszka1-1/+3
Commit 2ae0f17df1cd ("genetlink: use idr to track families") replaced if (++n < fams_to_skip) continue; into: if (n++ < fams_to_skip) continue; This subtle change cause that on retry ctrl_dumpfamily() call we omit one family that failed to do ctrl_fill_info() on previous call, because cb->args[0] = n number counts also family that failed to do ctrl_fill_info(). Patch fixes the problem and avoid confusion in the future just decrease n counter when ctrl_fill_info() fail. User visible problem caused by this bug is failure to get access to some genetlink family i.e. nl80211. However problem is reproducible only if number of registered genetlink families is big enough to cause second call of ctrl_dumpfamily(). Cc: Xose Vazquez Perez <xose.vazquez@gmail.com> Cc: Larry Finger <Larry.Finger@lwfinger.net> Cc: Johannes Berg <johannes@sipsolutions.net> Fixes: 2ae0f17df1cd ("genetlink: use idr to track families") Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22socket, bpf: fix sk_filter use after free in sk_clone_lockDaniel Borkmann1-0/+6
In sk_clone_lock(), we create a new socket and inherit most of the parent's members via sock_copy() which memcpy()'s various sections. Now, in case the parent socket had a BPF socket filter attached, then newsk->sk_filter points to the same instance as the original sk->sk_filter. sk_filter_charge() is then called on the newsk->sk_filter to take a reference and should that fail due to hitting max optmem, we bail out and release the newsk instance. The issue is that commit 278571baca2a ("net: filter: simplify socket charging") wrongly combined the dismantle path with the failure path of xfrm_sk_clone_policy(). This means, even when charging failed, we call sk_free_unlock_clone() on the newsk, which then still points to the same sk_filter as the original sk. Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually where it tests for present sk_filter and calls sk_filter_uncharge() on it, which potentially lets sk_omem_alloc wrap around and releases the eBPF prog and sk_filter structure from the (still intact) parent. Fix it by making sure that when sk_filter_charge() failed, we reset newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(), so that we don't mess with the parents sk_filter. Only if xfrm_sk_clone_policy() fails, we did reach the point where either the parent's filter was NULL and as a result newsk's as well or where we previously had a successful sk_filter_charge(), thus for that case, we do need sk_filter_uncharge() to release the prior taken reference on sk_filter. Fixes: 278571baca2a ("net: filter: simplify socket charging") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22ipv4: provide stronger user input validation in nl_fib_input()Eric Dumazet1-1/+2
Alexander reported a KMSAN splat caused by reads of uninitialized field (tb_id_in) from user provided struct fib_result_nl It turns out nl_fib_input() sanity tests on user input is a bit wrong : User can pretend nlh->nlmsg_len is big enough, but provide at sendmsg() time a too small buffer. Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22bpf: fix hashmap extra_elems logicAlexei Starovoitov2-76/+97
In both kmalloc and prealloc mode the bpf_map_update_elem() is using per-cpu extra_elems to do atomic update when the map is full. There are two issues with it. The logic can be misused, since it allows max_entries+num_cpus elements to be present in the map. And alloc_extra_elems() at map creation time can fail percpu alloc for large map values with a warn: WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60 illegal size (32824) or align (8) for percpu allocation The fixes for both of these issues are different for kmalloc and prealloc modes. For prealloc mode allocate extra num_possible_cpus elements and store their pointers into extra_elems array instead of actual elements. Hence we can use these hidden(spare) elements not only when the map is full but during bpf_map_update_elem() that replaces existing element too. That also improves performance, since pcpu_freelist_pop/push is avoided. Unfortunately this approach cannot be used for kmalloc mode which needs to kfree elements after rcu grace period. Therefore switch it back to normal kmalloc even when full and old element exists like it was prior to commit 6c9059817432 ("bpf: pre-allocate hash map elements"). Add tests to check for over max_entries and large map values. Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22enic: update enic maintainersGovindarajulu Varadarajan1-1/+0
update enic maintainers Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>