linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2016-05-26	IB/hfi1: Move driver out of staging	Dennis Dalessandro	1	-325/+0
	The TODO list for the hfi1 driver was completed during 4.6. In addition other objections raised (which are far beyond what was in the TODO list) have been addressed as well. It is now time to remove the driver from staging and into the drivers/infiniband sub-tree. Reviewed-by: Jubin John <jubin.john@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-13	IB/hfi1: Improve performance of interval RB trees	Mitko Haralanov	1	-11/+11
	The interval RB tree management functions use handlers to store user-specific callback for the various tree operations. These handlers are put on a doubly-linked list. When a RB tree function is called, the list is searched for the handler of the particular tree. The list which holds the handlers is modified very rarely - when a handler is created and when a handler is removed. On the other hand, it is searched very often. This a perfect usage scenario for RCU. The result is a much lower overhead of traversing the list as most of the time no locking will be required. Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28	IB/hfi1: Fix buffer cache races which may cause corruption	Mitko Haralanov	1	-3/+3
	There are two possible causes for node/memory corruption both of which are related to the cache eviction algorithm. One way to cause corruption is due to the asynchronous nature of the MMU invalidation and the locking used when invalidating node. The MMU invalidation routine would temporarily release the RB tree lock to avoid a deadlock. However, this would allow the eviction function to take the lock resulting in the removal of cache nodes. If the node being removed by the eviction code is the same as the node being invalidated, the result is use after free. The same is true in the other direction due to the temporary release of the eviction list lock in the eviction loop. Another corner case exists when dealing with the SDMA buffer cache that could cause memory corruption of kernel memory. The most common way, in which this corruption exhibits itself is a linked list node corruption. In that case, the kernel will complain that a node with poisoned pointers is being removed. The fact that the pointers are already poisoned means that the node has already been removed from the list. To root cause of this corruption was a mishandling of the eviction list maintained by the driver. In order for this to happen four conditions need to be satisfied: 1. A node describing a user buffer already exists in the interval RB tree, 2. The beginning of the current user buffer matches that node but is bigger. This will cause the node to be extended. 3. The amount of cached buffers is close or at the limit of the buffer cache size. 4. The node has dropped close to the end of the eviction list. This will cause the node to be considered for eviction. If all of the above conditions have been satisfied, it is possible for the eviction algorithm to evict the current node, which will free the node without the driver knowing. To solve both issues described above: - the locking around the MMU invalidation loop and cache eviction loop has been improved so locks are not released in the loop body, - a new RB function is introduced which will "atomically" find and remove the matching node from the RB tree, preventing the MMU invalidation loop from touching it, and - the node being extended by the pin_vector_pages() function is removed from the eviction list prior to calling the eviction function. Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28	IB/hfi1: Extract and reinsert MMU RB node on lookup	Mitko Haralanov	1	-0/+19
	The page pinning function, which also maintains the pin cache, behaves one of two ways when an exact buffer match is not found: 1. If no node is not found (a buffer with the same starting address is not found in the cache), a new node is created, the buffer pages are pinned, and the node is inserted into the RB tree, or 2. If a node is found but the buffer in that node is a subset of the new user buffer, the node is extended with the new buffer pages. Both modes of operation require (re-)insertion into the interval RB tree. When the node being inserted is a new node, the operations are pretty simple. However, when the node is already existing and is being extended, special care must be taken. First, we want to guard against an asynchronous attempt to delete the node by the MMU invalidation notifier. The simplest way to do this is to remove the node from the RB tree, preventing the search algorithm from finding it. Second, the node needs to be re-inserted so it lands in the proper place in the tree and the tree is correctly re-balanced. This also requires the node to be removed from the RB tree. This commit adds the hfi1_mmu_rb_extract() function, which will search for a node in the interval RB tree matching an address and length and remove it from the RB tree if found. This allows for both of the above special cases be handled in a single step. Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28	IB/hfi1: Correctly compute node interval	Mitko Haralanov	1	-1/+1
	The computation of the interval of an interval RB node was incorrect leading to data corruption due to the RB search algorithm not properly finding the all RB nodes in an MMU invalidation interval. The problem stemmed from the fact that the beginning address of the node's range was being aligned to a page boundary. For certain buffer sizes, this would lead to a end address calculation that was off by 1 page. An important aspect of keeping the RB same is also updating the node's range in the case it's being extended. Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28	IB/hfi1: Protect the interval RB tree when cleaning up	Mitko Haralanov	1	-2/+6
	The current implementation of the clean up function for the interval RB trees has two flaws which may cause problems in cases of concurrent executing of the function and MMU notifier. The flaws were due to the fact that deregistration of the MMU callbacks was done after the tree was emptied and, furthermore, the tree was not being locked. This commit fixes both of these flaws by, first, switch the order of operations, and, second, locking the tree while traversing it to prevent any other operations. Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28	IB/hfi1: Fix deadlock caused by locking with wrong scope	Mitko Haralanov	1	-5/+11
	The locking around the interval RB tree is designed to prevent access to the tree while it's being modified. The locking in its current form is too overzealous, which is causing a deadlock in certain cases with the following backtrace: Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 CPU: 0 PID: 5836 Comm: IMB-MPI1 Tainted: G O 3.12.18-wfr+ #1 0000000000000000 ffff88087f206c50 ffffffff814f1caa ffffffff817b53f0 ffff88087f206cc8 ffffffff814ecd56 0000000000000010 ffff88087f206cd8 ffff88087f206c78 0000000000000000 0000000000000000 0000000000001662 Call Trace: <NMI> [<ffffffff814f1caa>] dump_stack+0x45/0x56 [<ffffffff814ecd56>] panic+0xc2/0x1cb [<ffffffff810d4370>] ? restart_watchdog_hrtimer+0x50/0x50 [<ffffffff810d4432>] watchdog_overflow_callback+0xc2/0xd0 [<ffffffff81109b4e>] __perf_event_overflow+0x8e/0x2b0 [<ffffffff8110a714>] perf_event_overflow+0x14/0x20 [<ffffffff8101c906>] intel_pmu_handle_irq+0x1b6/0x390 [<ffffffff814f927b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff814f8ad8>] nmi_handle.isra.3+0x88/0x180 [<ffffffff814f8d39>] do_nmi+0x169/0x310 [<ffffffff814f8177>] end_repeat_nmi+0x1e/0x2e [<ffffffff81272600>] ? unmap_single+0x30/0x30 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 <<EOE>> <IRQ> [<ffffffffa056c4a8>] hfi1_mmu_rb_search+0x38/0x70 [hfi1] [<ffffffffa05919cb>] user_sdma_free_request+0xcb/0x120 [hfi1] [<ffffffffa0593393>] user_sdma_txreq_cb+0x263/0x350 [hfi1] [<ffffffffa057fad7>] ? sdma_txclean+0x27/0x1c0 [hfi1] [<ffffffffa0593130>] ? user_sdma_send_pkts+0x1710/0x1710 [hfi1] [<ffffffffa057fdd6>] sdma_make_progress+0x166/0x480 [hfi1] [<ffffffff810762c9>] ? ttwu_do_wakeup+0x19/0xd0 [<ffffffffa0581c7e>] sdma_engine_interrupt+0x8e/0x100 [hfi1] [<ffffffffa0546bdd>] sdma_interrupt+0x5d/0xa0 [hfi1] [<ffffffff81097e57>] handle_irq_event_percpu+0x47/0x1d0 [<ffffffff81098017>] handle_irq_event+0x37/0x60 [<ffffffff8109aa5f>] handle_edge_irq+0x6f/0x120 [<ffffffff810044af>] handle_irq+0xbf/0x150 [<ffffffff8104c9b7>] ? irq_enter+0x17/0x80 [<ffffffff8150168d>] do_IRQ+0x4d/0xc0 [<ffffffff814f7c6a>] common_interrupt+0x6a/0x6a <EOI> [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0 [<ffffffff814f56c6>] __schedule+0x3b6/0x7e0 [<ffffffff810763a6>] __cond_resched+0x26/0x30 [<ffffffff814f5eda>] _cond_resched+0x3a/0x50 [<ffffffff814f4f82>] down_write+0x12/0x30 [<ffffffffa0591619>] hfi1_release_user_pages+0x69/0x90 [hfi1] [<ffffffffa059173a>] sdma_rb_remove+0x9a/0xc0 [hfi1] [<ffffffffa056c00d>] __mmu_rb_remove.isra.5+0x5d/0x70 [hfi1] [<ffffffffa056c536>] hfi1_mmu_rb_remove+0x56/0x70 [hfi1] [<ffffffffa059427b>] hfi1_user_sdma_process_request+0x74b/0x1160 [hfi1] [<ffffffffa055c763>] hfi1_aio_write+0xc3/0x100 [hfi1] [<ffffffff8116a14c>] do_sync_readv_writev+0x4c/0x80 [<ffffffff8116b58b>] do_readv_writev+0xbb/0x230 [<ffffffff811a9da1>] ? fsnotify+0x241/0x320 [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0 [<ffffffff8116b795>] vfs_writev+0x35/0x60 [<ffffffff8116b8c9>] SyS_writev+0x49/0xc0 [<ffffffff810cd876>] ? __audit_syscall_exit+0x1f6/0x2a0 [<ffffffff814ff992>] system_call_fastpath+0x16/0x1b As evident from the backtrace above, the process was being put to sleep while holding the lock. Limiting the scope of the lock only to the RB tree operation fixes the above error allowing for proper locking and the process being put to sleep when needed. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28	IB/hfi1: Prevent NULL pointer deferences in caching code	Mitko Haralanov	1	-10/+14
	There is a potential kernel crash when the MMU notifier calls the invalidation routines in the hfi1 pinned page caching code for sdma. The invalidation routine could call the remove callback for the node, which in turn ends up dereferencing the current task_struct to get a pointer to the mm_struct. However, the mm_struct pointer could be NULL resulting in the following backtrace: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 IP: [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1] 15 task: ffff88085e66e080 ti: ffff88085c244000 task.ti: ffff88085c244000 RIP: 0010:[<ffffffffa041f75a>] [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1] RSP: 0000:ffff88085c245878 EFLAGS: 00010002 RAX: 0000000000000000 RBX: ffff88105b9bbd40 RCX: ffffea003931a830 RDX: 0000000000000004 RSI: ffff88105754a9c0 RDI: ffff88105754a9c0 RBP: ffff88085c245890 R08: ffff88105b9bbd70 R09: 00000000fffffffb R10: ffff88105b9bbd58 R11: 0000000000000013 R12: ffff88105754a9c0 R13: 0000000000000001 R14: 0000000000000001 R15: ffff88105b9bbd40 FS: 0000000000000000(0000) GS:ffff88107ef40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a8 CR3: 0000000001a0b000 CR4: 00000000001407e0 Stack: ffff88105b9bbd40 ffff88080ec481a8 ffff88080ec481b8 ffff88085c2458c0 ffffffffa03fa00e ffff88080ec48190 ffff88080ed9cd00 0000000001024000 0000000000000000 ffff88085c245920 ffffffffa03fa0e7 0000000000000282 Call Trace: [<ffffffffa03fa00e>] __mmu_rb_remove.isra.5+0x5e/0x70 [hfi1] [<ffffffffa03fa0e7>] mmu_notifier_mem_invalidate+0xc7/0xf0 [hfi1] [<ffffffffa03fa143>] mmu_notifier_page+0x13/0x20 [hfi1] [<ffffffff81156dd0>] __mmu_notifier_invalidate_page+0x50/0x70 [<ffffffff81140bbb>] try_to_unmap_one+0x20b/0x470 [<ffffffff81141ee7>] try_to_unmap_anon+0xa7/0x120 [<ffffffff81141fad>] try_to_unmap+0x4d/0x60 [<ffffffff8111fd7b>] shrink_page_list+0x2eb/0x9d0 [<ffffffff81120ab3>] shrink_inactive_list+0x243/0x490 [<ffffffff81121491>] shrink_lruvec+0x4c1/0x640 [<ffffffff81121641>] shrink_zone+0x31/0x100 [<ffffffff81121b0f>] kswapd_shrink_zone.constprop.62+0xef/0x1c0 [<ffffffff811229e3>] kswapd+0x403/0x7e0 [<ffffffff811225e0>] ? shrink_all_memory+0xf0/0xf0 [<ffffffff81068ac0>] kthread+0xc0/0xd0 [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40 [<ffffffff814ff8ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40 To correct this, the mm_struct passed to us by the MMU notifier is used (which is what should have been done to begin with). This avoids the broken derefences and ensures that the correct mm_struct is used. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Adjust last address values for intervals	Mitko Haralanov	1	-3/+3
	Last address values for intervals in the interval RB tree nodes should be non-inclusive in order to avoid confusing ranges. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Add filter callback	Mitko Haralanov	1	-5/+14
	This commit adds a filter callback, which can be used to filter out interval RB nodes matching a certain interval down to a single one. This is needed for the upcoming SDMA-side caching where buffers will need to be filtered by their virtual address. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Remove compare callback	Mitko Haralanov	1	-1/+1
	Interval RB trees provide their own searching function, which also takes care of determining the path through the tree that should be taken. This make the compare callback unnecessary. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Add MMU tracing	Mitko Haralanov	1	-0/+10
	Add a new tracepoint type for the MMU functions and calls to that tracepoint to allow tracing of MMU functionality. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Use interval RB trees	Mitko Haralanov	1	-74/+32
	The interval RB trees can handle RB nodes which hold ranged information. This is exactly the usage for the buffer cache implemented in the expected receive code path. Convert the MMU/RB functions to use the interval RB tree API. This will help with future users of the caching API, as well. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Notify remove MMU/RB callback of calling context	Mitko Haralanov	1	-5/+5
	Tell the remove MMU/RB callback if it's being called as part of a memory invalidation or not. This can be important in preventing a deadlock if the remove callback attempts to take the map_sem semaphore because the kernel's MMU invalidation functions have already taken it. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Allow remove MMU callbacks to free nodes	Mitko Haralanov	1	-5/+7
	In order to allow the remove MMU callbacks to free the RB nodes, it is necessary to prevent any references to the nodes after the remove callback has been called. Therefore, remove the node from the tree prior to calling the callback. In other words, the MMU/RB API now guarantees that all RB node operations it performs will be done prior to calling the remove callback and that the RB node will not be touched afterwards. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Prevent NULL pointer dereference	Mitko Haralanov	1	-0/+3
	Prevent a potential NULL pointer dereference (found by code inspection) when unregistering an MMU handler. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Allow MMU function execution in IRQ context	Mitko Haralanov	1	-15/+21
	Future users of the MMU/RB functions might be searching or manipulating the MMU RB trees in interrupt context. Therefore, the MMU/RB functions need to be able to run in interrupt context. This requires that we use the IRQ-aware API for spin locks. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21	IB/hfi1: Re-factor MMU notification code	Mitko Haralanov	1	-0/+304
	The MMU notification code added to the expected receive side has been re-factored and split into it's own file. This was done in order to make the code more general and, therefore, usable by other parts of the driver. The caching behavior remains the same. However, the handling of the RB tree (insertion, deletions, and searching) as well as the MMU invalidation processing is now handled by functions in the mmu_rb.[ch] files. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>