aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2020-03-17RDMA/cm: Make it clearer how concurrency works in cm_req_handler()Jason Gunthorpe1-42/+57
ib_crate_cm_id() immediately places the id in the xarray, and publishes it into the remote_id and remote_qpn rbtrees. This makes it visible to other threads before it is fully set up. It appears the thinking here was that the states IB_CM_IDLE and IB_CM_REQ_RCVD do not allow any MAD handler or lookup in the remote_id and remote_qpn rbtrees to advance. However, cm_rej_handler() does take an action on IB_CM_REQ_RCVD, which is not really expected by the design. Make the whole thing clearer: - Keep the new cm_id out of the xarray until it is completely set up. This directly prevents MAD handlers and all rbtree lookups from seeing the pointer. - Move all the trivial setup right to the top so it is obviously done before any concurrency begins - Move the mutation of the cm_id_priv out of cm_match_id() and into the caller so the state transition is obvious - Place the manipulation of the work_list at the end, under lock, after the cm_id is placed in the xarray. The work_count cannot change on an ID outside the xarray. - Add some comments Link: https://lore.kernel.org/r/20200310092545.251365-9-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-17RDMA/cm: Make it clear that there is no concurrency in cm_sidr_req_handler()Jason Gunthorpe1-27/+37
ib_create_cm_id() immediately places the id in the xarray, so it is visible to network traffic. The state is initially set to IB_CM_IDLE and all the MAD handlers will test this state under lock and refuse to advance from IDLE, so adding to the xarray is harmless. Further, the set to IB_CM_SIDR_REQ_RCVD also excludes all MAD handlers. However, the local_id isn't even used for SIDR mode, and there will be no input MADs related to the newly created ID. So, make the whole flow simpler so it can be understood: - Do not put the SIDR cm_id in the xarray. This directly shows that there is no concurrency - Delete the confusing work_count and pending_list manipulations. This mechanism is only used by MAD handlers and timewait, neither of which apply to SIDR. - Add a few comments and rename 'cur_cm_id_priv' to 'listen_cm_id_priv' - Move other loose sets up to immediately after cm_id creation so that the cm_id is fully configured right away. This fixes an oversight where the service_id will not be returned back on a IB_SIDR_UNSUPPORTED reject. Link: https://lore.kernel.org/r/20200310092545.251365-8-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-17RDMA/cm: Read id.state under lock when doing pr_debug()Jason Gunthorpe1-4/+4
The lock should not be dropped before doing the pr_debug() print as it is accessing data protected by the lock, such as id.state. Fixes: 119bf81793ea ("IB/cm: Add debug prints to ib_cm") Link: https://lore.kernel.org/r/20200310092545.251365-7-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-17RDMA/cm: Simplify establishing a listen cm_idJason Gunthorpe1-83/+116
Any manipulation of cm_id->state must be done under the cm_id_priv->lock, the two routines that added listens did not follow this rule, because they never participate in any concurrent access around the state. However, since this exception makes the code hard to understand, simplify the flow so that it can be fully locked: - Move manipulation of listen_sharecount into cm_insert_listen() so it is trivially under the cm.lock without having to expose the cm.lock to the caller. - Push the cm.lock down into cm_insert_listen() and have the function increment the reference count before returning an existing pointer. - Split ib_cm_listen() into an cm_init_listen() and do not call ib_cm_listen() from ib_cm_insert_listen() - Make both ib_cm_listen() and ib_cm_insert_listen() directly call cm_insert_listen() under their cm_id_priv->lock which does both a collision detect and, if needed, the insert (atomically) - Enclose all state manipulation within the cm_id_priv->lock, notice this set can be done safely after cm_insert_listen() as no reader is allowed to read the state without holding the lock. - Do not set the listen cm_id in the xarray, as it is never correct to look it up. This makes the concurrency simpler to understand. Many needless error unwinds are removed in the process. Link: https://lore.kernel.org/r/20200310092545.251365-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-17RDMA/cm: Make the destroy_id flow more robustJason Gunthorpe1-5/+8
Too much of the destruction is very carefully sensitive to the state and various other things. Move more code to the unconditional path and add several WARN_ONs to check consistency. Link: https://lore.kernel.org/r/20200310092545.251365-5-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-17RDMA/cm: Remove a race freeing timewait_infoJason Gunthorpe1-10/+15
When creating a cm_id during REQ the id immediately becomes visible to the other MAD handlers, and shortly after the state is moved to IB_CM_REQ_RCVD This allows cm_rej_handler() to run concurrently and free the work: CPU 0 CPU1 cm_req_handler() ib_create_cm_id() cm_match_req() id_priv->state = IB_CM_REQ_RCVD cm_rej_handler() cm_acquire_id() spin_lock(&id_priv->lock) switch (id_priv->state) case IB_CM_REQ_RCVD: cm_reset_to_idle() kfree(id_priv->timewait_info); goto destroy destroy: kfree(id_priv->timewait_info); id_priv->timewait_info = NULL Causing a double free or worse. Do not free the timewait_info without also holding the id_priv->lock. Simplify this entire flow by making the free unconditional during cm_destroy_id() and removing the confusing special case error unwind during creation of the timewait_info. This also fixes a leak of the timewait if cm_destroy_id() is called in IB_CM_ESTABLISHED with an XRC TGT QP. The state machine will be left in ESTABLISHED while it needed to transition through IB_CM_TIMEWAIT to release the timewait pointer. Also fix a leak of the timewait_info if the caller mis-uses the API and does ib_send_cm_reqs(). Fixes: a977049dacde ("[PATCH] IB: Add the kernel CM implementation") Link: https://lore.kernel.org/r/20200310092545.251365-4-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-17RDMA/cm: Fix checking for allowed duplicate listensJason Gunthorpe1-1/+2
The test here typod the cm_id_priv to use, it used the one that was freshly allocated. By definition the allocated one has the matching cm_handler and zero context, so the condition was always true. Instead check that the existing listening ID is compatible with the proposed handler so that it can be shared, as was originally intended. Fixes: 067b171b8679 ("IB/cm: Share listening CM IDs") Link: https://lore.kernel.org/r/20200310092545.251365-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-17RDMA/cm: Fix ordering of xa_alloc_cyclic() in ib_create_cm_id()Jason Gunthorpe1-16/+11
xa_alloc_cyclic() is a SMP release to be paired with some later acquire during xa_load() as part of cm_acquire_id(). As such, xa_alloc_cyclic() must be done after the cm_id is fully initialized, in particular, it absolutely must be after the refcount_set(), otherwise the refcount_inc() in cm_acquire_id() may not see the set. As there are several cases where a reader will be able to use the id.local_id after cm_acquire_id in the IB_CM_IDLE state there needs to be an unfortunate split into a NULL allocate and a finalizing xa_store. Fixes: a977049dacde ("[PATCH] IB: Add the kernel CM implementation") Link: https://lore.kernel.org/r/20200310092545.251365-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/hns: Fix wrong judgments of udata->outlenWeihang Li1-4/+4
These judgments were used to keep the compatibility with older versions of userspace that don't have the field named "cap_flags" in structure hns_roce_ib_create_cq_resp. But it will be wrong to compare outlen with the size of resp if another new field were added in resp. oulen should be compared with the end offset of cap_flags in resp. Fixes: 4f8f0d5e33dd ("RDMA/hns: Package the flow of creating cq") Link: https://lore.kernel.org/r/1583845569-47257-1-git-send-email-liweihang@huawei.com Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Allow MRs to be created in the cache synchronouslyJason Gunthorpe2-61/+87
If the cache is completely out of MRs, and we are running in cache mode, then directly, and synchronously, create an MR that is compatible with the cache bucket using a sleeping mailbox command. This ensures that the thread that is waiting for the MR absolutely will get one. When a MR allocated in this way becomes freed then it is compatible with the cache bucket and will be recycled back into it. Deletes the very buggy ent->compl scheme to create a synchronous MR allocation. Link: https://lore.kernel.org/r/20200310082238.239865-13-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Revise how the hysteresis scheme works for cache fillingJason Gunthorpe2-15/+27
Currently if the work queue is running then it is in 'hysteresis' mode and will fill until the cache reaches the high water mark. This implicit state is very tricky and doesn't interact with pending very well. Instead of self re-scheduling the work queue after the add_keys() has started to create the new MR, have the queue scheduled from reg_mr_callback() only after the requested MR has been added. This avoids the bad design of an in-rush of queue'd work doing back to back add_keys() until EAGAIN then sleeping. The add_keys() will be paced one at a time as they complete, slowly filling up the cache. Also, fix pending to be only manipulated under lock. Link: https://lore.kernel.org/r/20200310082238.239865-12-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Fix locking in MR cache work queueJason Gunthorpe2-46/+80
All of the members of mlx5_cache_ent must be accessed while holding the spinlock, add the missing spinlock in the __cache_work_func(). Using cache->stopped and flush_workqueue() is an inherently racy way to shutdown self-scheduling work on a queue. Replace it with ent->disabled under lock, and always check disabled before queuing any new work. Use cancel_work_sync() to shutdown the queue. Use READ_ONCE/WRITE_ONCE for dev->last_add to manage concurrency as coherency is less important here. Split fill_delay from the bitfield. C bitfield updates are not atomic and this is just a mess. Use READ_ONCE/WRITE_ONCE, but this could also use test_bit()/set_bit(). Link: https://lore.kernel.org/r/20200310082238.239865-11-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Lock access to ent->available_mrs/limit when doing queue_workJason Gunthorpe1-15/+25
Accesses to these members needs to be locked. There is no reason not to hold a spinlock while calling queue_work(), so move the tests into a helper and always call it under lock. The helper should be called when available_mrs is adjusted. Link: https://lore.kernel.org/r/20200310082238.239865-10-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Fix MR cache size and limit debugfsJason Gunthorpe1-64/+88
The size_write function is supposed to adjust the total_mr's to match the user's request, but lacks locking and safety checking. total_mrs can only be adjusted by at most available_mrs. mrs already assigned to users cannot be revoked. Ensure that the user provides a target value within the range of available_mrs and within the high/low water mark. limit_write has confusing and wrong sanity checking, and doesn't have the ability to deallocate on limit reduction. Since both functions use the same algorithm to adjust the available_mrs, consolidate it into one function and write it correctly. Fix the locking and by holding the spinlock for all accesses to ent->X. Always fail if the user provides a malformed string. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Link: https://lore.kernel.org/r/20200310082238.239865-9-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Always remove MRs from the cache before destroying themJason Gunthorpe1-6/+13
The cache bucket tracks the total number of MRs that exists, both inside and outside of the cache. Removing a MR from the cache (by setting cache_ent to NULL) without updating total_mrs will cause the tracking to leak and be inflated. Further fix the rereg_mr path to always destroy the MR. reg_create will always overwrite all the MR data in mlx5_ib_mr, so the MR must be completely destroyed, in all cases, before this function can be called. Detach the MR from the cache and unconditionally destroy it to avoid leaking HW mkeys. Fixes: afd1417404fb ("IB/mlx5: Use direct mkey destroy command upon UMR unreg failure") Fixes: 56e11d628c5d ("IB/mlx5: Added support for re-registration of MRs") Link: https://lore.kernel.org/r/20200310082238.239865-8-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Simplify how the MR cache bucket is locatedJason Gunthorpe3-98/+71
There are many bad APIs here that are accepting a cache bucket index instead of a bucket pointer. Many of the callers already have a bucket pointer, so this results in a lot of confusing uses of order2idx(). Pass the struct mlx5_cache_ent into add_keys(), remove_keys(), and alloc_cached_mr(). Once the MR is in the cache, store the cache bucket pointer directly in the MR, replacing the 'bool allocated_from cache'. In the end there is only one place that needs to form index from order, alloc_mr_from_cache(). Increase the safety of this function by disallowing it from accessing cache entries in the ODP special area. Link: https://lore.kernel.org/r/20200310082238.239865-7-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Rename the tracking variables for the MR cacheJason Gunthorpe2-31/+42
The old names do not clearly indicate the intent. Link: https://lore.kernel.org/r/20200310082238.239865-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Replace spinlock protected write with atomic varSaeed Mahameed3-10/+3
mkey variant calculation was spinlock protected to make it atomic, replace that with one atomic variable. Link: https://lore.kernel.org/r/20200310082238.239865-4-leon@kernel.org Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13{IB,net}/mlx5: Move asynchronous mkey creation to mlx5_ibMichael Guralnik3-28/+6
As mlx5_ib is the only user of the mlx5_core_create_mkey_cb, move the logic inside mlx5_ib and cleanup the code in mlx5_core. Signed-off-by: Michael Guralnik <michaelgur@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-03-13{IB,net}/mlx5: Assign mkey variant in mlx5_ib onlySaeed Mahameed6-22/+55
mkey variant is not required for mlx5_core use, move the mkey variant counter to mlx5_ib. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-03-13{IB,net}/mlx5: Setup mkey variant before mr create command invocationSaeed Mahameed2-6/+4
On reg_mr_callback() mlx5_ib is recalculating the mkey variant which is wrong and will lead to using a different key variant than the one submitted to firmware on create mkey command invocation. To fix this, we store the mkey variant before invoking the firmware command and use it later on completion (reg_mr_callback). Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-03-13RDMA/cm: Delete not implemented CM peer to peer communicationLeon Romanovsky2-8/+0
Peer to peer support was never implemented, so delete it to make code less clutter. Link: https://lore.kernel.org/r/20200310091438.248429-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Mark Zhang <markz@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Use offsetofend() instead of duplicated variantLeon Romanovsky2-31/+27
Convert mlx5 driver to use offsetofend() instead of its duplicated variant. Link: https://lore.kernel.org/r/20200310091438.248429-5-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx4: Delete duplicated offsetofend implementationLeon Romanovsky1-6/+3
Convert mlx4 to use in-kernel offsetofend() instead of its duplicated implementation. Link: https://lore.kernel.org/r/20200310091438.248429-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-10IB/mlx5: Replace tunnel mpls capability bits for tunnel_offloadsAlex Vesker2-5/+7
Until now the flex parser capability was used in ib_query_device() to indicate tunnel_offloads_caps support for mpls_over_gre/mpls_over_udp. Newer devices and firmware will have configurations with the flexparser but without mpls support. Testing for the flex parser capability was a mistake, the tunnel_stateless capability was intended for detecting mpls and was introduced at the same time as the flex parser capability. Otherwise userspace will be incorrectly informed that a future device supports MPLS when it does not. Link: https://lore.kernel.org/r/20200305123841.196086-1-leon@kernel.org Cc: <stable@vger.kernel.org> # 4.17 Fixes: e818e255a58d ("IB/mlx5: Expose MPLS related tunneling offloads") Signed-off-by: Alex Vesker <valex@mellanox.com> Reviewed-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-10RDMA/mlx5: Remove duplicate definitions of SW_ICM macrosErez Shitrit1-4/+0
Those macros are already defined in include/linux/mlx5/driver.h, so delete their duplicate variants. Link: https://lore.kernel.org/r/20200310075706.238592-1-leon@kernel.org Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-10RDMA/core: Remove the duplicate header fileZhu Yanjun1-2/+0
The header file rdma_core.h is duplicate, so let's remove it. Fixes: 622db5b6439a ("RDMA/core: Add trace points to follow MR allocation") Link: https://lore.kernel.org/r/20200310091656.249696-1-leon@kernel.org Signed-off-by: Zhu Yanjun <yanjunz@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-10RDMA/bnxt_re: Remove a redundant 'memset'Christophe JAILLET1-4/+1
'wqe' is already zeroed at the top of the 'while' loop, just a few lines below, and is not used outside of the loop. So there is no need to zero it again, or for the variable to be declared outside the loop. Link: https://lore.kernel.org/r/20200308065442.5415-1-christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-10RDMA/cma: Teach lockdep about the order of rtnl and lockJason Gunthorpe1-0/+13
This lock ordering only happens when bonding is enabled and a certain bonding related event fires. However, since it can happen this is a global restriction on lock ordering. Teach lockdep about the order directly and unconditionally so bugs here are found quickly. See https://syzkaller.appspot.com/bug?extid=55de90ab5f44172b0c90 Link: https://lore.kernel.org/r/20200227203651.GA27185@ziepe.ca Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-10RDMA/rw: map P2P memory correctly for signature operationsMax Gurtovoy1-6/+6
Since RDMA rw API support operations with P2P memory sg list, make sure to map/unmap the scatter list for signature operation correctly. Link: https://lore.kernel.org/r/20200220100819.41860-2-maxg@mellanox.com Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-10IB/mlx5: Introduce UAPIs to manage packet pacingYishai Hadas6-0/+165
Introduce packet pacing uobject and its alloc and destroy methods. This uobject holds mlx5 packet pacing context according to the device specification and enables managing packet pacing device entries that are needed by DEVX applications. Link: https://lore.kernel.org/r/20200219190518.200912-3-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-08Linux 5.6-rc5Linus Torvalds1-1/+1
2020-03-07net/mlx5: HW bit for goto chain offload supportEli Cohen1-1/+2
Add the HW bit definition indecating goto chain offload support. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-07net/mlx5: Expose link speed directlyMark Bloch1-1/+2
Expose port rate as part of the port speed register fields. Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-07net/mlx5: Introduce TLS and IPSec objects enumsSaeed Mahameed2-2/+3
Expose the TLS encryption key general object type enum correctly, and add the IPSec encryption key general object type enum. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-07net/mlx5: Introduce egress acl forward-to-vport capabilityVu Pham1-1/+1
Add HCA_CAP.egress_acl_forward_to_vport field to check whether HW supports e-switch vport's egress acl to forward packets to other e-switch vport or not. By default E-Switch egress ACL forwards eswitch vports egress packets to their corresponding NIC/VF vports. With this cap enabled, the driver is allowed to alter this behavior and forward packets to arbitrary NIC/VF vports with the following limitations: a. Multiple processing paths are supported if all of the following conditions are met: - HCA_CAP.egress_acl_forward_to_vport is set ==1. - A destination of type Flow Table only appears once, as the last destination in the list. - Vport destination is supported if HCA_CAP.egress_acl_forward_to_vport==1. Vport must not be the Uplink. b. Flow_tag not supported. c. This table is only applicable after an FDB table is created. d. Push VLAN action is not supported. e. Pop VLAN action cannot be added concurrently to this table and FDB table. This feature will be used during port failover in bonding scenario where two VFs representors are bonded to handle failover egress traffic (VM's ingress/receive traffic). Signed-off-by: Vu Pham <vuhuong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-07io_uring: fix lockup with timeoutsPavel Begunkov1-0/+1
There is a recipe to deadlock the kernel: submit a timeout sqe with a linked_timeout (e.g. test_single_link_timeout_ception() from liburing), and SIGKILL the process. Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it during io_put_free() to cancel the linked timeout. Probably, the same can happen with another io_kill_timeout() call site, that is io_commit_cqring(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06parse-maintainers: Mark as executableJonathan Neuschäfer1-0/+0
This makes the script more convenient to run. Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06vgacon: Fix a UAF in vgacon_invert_regionZhang Xiaoxu1-0/+3
When syzkaller tests, there is a UAF: BUG: KASan: use after free in vgacon_invert_region+0x9d/0x110 at addr ffff880000100000 Read of size 2 by task syz-executor.1/16489 page:ffffea0000004000 count:0 mapcount:-127 mapping: (null) index:0x0 page flags: 0xfffff00000000() page dumped because: kasan: bad access detected CPU: 1 PID: 16489 Comm: syz-executor.1 Not tainted Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: [<ffffffffb119f309>] dump_stack+0x1e/0x20 [<ffffffffb04af957>] kasan_report+0x577/0x950 [<ffffffffb04ae652>] __asan_load2+0x62/0x80 [<ffffffffb090f26d>] vgacon_invert_region+0x9d/0x110 [<ffffffffb0a39d95>] invert_screen+0xe5/0x470 [<ffffffffb0a21dcb>] set_selection+0x44b/0x12f0 [<ffffffffb0a3bfae>] tioclinux+0xee/0x490 [<ffffffffb0a1d114>] vt_ioctl+0xff4/0x2670 [<ffffffffb0a0089a>] tty_ioctl+0x46a/0x1a10 [<ffffffffb052db3d>] do_vfs_ioctl+0x5bd/0xc40 [<ffffffffb052e2f2>] SyS_ioctl+0x132/0x170 [<ffffffffb11c9b1b>] system_call_fastpath+0x22/0x27 Memory state around the buggy address: ffff8800000fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8800000fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff880000100000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff It can be reproduce in the linux mainline by the program: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h> #include <linux/vt.h> struct tiocl_selection { unsigned short xs; /* X start */ unsigned short ys; /* Y start */ unsigned short xe; /* X end */ unsigned short ye; /* Y end */ unsigned short sel_mode; /* selection mode */ }; #define TIOCL_SETSEL 2 struct tiocl { unsigned char type; unsigned char pad; struct tiocl_selection sel; }; int main() { int fd = 0; const char *dev = "/dev/char/4:1"; struct vt_consize v = {0}; struct tiocl tioc = {0}; fd = open(dev, O_RDWR, 0); v.v_rows = 3346; ioctl(fd, VT_RESIZEX, &v); tioc.type = TIOCL_SETSEL; ioctl(fd, TIOCLINUX, &tioc); return 0; } When resize the screen, update the 'vc->vc_size_row' to the new_row_size, but when 'set_origin' in 'vgacon_set_origin', vgacon use 'vga_vram_base' for 'vc_origin' and 'vc_visible_origin', not 'vc_screenbuf'. It maybe smaller than 'vc_screenbuf'. When TIOCLINUX, use the new_row_size to calc the offset, it maybe larger than the vga_vram_size in vgacon driver, then bad access. Also, if set an larger screenbuf firstly, then set an more larger screenbuf, when copy old_origin to new_origin, a bad access may happen. So, If the screen size larger than vga_vram, resize screen should be failed. This alse fix CVE-2020-8649 and CVE-2020-8647. Linus pointed out that overflow checking seems absent. We're saved by the existing bounds checks in vc_do_resize() with rather strict limits: if (cols > VC_RESIZE_MAXCOL || lines > VC_RESIZE_MAXROW) return -EINVAL; Fixes: 0aec4867dca14 ("[PATCH] SVGATextMode fix") Reference: CVE-2020-8647 and CVE-2020-8649 Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> [danvet: augment commit message to point out overflow safety] Cc: stable@vger.kernel.org Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20200304022429.37738-1-zhangxiaoxu5@huawei.com
2020-03-06dt-bindings: arm: Fixup the DT bindings for hierarchical PSCI statesUlf Hansson1-15/+13
The hierarchical topology with power-domain should be described through child nodes, rather than as currently described in the PSCI root node. Fix this by adding a patternProperties with a corresponding reference to the power-domain DT binding. Additionally, update the example to conform to the new pattern, but also to the adjusted domain-idle-state DT binding. Fixes: a3f048b5424e ("dt: psci: Update DT bindings to support hierarchical PSCI states") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> [robh: Add missing allOf, tweak power-domain node name] Signed-off-by: Rob Herring <robh@kernel.org>
2020-03-06dt-bindings: power: Extend nodename pattern for power-domain providersUlf Hansson1-1/+1
The existing binding requires the nodename to have a '@', which is a bit limiting for the wider use case. Therefore, let's extend the pattern to allow either '@' or '-'. Fixes: a3f048b5424e ("dt: psci: Update DT bindings to support hierarchical PSCI states") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> [robh: drop example change] Signed-off-by: Rob Herring <robh@kernel.org>
2020-03-06io_uring: free fixed_file_data after RCU grace periodJens Axboe1-2/+22
The percpu refcount protects this structure, and we can have an atomic switch in progress when exiting. This makes it unsafe to just free the struct normally, and can trigger the following KASAN warning: BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 Read of size 1 at addr ffff888181a19a30 by task swapper/0/0 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: <IRQ> dump_stack+0x76/0xa0 print_address_description.constprop.0+0x3b/0x60 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 __kasan_report.cold+0x1a/0x3d ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 rcu_core+0x370/0x830 ? percpu_ref_exit+0x50/0x50 ? rcu_note_context_switch+0x7b0/0x7b0 ? run_rebalance_domains+0x11d/0x140 __do_softirq+0x10a/0x3e9 irq_exit+0xd5/0xe0 smp_apic_timer_interrupt+0x86/0x200 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x26/0x1f0 Fix this by punting the final exit and free of the struct to RCU, then we know that it's safe to do so. Jann suggested the approach of using a double rcu callback to achieve this. It's important that we do a nested call_rcu() callback, as otherwise the free could be ordered before the atomic switch, even if the latter was already queued. Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com Suggested-by: Jann Horn <jannh@google.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06locks: fix a potential use-after-free problem when wakeup a waiteryangerkun1-14/+0
'16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the logic to check waiter->fl_blocker without blocked_lock_lock. And it will trigger a UAF when we try to wakeup some waiter: Thread 1 has create a write flock a on file, and now thread 2 try to unlock and delete flock a, thread 3 try to add flock b on the same file. Thread2 Thread3 flock syscall(create flock b) ...flock_lock_inode_wait flock_lock_inode(will insert our fl_blocked_member list to flock a's fl_blocked_requests) sleep flock syscall(unlock) ...flock_lock_inode_wait locks_delete_lock_ctx ...__locks_wake_up_blocks __locks_delete_blocks( b->fl_blocker = NULL) ... break by a signal locks_delete_block b->fl_blocker == NULL && list_empty(&b->fl_blocked_requests) success, return directly locks_free_lock b wake_up(&b->fl_waiter) trigger UAF Fix it by remove this logic, and this patch may also fix CVE-2019-19769. Cc: stable@vger.kernel.org Fixes: 16306a61d3b7 ("fs/locks: always delete_block after waiting.") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2020-03-06block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()Carlo Nonato1-4/+5
The bfq_find_set_group() function takes as input a blkcg (which represents a cgroup) and retrieves the corresponding bfq_group, then it updates the bfq internal group hierarchy (see comments inside the function for why this is needed) and finally it returns the bfq_group. In the hierarchy update cycle, the pointer holding the correct bfq_group that has to be returned is mistakenly used to traverse the hierarchy bottom to top, meaning that in each iteration it gets overwritten with the parent of the current group. Since the update cycle stops at root's children (depth = 2), the overwrite becomes a problem only if the blkcg describes a cgroup at a hierarchy level deeper than that (depth > 2). In this case the root's child that happens to be also an ancestor of the correct bfq_group is returned. The main consequence is that processes contained in a cgroup at depth greater than 2 are wrongly placed in the group described above by BFQ. This commits fixes this problem by using a different bfq_group pointer in the update cycle in order to avoid the overwrite of the variable holding the original group reference. Reported-by: Kwon Je Oh <kwonje.oh2@gmail.com> Signed-off-by: Carlo Nonato <carlo.nonato95@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06tty: serial: fsl_lpuart: free IDs allocated by IDAMichael Walle1-15/+24
Since commit 3bc3206e1c0f ("serial: fsl_lpuart: Remove the alias node dependence") the port line number can also be allocated by IDA, but in case of an error the ID will no be removed again. More importantly, any ID will be freed in remove(), even if it wasn't allocated but instead fetched by of_alias_get_id(). If it was not allocated by IDA there will be a warning: WARN(1, "ida_free called for id=%d which is not allocated.\n", id); Move the ID allocation more to the end of the probe() so that we still can use plain return in the first error cases. Fixes: 3bc3206e1c0f ("serial: fsl_lpuart: Remove the alias node dependence") Signed-off-by: Michael Walle <michael@walle.cc> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20200303174306.6015-3-michael@walle.cc Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-06Revert "tty: serial: fsl_lpuart: drop EARLYCON_DECLARE"Michael Walle1-0/+2
This reverts commit a659652f6169240a5818cb244b280c5a362ef5a4. This broke the earlycon on LS1021A processors because the order of the earlycon_setup() functions were changed. Before the commit the normal lpuart32_early_console_setup() was called. After the commit the lpuart32_imx_early_console_setup() is called instead. Fixes: a659652f6169 ("tty: serial: fsl_lpuart: drop EARLYCON_DECLARE") Signed-off-by: Michael Walle <michael@walle.cc> Link: https://lore.kernel.org/r/20200303174306.6015-2-michael@walle.cc Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-06serdev: Fix detection of UART devices on Apple machines.Ronald Tschalär1-0/+10
On Apple devices the _CRS method returns an empty resource template, and the resource settings are instead provided by the _DSM method. But commit 33364d63c75d6182fa369cea80315cf1bb0ee38e (serdev: Add ACPI devices by ResourceSource field) changed the search for serdev devices to require valid, non-empty resource template, thereby breaking Apple devices and causing bluetooth devices to not be found. This expands the check so that if we don't find a valid template, and we're on an Apple machine, then just check for the device being an immediate child of the controller and having a "baud" property. Cc: <stable@vger.kernel.org> # 5.5 Fixes: 33364d63c75d ("serdev: Add ACPI devices by ResourceSource field") Signed-off-by: Ronald Tschalär <ronald@innovation.ch> Link: https://lore.kernel.org/r/20200211194723.486217-1-ronald@innovation.ch Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-06arch/Kconfig: update HAVE_RELIABLE_STACKTRACE descriptionMiroslav Benes1-2/+3
save_stack_trace_tsk_reliable() is not the only function providing the reliable stack traces anymore. Architecture might define ARCH_STACKWALK which provides a newer stack walking interface and has arch_stack_walk_reliable() function. Update the description accordingly. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: http://lkml.kernel.org/r/20200120154042.9934-1-mbenes@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabledVlastimil Babka2-1/+11
Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC") fixed memory hotplug with debug_pagealloc enabled, where onlining a page goes through page freeing, which removes the direct mapping. Some arches don't like when the page is not mapped in the first place, so generic_online_page() maps it first. This is somewhat wasteful, but better than special casing page freeing fast paths. The commit however missed that DEBUG_PAGEALLOC configured doesn't mean it's actually enabled. One has to test debug_pagealloc_enabled() since 031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime configurable"), or alternatively debug_pagealloc_enabled_static() since 8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"), but this is not done. As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled will crash: Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000483 Fault in home space mode while using kernel ASCE. AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d Oops: 0004 ilc:2 [#1] SMP CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased) Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100 0000000000000001 0000000000000000 0000000000000002 0000000000000100 0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000 000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20 Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1 0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3 >0000001ecd281b9e: 94fb5006 ni 6(%r5),251 0000001ecd281ba2: 41505008 la %r5,8(%r5) 0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e 0000001ecd281bac: 1a07 ar %r0,%r7 0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e Call Trace: [<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188 [<0000001ecd4d9516>] online_pages_range+0xf6/0x128 [<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8 [<0000001ecda28aae>] online_pages+0x2fe/0x3f0 [<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0 [<0000001ecd7add42>] device_online+0x5a/0xc8 [<0000001ecd7d0430>] state_store+0x88/0x118 [<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200 [<0000001ecd5064b6>] vfs_write+0x176/0x1e0 [<0000001ecd50676a>] ksys_write+0xa2/0x100 [<0000001ecda315d4>] system_call+0xd8/0x2c8 Fix this by checking debug_pagealloc_enabled_static() before calling kernel_map_pages(). Backports for kernel before 5.5 should use debug_pagealloc_enabled() instead. Also add comments. Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC") Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06mm/z3fold.c: do not include rwlock.h directlySebastian Andrzej Siewior1-1/+0
rwlock.h should not be included directly. Instead linux/splinlock.h should be included. One thing it does is to break the RT build. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>