Age | Commit message (Collapse) | Author | Files | Lines |
|
Start to use the lockless version of netdev_notify_peers.
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are some use cases for netdev_notify_peers in the context
when rtnl lock is already held. Introduce lockless version
of netdev_notify_peers call to save the extra code to call
call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
After that, convert netdev_notify_peers to call the new helper.
Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
'irq_of_parse_and_map()' should be balanced by a corresponding
'irq_dispose_mapping()' call. Otherwise, there is some resources leaks.
Add such a call in the error handling path of the probe function and in the
remove function.
Fixes: 492205050d77 ("net: Add EMAC ethernet driver found on Allwinner A10 SoC's")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20201214202117.146293-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Syzbot reported a shift of a u32 by more than 31 in strset_parse_request()
which is undefined behavior. This is caused by range check of string set id
using variable ret (which is always 0 at this point) instead of id (string
set id from request).
Fixes: 71921690f974 ("ethtool: provide string sets with STRSET_GET request")
Reported-by: syzbot+96523fb438937cd01220@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/b54ed5c5fd972a59afea3e1badfb36d86df68799.1607952208.git.mkubecek@suse.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In case of error after calling 'ocelot_init()', it must be undone by a
corresponding 'ocelot_deinit()' call, as already done in the remove
function.
Fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20201213114838.126922-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The connector driver never modifies any cb_id passed to it, so add a const
qualifier to those arguments so callers can declare their struct cb_id as a
constant object.
Fixes build warnings like these when passing a constant struct cb_id:
warning: passing argument 1 of ‘cn_add_callback’ discards ‘const’ qualifier from pointer target
Signed-off-by: Geoff Levand <geoff@infradead.org>
Link: https://lore.kernel.org/r/a9e49c9e-67fa-16e7-0a6b-72f6bd30c58a@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If the 'register_netdev()' call fails, we must undo a previous
'bcmgenet_mii_init()' call.
Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201212182005.120437-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch fixes an error condition triggered when the code path which
transmits a S/G frame descriptor when the skb's headroom is not enough
for DPAA2's needs.
We are greated with a splat like the one below when a SGT structure is
recycled and that is because even though a dma_unmap is performed on the
Tx confirmation path, the unmap is not done with the proper size.
[ 714.464927] WARNING: CPU: 13 PID: 0 at drivers/iommu/io-pgtable-arm.c:281 __arm_lpae_map+0x2d4/0x30c
(...)
[ 714.465343] Call trace:
[ 714.465348] __arm_lpae_map+0x2d4/0x30c
[ 714.465353] __arm_lpae_map+0x114/0x30c
[ 714.465357] __arm_lpae_map+0x114/0x30c
[ 714.465362] __arm_lpae_map+0x114/0x30c
[ 714.465366] arm_lpae_map+0xf4/0x180
[ 714.465373] arm_smmu_map+0x4c/0xc0
[ 714.465379] __iommu_map+0x100/0x2bc
[ 714.465385] iommu_map_atomic+0x20/0x30
[ 714.465391] __iommu_dma_map+0xb0/0x110
[ 714.465397] iommu_dma_map_page+0xb8/0x120
[ 714.465404] dma_map_page_attrs+0x1a8/0x210
[ 714.465413] __dpaa2_eth_tx+0x384/0xbd0 [fsl_dpaa2_eth]
[ 714.465421] dpaa2_eth_tx+0x84/0x134 [fsl_dpaa2_eth]
[ 714.465427] dev_hard_start_xmit+0x10c/0x2b0
[ 714.465433] sch_direct_xmit+0x1a0/0x550
(...)
The dpaa2-eth driver uses an area of software annotations to transmit
necessary information from the Tx path to the Tx confirmation one. This
SWA structure has a different layout for each kind of frame that we are
dealing with: linear, S/G or XDP.
The commit referenced was incorrectly setting up the 'sgt_size' field
for the S/G type of SWA even though we are dealing with a linear skb
here.
Fixes: d70446ee1f40 ("dpaa2-eth: send a scatter-gather FD instead of realloc-ing")
Reported-by: Daniel Thompson <daniel.thompson@linaro.org>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20201211171607.108034-1-ciorneiioana@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With lockdep enabled, we will get following warning:
ar9331_switch ethernet.1:10 lan0 (uninitialized): PHY [!ahb!ethernet@1a000000!mdio!switch@10:00] driver [Qualcomm Atheros AR9331 built-in PHY] (irq=13)
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 18, name: kworker/0:1
INFO: lockdep is turned off.
irq event stamp: 602
hardirqs last enabled at (601): [<8073fde0>] _raw_spin_unlock_irq+0x3c/0x80
hardirqs last disabled at (602): [<8073a4f4>] __schedule+0x184/0x800
softirqs last enabled at (0): [<80080f60>] copy_process+0x578/0x14c8
softirqs last disabled at (0): [<00000000>] 0x0
CPU: 0 PID: 18 Comm: kworker/0:1 Not tainted 5.10.0-rc3-ar9331-00734-g7d644991df0c #31
Workqueue: events deferred_probe_work_func
Stack : 80980000 80980000 8089ef70 80890000 804b5414 80980000 00000002 80b53728
00000000 800d1268 804b5414 ffffffde 00000017 800afe08 81943860 0f5bfc32
00000000 00000000 8089ef70 819436c0 ffffffea 00000000 00000000 00000000
8194390c 808e353c 0000000f 66657272 80980000 00000000 00000000 80890000
804b5414 80980000 00000002 80b53728 00000000 00000000 00000000 80d40000
...
Call Trace:
[<80069ce0>] show_stack+0x9c/0x140
[<800afe08>] ___might_sleep+0x220/0x244
[<8073bfb0>] __mutex_lock+0x70/0x374
[<8073c2e0>] mutex_lock_nested+0x2c/0x38
[<804b5414>] regmap_update_bits_base+0x38/0x8c
[<804ee584>] regmap_update_bits+0x1c/0x28
[<804ee714>] ar9331_sw_unmask_irq+0x34/0x60
[<800d91f0>] unmask_irq+0x48/0x70
[<800d93d4>] irq_startup+0x114/0x11c
[<800d65b4>] __setup_irq+0x4f4/0x6d0
[<800d68a0>] request_threaded_irq+0x110/0x190
[<804e3ef0>] phy_request_interrupt+0x4c/0xe4
[<804df508>] phylink_bringup_phy+0x2c0/0x37c
[<804df7bc>] phylink_of_phy_connect+0x118/0x130
[<806c1a64>] dsa_slave_create+0x3d0/0x578
[<806bc4ec>] dsa_register_switch+0x934/0xa20
[<804eef98>] ar9331_sw_probe+0x34c/0x364
[<804eb48c>] mdio_probe+0x44/0x70
[<8049e3b4>] really_probe+0x30c/0x4f4
[<8049ea10>] driver_probe_device+0x264/0x26c
[<8049bc10>] bus_for_each_drv+0xb4/0xd8
[<8049e684>] __device_attach+0xe8/0x18c
[<8049ce58>] bus_probe_device+0x48/0xc4
[<8049db70>] deferred_probe_work_func+0xdc/0xf8
[<8009ff64>] process_one_work+0x2e4/0x4a0
[<800a0770>] worker_thread+0x2a8/0x354
[<800a774c>] kthread+0x16c/0x174
[<8006306c>] ret_from_kernel_thread+0x14/0x1c
ar9331_switch ethernet.1:10 lan1 (uninitialized): PHY [!ahb!ethernet@1a000000!mdio!switch@10:02] driver [Qualcomm Atheros AR9331 built-in PHY] (irq=13)
DSA: tree 0 setup
To fix it, it is better to move access to MDIO register to the .irq_bus_sync_unlock
call back.
Fixes: ec6698c272de ("net: dsa: add support for Atheros AR9331 built-in switch")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20201211110317.17061-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
On the Rx side, the next_to_use index points to the next item in the
HW ring to be refilled/allocated, and next_to_clean points to the next
item to potentially be processed.
When the HW Rx ring is fully refilled, i.e. no packets has been
processed, the next_to_use will be next_to_clean - 1. When the ring is
fully processed next_to_clean will be equal to next_to_use. The latter
case is where a bug is triggered.
If the next_to_use bits are not cleared, and the "fully processed"
state is entered, a stale descriptor can be processed.
The skb-path correctly clear the status bit for the next_to_use
descriptor, but the AF_XDP zero-copy path did not do that.
This change adds the status bits clearing of the next_to_use
descriptor.
Fixes: 3b4f0b66c2b3 ("i40e, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
On the Rx side, the next_to_use index points to the next item in the
HW ring to be refilled/allocated, and next_to_clean points to the next
item to potentially be processed.
When the HW Rx ring is fully refilled, i.e. no packets has been
processed, the next_to_use will be next_to_clean - 1. When the ring is
fully processed next_to_clean will be equal to next_to_use. The latter
case is where a bug is triggered.
If the next_to_use bits are not cleared, and the "fully processed"
state is entered, a stale descriptor can be processed.
The skb-path correctly clear the status bit for the next_to_use
descriptor, but the AF_XDP zero-copy path did not do that.
This change adds the status bits clearing of the next_to_use
descriptor.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Even if there is more rx data waiting on the chip, the rx napi poll fn
will never run more than once - it will always read a few buffers, then
bail out and re-arm interrupts. Which results in ping-pong between napi
and interrupt.
This defeats the purpose of napi, and is bad for performance.
Fix by making the rx napi poll behave identically to other ethernet
drivers:
1. initialize rx napi polling with an arbitrary budget (64).
2. in the polling fn, return full weight if rx queue is not depleted,
this tells the napi core to "keep polling".
3. update the rx tail ("ring the doorbell") once for every 8 processed
rx ring buffers.
Thanks to Jakub Kicinski, Eric Dumazet and Andrew Lunn for their expert
opinions and suggestions.
Tested with 20 seconds of full bandwidth receive (iperf3):
rx irqs softirqs(NET_RX)
-----------------------------
before 23827 33620
after 129 4081
Tested-by: Sven Van Asbroeck <thesven73@gmail.com> # lan7430
Fixes: 23f0703c125be ("lan743x: Add main source files for new lan743x driver")
Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
Link: https://lore.kernel.org/r/20201215161954.5950-1-TheSven73@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Range checks can folded into proper conversion function. kstrto*() exist
for all arithmetic types.
Link: https://lkml.kernel.org/r/20201122123759.GC92364@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple of
warnings by explicitly adding a break statement instead of just letting
the code fall through to the next, and by adding a fallthrough
pseudo-keyword in places where the code is intended to fall through.
Link: https://github.com/KSPP/linux/issues/115
Link: https://lkml.kernel.org/r/f5756988b8842a3f10008fbc5b0a654f828920a9.1605896059.git.gustavoars@kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Convert the unbounded uses of sprintf to sysfs_emit.
A few conversions may now not end in a newline if the output buffer is
overflowed.
Link: https://lkml.kernel.org/r/0c90a90f466167f8c37de4b737553cf49c4a277f.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Update the function to use sysfs_emit_at while neatening the uses of
sprintf and overwriting the last space char with a newline to avoid
possible output buffer overflow.
Miscellanea:
- in shmem_enabled_show, the removal of the indirected use of fmt
allows __printf verification
Link: https://lkml.kernel.org/r/b612a93825e5ea330cb68d2e8b516e9687a06cc6.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The cocci script used in commit bdacbb8d04f ("mm: Use sysfs_emit for
struct kobject * uses") does not convert the name##_show macro because the
macro uses concatenation via ##.
Convert it by hand.
Link: https://lkml.kernel.org/r/45ec6cfc177d743f9c0ebaf35e43969dce43af42.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Convert the only use of sprintf with struct kobject * that the cocci
script could not convert.
Miscellanea:
- Neaten the uses of a constant string with sysfs_emit to use a const
char * to reduce overall object size
Link: https://lkml.kernel.org/r/7df6be66bbd68e1a0bca9d35aca1341dbf94d2a7.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: Convert sysfs sprintf family to sysfs_emit", v2.
Use the new sysfs_emit family and not the sprintf family.
This patch (of 5):
Use the sysfs_emit function instead of the sprintf family.
Done with cocci script as in commit 3c6bff3cf988 ("RDMA: Convert sysfs
kobject * show functions to use sysfs_emit()")
Link: https://lkml.kernel.org/r/cover.1605376435.git.joe@perches.com
Link: https://lkml.kernel.org/r/9c249215bad6df616ba0410ad980042694970c1b.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Kernel-doc markups should use this format:
identifier - description
Fix some issues on mm files:
1) The definition for get_user_pages_locked() doesn't follow it. Also,
it expects a short descrpition at the header, followed by a long one,
after the parameters. Fix it.
2) Kernel-doc requires that a kernel-doc markup to be immediately below
the function prototype, as otherwise it will rename it. So, move
get_pfnblock_flags_mask() description to the right place.
3) Make invalidate_mapping_pagevec() to also follow the expected
kernel-doc format.
While here, fix a few minor English syntax issues, as suggested
by Matthew:
will used -> will be used
similar with -> similar to
Link: https://lkml.kernel.org/r/80e85dddc92d333bc2159ee8a2294921612e8745.1605521731.git.mchehab+huawei@kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Suggested-by: Mattew Wilcox <willy@infradead.org> [English fixes]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
From the beginning, the zram block device always enabled CRYPTO_LZO,
since lzo-rle is hardcoded as the fallback compression algorithm. As a
consequence, on systems where another compression algorithm is chosen
(e.g. CRYPTO_ZSTD), the lzo kernel module becomes unused, while still
having to be built/loaded.
This patch removes the hardcoded lzo-rle dependency and allows the user
to select the default compression algorithm for zram at build time. The
previous behaviour is kept, as the default algorithm is still lzo-rle.
Link: https://lkml.kernel.org/r/20201207121245.50529-1-rsalvaterra@gmail.com
Signed-off-by: Rui Salvaterra <rsalvaterra@gmail.com>
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, zram supports the stat via /sys/block/zram/mm_stat to represent
how many of incompressible pages are stored at the moment but it couldn't
show how many times incompressible pages were wrote down since zram set
up. It's also good indication to see how zram is effective in the system.
Link: https://lkml.kernel.org/r/20201130201907.1284910-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is demand to writeback specific process pages to backing store
instead of all idles pages in the system due to storage wear out concerns
and to launching latency of apps which are most of the time idle but are
critical for resume latency.
This patch extends the writeback knob to support a specific page
writeback.
Link: https://lkml.kernel.org/r/20201020190506.3758660-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The pointer iov_r is being initialized with a value that is never read and
it is being updated later with a new value. The initialization is
redundant and can be removed.
Link: https://lkml.kernel.org/r/20201102120614.694917-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Rework the list_add code to make it more readable and simple.
Link: https://lkml.kernel.org/r/20201015130107.65195-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Right now, all new ZIP drivers are adapted to crypto_acomp APIs rather
than legacy crypto_comp APIs. Tradiontal ZIP drivers like lz4,lzo etc
have been also wrapped into acomp via scomp backend. But zswap.c is still
using the old APIs. That means zswap won't be able to work on any new ZIP
drivers in kernel.
This patch moves to use cryto_acomp APIs to fix the disconnected bridge
between new ZIP drivers and zswap. It is probably the first real user to
use acomp but perhaps not a good example to demonstrate how multiple acomp
requests can be executed in parallel in one acomp instance. frontswap is
doing page load and store page by page synchronously. swap_writepage()
depends on the completion of frontswap_store() to decide if it should call
__swap_writepage() to swap to disk.
However this patch creates multiple acomp instances, so multiple threads
running on multiple different cpus can actually do (de)compression
parallelly, leveraging the power of multiple ZIP hardware queues. This is
also consistent with frontswap's page management model.
The old zswap code uses atomic context and avoids the race conditions
while shared resources like zswap_dstmem are accessed. Here since acomp
can sleep, per-cpu mutex is used to replace preemption-disable.
While it is possible to make mm/page_io.c and mm/frontswap.c support async
(de)compression in some way, the entire design requires careful thinking
and performance evaluation. For the first step, the base with fixed
connection between ZIP drivers and zswap should be built.
Link: https://lkml.kernel.org/r/20201107065332.26992-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mahipal Challa <mahipalreddy2006@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Zhou Wang <wangzhou1@hisilicon.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix smatch warning:
mm/zswap.c:425 zswap_cpu_comp_prepare() warn: passing zero to 'PTR_ERR'
crypto_alloc_comp() never return NULL, use IS_ERR instead of
IS_ERR_OR_NULL to fix this.
Link: https://lkml.kernel.org/r/20201031055615.28080-1-yuehaibing@huawei.com
Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
These should be const, so make it so.
Link: https://lkml.kernel.org/r/1791535ee0b00f4a5c68cc4a8adada06593ad8f1.1601770305.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now userfaultfd test program requires either root or ptrace privilege due
to the signal/event tests. When UFFDIO_API failed, hint the test runner
about this fact verbosely.
Link: https://lkml.kernel.org/r/20201208024709.7701-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
userfaultfd_open() returns 1 for errors rather than negatives. Fix it on
all the callers so when UFFDIO_API failed the test will bail out.
Link: https://lkml.kernel.org/r/20201208024709.7701-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "userfaultfd: selftests: Small fixes".
Some very trivial fixes that I kept locally to userfaultfd selftest
program.
This patch (of 3):
BOUNCE_POLL is a special bit that if cleared it means "READ" instead.
Dump that too otherwise we'll see tests with empty modes.
Link: https://lkml.kernel.org/r/20201208024709.7701-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20201208024709.7701-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On certain platforms (powerpcle is the one on which I ran into this),
"%Ld" and "%Lu" are unsuitable for printing __s64 and __u64, respectively,
resulting in build warnings. Cast to {u,}int64_t, and use the PRI{d,u}64
macros defined in inttypes.h to print them. This ought to be portable to
all platforms.
Splitting this off into a separate macro lets us remove some lines, and
get rid of some (I would argue) stylistically odd cases where we joined
printf() and exit() into a single statement with a ,.
Finally, this also fixes a "missing braces around initializer" warning
when we initialize prms in wp_range().
[axelrasmussen@google.com: v2]
Link: https://lkml.kernel.org/r/20201203180244.1811601-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20201202211542.1121189-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Alan Gilbert <dgilbert@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With this change, when the knob is set to 0, it allows unprivileged users
to call userfaultfd, like when it is set to 1, but with the restriction
that page faults from only user-mode can be handled. In this mode, an
unprivileged user (without SYS_CAP_PTRACE capability) must pass
UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM.
This enables administrators to reduce the likelihood that an attacker with
access to userfaultfd can delay faulting kernel code to widen timing
windows for other exploits.
The default value of this knob is changed to 0. This is required for
correct functioning of pipe mutex. However, this will fail postcopy live
migration, which will be unnoticeable to the VM guests. To avoid this,
set 'vm.userfault = 1' in /sys/sysctl.conf.
The main reason this change is desirable as in the short term is that the
Android userland will behave as with the sysctl set to zero. So without
this commit, any Linux binary using userfaultfd to manage its memory would
behave differently if run within the Android userland. For more details,
refer to Andrea's reply [1].
[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/
Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: <calin@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "Control over userfaultfd kernel-fault handling", v6.
This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.
It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel. For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3]. Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.
This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.
The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.
[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
This patch (of 2):
userfaultfd handles page faults from both user and kernel code. Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.
A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.
Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <calin@google.com>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CONFIG_PAGE_POISONING_ZERO uses the zero pattern instead of 0xAA. It was
introduced by commit 1414c7f4f7d7 ("mm/page_poisoning.c: allow for zero
poisoning"), noting that using zeroes retains the benefit of sanitizing
content of freed pages, with the benefit of not having to zero them again
on alloc, and the downside of making some forms of corruption (stray
writes of NULLs) harder to detect than with the 0xAA pattern. Together
with CONFIG_PAGE_POISONING_NO_SANITY it made possible to sanitize the
contents on free without checking it back on alloc.
These days we have the init_on_free() option to achieve sanitization with
zeroes and to save clearing on alloc (and without checking on alloc).
Arguably if someone does choose to check the poison for corruption on
alloc, the savings of not clearing the page are secondary, and it makes
sense to always use the 0xAA poison pattern. Thus, remove the
CONFIG_PAGE_POISONING_ZERO option for being redundant.
Link: https://lkml.kernel.org/r/20201113104033.22907-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CONFIG_PAGE_POISONING_NO_SANITY skips the check on page alloc whether the
poison pattern was corrupted, suggesting a use-after-free. The motivation
to introduce it in commit 8823b1dbc05f ("mm/page_poison.c: enable
PAGE_POISONING as a separate option") was to simply sanitize freed pages,
optimally together with CONFIG_PAGE_POISONING_ZERO.
These days we have an init_on_free=1 boot option, which makes this use
case of page poisoning redundant. For sanitizing, writing zeroes is
sufficient, there is pretty much no benefit from writing the 0xAA poison
pattern to freed pages, without checking it back on alloc. Thus, remove
this option and suggest init_on_free instead in the main config's help.
Link: https://lkml.kernel.org/r/20201113104033.22907-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Page poisoning used to be incompatible with hibernation, as the state of
poisoned pages was lost after resume, thus enabling CONFIG_HIBERNATION
forces CONFIG_PAGE_POISONING_NO_SANITY. For the same reason, the
poisoning with zeroes variant CONFIG_PAGE_POISONING_ZERO used to disable
hibernation. The latter restriction was removed by commit 1ad1410f632d
("PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO") and
similarly for init_on_free by commit 18451f9f9e58 ("PM: hibernate: fix
crashes with init_on_free=1") by making sure free pages are cleared after
resume.
We can use the same mechanism to instead poison free pages with
PAGE_POISON after resume. This covers both zero and 0xAA patterns. Thus
we can remove the Kconfig restriction that disables page poison sanity
checking when hibernation is enabled.
Link: https://lkml.kernel.org/r/20201113104033.22907-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [hibernation]
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 11c9c7edae06 ("mm/page_poison.c: replace bool variable with static
key") changed page_poisoning_enabled() to a static key check. However,
the function is not inlined, so each check still involves a function call
with overhead not eliminated when page poisoning is disabled.
Analogically to how debug_pagealloc is handled, this patch converts
page_poisoning_enabled() back to boolean check, and introduces
page_poisoning_enabled_static() for fast paths. Both functions are
inlined.
The function kernel_poison_pages() is also called unconditionally and does
the static key check inside. Remove it from there and put it to callers.
Also split it to two functions kernel_poison_pages() and
kernel_unpoison_pages() instead of the confusing bool parameter.
Also optimize the check that enables page poisoning instead of
debug_pagealloc for architectures without proper debug_pagealloc support.
Move the check to init_mem_debugging_and_hardening() to enable a single
static key instead of having two static branches in
page_poisoning_enabled_static().
Link: https://lkml.kernel.org/r/20201113104033.22907-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "cleanup page poisoning", v3.
I have identified a number of issues and opportunities for cleanup with
CONFIG_PAGE_POISON and friends:
- interaction with init_on_alloc and init_on_free parameters depends on
the order of parameters (Patch 1)
- the boot time enabling uses static key, but inefficienty (Patch 2)
- sanity checking is incompatible with hibernation (Patch 3)
- CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have
init_on_free (Patch 4)
- CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we
have init_on_free (Patch 5)
This patch (of 5):
Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1
produces a warning in dmesg that page_poison takes precedence. However,
as these warnings are printed in early_param handlers for
init_on_alloc/free, they are not printed if page_poison is enabled later
on the command line (handlers are called in the order of their
parameters), or when init_on_alloc/free is always enabled by the
respective config option - before the page_poison early param handler is
called, it is not considered to be enabled. This is inconsistent.
We can remove the dependency on order by making the init_on_* parameters
only set a boolean variable, and postponing the evaluation after all early
params have been processed. Introduce a new
init_mem_debugging_and_hardening() function for that, and move the related
debug_pagealloc processing there as well.
As a result init_mem_debugging_and_hardening() knows always accurately if
init_on_* and/or page_poison options were enabled. Thus we can also
optimize want_init_on_alloc() and want_init_on_free(). We don't need to
check page_poisoning_enabled() there, we can instead not enable the
init_on_* static keys at all, if page poisoning is enabled. This results
in a simpler and more effective code.
Link: https://lkml.kernel.org/r/20201113104033.22907-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20201113104033.22907-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Laura Abbott <labbott@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It is required to print 'count' of pages, along with the pages, passed to
cma_release to debug the cases of mismatched count value passed between
cma_alloc() and cma_release() from a code path.
As an example, consider the below scenario:
1) CMA pool size is 4MB and
2) User doing the erroneous step of allocating 2 pages but freeing 1
page in a loop from this CMA pool. The step 2 causes cma_alloc() to
return NULL at one point of time because of -ENOMEM condition.
And the current pr_debug logs is not giving the info about these types of
allocation patterns because of count value not being printed in
cma_release().
We are printing the count value in the trace logs, just extend the same to
pr_debug logs too.
[akpm@linux-foundation.org: fix printk warning]
Link: https://lkml.kernel.org/r/1606318341-29521-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The cma_mutex which protects alloc_contig_range() was first appeared in
commit 7ee793a62fa8c ("cma: Remove potential deadlock situation"), at that
time, there is no guarantee the behavior of concurrency inside
alloc_contig_range().
After commit 2c7452a075d4db2dc ("mm/page_isolation.c: make
start_isolate_page_range() fail if already isolated")
> However, two subsystems (CMA and gigantic
> huge pages for example) could attempt operations on the same range. If
> this happens, one thread may 'undo' the work another thread is doing.
> This can result in pageblocks being incorrectly left marked as
> MIGRATE_ISOLATE and therefore not available for page allocation.
The concurrency inside alloc_contig_range() was clarified.
Now we can find that hugepage and virtio call alloc_contig_range() without
any lock, thus cma_mutex is "redundant" in cma_alloc() now.
Link: https://lkml.kernel.org/r/20201020102241.3729-1-lecopzer.chen@mediatek.com
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
"dst" parameter to migrate_vma_insert_page() is not used anymore.
Link: https://lkml.kernel.org/r/CANubcdUwCAMuUyamG2dkWP=cqSR9MAS=tHLDc95kQkqU-rEnAg@mail.gmail.com
Signed-off-by: Stephen Zhang <starzhangzsd@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In the current implementation unmap_and_move() would return -ENOMEM if THP
migration is unsupported, then the THP will be split. If split is failed
just exit without trying to migrate other pages. It doesn't make too much
sense since there may be enough free memory to migrate other pages and
there may be a lot base pages on the list.
Return -ENOSYS to make consistent with hugetlb. And if THP split is
failed just skip and try other pages on the list.
Just skip the whole list and exit when free memory is really low.
Link: https://lkml.kernel.org/r/20201113205359.556831-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The migrate_prep{_local} never fails, so it is pointless to have return
value and check the return value.
Link: https://lkml.kernel.org/r/20201113205359.556831-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The NUMA balancing skip shared exec base page. Since
CONFIG_READ_ONLY_THP_FOR_FS was introduced, there are probably shared exec
THP, so skip such THPs for NUMA balancing as well.
And Willy's regular filesystem THP support patches could create shared
exec THP wven without that config.
In addition, the page_is_file_lru() is used to tell if the page is file
cache or not, but it filters out shmem page. It sounds like a typical
usecase by putting executables in shmem to achieve performance gain via
using shmem-THP, so it sounds worth skipping migration for such case too.
Link: https://lkml.kernel.org/r/20201113205359.556831-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When unmap_and_move{_huge_page}() returns !-EAGAIN and
!MIGRATEPAGE_SUCCESS, the page would be put back to LRU or proper list if
it is non-LRU movable page. But, the callers always call
putback_movable_pages() to put the failed pages back later on, so it seems
not very efficient to put every single page back immediately, and the code
looks convoluted.
Put the failed page on a separate list, then splice the list to migrate
list when all pages are tried. It is the caller's responsibility to call
putback_movable_pages() to handle failures. This also makes the code
simpler and more readable.
After the change the rules are:
* Success: non hugetlb page will be freed, hugetlb page will be put
back
* -EAGAIN: stay on the from list
* -ENOMEM: stay on the from list
* Other errno: put on ret_pages list then splice to from list
The from list would be empty iff all pages are migrated successfully, it
was not so before. This has no impact to current existing callsites.
Link: https://lkml.kernel.org/r/20201113205359.556831-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: misc migrate cleanup and improvement", v3.
This patch (of 5):
The commit 9f4e41f4717832e ("mm: refactor truncate_complete_page()")
refactored truncate_complete_page(), and it is not existed anymore,
correct the comment in vmscan and migrate to avoid confusion.
Link: https://lkml.kernel.org/r/20201113205359.556831-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20201113205359.556831-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We can only kmap() one subpage of a THP at a time, so loop over all
relevant subpages, skipping ones which don't need to be zeroed. This is
too large to inline when THPs are enabled and we actually need highmem, so
put it in highmem.c.
[willy@infradead.org: start1 was allowed to be less than start2]
Link: https://lkml.kernel.org/r/20201124041507.28996-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When migrating a zero page or pte_none() anonymous page to device private
memory, migrate_vma_setup() will initialize the src[] array with a NULL
PFN. This lets the device driver allocate device private memory and clear
it instead of DMAing a page of zeros over the device bus.
Since the source page didn't exist at the time, no struct page was locked
nor a migration PTE inserted into the CPU page tables. The actual PTE
insertion happens in migrate_vma_pages() when it tries to insert the
device private struct page PTE into the CPU page tables.
migrate_vma_pages() has to call the mmu notifiers again since another
device could fault on the same page before the page table locks are
acquired.
Allow device drivers to optimize the invalidation similar to
migrate_vma_setup() by calling mmu_notifier_range_init() which sets struct
mmu_notifier_range event type to MMU_NOTIFY_MIGRATE and the
migrate_pgmap_owner field.
Link: https://lkml.kernel.org/r/20201021191335.10916-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The word in the comment is misspelled, it should be "include".
Link: https://lkml.kernel.org/r/20201024114144.GA20552@lilong
Signed-off-by: Long Li <lonuxli.64@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|