aboutsummaryrefslogtreecommitdiffstats
path: root/include (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-12-28Merge tag 'gpio-v4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpioLinus Torvalds3-10/+49
Pull GPIO updates from Linus Walleij: "This is the bulk of GPIO changes for the v4.21 kernel series. Core changes: - Some core changes are already in outside of this pull request as they came through the regulator tree, most notably devm_gpiod_unhinge() that removes devres refcount management from a GPIO descriptor. This is needed in subsystems such as regulators where the regulator core need to take over the reference counting and lifecycle management for a GPIO descriptor. - We dropped devm_gpiochip_remove() and devm_gpio_chip_match() as nothing needs it. We can bring it back if need be. - Add a global TODO so people see where we are going. This helps setting the direction now that we are two GPIO maintainers. - Handle the MMC CD/WP properties in the device tree core. (The bulk of patches activating this code is already merged through the MMC/SD tree.) - Augment gpiochip_request_own_desc() to pass a flag so we as gpiochips can request lines as active low or open drain etc even from ourselves. New drivers: - New driver for Cadence GPIO blocks. - New driver for Atmel SAMA5D2 PIOBU GPIO lines. Driver improvements: - A major refactoring of the PCA953x driver - this driver has been around for ages, and is now modernized to reduce code duplication that has stacked up and is using regmap to read write and cache registers. - Intel drivers are now maintained in a separate tree and start with a round of cleanups and unifications" * tag 'gpio-v4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (99 commits) gpio: sama5d2-piobu: Depend on OF_GPIO gpio: Add Cadence GPIO driver dt-bindings: gpio: Add bindings for Cadence GPIO gpiolib-acpi: remove unused variable 'err', cleans up build warning gpio: mxs: read pin level directly instead of using .get gpio: aspeed: remove duplicated statement gpio: add driver for SAMA5D2 PIOBU pins dt-bindings: arm: atmel: describe SECUMOD usage as a GPIO controller gpio/mmc/of: Respect polarity in the device tree dt-bindings: gpio: rcar: Add r8a774c0 (RZ/G2E) support memory: omap-gpmc: Get the header of the enum ARM: omap1: Fix new user of gpiochip_request_own_desc() gpio: pca953x: Add regmap dependency for PCA953x driver gpio: raspberrypi-exp: decrease refcount on firmware dt node gpiolib: Fix return value of gpio_to_desc() stub if !GPIOLIB gpio: pca953x: Restore registers after suspend/resume cycle gpio: pca953x: Zap single use of pca953x_read_single() gpio: pca953x: Zap ad-hoc reg_output cache gpio: pca953x: Zap ad-hoc reg_direction cache gpio: pca953x: Perform basic regmap conversion ...
2018-12-28Merge tag 'drm-next-2018-12-27' of git://anongit.freedesktop.org/drm/drmLinus Torvalds1-0/+10
Pull more drm updates from Dave Airlie: "Daniel collected a couple of pulls after I want on holidays, back for a couple of days, so may as well send them out. This has exynos and etnaviv work for 4.21. exynos: - plane alpha and blending configurability etnaviv: - mostly cleanups in prep for new features" * tag 'drm-next-2018-12-27' of git://anongit.freedesktop.org/drm/drm: drm/etnaviv: remove lastctx member from gpu struct drm/etnaviv: replace header include with forward declaration drm/etnaviv: remove unnecessary local irq disable drm/exynos: fimd: Make pixel blend mode configurable drm/exynos: fimd: Make plane alpha configurable drm/etnaviv: Replace drm_dev_unref with drm_dev_put drm/etnaviv: consolidate hardware fence handling in etnaviv_gpu drm/etnaviv: kill active fence tracking
2018-12-28Merge tag 'hwmon-for-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-stagingLinus Torvalds3-2/+67
Pull hwmon updates from Guenter Roeck: "The big change in this series is for the most part automatic: Introducing SENSOR[_DEVICE]_ATTR_{RO,RW,WO} variants and conversion of various drivers to use it. This is similar to DEVICE_ATTR variants. Other than that, we have - Some conversions of S_<PERMS> with octal values, also automated - Added support for Hygon Dhyana CPUs to k10temp driver - Added support for STLM75 to lm75 driver - B57891S0103 to ntc_thermistor - Added pm-runtime support to ina3221 driver - Support for PowerPC On-Chip Controller (OCC) - Various minor bug fices and improvements" * tag 'hwmon-for-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (80 commits) hwmon: (lm80) fix a missing check of bus read in lm80 probe hwmon: (lm80) fix a missing check of the status of SMBus read hwmon: (asus_atk0110) Fix debugfs_simple_attr.cocci warnings hwmon: (ftsteutates) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (fschmd) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (emc6w201) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (emc2103) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (emc1403) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (ds620) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (ds1621) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (dell-smm-hwmon) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (da9055-hwmon) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (da9052-hwmon) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (coretemp) Replace S_<PERMS> with octal values hwmon: (asus_atk0110) Replace S_<PERMS> with octal values hwmon: (aspeed-pwm-tacho) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (applesmc) Replace S_<PERMS> with octal values hwmon: (amc6821) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (adt7x10) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (adt7475) Use permission specific SENSOR[_DEVICE]_ATTR variants ...
2018-12-28Merge branch 'akpm' (patches from Andrew)Linus Torvalds31-188/+437
Merge misc updates from Andrew Morton: - large KASAN update to use arm's "software tag-based mode" - a few misc things - sh updates - ocfs2 updates - just about all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits) kernel/fork.c: mark 'stack_vm_area' with __maybe_unused memcg, oom: notify on oom killer invocation from the charge path mm, swap: fix swapoff with KSM pages include/linux/gfp.h: fix typo mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization memory_hotplug: add missing newlines to debugging output mm: remove __hugepage_set_anon_rmap() include/linux/vmstat.h: remove unused page state adjustment macro mm/page_alloc.c: allow error injection mm: migrate: drop unused argument of migrate_page_move_mapping() blkdev: avoid migration stalls for blkdev pages mm: migrate: provide buffer_migrate_page_norefs() mm: migrate: move migrate_page_lock_buffers() mm: migrate: lock buffers before migrate_page_move_mapping() mm: migration: factor out code to compute expected number of page references mm, page_alloc: enable pcpu_drain with zone capability kmemleak: add config to select auto scan mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init ...
2018-12-28Merge tag 'mmc-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds9-41/+296
Pull MMC updates from Ulf Hansson: "This time, this pull request contains changes crossing subsystems and archs/platforms, which is mainly because of a bigger modernization of moving from legacy GPIO to GPIO descriptors for MMC (by Linus Walleij). Additionally, once again, I am funneling changes to drivers/misc/cardreader/* and drivers/memstick/* through my MMC tree, mostly due to that we lack a maintainer for these. Summary: MMC core: - Cleanup BKOPS support - Introduce MMC_CAP_SYNC_RUNTIME_PM - slot-gpio: Delete legacy slot GPIO handling MMC host: - alcor: Add new mmc host driver for Alcor Micro PCI based cardreader - bcm2835: Several improvements to better recover from errors - jz4740: Rework and fixup pre|post_req support - mediatek: Add support for SDIO IRQs - meson-gx: Improve clock phase management - meson-gx: Stop descriptor on errors - mmci: Complete the sbc error path by sending a stop command - renesas_sdhi/tmio: Fixup reset/resume operations - renesas_sdhi: Add support for r8a774c0 and R7S9210 - renesas_sdhi: Whitelist R8A77990 SDHI - renesas_sdhi: Fixup eMMC HS400 compatibility issues for H3 and M3-W - rtsx_usb_sdmmc: Re-work card detection/removal support - rtsx_usb_sdmmc: Re-work runtime PM support - sdhci: Fix timeout loops for some variant drivers - sdhci: Improve support for error handling due to failing commands - sdhci-acpi/pci: Disable LED control for Intel BYT-based controllers - sdhci_am654: Add new SDHCI variant driver to support TI's AM654 SOCs - sdhci-of-esdhc: Add support for eMMC HS400 mode - sdhci-omap: Fixup reset support - sdhci-omap: Workaround errata regarding SDR104/HS200 tuning failures - sdhci-msm: Fixup sporadic write transfers issues for SDR104/HS200 - sdhci-msm: Fixup dynamical clock gating issues - various: Complete converting all hosts into using slot GPIO descriptors Other: - Move GPIO mmc platform data for mips/sh/arm to GPIO descriptors - Add new Alcor Micro cardreader PCI driver - Support runtime power management for memstick rtsx_usb_ms driver - Use USB remote wakeups for card detection for rtsx_usb misc driver" * tag 'mmc-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (99 commits) mmc: mediatek: Add MMC_CAP_SDIO_IRQ support mmc: renesas_sdhi_internal_dmac: Whitelist r8a774c0 dt-bindings: mmc: renesas_sdhi: Add r8a774c0 support mmc: core: Cleanup BKOPS support mmc: core: Drop redundant check in mmc_send_hpi_cmd() mmc: sdhci-omap: Workaround errata regarding SDR104/HS200 tuning failures (i929) dt-bindings: sdhci-omap: Add note for cpu_thermal mmc: sdhci-acpi: Disable LED control for Intel BYT-based controllers mmc: sdhci-pci: Disable LED control for Intel BYT-based controllers mmc: sdhci: Add quirk to disable LED control mmc: mmci: add variant property to set command stop bit misc: alcor_pci: fix spelling mistake "invailid" -> "invalid" mmc: meson-gx: add signal resampling mmc: meson-gx: align default phase on soc vendor tree mmc: meson-gx: remove useless lock mmc: meson-gx: make sure the descriptor is stopped on errors mmc: sdhci_am654: Add Initial Support for AM654 SDHCI driver dt-bindings: mmc: sdhci-of-arasan: Add deprecated message for AM65 dt-bindings: mmc: sdhci-am654: Document bindings for the host controllers on TI's AM654 SOCs mmc: sdhci-msm: avoid unused function warning ...
2018-12-28Merge tag 'libnvdimm-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds2-5/+74
Pull libnvdimm updates from Dan Williams: "The vast bulk of this update is the new support for the security capabilities of some nvdimms. The userspace tooling for this capability is still a work in progress, but the changes survive the existing libnvdimm unit tests. The changes also pass manual checkout on hardware and the new nfit_test emulation of the security capability. The touches of the security/keys/ files have received the necessary acks from Mimi and David. Those changes were necessary to allow for a new generic encrypted-key type, and allow the nvdimm sub-system to lookup key material referenced by the libnvdimm-sysfs interface. Summary: - Add support for the security features of nvdimm devices that implement a security model similar to ATA hard drive security. The security model supports locking access to the media at device-power-loss, to be unlocked with a passphrase, and secure-erase (crypto-scramble). Unlike the ATA security case where the kernel expects device security to be managed in a pre-OS environment, the libnvdimm security implementation allows key provisioning and key-operations at OS runtime. Keys are managed with the kernel's encrypted-keys facility to provide data-at-rest security for the libnvdimm key material. The usage model mirrors fscrypt key management, but is driven via libnvdimm sysfs. - Miscellaneous updates for api usage and comment fixes" * tag 'libnvdimm-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits) libnvdimm/security: Quiet security operations libnvdimm/security: Add documentation for nvdimm security support tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test tools/testing/nvdimm: Add overwrite support for nfit_test tools/testing/nvdimm: Add test support for Intel nvdimm security DSMs acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support acpi/nfit, libnvdimm/security: Add security DSM overwrite support acpi/nfit, libnvdimm: Add support for issue secure erase DSM to Intel nvdimm acpi/nfit, libnvdimm: Add enable/update passphrase support for Intel nvdimms acpi/nfit, libnvdimm: Add disable passphrase support to Intel nvdimm. acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs acpi/nfit, libnvdimm: Add freeze security support to Intel nvdimm acpi/nfit, libnvdimm: Introduce nvdimm_security_ops keys-encrypted: add nvdimm key format type to encrypted keys keys: Export lookup_user_key to external users acpi/nfit, libnvdimm: Store dimm id as a member to struct nvdimm libnvdimm, namespace: Replace kmemdup() with kstrndup() libnvdimm, label: Switch to bitmap_zalloc() ACPI/nfit: Adjust annotation for why return 0 if fail to find NFIT at start libnvdimm, bus: Check id immediately following ida_simple_get ...
2018-12-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds19-406/+793
Pull rdma updates from Jason Gunthorpe: "This has been a fairly typical cycle, with the usual sorts of driver updates. Several series continue to come through which improve and modernize various parts of the core code, and we finally are starting to get the uAPI command interface cleaned up. - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5, qib, rxe, usnic - Rework the entire syscall flow for uverbs to be able to run over ioctl(). Finally getting past the historic bad choice to use write() for command execution - More functional coverage with the mlx5 'devx' user API - Start of the HFI1 series for 'TID RDMA' - SRQ support in the hns driver - Support for new IBTA defined 2x lane widths - A big series to consolidate all the driver function pointers into a big struct and have drivers provide a 'static const' version of the struct instead of open coding initialization - New 'advise_mr' uAPI to control device caching/loading of page tables - Support for inline data in SRPT - Modernize how umad uses the driver core and creates cdev's and sysfs files - First steps toward removing 'uobject' from the view of the drivers" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits) RDMA/srpt: Use kmem_cache_free() instead of kfree() RDMA/mlx5: Signedness bug in UVERBS_HANDLER() IB/uverbs: Signedness bug in UVERBS_HANDLER() IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported IB/umad: Start using dev_groups of class IB/umad: Use class_groups and let core create class file IB/umad: Refactor code to use cdev_device_add() IB/umad: Avoid destroying device while it is accessed IB/umad: Simplify and avoid dynamic allocation of class IB/mlx5: Fix wrong error unwind IB/mlx4: Remove set but not used variable 'pd' RDMA/iwcm: Don't copy past the end of dev_name() string IB/mlx5: Fix long EEH recover time with NVMe offloads IB/mlx5: Simplify netdev unbinding IB/core: Move query port to ioctl RDMA/nldev: Expose port_cap_flags2 IB/core: uverbs copy to struct or zero helper IB/rxe: Reuse code which sets port state IB/rxe: Make counters thread safe IB/mlx5: Use the correct commands for UMEM and UCTX allocation ...
2018-12-28Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds6-34/+151
Pull SCSI updates from James Bottomley: "This is mostly update of the usual drivers: smarpqi, lpfc, qedi, megaraid_sas, libsas, zfcp, mpt3sas, hisi_sas. Additionally, we have a pile of annotation, unused variable and minor updates. The big API change is the updates for Christoph's DMA rework which include removing the DISABLE_CLUSTERING flag. And finally there are a couple of target tree updates" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (259 commits) scsi: isci: request: mark expected switch fall-through scsi: isci: remote_node_context: mark expected switch fall-throughs scsi: isci: remote_device: Mark expected switch fall-throughs scsi: isci: phy: Mark expected switch fall-through scsi: iscsi: Capture iscsi debug messages using tracepoints scsi: myrb: Mark expected switch fall-throughs scsi: megaraid: fix out-of-bound array accesses scsi: mpt3sas: mpt3sas_scsih: Mark expected switch fall-through scsi: fcoe: remove set but not used variable 'port' scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown() scsi: smartpqi: fix build warnings scsi: smartpqi: update driver version scsi: smartpqi: add ofa support scsi: smartpqi: increase fw status register read timeout scsi: smartpqi: bump driver version scsi: smartpqi: add smp_utils support scsi: smartpqi: correct lun reset issues scsi: smartpqi: correct volume status scsi: smartpqi: do not offline disks for transient did no connect conditions scsi: smartpqi: allow for larger raid maps ...
2018-12-28Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds8-310/+186
Pull DMA mapping updates from Christoph Hellwig: "A huge update this time, but a lot of that is just consolidating or removing code: - provide a common DMA_MAPPING_ERROR definition and avoid indirect calls for dma_map_* error checking - use direct calls for the DMA direct mapping case, avoiding huge retpoline overhead for high performance workloads - merge the swiotlb dma_map_ops into dma-direct - provide a generic remapping DMA consistent allocator for architectures that have devices that perform DMA that is not cache coherent. Based on the existing arm64 implementation and also used for csky now. - improve the dma-debug infrastructure, including dynamic allocation of entries (Robin Murphy) - default to providing chaining scatterlist everywhere, with opt-outs for the few architectures (alpha, parisc, most arm32 variants) that can't cope with it - misc sparc32 dma-related cleanups - remove the dma_mark_clean arch hook used by swiotlb on ia64 and replace it with the generic noncoherent infrastructure - fix the return type of dma_set_max_seg_size (Niklas Söderlund) - move the dummy dma ops for not DMA capable devices from arm64 to common code (Robin Murphy) - ensure dma_alloc_coherent returns zeroed memory to avoid kernel data leaks through userspace. We already did this for most common architectures, but this ensures we do it everywhere. dma_zalloc_coherent has been deprecated and can hopefully be removed after -rc1 with a coccinelle script" * tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping: (73 commits) dma-mapping: fix inverted logic in dma_supported dma-mapping: deprecate dma_zalloc_coherent dma-mapping: zero memory returned from dma_alloc_* sparc/iommu: fix ->map_sg return value sparc/io-unit: fix ->map_sg return value arm64: default to the direct mapping in get_arch_dma_ops PCI: Remove unused attr variable in pci_dma_configure ia64: only select ARCH_HAS_DMA_COHERENT_TO_PFN if swiotlb is enabled dma-mapping: bypass indirect calls for dma-direct vmd: use the proper dma_* APIs instead of direct methods calls dma-direct: merge swiotlb_dma_ops into the dma_direct code dma-direct: use dma_direct_map_page to implement dma_direct_map_sg dma-direct: improve addressability error reporting swiotlb: remove dma_mark_clean swiotlb: remove SWIOTLB_MAP_ERROR ACPI / scan: Refactor _CCA enforcement dma-mapping: factor out dummy DMA ops dma-mapping: always build the direct mapping code dma-mapping: move dma_cache_sync out of line dma-mapping: move various slow path functions out of line ...
2018-12-28Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds31-604/+664
Pull block updates from Jens Axboe: "This is the main pull request for block/storage for 4.21. Larger than usual, it was a busy round with lots of goodies queued up. Most notable is the removal of the old IO stack, which has been a long time coming. No new features for a while, everything coming in this week has all been fixes for things that were previously merged. This contains: - Use atomic counters instead of semaphores for mtip32xx (Arnd) - Cleanup of the mtip32xx request setup (Christoph) - Fix for circular locking dependency in loop (Jan, Tetsuo) - bcache (Coly, Guoju, Shenghui) * Optimizations for writeback caching * Various fixes and improvements - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith) * host and target support for NVMe over TCP * Error log page support * Support for separate read/write/poll queues * Much improved polling * discard OOM fallback * Tracepoint improvements - lightnvm (Hans, Hua, Igor, Matias, Javier) * Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. * Fix from Geert on uninitialized value on chunk metadata reads. * Fixes from Hans and Javier to pblk recovery and write path. * Fix from Hua Su to fix a race condition in the pblk recovery code. * Scan optimization added to pblk recovery from Zhoujie. * Small geometry cleanup from me. - Conversion of the last few drivers that used the legacy path to blk-mq (me) - Removal of legacy IO path in SCSI (me, Christoph) - Removal of legacy IO stack and schedulers (me) - Support for much better polling, now without interrupts at all. blk-mq adds support for multiple queue maps, which enables us to have a map per type. This in turn enables nvme to have separate completion queues for polling, which can then be interrupt-less. Also means we're ready for async polled IO, which is hopefully coming in the next release. - Killing of (now) unused block exports (Christoph) - Unification of the blk-rq-qos and blk-wbt wait handling (Josef) - Support for zoned testing with null_blk (Masato) - sx8 conversion to per-host tag sets (Christoph) - IO priority improvements (Damien) - mq-deadline zoned fix (Damien) - Ref count blkcg series (Dennis) - Lots of blk-mq improvements and speedups (me) - sbitmap scalability improvements (me) - Make core inflight IO accounting per-cpu (Mikulas) - Export timeout setting in sysfs (Weiping) - Cleanup the direct issue path (Jianchao) - Export blk-wbt internals in block debugfs for easier debugging (Ming) - Lots of other fixes and improvements" * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits) kyber: use sbitmap add_wait_queue/list_del wait helpers sbitmap: add helpers for add/del wait queue handling block: save irq state in blkg_lookup_create() dm: don't reuse bio for flushes nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1 blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue ...
2018-12-28Merge tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playgroundLinus Torvalds8-70/+60
Pull y2038 updates from Arnd Bergmann: "More syscalls and cleanups This concludes the main part of the system call rework for 64-bit time_t, which has spread over most of year 2018, the last six system calls being - ppoll - pselect6 - io_pgetevents - recvmmsg - futex - rt_sigtimedwait As before, nothing changes for 64-bit architectures, while 32-bit architectures gain another entry point that differs only in the layout of the timespec structure. Hopefully in the next release we can wire up all 22 of those system calls on all 32-bit architectures, which gives us a baseline version for glibc to start using them. This does not include the clock_adjtime, getrusage/waitid, and getitimer/setitimer system calls. I still plan to have new versions of those as well, but they are not required for correct operation of the C library since they can be emulated using the old 32-bit time_t based system calls. Aside from the system calls, there are also a few cleanups here, removing old kernel internal interfaces that have become unused after all references got removed. The arch/sh cleanups are part of this, there were posted several times over the past year without a reaction from the maintainers, while the corresponding changes made it into all other architectures" * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: timekeeping: remove obsolete time accessors vfs: replace current_kernel_time64 with ktime equivalent timekeeping: remove timespec_add/timespec_del timekeeping: remove unused {read,update}_persistent_clock sh: remove board_time_init() callback sh: remove unused rtc_sh_get/set_time infrastructure sh: sh03: rtc: push down rtc class ops into driver sh: dreamcast: rtc: push down rtc class ops into driver y2038: signal: Add compat_sys_rt_sigtimedwait_time64 y2038: signal: Add sys_rt_sigtimedwait_time32 y2038: socket: Add compat_sys_recvmmsg_time64 y2038: futex: Add support for __kernel_timespec y2038: futex: Move compat implementation into futex.c io_pgetevents: use __kernel_timespec pselect6: use __kernel_timespec ppoll: use __kernel_timespec signal: Add restore_user_sigmask() signal: Add set_user_sigmask()
2018-12-28include/linux/gfp.h: fix typoKyle Spiers1-1/+1
Fix misspelled "satisfied" Link: http://lkml.kernel.org/r/20181227232354.64562-1-ksspiers@google.com Signed-off-by: Kyle Spiers <ksspiers@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmmDan Williams2-32/+24
The kbuild robot reported the following on a development branch that used memremap.h in a new path: In file included from arch/m68k/include/asm/pgtable_mm.h:148:0, from arch/m68k/include/asm/pgtable.h:5, from include/linux/memremap.h:7, from drivers//dax/bus.c:3: arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset': >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct' return mm->pgd + pgd_index(address); ^~ The ->page_fault() callback is specific to HMM. Move it to 'struct hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in include/linux/hmm.h. Longer term refactoring this dependency out of HMM is recommended, but in the meantime memremap.h remains generic. Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28include/linux/vmstat.h: remove unused page state adjustment macroWei Yang1-5/+0
These four macro are not used anymore. Just remove them. Link: http://lkml.kernel.org/r/20181214063211.2290-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/page_alloc.c: allow error injectionBenjamin Poirier1-0/+1
Model call chain after should_failslab(). Likewise, we can now use a kprobe to override the return value of should_fail_alloc_page() and inject allocation failures into alloc_page*(). This will allow injecting allocation failures using the BCC tools even without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a fail_page_alloc= parameter, which incurs some overhead even when failures are not being injected. On the other hand, this patch adds an unconditional call to should_fail_alloc_page() from page allocation hotpath. That overhead should be rather negligible with CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though. [vbabka@suse.cz: changelog addition] Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migrate: drop unused argument of migrate_page_move_mapping()Jan Kara1-2/+1
All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migrate: provide buffer_migrate_page_norefs()Jan Kara1-0/+4
Provide a variant of buffer_migrate_page() that also checks whether there are no unexpected references to buffer heads. This function will then be safe to use for block device pages. [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)] Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPESPingfan Liu1-1/+2
Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code. If someone adds extra migrate type, then he may forget to enlarge the NR_PAGEBLOCK_BITS. Hence it requires some way to fix. NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on two different .h file with reverse dependency, it is a little hard to refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind such relation in compiling-time. Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, thp, proc: report THP eligibility for each vmaMichal Hocko1-1/+12
Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28include/linux/memory_hotplug.h: remove duplicate declaration of offline_pages()Wei Yang1-1/+0
offline_pages() is already declared in this file. Just remove the duplicated one. Link: http://lkml.kernel.org/r/20181205031357.24769-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end calls v2Jérôme Glisse2-31/+62
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end callbackJérôme Glisse1-5/+9
Patch series "mmu notifier contextual informations", v2. This patchset adds contextual information, why an invalidation is happening, to mmu notifier callback. This is necessary for user of mmu notifier that wish to maintains their own data structure without having to add new fields to struct vm_area_struct (vma). For instance device can have they own page table that mirror the process address space. When a vma is unmap (munmap() syscall) the device driver can free the device page table for the range. Today we do not have any information on why a mmu notifier call back is happening and thus device driver have to assume that it is always an munmap(). This is inefficient at it means that it needs to re-allocate device page table on next page fault and rebuild the whole device driver data structure for the range. Other use case beside munmap() also exist, for instance it is pointless for device driver to invalidate the device page table when the invalidation is for the soft dirtyness tracking. Or device driver can optimize away mprotect() that change the page table permission access for the range. This patchset enables all this optimizations for device drivers. I do not include any of those in this series but another patchset I am posting will leverage this. The patchset is pretty simple from a code point of view. The first two patches consolidate all mmu notifier arguments into a struct so that it is easier to add/change arguments. The last patch adds the contextual information (munmap, protection, soft dirty, clear, ...). This patch (of 3): To avoid having to change many callback definition everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end callback. No functional changes with this patch. [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc] Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jason Gunthorpe <jgg@mellanox.com> [infiniband] Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28lib/ioremap: ensure break-before-make is used for huge p4d mappingsWill Deacon1-0/+5
Whilst no architectures actually enable support for huge p4d mappings in the vmap area, the code that is implemented should be using break-before-make, as we do for pud and pmd huge entries. Link: http://lkml.kernel.org/r/1544120495-17438-6-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28/proc/kpagecount: return 0 for special pages that are never mappedAnthony Yznaga1-0/+6
Certain pages that are never mapped to userspace have a type indicated in the page_type field of their struct pages (e.g. PG_buddy). page_type overlaps with _mapcount so set the count to 0 and avoid calling page_mapcount() for these pages. [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox] Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, hotplug: move init_currently_empty_zone() under zone_span_lock protectionWei Yang1-2/+2
During online_pages phase, pgdat->nr_zones will be updated in case this zone is empty. Currently the online_pages phase is protected by the global locks (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is no contention during the update of nr_zones. These global locks introduces scalability issues (especially the second one), which slow down code relying on get_online_mems(). This is also a preparation for not having to rely on get_online_mems() but instead some more fine grained locks. The patch moves init_currently_empty_zone under both zone_span_writelock and pgdat_resize_lock because both the pgdat state is changed (nr_zones) and the zone's start_pfn. Also this patch changes the documentation of node_size_lock to include the protection of nr_zones. Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, sparse: pass nid instead of pgdat to sparse_add_one_section()Wei Yang1-2/+2
Since the information needed in sparse_add_one_section() is node id to allocate proper memory, it is not necessary to pass its pgdat. This patch changes the prototype of sparse_add_one_section() to pass node id directly. This is intended to reduce misleading that sparse_add_one_section() would touch pgdat. Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section()Wei Yang1-2/+1
pgdat_resize_lock is used to protect pgdat's memory region information like: node_start_pfn, node_present_pages, etc. While in function sparse_add/remove_one_section(), pgdat_resize_lock is used to protect initialization/release of one mem_section. This looks not proper. These code paths are currently protected by mem_hotplug_lock currently but should there ever be any reason for locking at the sparse layer a dedicated lock should be introduced. Following is the current call trace of sparse_add/remove_one_section() mem_hotplug_begin() arch_add_memory() add_pages() __add_pages() __add_section() sparse_add_one_section() mem_hotplug_done() mem_hotplug_begin() arch_remove_memory() __remove_pages() __remove_section() sparse_remove_one_section() mem_hotplug_done() The comment above the pgdat_resize_lock also mentions "Holding this will also guarantee that any pfn_valid() stays that way.", which is true with the current implementation and false after this patch. But current implementation doesn't meet this comment. There isn't any pfn walkers to take the lock so this looks like a relict from the past. This patch also removes this comment. [richard.weiyang@gmail.com: v4] Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com [mhocko@suse.com: changelog suggestion] Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: remove pte_lock_deinit()Yu Zhao1-9/+2
Pagetable page doesn't touch page->mapping or have any used field that overlaps with it. No need to clear mapping in dtor. In fact, doing so might mask problems that otherwise would be detected by bad_page(). Link: http://lkml.kernel.org/r/20181128235525.58780-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/memblock.c: skip kmemleak for kasan_init()Qian Cai1-0/+1
Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and Huawei TaiShan 2280 aarch64 servers). After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early log buffer went from something like 280 to 260000 which caused kmemleak disabled and crash dump memory reservation failed. The multitude of kmemleak_alloc() calls is from nested loops while KASAN is setting up full memory mappings, so let early kmemleak allocations skip those memblock_alloc_internal() calls came from kasan_init() given that those early KASAN memory mappings should not reference to other memory. Hence, no kmemleak false positives. kasan_init kasan_map_populate [1] kasan_pgd_populate [2] kasan_pud_populate [3] kasan_pmd_populate [4] kasan_pte_populate [5] kasan_alloc_zeroed_page memblock_alloc_try_nid memblock_alloc_internal kmemleak_alloc [1] for_each_memblock(memory, reg) [2] while (pgdp++, addr = next, addr != end) [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp))) [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp))) [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))) Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us Signed-off-by: Qian Cai <cai@gmx.us> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, memory_hotplug: add nid parameter to arch_remove_memoryOscar Salvador1-2/+2
Patch series "Do not touch pages in hot-remove path", v2. This patchset aims for two things: 1) A better definition about offline and hot-remove stage 2) Solving bugs where we can access non-initialized pages during hot-remove operations [2] [3]. This is achieved by moving all page/zone handling to the offline stage, so we do not need to access pages when hot-removing memory. [1] https://patchwork.kernel.org/cover/10691415/ [2] https://patchwork.kernel.org/patch/10547445/ [3] https://www.spinics.net/lists/linux-mm/msg161316.html This patch (of 5): This is a preparation for the following-up patches. The idea of passing the nid is that it will allow us to get rid of the zone parameter afterwards. Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init()Wei Yang1-2/+0
When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of each node's highest zone is initialized before defer stage. static_init_pgcnt is used to store the number of pages like this: pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, pgdat->node_spanned_pages); because we don't want to overflow zone's range. But this is not necessary, since defer_init() is called like this: memmap_init_zone() for pfn in [start_pfn, end_pfn) defer_init(pfn, end_pfn) In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would stop before calling defer_init(). BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct, since nr_initialised is zone based instead of node based. Even node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone would have pages less than PAGES_PER_SECTION. Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: put_and_wait_on_page_locked() while page is migratedHugh Dickins1-0/+2
Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: https://marc.info/?l=linux-mm&m=154217936431300&w=2 We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using "behavior" enum in place of "lock" arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. __migration_entry_wait() still has to take a brief reference to the page, prior to calling put_and_wait_on_page_locked(): but now that it is dropped before waiting, the chance of impeding page migration is very much reduced. Should we perhaps disable preemption across this? shrink_page_list()'s __ClearPageLocked(): that was a surprise! This survived a lot of testing before that showed up. PageWaiters may have been set by wait_on_page_bit_common(), and the reference dropped, just before shrink_page_list() succeeds in freezing its last page reference: in such a case, unlock_page() must be used. Follow the suggestion from Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now: that optimization predates PageWaiters, and won't buy much these days; but we can reinstate it for the !PageWaiters case if anyone notices. It does raise the question: should vmscan.c's is_page_cache_freeable() and __remove_mapping() now treat a PageWaiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only used for page migration. [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Baoquan He <bhe@redhat.com> Tested-by: Baoquan He <bhe@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, oom: add oom victim's memcg to the oom context informationyuzhoujian1-2/+9
The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, oom: reorganize the oom report in dump_headeryuzhoujian1-0/+10
OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: make free_reserved_area() return "const char *"Alexey Dobriyan1-1/+1
and propagate through down the call stack. Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/debug.c: make "migrate_reason_names[]" const char *Alexey Dobriyan1-1/+1
Those strings are immutable as well. Link: http://lkml.kernel.org/r/20181124090508.GB10877@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmzone.c: make "migratetype_names" const char *Alexey Dobriyan1-1/+1
Those strings are immutable in fact. Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: reclaim small amounts of memory when an external fragmentation event occursMel Gorman2-4/+8
An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: move zone watermark accesses behind an accessorMel Gorman1-4/+5
This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/memory_hotplug: drop "online" parameter from add_memory_resource()David Hildenbrand1-1/+1
Userspace should always be in charge of how to online memory and if memory should be onlined automatically in the kernel. Let's drop the parameter to overwrite this - XEN passes memhp_auto_online, just like add_memory(), so we can directly use that instead internally. Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28memblock: replace usage of __memblock_free_early() with memblock_free()Mike Rapoport1-3/+2
__memblock_free_early() is only used by the convenience wrappers, so essentially we wrap a call to memblock_free() twice. Replace calls of __memblock_free_early() with calls to memblock_free() and drop the former. Link: http://lkml.kernel.org/r/20181125102940.GE28634@rapoport-lnx Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Wentao Wang <witallwang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, hmm: use devm semantics for hmm_devmem_{add, remove}Dan Williams1-3/+1
devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, devm_memremap_pages: fix shutdown handlingDan Williams1-0/+2
The last step before devm_memremap_pages() returns success is to allocate a release action, devm_memremap_pages_release(), to tear the entire setup down. However, the result from devm_add_action() is not checked. Checking the error from devm_add_action() is not enough. The api currently relies on the fact that the percpu_ref it is using is killed by the time the devm_memremap_pages_release() is run. Rather than continue this awkward situation, offload the responsibility of killing the percpu_ref to devm_memremap_pages_release() directly. This allows devm_memremap_pages() to do the right thing relative to init failures and shutdown. Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range. An argument could be made to require that the ->kill() operation be set in the @pgmap arg rather than passed in separately. However, it helps code readability, tracking the lifetime of a given instance, to be able to grep the kill routine directly at the devm_memremap_pages() call site. Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com> Reported-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/swap: use nr_node_ids for avail_lists in swap_info_structAaron Lu1-1/+10
Since a2468cc9bfdf ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is nWei Yang1-6/+0
Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that fail zone_reclaim() as full") changed the return value of node_reclaim(). The original return value 0 means NODE_RECLAIM_SOME after this commit. While the return value of node_reclaim() when CONFIG_NUMA is n is not changed. This will leads to call zone_watermark_ok() again. This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN. Since node_reclaim() is only called in page_alloc.c, move it to mm/internal.h. Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: remove managed_page_count_lock spinlockArun KS1-6/+0
Now that totalram_pages and managed_pages are atomic varibles, no need of managed_page_count spinlock. The lock had really a weak consistency guarantee. It hasn't been used for anything but the update but no reader actually cares about all the values being updated to be in sync. Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS3-4/+52
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: convert zone->managed_pages to atomic variableArun KS1-2/+7
totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. This patch converts zone->managed_pages. Subsequent patches will convert totalram_panges, totalhigh_pages and eventually managed_page_count_lock will be removed. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28xxHash: create arch dependent 32/64-bit xxhash()Timofey Titovets1-0/+23
Patch series "Currently used jhash are slow enough and replace it allow as to make KSM", v8. Apeed (in kernel): ksm: crc32c hash() 12081 MB/s ksm: xxh64 hash() 8770 MB/s ksm: xxh32 hash() 4529 MB/s ksm: jhash2 hash() 1569 MB/s Sioh Lee's testing (copy from other mail): Test platform: openstack cloud platform (NEWTON version) Experiment node: openstack based cloud compute node (CPU: xeon E5-2620 v3, memory 64gb) VM: (2 VCPU, RAM 4GB, DISK 20GB) * 4 Linux kernel: 4.14 (latest version) KSM setup - sleep_millisecs: 200ms, pages_to_scan: 200 Experiment process: Firstly, we turn off KSM and launch 4 VMs. Then we turn on the KSM and measure the checksum computation time until full_scans become two. The experimental results (the experimental value is the average of the measured values) crc32c_intel: 1084.10ns crc32c (no hardware acceleration): 7012.51ns xxhash32: 2227.75ns xxhash64: 1413.16ns jhash2: 5128.30ns In summary, the result shows that crc32c_intel has advantages over all of the hash function used in the experiment. (decreased by 84.54% compared to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to xxhash64) the results are similar to those of Timofey. But, use only xxhash for now, because for using crc32c, cryptoapi must be initialized first - that require some tricky solution to work good in all situations. So: - First patch implement compile time pickup of fastest implementation of xxhash for target platform. - The second patch replaces jhash2 with xxhash This patch (of 2): xxh32() - fast on both 32/64-bit platforms xxh64() - fast only on 64-bit platform Create xxhash() which will pick up the fastest version at compile time. Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: leesioh <solee@os.korea.ac.kr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: only report isolation failures when offlining memoryMichal Hocko1-2/+9
Heiko has complained that his log is swamped by warnings from has_unmovable_pages [ 20.536664] page dumped because: has_unmovable_pages [ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0 [ 20.536794] flags: 0x3fffe0000010200(slab|head) [ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600 [ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000 [ 20.536797] page dumped because: has_unmovable_pages [ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0 [ 20.536815] flags: 0x7fffe0000000000() [ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000 [ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000 which are not triggered by the memory hotplug but rather CMA allocator. The original idea behind dumping the page state for all call paths was that these messages will be helpful debugging failures. From the above it seems that this is not the case for the CMA path because we are lacking much more context. E.g the second reported page might be a CMA allocated page. It is still interesting to see a slab page in the CMA area but it is hard to tell whether this is bug from the above output alone. Address this issue by dumping the page state only on request. Both start_isolate_page_range and has_unmovable_pages already have an argument to ignore hwpoison pages so make this argument more generic and turn it into flags and allow callers to combine non-default modes into a mask. While we are at it, has_unmovable_pages call from is_pageblock_removable_nolock (sysfs removable file) is questionable to report the failure so drop it from there as well. Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>