aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-10-11include/linux/ctype.h: make isdigit() table lookuplessAlexey Dobriyan1-1/+4
Make isdigit into a simple range checking inline function: return '0' <= c && c <= '9'; This code is 1 branch, not 2 because any reasonable compiler can optimize this code into SUB+CMP, so the code while (isdigit((c = *s++))) ... remains 1 branch per iteration HOWEVER it suddenly doesn't do table lookup priming cacheline nobody cares about. Link: http://lkml.kernel.org/r/20160826190047.GA12536@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11radix-tree: 'slot' can be NULL in radix_tree_next_slot()Ross Zwisler1-0/+8
There are four cases I can see where we could end up with a NULL 'slot' in radix_tree_next_slot(). Yet radix_tree_next_slot() never actually checks whether 'slot' is NULL. It just happens that for the cases where 'slot' is NULL, some other combination of factors prevents us from dereferencing it. It would be very easy for someone to unwittingly change one of these factors without realizing that we are implicitly depending on it to save us from a NULL pointer dereference. Add a comment documenting the things that allow 'slot' to be safely passed as NULL to radix_tree_next_slot(). Here are details on the four cases: 1) radix_tree_iter_retry() via a non-tagged iteration like radix_tree_for_each_slot(). In this case we currently aren't seeing a bug because radix_tree_iter_retry() sets iter->next_index = iter->index; which means that in in the else case in radix_tree_next_slot(), 'count' is zero, so we skip over the while() loop and effectively just return NULL without ever dereferencing 'slot'. 2) radix_tree_iter_retry() via tagged iteration like radix_tree_for_each_tagged(). This case was giving us NULL pointer dereferences in testing, and was fixed with this commit: commit 3cb9185c6730 ("radix-tree: fix radix_tree_iter_retry() for tagged iterators.") This fix doesn't explicitly check for 'slot' being NULL, though, it works around the NULL pointer dereference by instead zeroing iter->tags in radix_tree_iter_retry(), which makes us bail out of the if() case in radix_tree_next_slot() before we dereference 'slot'. 3) radix_tree_iter_next() via via a non-tagged iteration like radix_tree_for_each_slot(). This currently happens in shmem_tag_pins() and shmem_partial_swap_usage(). As with non-tagged iteration, 'count' in the else case of radix_tree_next_slot() is zero, so we skip over the while() loop and effectively just return NULL without ever dereferencing 'slot'. 4) radix_tree_iter_next() via tagged iteration like radix_tree_for_each_tagged(). This happens in shmem_wait_for_pins(). radix_tree_iter_next() zeros out iter->tags, so we end up exiting radix_tree_next_slot() here: if (flags & RADIX_TREE_ITER_TAGGED) { void *canon = slot; iter->tags >>= 1; if (unlikely(!iter->tags)) return NULL; Link: http://lkml.kernel.org/r/20160815194237.25967-2-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+1
Pull networking fixes from David Miller: 1) Netfilter list handling fix, from Linus. 2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless endpoints, build warnings, missing notifications, etc.) From David Howells. 3) Kernel log message missing newlines, from Colin Ian King. 4) Don't enter direct reclaim in netlink dumps, the idea is to use a high order allocation first and fallback quickly to a 0-order allocation if such a high-order one cannot be done cheaply and without reclaim. From Eric Dumazet. 5) Fix firmware download errors in btusb bluetooth driver, from Ethan Hsieh. 6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven. 7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott. 8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej Żenczykowski. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits) netfilter: Fix slab corruption. be2net: Enable VF link state setting for BE3 be2net: Fix TX stats for TSO packets be2net: Update Copyright string in be_hw.h be2net: NCSI FW section should be properly updated with ethtool for BE3 be2net: Provide an alternate way to read pf_num for BEx chips wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent() net: macb: NULL out phydev after removing mdio bus xen-netback: make sure that hashes are not send to unaware frontends Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion MAINTAINERS: add myself as a maintainer of xen-netback ipv6 addrconf: disallow rtr_solicits < -1 Bluetooth: btusb: Fix atheros firmware download error drivers: net: phy: Correct duplicate MDIO_XGENE entry ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM net: ethernet: mediatek: remove hwlro property in the device tree net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi net: ethernet: mediatek: get the chip id by ETHDMASYS registers net: bgmac: Fix errant feature flag check netlink: do not enter direct reclaim from netlink_dump() ...
2016-10-10Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-3/+3
Pull more vfs updates from Al Viro: ">rename2() work from Miklos + current_time() from Deepa" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Replace current_fs_time() with current_time() fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: proc: Delete inode time initializations in proc_alloc_inode() vfs: Add current_time() api vfs: add note about i_op->rename changes to porting fs: rename "rename2" i_op to "rename" vfs: remove unused i_op->rename fs: make remaining filesystems use .rename2 libfs: support RENAME_NOREPLACE in simple_rename() fs: support RENAME_NOREPLACE for local filesystems ncpfs: fix unused variable warning
2016-10-10Merge remote-tracking branch 'ovl/rename2' into for-linusAl Viro12-41/+146
2016-10-10Merge tag 'for-linus-20161008' of git://git.infradead.org/linux-mtdLinus Torvalds2-69/+272
Pull MTD updates from Brian Norris: "I've not been very active this cycle, so these are mostly from Boris, for the NAND flash subsystem. NAND: - Add the infrastructure to automate NAND timings configuration - Provide a generic DT property to maximize ECC strength - Some refactoring in the core bad block table handling, to help with improving some of the logic in error cases. - Minor cleanups and fixes MTD: - Add APIs for handling page pairing; this is necessary for reliably supporting MLC and TLC NAND flash, where paired-page disturbance affects reliability. Upper layers (e.g., UBI) should make use of these in the near future" * tag 'for-linus-20161008' of git://git.infradead.org/linux-mtd: (35 commits) mtd: nand: fix trivial spelling error mtdpart: Propagate _get/put_device() mtd: nand: Provide nand_cleanup() function to free NAND related resources mtd: Kill the OF_MTD Kconfig option mtd: nand: mxc: Test CONFIG_OF instead of CONFIG_OF_MTD mtd: nand: Fix nand_command_lp() for 8bits opcodes mtd: nand: sunxi: Support ECC maximization mtd: nand: Support maximizing ECC when using software BCH mtd: nand: Add an option to maximize the ECC strength mtd: nand: mxc: Add timing setup for v2 controllers mtd: nand: mxc: implement onfi get/set features mtd: nand: sunxi: switch from manual to automated timing config mtd: nand: automate NAND timings selection mtd: nand: Expose data interface for ONFI mode 0 mtd: nand: Add function to convert ONFI mode to data_interface mtd: nand: convert ONFI mode into data interface mtd: nand: Introduce nand_data_interface mtd: nand: Create a NAND reset function mtd: nand: remove unnecessary 'extern' from function declarations MAINTAINERS: Add maintainer entry for Ingenic JZ4780 NAND driver ...
2016-10-10Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-13/+17
Pull vfs xattr updates from Al Viro: "xattr stuff from Andreas This completes the switch to xattr_handler ->get()/->set() from ->getxattr/->setxattr/->removexattr" * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Remove {get,set,remove}xattr inode operations xattr: Stop calling {get,set,remove}xattr inode operations vfs: Check for the IOP_XATTR flag in listxattr xattr: Add __vfs_{get,set,remove}xattr helpers libfs: Use IOP_XATTR flag for empty directory handling vfs: Use IOP_XATTR flag for bad-inode handling vfs: Add IOP_XATTR inode operations flag vfs: Move xattr_resolve_name to the front of fs/xattr.c ecryptfs: Switch to generic xattr handlers sockfs: Get rid of getxattr iop sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names kernfs: Switch to generic xattr handlers hfs: Switch to generic xattr handlers jffs2: Remove jffs2_{get,set,remove}xattr macros xattr: Remove unnecessary NULL attribute name check
2016-10-10Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds2-4/+3
Pull crypto updates from Herbert Xu: "Here is the crypto update for 4.9: API: - The crypto engine code now supports hashes. Algorithms: - Allow keys >= 2048 bits in FIPS mode for RSA. Drivers: - Memory overwrite fix for vmx ghash. - Add support for building ARM sha1-neon in Thumb2 mode. - Reenable ARM ghash-ce code by adding import/export. - Reenable img-hash by adding import/export. - Add support for multiple cores in omap-aes. - Add little-endian support for sha1-powerpc. - Add Cavium HWRNG driver for ThunderX SoC" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (137 commits) crypto: caam - treat SGT address pointer as u64 crypto: ccp - Make syslog errors human-readable crypto: ccp - clean up data structure crypto: vmx - Ensure ghash-generic is enabled crypto: testmgr - add guard to dst buffer for ahash_export crypto: caam - Unmap region obtained by of_iomap crypto: sha1-powerpc - little-endian support crypto: gcm - Fix IV buffer size in crypto_gcm_setkey crypto: vmx - Fix memory corruption caused by p8_ghash crypto: ghash-generic - move common definitions to a new header file crypto: caam - fix sg dump hwrng: omap - Only fail if pm_runtime_get_sync returns < 0 crypto: omap-sham - shrink the internal buffer size crypto: omap-sham - add support for export/import crypto: omap-sham - convert driver logic to use sgs for data xmit crypto: omap-sham - change the DMA threshold value to a define crypto: omap-sham - add support functions for sg based data handling crypto: omap-sham - rename sgl to sgl_tmp for deprecation crypto: omap-sham - align algorithms on word offset crypto: omap-sham - add context export/import stubs ...
2016-10-10Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds6-3/+89
Pull Ceph updates from Ilya Dryomov: "The big ticket item here is support for rbd exclusive-lock feature, with maintenance operations offloaded to userspace (Douglas Fuller, Mike Christie and myself). Another block device bullet is a series fixing up layering error paths (myself). On the filesystem side, we've got patches that improve our handling of buffered vs dio write races (Neil Brown) and a few assorted fixes from Zheng. Also included a couple of random cleanups and a minor CRUSH update" * tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits) crush: remove redundant local variable crush: don't normalize input of crush_ln iteratively libceph: ceph_build_auth() doesn't need ceph_auth_build_hello() libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello() ceph: fix description for rsize and rasize mount options rbd: use kmalloc_array() in rbd_header_from_disk() ceph: use list_move instead of list_del/list_add ceph: handle CEPH_SESSION_REJECT message ceph: avoid accessing / when mounting a subpath ceph: fix mandatory flock check ceph: remove warning when ceph_releasepage() is called on dirty page ceph: ignore error from invalidate_inode_pages2_range() in direct write ceph: fix error handling of start_read() rbd: add rbd_obj_request_error() helper rbd: img_data requests don't own their page array rbd: don't call rbd_osd_req_format_read() for !img_data requests rbd: rework rbd_img_obj_exists_submit() error paths rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback() rbd: move bumping img_request refcount into rbd_obj_request_submit() rbd: mark the original request as done if stat request fails ...
2016-10-10Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+2
Pull splice fixups from Al Viro: "A couple of fixups for interaction of pipe-backed iov_iter with O_DIRECT reads + constification of a couple of primitives in uio.h missed by previous rounds. Kudos to davej - his fuzzing has caught those bugs" * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [btrfs] fix check_direct_IO() for non-iovec iterators constify iov_iter_count() and iter_is_iovec() fix ITER_PIPE interaction with direct_IO
2016-10-10Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds9-114/+43
Pull misc vfs updates from Al Viro: "Assorted misc bits and pieces. There are several single-topic branches left after this (rename2 series from Miklos, current_time series from Deepa Dinamani, xattr series from Andreas, uaccess stuff from from me) and I'd prefer to send those separately" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits) proc: switch auxv to use of __mem_open() hpfs: support FIEMAP cifs: get rid of unused arguments of CIFSSMBWrite() posix_acl: uapi header split posix_acl: xattr representation cleanups fs/aio.c: eliminate redundant loads in put_aio_ring_file fs/internal.h: add const to ns_dentry_operations declaration compat: remove compat_printk() fs/buffer.c: make __getblk_slow() static proc: unsigned file descriptors fs/file: more unsigned file descriptors fs: compat: remove redundant check of nr_segs cachefiles: Fix attempt to read i_blocks after deleting file [ver #2] cifs: don't use memcpy() to copy struct iov_iter get rid of separate multipage fault-in primitives fs: Avoid premature clearing of capabilities fs: Give dentry to inode_change_ok() instead of inode fuse: Propagate dentry down to inode_change_ok() ceph: Propagate dentry down to inode_change_ok() xfs: Propagate dentry down to inode_change_ok() ...
2016-10-10Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-15/+34
Pull protection keys syscall interface from Thomas Gleixner: "This is the final step of Protection Keys support which adds the syscalls so user space can actually allocate keys and protect memory areas with them. Details and usage examples can be found in the documentation. The mm side of this has been acked by Mel" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pkeys: Update documentation x86/mm/pkeys: Do not skip PKRU register if debug registers are not used x86/pkeys: Fix pkeys build breakage for some non-x86 arches x86/pkeys: Add self-tests x86/pkeys: Allow configuration of init_pkru x86/pkeys: Default to a restrictive init PKRU pkeys: Add details of system call use to Documentation/ generic syscalls: Wire up memory protection keys syscalls x86: Wire up protection keys system calls x86/pkeys: Allocation/free syscalls x86/pkeys: Make mprotect_key() mask off additional vm_flags mm: Implement new pkey_mprotect() system call x86/pkeys: Add fault handling for PF_PK page fault bit
2016-10-10constify iov_iter_count() and iter_is_iovec()Al Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-10Merge branch 'printk-cleanups'Linus Torvalds2-1/+2
Merge my system logging cleanups, triggered by the broken '\n' patches. The line continuation handling has been broken basically forever, and the code to handle the system log records was both confusing and dubious. And it would do entirely the wrong thing unless you always had a terminating newline, partly because it couldn't actually see whether a message was marked KERN_CONT or not (but partly because the LOG_CONT handling in the recording code was rather confusing too). This re-introduces a real semantically meaningful KERN_CONT, and fixes the few places I noticed where it was missing. There are probably more missing cases, since KERN_CONT hasn't actually had any semantic meaning for at least four years (other than the checkpatch meaning of "no log level necessary, this is a continuation line"). This also allows the combination of KERN_CONT and a log level. In that case the log level will be ignored if the merging with a previous line is successful, but if a new record is needed, that new record will now get the right log level. That also means that you can at least in theory combine KERN_CONT with the "pr_info()" style helpers, although any use of pr_fmt() prefixing would make that just result in a mess, of course (the prefix would end up in the middle of a continuing line). * printk-cleanups: printk: make reading the kernel log flush pending lines printk: re-organize log_output() to be more legible printk: split out core logging code into helper function printk: reinstate KERN_CONT for printing continuation lines
2016-10-09Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+1
Pull blk-mq CPU hotplug update from Jens Axboe: "This is the conversion of blk-mq to the new hotplug state machine" * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block: blk-mq: fixup "Convert to new hotplug state machine" blk-mq: Convert to new hotplug state machine blk-mq/cpu-notif: Convert to new hotplug state machine
2016-10-09Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-blockLinus Torvalds2-8/+13
Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09Merge tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dmLinus Torvalds1-0/+1
Pull device mapper updates from Mike Snitzer: - various fixes and cleanups for request-based DM core - add support for delaying the requeue of requests; used by DM multipath when all paths have failed and 'queue_if_no_path' is enabled - DM cache improvements to speedup the loading metadata and the writing of the hint array - fix potential for a dm-crypt crash on device teardown - remove dm_bufio_cond_resched() and just using cond_resched() - change DM multipath to return a reservation conflict error immediately; rather than failing the path and retrying (potentially indefinitely) * tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm mpath: always return reservation conflict without failing over dm bufio: remove dm_bufio_cond_resched() dm crypt: fix crash on exit dm cache metadata: switch to using the new cursor api for loading metadata dm array: introduce cursor api dm btree: introduce cursor api dm cache policy smq: distribute entries to random levels when switching to smq dm cache: speed up writing of the hint array dm array: add dm_array_new() dm mpath: delay the requeue of blk-mq requests while all paths down dm mpath: use dm_mq_kick_requeue_list() dm rq: introduce dm_mq_kick_requeue_list() dm rq: reduce arguments passed to map_request() and dm_requeue_original_request() dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests dm: convert wait loops to use autoremove_wake_function() dm: use signal_pending_state() in dm_wait_for_completion() dm: rename task state function arguments dm: add two lockdep_assert_held() statements dm rq: simplify dm_old_stop_queue() dm mpath: check if path's request_queue is dying in activate_path() ...
2016-10-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdmaLinus Torvalds1-1/+12
Pull main rdma updates from Doug Ledford: "This is the main pull request for the rdma stack this release. The code has been through 0day and I had it tagged for linux-next testing for a couple days. Summary: - updates to mlx5 - updates to mlx4 (two conflicts, both minor and easily resolved) - updates to iw_cxgb4 (one conflict, not so obvious to resolve, proper resolution is to keep the code in cxgb4_main.c as it is in Linus' tree as attach_uld was refactored and moved into cxgb4_uld.c) - improvements to uAPI (moved vendor specific API elements to uAPI area) - add hns-roce driver and hns and hns-roce ACPI reset support - conversion of all rdma code away from deprecated create_singlethread_workqueue - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in staging)" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits) staging/lustre: Disable InfiniBand support iw_cxgb4: add fast-path for small REG_MR operations cxgb4: advertise support for FR_NSMR_TPTE_WR IB/core: correctly handle rdma_rw_init_mrs() failure IB/srp: Fix infinite loop when FMR sg[0].offset != 0 IB/srp: Remove an unused argument IB/core: Improve ib_map_mr_sg() documentation IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets IB/mthca: Move user vendor structures IB/nes: Move user vendor structures IB/ocrdma: Move user vendor structures IB/mlx4: Move user vendor structures IB/cxgb4: Move user vendor structures IB/cxgb3: Move user vendor structures IB/mlx5: Move and decouple user vendor structures IB/{core,hw}: Add constant for node_desc ipoib: Make ipoib_warn ratelimited IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue IB/ipoib: Remove deprecated create_singlethread_workqueue ...
2016-10-09printk: reinstate KERN_CONT for printing continuation linesLinus Torvalds2-1/+2
Long long ago the kernel log buffer was a buffered stream of bytes, very much like stdio in user space. It supported log levels by scanning the stream and noticing the log level markers at the beginning of each line, but if you wanted to print a partial line in multiple chunks, you just did multiple printk() calls, and it just automatically worked. Except when it didn't, and you had very confusing output when different lines got all mixed up with each other. Then you got fragment lines mixing with each other, or with non-fragment lines, because it was traditionally impossible to tell whether a printk() call was a continuation or not. To at least help clarify the issue of continuation lines, we added a KERN_CONT marker back in 2007 to mark continuation lines: 474925277671 ("printk: add KERN_CONT annotation"). That continuation marker was initially an empty string, and didn't actuall make any semantic difference. But it at least made it possible to annotate the source code, and have check-patch notice that a printk() didn't need or want a log level marker, because it was a continuation of a previous line. To avoid the ambiguity between a continuation line that had that KERN_CONT marker, and a printk with no level information at all, we then in 2009 made KERN_CONT be a real log level marker which meant that we could now reliably tell the difference between the two cases. 5fd29d6ccbc9 ("printk: clean up handling of log-levels and newlines") and we could take advantage of that to make sure we didn't mix up continuation lines with lines that just didn't have any loglevel at all. Then, in 2012, the kernel log buffer was changed to be a "record" based log, where each line was a record that has a loglevel and a timestamp. You can see the beginning of that conversion in commits e11fea92e13f ("kmsg: export printk records to the /dev/kmsg interface") 7ff9554bb578 ("printk: convert byte-buffer to variable-length record buffer") with a number of follow-up commits to fix some painful fallout from that conversion. Over all, it took a couple of months to sort out most of it. But the upside was that you could have concurrent readers (and writers) of the kernel log and not have lines with mixed output in them. And one particular pain-point for the record-based kernel logging was exactly the fragmentary lines that are generated in smaller chunks. In order to still log them as one recrod, the continuation lines need to be attached to the previous record properly. However the explicit continuation record marker that is actually useful for this exact case was actually removed in aroundm the same time by commit 61e99ab8e35a ("printk: remove the now unnecessary "C" annotation for KERN_CONT") due to the incorrect belief that KERN_CONT wasn't meaningful. The ambiguity between "is this a continuation line" or "is this a plain printk with no log level information" was reintroduced, and in fact became an even bigger pain point because there was now the whole record-level merging of kernel messages going on. This patch reinstates the KERN_CONT as a real non-empty string marker, so that the ambiguity is fixed once again. But it's not a plain revert of that original removal: in the four years since we made KERN_CONT an empty string again, not only has the format of the log level markers changed, we've also had some usage changes in this area. For example, some ACPI code seems to use KERN_CONT _together_ with a log level, and now uses both the KERN_CONT marker and (for example) a KERN_INFO marker to show that it's an informational continuation of a line. Which is actually not a bad idea - if the continuation line cannot be attached to its predecessor, without the log level information we don't know what log level to assign to it (and we traditionally just assigned it the default loglevel). So having both a log level and the KERN_CONT marker is not necessarily a bad idea, but it does mean that we need to actually iterate over potentially multiple markers, rather than just a single one. Also, since KERN_CONT was still conceptually needed, and encouraged, but didn't actually _do_ anything, we've also had the reverse problem: rather than having too many annotations it has too few, and there is bit rot with code that no longer marks the continuation lines with the KERN_CONT marker. So this patch not only re-instates the non-empty KERN_CONT marker, it also fixes up the cases of bit-rot I noticed in my own logs. There are probably other cases where KERN_CONT will be needed to be added, either because it is new code that never dealt with the need for KERN_CONT, or old code that has bitrotted without anybody noticing. That said, we should strive to avoid the need for KERN_CONT. It does result in real problems for logging, and should generally not be seen as a good feature. If we some day can get rid of the feature entirely, because nobody does any fragmented printk calls, that would be lovely. But until that point, let's at mark the code that relies on the hacky multi-fragment kernel printk's. Not only does it avoid the ambiguity, it also annotates code as "maybe this would be good to fix some day". (That said, particularly during single-threaded bootup, the downsides of KERN_CONT are very limited. Things get much hairier when you have multiple threads going on and user level reading and writing logs too). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08Merge tag '4.9/mtd-pairing-scheme' of github.com:linux-nand/linuxBrian Norris1-0/+107
Introduction of the MTD pairing scheme concept.
2016-10-08Merge remote-tracking branch 'jk/vfs' into work.miscAl Viro2-1/+2
2016-10-08Merge remote-tracking branch 'ovl/misc' into work.miscAl Viro3-10/+27
2016-10-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds24-152/+240
Merge updates from Andrew Morton: - fsnotify updates - ocfs2 updates - all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits) console: don't prefer first registered if DT specifies stdout-path cred: simpler, 1D supplementary groups CREDITS: update Pavel's information, add GPG key, remove snail mail address mailmap: add Johan Hovold .gitattributes: set git diff driver for C source code files uprobes: remove function declarations from arch/{mips,s390} spelling.txt: "modeled" is spelt correctly nmi_backtrace: generate one-line reports for idle cpus arch/tile: adopt the new nmi_backtrace framework nmi_backtrace: do a local dump_stack() instead of a self-NMI nmi_backtrace: add more trigger_*_cpu_backtrace() methods min/max: remove sparse warnings when they're nested Documentation/filesystems/proc.txt: add more description for maps/smaps mm, proc: fix region lost in /proc/self/smaps proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self proc: add LSM hook checks to /proc/<tid>/timerslack_ns proc: relax /proc/<tid>/timerslack_ns capability requirements meminfo: break apart a very long seq_printf with #ifdefs seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char proc: faster /proc/*/status ...
2016-10-07Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-socLinus Torvalds3-4/+60
Pull ARM SoC driver updates from Arnd Bergmann: "Driver updates for ARM SoCs, including a couple of newly added drivers: - The Qualcomm external bus interface 2 (EBI2), used in some of their mobile phone chips for connecting flash memory, LCD displays or other peripherals - Secure monitor firmware for Amlogic SoCs, and an NVMEM driver for the EFUSE based on that firmware interface. - Perf support for the AppliedMicro X-Gene performance monitor unit - Reset driver for STMicroelectronics STM32 - Reset driver for SocioNext UniPhier SoCs Aside from these, there are minor updates to SoC-specific bus, clocksource, firmware, pinctrl, reset, rtc and pmic drivers" * tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (50 commits) bus: qcom-ebi2: depend on HAS_IOMEM pinctrl: mvebu: orion5x: Generalise mv88f5181l support for 88f5181 clk: mvebu: Add clk support for the orion5x SoC mv88f5181 dt-bindings: EXYNOS: Add Exynos5433 PMU compatible clocksource: exynos_mct: Add the support for ARM64 perf: xgene: Add APM X-Gene SoC Performance Monitoring Unit driver Documentation: Add documentation for APM X-Gene SoC PMU DTS binding MAINTAINERS: Add entry for APM X-Gene SoC PMU driver bus: qcom: add EBI2 driver bus: qcom: add EBI2 device tree bindings rtc: rtc-pm8xxx: Add support for pm8018 rtc nvmem: amlogic: Add Amlogic Meson EFUSE driver firmware: Amlogic: Add secure monitor driver soc: qcom: smd: Reset rx tail rather than tx memory: atmel-sdramc: fix a possible NULL dereference reset: hi6220: allow to compile test driver on other architectures reset: zynq: add driver Kconfig option reset: sunxi: add driver Kconfig option reset: stm32: add driver Kconfig option reset: socfpga: add driver Kconfig option ...
2016-10-07Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversionAlex Sidorenko1-1/+1
Roundrobin runner of team driver uses 'unsigned int' variable to count the number of sent_packets. Later it is passed to a subroutine team_num_to_port_index(struct team *team, int num) as 'num' and when we reach MAXINT (2**31-1), 'num' becomes negative. This leads to using incorrect hash-bucket for port lookup and as a result, packets are dropped. The fix consists of changing 'int num' to 'unsigned int num'. Testing of a fixed kernel shows that there is no packet drop anymore. Signed-off-by: Alex Sidorenko <alexandre.sidorenko@hpe.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-07Merge branch 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linuxLinus Torvalds1-15/+4
Pull parisc updates from Helge Deller: "Changes include: - Fix boot of 32bit SMP kernel (initial kernel mapping was too small) - Added hardened usercopy checks - Drop bootmem and switch to memblock and NO_BOOTMEM implementation - Drop the BROKEN_RODATA config option (and thus remove the relevant code from the generic headers and files because parisc was the last architecture which used this config option) - Improve segfault reporting by printing human readable error strings - Various smaller changes, e.g. dwarf debug support for assembly code, update comments regarding copy_user_page_asm, switch to kmalloc_array()" * 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Increase KERNEL_INITIAL_SIZE for 32-bit SMP kernels parisc: Drop bootmem and switch to memblock parisc: Add hardened usercopy feature parisc: Add cfi_startproc and cfi_endproc to assembly code parisc: Move hpmc stack into page aligned bss section parisc: Fix self-detected CPU stall warnings on Mako machines parisc: Report trap type as human readable string parisc: Update comment regarding implementation of copy_user_page_asm parisc: Use kmalloc_array() in add_system_map_addresses() parisc: Check return value of smp_boot_one_cpu() parisc: Drop BROKEN_RODATA config option
2016-10-07vfs: Remove {get,set,remove}xattr inode operationsAndreas Gruenbacher2-9/+0
These inode operations are no longer used; remove them. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07console: don't prefer first registered if DT specifies stdout-pathPaul Burton1-0/+6
If a device tree specifies a preferred device for kernel console output via the stdout-path or linux,stdout-path chosen node properties or the stdout alias then the kernel ought to honor it & output the kernel console to that device. As it stands, this isn't the case. Whilst we parse the stdout-path properties & set an of_stdout variable from of_alias_scan(), and use that from of_console_check() to determine whether to add a console device as a preferred console whilst registering it, we also prefer the first registered console if no other has been selected at the time of its registration. This means that if a console other than the one the device tree selects via stdout-path is registered first, we will switch to using it & when the stdout-path console is later registered the call to add_preferred_console() via of_console_check() is too late to do anything useful. In practice this seems to mean that we switch to the dummy console device fairly early & see no further console output: Console: colour dummy device 80x25 console [tty0] enabled bootconsole [ns16550a0] disabled Fix this by not automatically preferring the first registered console if one is specified by the device tree. This allows consoles to be registered but not enabled, and once the driver for the console selected by stdout-path calls of_console_check() the driver will be added to the list of preferred consoles before any other console has been enabled. When that console is then registered via register_console() it will be enabled as expected. Link: http://lkml.kernel.org/r/20160809151937.26118-1-paul.burton@imgtec.com Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Tejun Heo <tj@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Ivan Delalande <colona@arista.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07cred: simpler, 1D supplementary groupsAlexey Dobriyan1-10/+1
Current supplementary groups code can massively overallocate memory and is implemented in a way so that access to individual gid is done via 2D array. If number of gids is <= 32, memory allocation is more or less tolerable (140/148 bytes). But if it is not, code allocates full page (!) regardless and, what's even more fun, doesn't reuse small 32-entry array. 2D array means dependent shifts, loads and LEAs without possibility to optimize them (gid is never known at compile time). All of the above is unnecessary. Switch to the usual trailing-zero-len-array scheme. Memory is allocated with kmalloc/vmalloc() and only as much as needed. Accesses become simpler (LEA 8(gi,idx,4) or even without displacement). Maximum number of gids is 65536 which translates to 256KB+8 bytes. I think kernel can handle such allocation. On my usual desktop system with whole 9 (nine) aux groups, struct group_info shrinks from 148 bytes to 44 bytes, yay! Nice side effects: - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing, - fix little mess in net/ipv4/ping.c should have been using GROUP_AT macro but this point becomes moot, - aux group allocation is persistent and should be accounted as such. Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07nmi_backtrace: generate one-line reports for idle cpusChris Metcalf1-0/+5
When doing an nmi backtrace of many cores, most of which are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN". We do this by grouping all the cpuidle code together into a new .cpuidle.text section, and then checking the address of the interrupted PC to see if it lies within that section. This commit suitably tags x86 and tile idle routines, and only adds in the minimal framework for other architectures. Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm] Tested-by: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07nmi_backtrace: add more trigger_*_cpu_backtrace() methodsChris Metcalf1-5/+26
Patch series "improvements to the nmi_backtrace code" v9. This patch series modifies the trigger_xxx_backtrace() NMI-based remote backtracing code to make it more flexible, and makes a few small improvements along the way. The motivation comes from the task isolation code, where there are scenarios where we want to be able to diagnose a case where some cpu is about to interrupt a task-isolated cpu. It can be helpful to see both where the interrupting cpu is, and also an approximation of where the cpu that is being interrupted is. The nmi_backtrace framework allows us to discover the stack of the interrupted cpu. I've tested that the change works as desired on tile, and build-tested x86, arm, mips, and sparc64. For x86 I confirmed that the generic cpuidle stuff as well as the architecture-specific routines are in the new cpuidle section. For arm, mips, and sparc I just build-tested it and made sure the generic cpuidle routines were in the new cpuidle section, but I didn't attempt to figure out which the platform-specific idle routines might be. That might be more usefully done by someone with platform experience in follow-up patches. This patch (of 4): Currently you can only request a backtrace of either all cpus, or all cpus but yourself. It can also be helpful to request a remote backtrace of a single cpu, and since we want that, the logical extension is to support a cpumask as the underlying primitive. This change modifies the existing lib/nmi_backtrace.c code to take a cpumask as its basic primitive, and modifies the linux/nmi.h code to use the new "cpumask" method instead. The existing clients of nmi_backtrace (arm and x86) are converted to using the new cpumask approach in this change. The other users of the backtracing API (sparc64 and mips) are converted to use the cpumask approach rather than the all/allbutself approach. The mips code ignored the "include_self" boolean but with this change it will now also dump a local backtrace if requested. Link: http://lkml.kernel.org/r/1472487169-14923-2-git-send-email-cmetcalf@mellanox.com Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm] Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07min/max: remove sparse warnings when they're nestedJohannes Berg1-20/+28
Currently, when min/max are nested within themselves, sparse will warn: warning: symbol '_min1' shadows an earlier one originally declared here warning: symbol '_min1' shadows an earlier one originally declared here warning: symbol '_min2' shadows an earlier one originally declared here This also immediately happens when min3() or max3() are used. Since sparse implements __COUNTER__, we can use __UNIQUE_ID() to generate unique variable names, avoiding this. Link: http://lkml.kernel.org/r/1471519773-29882-1-git-send-email-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not charJoe Perches1-2/+2
Allow some seq_puts removals by taking a string instead of a single char. [akpm@linux-foundation.org: update vmstat_show(), per Joe] Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07linux/mm.h: canonicalize macro PAGE_ALIGNED() definitionzijun_hu1-1/+1
The macro PAGE_ALIGNED() is prone to cause error because it doesn't follow convention to parenthesize parameter @addr within macro body, for example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is desired but the actual one is (unsigned long)ptr + 16. It is fixed by simply canonicalizing macro PAGE_ALIGNED() definition. Link: http://lkml.kernel.org/r/57EA6AE7.7090807@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: remove unnecessary condition in remove_inode_hugepageszhong jiang1-1/+1
When the huge page is added to the page cahce (huge_add_to_page_cache), the page private flag will be cleared. since this code (remove_inode_hugepages) will only be called for pages in the page cahce, PagePrivate(page) will always be false. The patch remove the code without any functional change. Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: consolidate warn_alloc_failed usersMichal Hocko1-3/+2
warn_alloc_failed is currently used from the page and vmalloc allocators. This is a good reuse of the code except that vmalloc would appreciate a slightly different warning message. This is already handled by the fmt parameter except that "%s: page allocation failure: order:%u, mode:%#x(%pGg)" is printed anyway. This might be quite misleading because it might be a vmalloc failure which leads to the warning while the page allocator is not the culprit here. Fix this by always using the fmt string and only print the context that makes sense for the particular context (e.g. order makes only very little sense for the vmalloc context). Rename the function to not miss any user and also because a later patch will reuse it also for !failure cases. Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walkAndrea Arcangeli1-2/+8
The rmap_walk can access vm_page_prot (and potentially vm_flags in the pte/pmd manipulations). So it's not safe to wait the caller to update the vm_page_prot/vm_flags after vma_merge returned potentially removing the "next" vma and extending the "current" vma over the next->vm_start,vm_end range, but still with the "current" vma vm_page_prot, after releasing the rmap locks. The vm_page_prot/vm_flags must be transferred from the "next" vma to the current vma while vma_merge still holds the rmap locks. The side effect of this race condition is pte corruption during migrate as remove_migration_ptes when run on a address of the "next" vma that got removed, used the vm_page_prot of the current vma. migrate mprotect ------------ ------------- migrating in "next" vma vma_merge() # removes "next" vma and # extends "current" vma # current vma is not with # vm_page_prot updated remove_migration_ptes read vm_page_prot of current "vma" establish pte with wrong permissions vm_set_page_prot(vma) # too late! change_protection in the old vma range only, next range is not updated This caused segmentation faults and potentially memory corruption in heavy mprotect loads with some light page migration caused by compaction in the background. Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge which confirms the case 8 is only buggy one where the race can trigger, in all other vma_merge cases the above cannot happen. This fix removes the oddness factor from case 8 and it converts it from: AAAA PPPPNNNNXXXX -> PPPPNNNNNNNN to: AAAA PPPPNNNNXXXX -> PPPPXXXXXXXX XXXX has the right vma properties for the whole merged vma returned by vma_adjust, so it solves the problem fully. It has the added benefits that the callers could stop updating vma properties when vma_merge succeeds however the callers are not updated by this patch (there are bits like VM_SOFTDIRTY that still need special care for the whole range, as the vma merging ignores them, but as long as they're not processed by rmap walks and instead they're accessed with the mmap_sem at least for reading, they are fine not to be updated within vma_adjust before releasing the rmap_locks). Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Aditya Mandaleeka <adityam@microsoft.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: vm_page_prot: update with WRITE_ONCE/READ_ONCEAndrea Arcangeli1-1/+1
vma->vm_page_prot is read lockless from the rmap_walk, it may be updated concurrently and this prevents the risk of reading intermediate values. Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm/hugetlb: check for reserved hugepages during memory offlineGerald Schaefer1-3/+3
In dissolve_free_huge_pages(), free hugepages will be dissolved without making sure that there are enough of them left to satisfy hugepage reservations. Fix this by adding a return value to dissolve_free_huge_pages() and checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may lead to the situation where dissolve_free_huge_page() returns an error and all free hugepages that were dissolved before that error are lost, while the memory block still cannot be set offline. Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: memcontrol: consolidate cgroup socket trackingJohannes Weiner1-2/+4
The cgroup core and the memory controller need to track socket ownership for different purposes, but the tracking sites being entirely different is kind of ugly. Be a better citizen and rename the memory controller callbacks to match the cgroup core callbacks, then move them to the same place. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: restrict full priority to non-costly ordersVlastimil Babka1-0/+1
The new ultimate compaction priority disables some heuristics, which may result in excessive cost. This is fine for non-costly orders where we want to try hard before resulting for OOM, but might be disruptive for costly orders which do not trigger OOM and should generally have some fallback. Thus, we disable the full priority for costly orders. Suggested-by: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: remove page_file_indexHuang Ying2-13/+1
After using the offset of the swap entry as the key of the swap cache, the page_index() becomes exactly same as page_file_index(). So the page_file_index() is removed and the callers are changed to use page_index() instead. Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, swap: use offset of swap entry as key of swap cacheHuang Ying1-4/+4
This patch is to improve the performance of swap cache operations when the type of the swap device is not 0. Originally, the whole swap entry value is used as the key of the swap cache, even though there is one radix tree for each swap device. If the type of the swap device is not 0, the height of the radix tree of the swap cache will be increased unnecessary, especially on 64bit architecture. For example, for a 1GB swap device on the x86_64 architecture, the height of the radix tree of the swap cache is 11. But if the offset of the swap entry is used as the key of the swap cache, the height of the radix tree of the swap cache is 4. The increased height causes unnecessary radix tree descending and increased cache footprint. This patch reduces the height of the radix tree of the swap cache via using the offset of the swap entry instead of the whole swap entry value as the key of the swap cache. In 32 processes sequential swap out test case on a Xeon E5 v3 system with RAM disk as swap, the lock contention for the spinlock of the swap cache is reduced from 20.15% to 12.19%, when the type of the swap device is 1. Use the whole swap entry as key, perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37, perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78, Use the swap offset as key, perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25, perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94, Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07thp: reduce usage of huge zero page's atomic counterAaron Lu2-4/+5
The global zero page is used to satisfy an anonymous read fault. If THP(Transparent HugePage) is enabled then the global huge zero page is used. The global huge zero page uses an atomic counter for reference counting and is allocated/freed dynamically according to its counter value. CPU time spent on that counter will greatly increase if there are a lot of processes doing anonymous read faults. This patch proposes a way to reduce the access to the global counter so that the CPU load can be reduced accordingly. To do this, a new flag of the mm_struct is introduced: MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch the global counter in two cases: 1 The first time it uses the global huge zero page; 2 The time when mm_user of its mm_struct reaches zero. Note that right now, the huge zero page is eligible to be freed as soon as its last use goes away. With this patch, the page will not be eligible to be freed until the exit of the last process from which it was ever used. And with the use of mm_user, the kthread is not eligible to use huge zero page either. Since no kthread is using huge zero page today, there is no difference after applying this patch. But if that is not desired, I can change it to when mm_count reaches zero. Case used for test on Haswell EP: usemem -n 72 --readonly -j 0x200000 100G Which spawns 72 processes and each will mmap 100G anonymous space and then do read only access to that space sequentially with a step of 2MB. CPU cycles from perf report for base commit: 54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page CPU cycles from perf report for this commit: 0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page Performance(throughput) of the workload for base commit: 1784430792 Performance(throughput) of the workload for this commit: 4726928591 164% increase. Runtime of the workload for base commit: 707592 us Runtime of the workload for this commit: 303970 us 50% drop. Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07thp, dax: add thp_get_unmapped_area for pmd mappingsToshi Kani1-0/+7
When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size. This feature relies on both mmap virtual address and FS block (i.e. physical address) to be aligned by the pmd page size. Users can use mkfs options to specify FS to align block allocations. However, aligning mmap address requires code changes to existing applications for providing a pmd-aligned address to mmap(). For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1]. It calls mmap() with a NULL address, which needs to be changed to provide a pmd-aligned address for testing with DAX pmd mappings. Changing all applications that call mmap() with NULL is undesirable. Add thp_get_unmapped_area(), which can be called by filesystem's get_unmapped_area to align an mmap address by the pmd size for a DAX file. It calls the default handler, mm->get_unmapped_area(), to find a range and then aligns it for a DAX file. The patch is based on Matthew Wilcox's change that allows adding support of the pud page size easily. [1]: https://github.com/axboe/fio/blob/master/engines/mmap.c Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: don't use radix tree writeback tags for pages in swap cacheHuang Ying1-0/+12
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK, etc.) to accelerate finding the pages with a specific tag in the radix tree during inode writeback. But for anonymous pages in the swap cache, there is no inode writeback. So there is no need to find the pages with some writeback tags in the radix tree. It is not necessary to touch radix tree writeback tags for pages in the swap cache. Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is introduced for address spaces which don't need to update the writeback tags. The flag is set for swap caches. It may be used for DAX file systems, etc. With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. The improvement comes from the reduced contention on the swap cache radix tree lock. To test sequential swapping out, the test case uses 8 processes, which sequentially allocate and write to the anonymous pages until RAM and part of the swap device is used up. Details of comparison is as follow, base base+patch ---------------- -------------------------- %stddev %change %stddev \ | \ 2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput 1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so 10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list 10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg 10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage 10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm/nobootmem.c: remove duplicate macro ARCH_LOW_ADDRESS_LIMIT statementszijun_hu1-4/+5
Fix the following bugs: - the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between header and relevant source - don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in asm/processor.h is preferred over default in linux/bootmem.h completely since the former header isn't included by the latter Link: http://lkml.kernel.org/r/e046aeaa-e160-6d9e-dc1b-e084c2fd999f@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm/memblock.c: expose total reserved memorySrikar Dronamraju1-0/+1
The total reserved memory in a system is accounted but not available for use use outside mm/memblock.c. By exposing the total reserved memory, systems can better calculate the size of large hashes. Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: introduce arch_reserved_kernel_pages()Srikar Dronamraju1-0/+3
Currently arch specific code can reserve memory blocks but alloc_large_system_hash() may not take it into consideration when sizing the hashes. This can lead to bigger hash than required and lead to no available memory for other purposes. This is specifically true for systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. One approach to solve this problem would be to walk through the memblock regions and calculate the available memory and base the size of hash system on the available memory. The other approach would be to depend on the architecture to provide the number of pages that are reserved. This change provides hooks to allow the architecture to provide the required info. Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: make sure that kthreads will not refault oom reaped memoryMichal Hocko1-0/+1
There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: "Michael S. Tsirkin" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>