2016-08-01net: ethernet: ax88796: avoid null pointer dereferencexypron.glpk@gmx.de1-1/+2
If platform_get_resource fails, mem2 is null. Do not dereference null. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-01net: caif: use correct format specifierxypron.glpk@gmx.de1-2/+2
%u is the wrong format specifier for int. size_t cannot be converted to int without possible loss of information. So leave the result as size_t and use %zu as format specifier. cf. Documentation/printk-formats.txt Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-31phy/micrel: Change phy_id_mask for KSZ8721Alexander Stein1-2/+2
There are KSZ8721 PHYs with phy_id 0x00221619. In order to detect them as PHY_ID_KSZ8001 compatible while staying different to PHY_ID_KSZ9021 ignore the last two bits when matching PHY_ID Signed-off-by: Alexander Stein <alexanders83@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-31r8169: fix nic may not work after changing mac address.Chun-Hao Lin1-1/+8
When there is no AC power, NIC may not work after changing mac address. Please refer to following link. http://www.spinics.net/lists/netdev/msg356572.html This issue is caused by runtime power management. When there is no AC power, if we put NIC down (ifconfig down), the driver will be in runtime suspend state and hardware will be put into D3 state. During this time, driver cannot access hardware regisers. So if you set new mac address during this time, it will not be set to hardware. After resume, NIC will keep using the old mac address and the network will not work normally. In this patch I add detecting runtime pm status when setting mac address. If driver is in runtime suspend state, it will skip setting mac address, keep the new mac address, and set the new mac address during runtime resume. Signed-off-by: Chunhao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-31r8169: add checking driver's runtime pm status in rtl8169_get_ethtool_stats()Chun-Hao Lin1-1/+7
Not to call rtl8169_update_counters() to dump tally counter when driver is in runtime suspend state. Calling rtl8169_update_counters() in runtime suspend state will produce warning message "rtl_counters_cond == 1 (loop: 1000, delay: 10)". Signed-off-by: Chunhao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-31r8169: fix kernel log spam when set or get hardware wol setting.Chun-Hao Lin1-2/+18
NIC will be put into D3 state during runtime suspend state. When set or get hardware wol setting, driver will write or read hardware registers. If we set or get hardware wol setting in runtime suspend state, because NIC will in D3 state, the hardware registers read by driver will return all 0xff. That will let driver thinking register flag is not toggled and then prints the warning message "rtl_counters_cond == 1 (loop: 1000, delay: 10)" to kernel log. For fixing this issue, add checking driver's pm runtime status in rtl8169_get_wol() and rtl8169_set_wol(). Signed-off-by: Chunhao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30net: dsa: bcm_sf2: Unwind errors in correct orderFlorian Fainelli1-2/+3
In case we cannot complete bcm_sf2_sw_setup() for any reason, and we go to the out_unmap label, but the MDIO bus has not been registered yet, we will hit the BUG condition in drivers/net/phy/mdio_bus.c about the bus not being registered. Fix this by dedicating a specific lable for when we fail after the MDIO bus has been successfully registered. Fixes: 461cd1b03e32 ("net: dsa: bcm_sf2: Register our slave MDIO bus") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30net: tulip: fix spelling mistake: "attemping" -> "attempting"Colin Ian King1-1/+1
trivial fix to spelling mistake in printk message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30macsec: fix negative refcnt on parent linkSabrina Dubroca1-2/+2
When creation of a macsec device fails because an identical device already exists on this link, the current code decrements the refcnt on the parent link (in ->destructor for the macsec device), but it had not been incremented yet. Move the dev_hold(parent_link) call earlier during macsec device creation. Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30macsec: RXSAs don't need to hold a reference on RXSCsSabrina Dubroca1-2/+1
Following the previous patch, RXSCs are held and properly refcounted in the RX path (instead of being implicitly held by their SA), so the SA doesn't need to hold a reference on its parent RXSC. This also avoids panics on module unload caused by the double layer of RCU callbacks (call_rcu frees the RXSA, which puts the final reference on the RXSC and allows to free it in its own call_rcu) that commit b196c22af5c3 ("macsec: add rcu_barrier() on module exit") didn't protect against. There were also some refcounting bugs in macsec_add_rxsa where I didn't put the reference on the RXSC on the error paths, which would lead to memory leaks. Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30macsec: fix reference counting on RXSC in macsec_handle_frameSabrina Dubroca1-1/+8
Currently, we lookup the RXSC without taking a reference on it. The RXSA holds a reference on the RXSC, but the SA and SC could still both disappear before we take a reference on the SA. Take a reference on the RXSC in macsec_handle_frame. Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30drivers: net: cpsw: use of_platform_depopulate()Grygorii Strashko1-10/+1
Use of_platform_depopulate() in cpsw_remove() instead of of_device_unregister(), because CSPW child devices will not be recreated otherwise on next insmod. of_platform_depopulate() is correct way now as it will ensure that all steps done in of_platform_populate() are reverted, including cleaning up of OF_POPULATED flag. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30drivers: net: cpsw: fix wrong regs access in cpsw_removeGrygorii Strashko1-1/+9
The L3 error will be generated and system will crash during unloading of CPSW driver if CPSW is used as module and ethX devices are down. This happens because CPSW can be power off by PM runtime now when ethX devices are down. Hence, ensure that CPSW powered up by PM runtime before performing any deinitialization actions which require CPSW registers access. In case of PM runtime error just leave cpsw_remove() as we can't do anything anymore. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30net: ethernet: ti: cpdma: fix lockup in cpdma_ctlr_destroy()Grygorii Strashko1-3/+0
Fix deadlock in cpdma_ctlr_destroy() which is triggered now on cpsw module removal: cpsw_remove() - cpdma_ctlr_destroy() - spin_lock_irqsave(&ctlr->lock, flags) - cpdma_ctlr_stop() - spin_lock_irqsave(&ctlr->lock, flags); - cpdma_chan_destroy() - spin_lock_irqsave(&ctlr->lock, flags); The issue has not been observed before because CPDMA channels have been destroyed manually by CPSW until commit d941ebe88a41 ("net: ethernet: ti: cpsw: use destroy ctlr to destroy channels") was merged. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30cxgb4/cxgb4vf: Fixes regression in perf when tx vlan offload is disabledHariprasad Shenai2-2/+2
The commit 637d3e997351 ("cxgb4: Discard the packet if the length is greater than mtu") introduced a regression in the VLAN interface performance when Tx VLAN offload is disabled. Check if skb is tagged, regardless of whether it is hardware accelerated or not. Presently we were checking only for hardware acclereated one, which caused performance to drop to ~0.17Mbps on a 10GbE adapter for VLAN interface, when tx vlan offload is turned off using ethtool. The ethernet head length calculation was going wrong in this case, and driver ended up dropping packets. Fixes: 637d3e997351 ("cxgb4: Discard the packet if the length is greater than mtu") Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30drivers: net: phy: xgene: Remove redundant dev_err call in xgene_mdio_probe()Wei Yongjun1-3/+1
There is a error message within devm_ioremap_resource already, so remove the dev_err call to avoid redundant error message. Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Acked-By: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30qed: Prevent over-usage of vlan credits by PFYuval Mintz1-1/+8
Each PF/VF has a limited number of vlan filters for configuration purposes; This information is passed to qede and is used to prevent over-usage - once a vlan is to be configured and no filter credit is available, the driver would switch into working in vlan-promisc mode. Problem is the credit pool is shared by both PFs and VFs, and currently PFs aren't deducting the filters that are reserved for their VFs from their quota, which may lead to some vlan filters failing unknowingly due to lack of credit. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30qed: Correct min bandwidth for 100gYuval Mintz1-1/+1
Driver uses reverse logic when checking if minimum bandwidth configuration applied, causing it to configure the guarantee only on the first hw-function. Fixes: a0d26d5a4fc8 ("qed*: Don't reset statistics on inner reload") Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30qede: Reset statistics on explicit downYuval Mintz1-0/+1
Adding the necessary logic to prevet statistics reset on inner-reload introduced a bug, and now statistics are reset only when re-probing the driver. Fixes: a0d26d5a4fc8e ("qed*: Don't reset statistics on inner reload") Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30qed: Don't over-do producer cleanup for RxYuval Mintz2-4/+4
Before requesting the firmware to start Rx queues, driver goes and sets the queue producer in the device to 0. But while the producer is 32-bit, the driver currently clears 64 bits, effectively zeroing an additional CID's producer as well. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30qed: Fix removal of spoof checking for VFsYuval Mintz1-1/+1
Driver has reverse logic for checking the result of the spoof-checking configuration. As a result, it would log that the configuration failed [even though it succeeded], and will no longer do anything when requested to remove the configuration, as it's accounting of the feature will be incorrect. Fixes: 6ddc7608258d5 ("qed*: IOV support spoof-checking") Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30qede: Don't try removing unconfigured vlansYuval Mintz1-4/+7
As part of ndo_vlan_rx_kill_vid() implementation, qede is requesting firmware to remove the vlan filter. This currently happens even if the vlan wasn't previously added [In case device ran out of vlan credits]. For PFs this doesn't cause any issues as the firmware would simply ignore the removal request. But for VFs their parent PF is holding an accounting of the configured vlans, and such a request would cause the PF to fail the VF's removal request. Simply fix this for both PFs & VFs and don't remove filters that were not previously added. Fixes: 7c1bfcad9f3c8 ("qede: Add vlan filtering offload support") Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-29Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-securityLinus Torvalds30-1452/+2968
Pull security subsystem updates from James Morris: "Highlights: - TPM core and driver updates/fixes - IPv6 security labeling (CALIPSO) - Lots of Apparmor fixes - Seccomp: remove 2-phase API, close hole where ptrace can change syscall #" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits) apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family) tpm: Factor out common startup code tpm: use devm_add_action_or_reset tpm2_i2c_nuvoton: add irq validity check tpm: read burstcount from TPM_STS in one 32-bit transaction tpm: fix byte-order for the value read by tpm2_get_tpm_pt tpm_tis_core: convert max timeouts from msec to jiffies apparmor: fix arg_size computation for when setprocattr is null terminated apparmor: fix oops, validate buffer size in apparmor_setprocattr() apparmor: do not expose kernel stack apparmor: fix module parameters can be changed after policy is locked apparmor: fix oops in profile_unpack() when policy_db is not present apparmor: don't check for vmalloc_addr if kvzalloc() failed apparmor: add missing id bounds check on dfa verification apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task apparmor: use list_next_entry instead of list_entry_next apparmor: fix refcount race when finding a child profile apparmor: fix ref count leak when profile sha1 hash is read apparmor: check that xindex is in trans_table bounds ...
2016-07-29Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-1/+1
Pull userns vfs updates from Eric Biederman: "This tree contains some very long awaited work on generalizing the user namespace support for mounting filesystems to include filesystems with a backing store. The real world target is fuse but the goal is to update the vfs to allow any filesystem to be supported. This patchset is based on a lot of code review and testing to approach that goal. While looking at what is needed to support the fuse filesystem it became clear that there were things like xattrs for security modules that needed special treatment. That the resolution of those concerns would not be fuse specific. That sorting out these general issues made most sense at the generic level, where the right people could be drawn into the conversation, and the issues could be solved for everyone. At a high level what this patchset does a couple of simple things: - Add a user namespace owner (s_user_ns) to struct super_block. - Teach the vfs to handle filesystem uids and gids not mapping into to kuids and kgids and being reported as INVALID_UID and INVALID_GID in vfs data structures. By assigning a user namespace owner filesystems that are mounted with only user namespace privilege can be detected. This allows security modules and the like to know which mounts may not be trusted. This also allows the set of uids and gids that are communicated to the filesystem to be capped at the set of kuids and kgids that are in the owning user namespace of the filesystem. One of the crazier corner casees this handles is the case of inodes whose i_uid or i_gid are not mapped into the vfs. Most of the code simply doesn't care but it is easy to confuse the inode writeback path so no operation that could cause an inode write-back is permitted for such inodes (aka only reads are allowed). This set of changes starts out by cleaning up the code paths involved in user namespace permirted mounts. Then when things are clean enough adds code that cleanly sets s_user_ns. Then additional restrictions are added that are possible now that the filesystem superblock contains owner information. These changes should not affect anyone in practice, but there are some parts of these restrictions that are changes in behavior. - Andy's restriction on suid executables that does not honor the suid bit when the path is from another mount namespace (think /proc/[pid]/fd/) or when the filesystem was mounted by a less privileged user. - The replacement of the user namespace implicit setting of MNT_NODEV with implicitly setting SB_I_NODEV on the filesystem superblock instead. Using SB_I_NODEV is a stronger form that happens to make this state user invisible. The user visibility can be managed but it caused problems when it was introduced from applications reasonably expecting mount flags to be what they were set to. There is a little bit of work remaining before it is safe to support mounting filesystems with backing store in user namespaces, beyond what is in this set of changes. - Verifying the mounter has permission to read/write the block device during mount. - Teaching the integrity modules IMA and EVM to handle filesystems mounted with only user namespace root and to reduce trust in their security xattrs accordingly. - Capturing the mounters credentials and using that for permission checks in d_automount and the like. (Given that overlayfs already does this, and we need the work in d_automount it make sense to generalize this case). Furthermore there are a few changes that are on the wishlist: - Get all filesystems supporting posix acls using the generic posix acls so that posix_acl_fix_xattr_from_user and posix_acl_fix_xattr_to_user may be removed. [Maintainability] - Reducing the permission checks in places such as remount to allow the superblock owner to perform them. - Allowing the superblock owner to chown files with unmapped uids and gids to something that is mapped so the files may be treated normally. I am not considering even obvious relaxations of permission checks until it is clear there are no more corner cases that need to be locked down and handled generically. Many thanks to Seth Forshee who kept this code alive, and putting up with me rewriting substantial portions of what he did to handle more corner cases, and for his diligent testing and reviewing of my changes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits) fs: Call d_automount with the filesystems creds fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns evm: Translate user/group ids relative to s_user_ns when computing HMAC dquot: For now explicitly don't support filesystems outside of init_user_ns quota: Handle quota data stored in s_user_ns in quota_setxquota quota: Ensure qids map to the filesystem vfs: Don't create inodes with a uid or gid unknown to the vfs vfs: Don't modify inodes with a uid or gid unknown to the vfs cred: Reject inodes with invalid ids in set_create_file_as() fs: Check for invalid i_uid in may_follow_link() vfs: Verify acls are valid within superblock's s_user_ns. userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS fs: Refuse uid/gid changes which don't map into s_user_ns selinux: Add support for unprivileged mounts from user namespaces Smack: Handle labels consistently in untrusted mounts Smack: Add support for unprivileged mounts from user namespaces fs: Treat foreign mounts as nosuid fs: Limit file caps to the user namespace of the super block userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag userns: Remove implicit MNT_NODEV fragility. ...
2016-07-29Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds21-549/+357
Pull smp hotplug updates from Thomas Gleixner: "This is the next part of the hotplug rework. - Convert all notifiers with a priority assigned - Convert all CPU_STARTING/DYING notifiers The final removal of the STARTING/DYING infrastructure will happen when the merge window closes. Another 700 hundred line of unpenetrable maze gone :)" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits) timers/core: Correct callback order during CPU hot plug leds/trigger/cpu: Move from CPU_STARTING to ONLINE level powerpc/numa: Convert to hotplug state machine arm/perf: Fix hotplug state machine conversion irqchip/armada: Avoid unused function warnings ARC/time: Convert to hotplug state machine clocksource/atlas7: Convert to hotplug state machine clocksource/armada-370-xp: Convert to hotplug state machine clocksource/exynos_mct: Convert to hotplug state machine clocksource/arm_global_timer: Convert to hotplug state machine rcu: Convert rcutree to hotplug state machine KVM/arm/arm64/vgic-new: Convert to hotplug state machine smp/cfd: Convert core to hotplug state machine x86/x2apic: Convert to CPU hotplug state machine profile: Convert to hotplug state machine timers/core: Convert to hotplug state machine hrtimer: Convert to hotplug state machine x86/tboot: Convert to hotplug state machine arm64/armv8 deprecated: Convert to hotplug state machine hwtracing/coresight-etm4x: Convert to hotplug state machine ...
2016-07-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ideLinus Torvalds4-5/+6
Pull IDE updates from David Miller: "Just a couple small bug fixes, nothing overly exciting in here" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide: ide: missing break statement in set_timings_mdma() ide: hpt366: fix incorrect mask when checking at cmd_high_time ide-tape: fix misprint in failure handling in idetape_init() cmd640: add __init attribute
2016-07-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds1-6/+0
Pull sparc updates from David Miller: 1) Double spin lock bug in sunhv serial driver, from Dan Carpenter. 2) Use correct RSS estimate when determining whether to grow the huge TSB or not, from Mike Kravetz. 3) Don't use full three level page tables for hugepages, PMD level is sufficient. From Nitin Gupta. 4) Mask out extraneous bits from TSB_TAG_ACCESS register, we only want the address bits. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Trim page tables for 8M hugepages sparc64 mm: Fix base TSB sizing when hugetlb pages are used sparc: serial: sunhv: fix a double lock bug sparc32: off by ones in BUG_ON() sparc: Don't leak context bits into thread->fault_address
2016-07-28Merge tag 'vfio-v4.8-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds7-44/+267
Pull VFIO updates from Alex Williamson: - Enable no-iommu mode for platform devices (Peng Fan) - Sub-page mmap for exclusive pages (Yongji Xie) - Use-after-free fix (Ilya Lesokhin) - Support for ACPI-based platform devices (Sinan Kaya) * tag 'vfio-v4.8-rc1' of git://github.com/awilliam/linux-vfio: vfio: platform: check reset call return code during release vfio: platform: check reset call return code during open vfio, platform: make reset driver a requirement by default vfio: platform: call _RST method when using ACPI vfio: platform: add extra debug info argument to call reset vfio: platform: add support for ACPI probe vfio: platform: determine reset capability vfio: platform: move reset call to a common function vfio: platform: rename reset function vfio: fix possible use after free of vfio group vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive vfio: platform: support No-IOMMU mode
2016-07-28Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds7-213/+328
Pull MD updates from Shaohua Li: - A bunch of patches from Neil Brown to fix RCU usage - Two performance improvement patches from Tomasz Majchrzak - Alexey Obitotskiy fixes module refcount issue - Arnd Bergmann fixes time granularity - Cong Wang fixes a list corruption issue - Guoqing Jiang fixes a deadlock in md-cluster - A null pointer deference fix from me - Song Liu fixes misuse of raid6 rmw - Other trival/cleanup fixes from Guoqing Jiang and Xiao Ni * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (28 commits) MD: fix null pointer deference raid10: improve random reads performance md: add missing sysfs_notify on array_state update Fix kernel module refcount handling md: use seconds granularity for error logging md: reduce the number of synchronize_rcu() calls when multiple devices fail. md: be extra careful not to take a reference to a Faulty device. md/multipath: add rcu protection to rdev access in multipath_status. md/raid5: add rcu protection to rdev accesses in raid5_status. md/raid5: add rcu protection to rdev accesses in want_replace md/raid5: add rcu protection to rdev accesses in handle_failed_sync. md/raid1: add rcu protection to rdev in fix_read_error md/raid1: small code cleanup in end_sync_write md/raid1: small cleanup in raid1_end_read/write_request md/raid10: simplify print_conf a little. md/raid10: minor code improvement in fix_read_error() md/raid10: add rcu protection to rdev access during reshape. md/raid10: add rcu protection to rdev access in raid10_sync_request. md/raid10: add rcu protection in raid10_status. md/raid10: fix refounct imbalance when resyncing an array with a replacement device. ...
2016-07-28Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds30-572/+1113
Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...
2016-07-28Merge tag 'pinctrl-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrlLinus Torvalds105-2681/+10726
Pull pin control updates from Linus Walleij: "This is the bulk of pin control changes for the v4.8 kernel cycle. Nothing stands out as especially exiting: new drivers, new subdrivers, lots of cleanups and incremental features. Business as usual. New drivers: - New driver for Oxnas pin control and GPIO. This ARM-based chipset is used in a few storage (NAS) type devices. - New driver for the MAX77620/MAX20024 pin controller portions. - New driver for the Intel Merrifield pin controller. New subdrivers: - New subdriver for the Qualcomm MDM9615 - New subdriver for the STM32F746 MCU - New subdriver for the Broadcom NSP SoC. Cleanups: - Demodularization of bool compiled-in drivers. Apart from this there is just regular incremental improvements to a lot of drivers, especially Uniphier and PFC" * tag 'pinctrl-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (131 commits) pinctrl: fix pincontrol definition for marvell pinctrl: xway: fix typo Revert "pinctrl: amd: make it explicitly non-modular" pinctrl: iproc: Add NSP and Stingray GPIO support pinctrl: Update iProc GPIO DT bindings pinctrl: bcm: add OF dependencies pinctrl: ns2: remove redundant dev_err call in ns2_pinmux_probe() pinctrl: Add STM32F746 MCU support pinctrl: intel: Protect set wake flow by spin lock pinctrl: nsp: remove redundant dev_err call in nsp_pinmux_probe() pinctrl: uniphier: add Ethernet pin-mux settings sh-pfc: Use PTR_ERR_OR_ZERO() to simplify the code pinctrl: ns2: fix return value check in ns2_pinmux_probe() pinctrl: qcom: update DT bindings with ebi2 groups pinctrl: qcom: establish proper EBI2 pin groups pinctrl: imx21: Remove the MODULE_DEVICE_TABLE() macro Documentation: dt: Add new compatible to STM32 pinctrl driver bindings includes: dt-bindings: Add STM32F746 pinctrl DT bindings pinctrl: sunxi: fix nand0 function name for sun8i pinctrl: uniphier: remove pointless pin-mux settings for PH1-LD11 ...
2016-07-28Merge branch 'akpm' (patches from Andrew)Linus Torvalds3-44/+52
Merge more updates from Andrew Morton: "The rest of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits) mm, compaction: simplify contended compaction handling mm, compaction: introduce direct compaction priority mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations mm, page_alloc: make THP-specific decisions more generic mm, page_alloc: restructure direct compaction handling in slowpath mm, page_alloc: don't retry initial attempt in slowpath mm, page_alloc: set alloc_flags only once in slowpath lib/stackdepot.c: use __GFP_NOWARN for stack allocations mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB mm, kasan: account for object redzone in SLUB's nearest_obj() mm: fix use-after-free if memory allocation failed in vma_adjust() zsmalloc: Delete an unnecessary check before the function call "iput" mm/memblock.c: fix index adjustment error in __next_mem_range_rev() mem-hotplug: alloc new page from a nearest neighbor node when mem-offline mm: optimize copy_page_to/from_iter_iovec mm: add cond_resched() to generic_swapfile_activate() Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements" mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode mm: hwpoison: remove incorrect comments make __section_nr() more efficient ...
2016-07-28mm: track NR_KERNEL_STACK in KiB instead of number of stacksAndy Lutomirski1-2/+1
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone. This only makes sense if each kernel stack exists entirely in one zone, and allowing vmapped stacks could break this assumption. Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all architectures. Keep it simple and use KiB. Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: move most file-based accounting to the nodeMel Gorman3-12/+14
There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: rename NR_ANON_PAGES to NR_ANON_MAPPEDMel Gorman1-1/+1
NR_FILE_PAGES is the number of file pages. NR_FILE_MAPPED is the number of mapped file pages. NR_ANON_PAGES is the number of mapped anon pages. This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we have NR_FILE_PAGES is the number of file pages. NR_FILE_MAPPED is the number of mapped file pages. NR_ANON_MAPPED is the number of mapped anon pages. Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: move page mapped accounting to the nodeMel Gorman1-2/+2
Reclaim makes decisions based on the number of pages that are mapped but it's mixing node and zone information. Account NR_FILE_MAPPED and NR_ANON_PAGES pages on the node. Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmscan: move LRU lists to nodeMel Gorman2-13/+14
This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmstat: add infrastructure for per-node vmstatsMel Gorman1-35/+41
Patchset: "Move LRU page reclaim from zones to nodes v9" This series moves LRUs from the zones to the node. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc --------- This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v9 Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%) Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%) Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%) Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%) Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%) Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%) Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%) Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%) Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%) Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%) Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%) Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%) Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%) Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%) Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%) Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%) Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%) Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%) Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%) Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%) Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%) Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%) Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%) Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%) Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%) Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%) Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%) This shows a steady improvement throughout. The primary benefit is from reduced system CPU usage which is obvious from the overall times; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 User 189.19 191.80 System 2604.45 2533.56 Elapsed 2855.30 2786.39 The vmstats also showed that the fair zone allocation policy was definitely removed as can be seen here; 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v8 DMA32 allocs 28794729769 0 Normal allocs 48432501431 77227309877 Movable allocs 0 0 tiobench on ext4 ---------------- tiobench is a benchmark that artifically benefits if old pages remain resident while new pages get reclaimed. The fair zone allocation policy mitigates this problem so pages age fairly. While the benchmark has problems, it is important that tiobench performance remains constant as it implies that page aging problems that the fair zone allocation policy fixes are not re-introduced. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v9 Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%) Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%) Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%) Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%) Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%) Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%) Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%) Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%) Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%) Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%) Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%) Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%) Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%) Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%) Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%) Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%) Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%) Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%) Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%) Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%) Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%) 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9 User 645.72 525.90 System 403.85 331.75 Elapsed 6795.36 6783.67 This shows that the series has little or not impact on tiobench which is desirable and a reduction in system CPU usage. It indicates that the fair zone allocation policy was removed in a manner that didn't reintroduce one class of page aging bug. There were only minor differences in overall reclaim activity 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Minor Faults 645838 647465 Major Faults 573 640 Swap Ins 0 0 Swap Outs 0 0 DMA allocs 0 0 DMA32 allocs 46041453 44190646 Normal allocs 78053072 79887245 Movable allocs 0 0 Allocation stalls 24 67 Stall zone DMA 0 0 Stall zone DMA32 0 0 Stall zone Normal 0 2 Stall zone HighMem 0 0 Stall zone Movable 0 65 Direct pages scanned 10969 30609 Kswapd pages scanned 93375144 93492094 Kswapd pages reclaimed 93372243 93489370 Direct pages reclaimed 10969 30609 Kswapd efficiency 99% 99% Kswapd velocity 13741.015 13781.934 Direct efficiency 100% 100% Direct velocity 1.614 4.512 Percentage direct scans 0% 0% kswapd activity was roughly comparable. There were differences in direct reclaim activity but negligible in the context of the overall workload (velocity of 4 pages per second with the patches applied, 1.6 pages per second in the baseline kernel). pgbench read-only large configuration on ext4 --------------------------------------------- pgbench is a database benchmark that can be sensitive to page reclaim decisions. This also checks if removing the fair zone allocation policy is safe pgbench Transactions 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%) Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%) Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%) Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%) Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%) Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%) Negligible differences again. As with tiobench, overall reclaim activity was comparable. bonnie++ on ext4 ---------------- No interesting performance difference, negligible differences on reclaim stats. paralleldd on ext4 ------------------ This workload uses varying numbers of dd instances to read large amounts of data from disk. 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v9 Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%) Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%) Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%) Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%) Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%) Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%) 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v9 User 1548.01 1552.44 System 8609.71 8515.08 Elapsed 3587.10 3594.54 There is little or no change in performance but some drop in system CPU usage. 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v9 Minor Faults 362662 367360 Major Faults 1204 1143 Swap Ins 22 0 Swap Outs 2855 1029 DMA allocs 0 0 DMA32 allocs 31409797 28837521 Normal allocs 46611853 49231282 Movable allocs 0 0 Direct pages scanned 0 0 Kswapd pages scanned 40845270 40869088 Kswapd pages reclaimed 40830976 40855294 Direct pages reclaimed 0 0 Kswapd efficiency 99% 99% Kswapd velocity 11386.711 11369.769 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Percentage direct scans 0% 0% Page writes by reclaim 2855 1029 Page writes file 0 0 Page writes anon 2855 1029 Page reclaim immediate 771 1628 Sector Reads 293312636 293536360 Sector Writes 18213568 18186480 Page rescued immediate 0 0 Slabs scanned 128257 132747 Direct inode steals 181 56 Kswapd inode steals 59 1131 It basically shows that kswapd was active at roughly the same rate in both kernels. There was also comparable slab scanning activity and direct reclaim was avoided in both cases. There appears to be a large difference in numbers of inodes reclaimed but the workload has few active inodes and is likely a timing artifact. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%) 1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%) 2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%) 3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%) Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%) Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%) Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%) Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%) Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%) Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%) Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%) Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%) Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%) Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%) Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%) Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%) Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%) This shows a number of improvements with the worst-case outlier greatly improved. Some of the vmstats are interesting 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Swap Ins 163 502 Swap Outs 0 0 DMA allocs 0 0 DMA32 allocs 618719206 1381662383 Normal allocs 891235743 564138421 Movable allocs 0 0 Allocation stalls 2603 1 Direct pages scanned 216787 2 Kswapd pages scanned 50719775 41778378 Kswapd pages reclaimed 41541765 41777639 Direct pages reclaimed 209159 0 Kswapd efficiency 81% 99% Kswapd velocity 16859.554 14329.059 Direct efficiency 96% 0% Direct velocity 72.061 0.001 Percentage direct scans 0% 0% Page writes by reclaim 6215049 0 Page writes file 6215049 0 Page writes anon 0 0 Page reclaim immediate 70673 90 Sector Reads 81940800 81680456 Sector Writes 100158984 98816036 Page rescued immediate 0 0 Slabs scanned 1366954 22683 While this is not guaranteed in all cases, this particular test showed a large reduction in direct reclaim activity. It's also worth noting that no page writes were issued from reclaim context. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. VM statistic counters for reclaim decisions are zone-based. If the kernel is to reclaim on a per-node basis then we need to track per-node statistics but there is no infrastructure for that. The most notable change is that the old node_page_state is renamed to sum_zone_node_page_state. The new node_page_state takes a pglist_data and uses per-node stats but none exist yet. There is some renaming such as vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical patch with no functional change. There is a lot of similarity between the node and zone helpers which is unfortunate but there was no obvious way of reusing the code and maintaining type safety. Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28Merge tag 'dmaengine-4.8-rc1' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds49-251/+3108
Pull dmaengine updates from Vinod Koul: "This time we have bit of largish changes: two new drivers, bunch of updates and cleanups to existing set. Nothing super exciting though. New drivers: - Xilinx zynqmp dma engine driver - Marvell xor2 driver Updates: - dmatest sg support - updates and enhancements to Xilinx drivers, adding of cyclic mode - clock handling fixes across drivers - removal of OOM messages on kzalloc across subsystem - interleaved transfers support in omap driver - runtime pm support in qcom bam dma - tasklet kill freeup across drivers - irq cleanup on remove across drivers" * tag 'dmaengine-4.8-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (94 commits) dmaengine: k3dma: add missing clk_disable_unprepare() on error in k3_dma_probe() dmaengine: zynqmp_dma: add missing MODULE_LICENSE dmaengine: qcom_hidma: use for_each_matching_node() macro dmaengine: zynqmp_dma: Fix static checker warning dmaengine: omap-dma: Support for interleaved transfer dmaengine: ioat: statify symbol dmaengine: pxa_dma: implement device_synchronize dmaengine: imx-sdma: remove assignment never used dmaengine: imx-sdma: remove dummy assignment dmaengine: cppi: remove unused and bogus check dmaengine: qcom_hidma_lli: kill the tasklets upon exit dmaengine: pxa_dma: remove owner assignment dmaengine: fsl_raid: remove owner assignment dmaengine: coh901318: remove owner assignment dmaengine: qcom_hidma: kill the tasklets upon exit dmaengine: txx9dmac: explicitly freeup irq dmaengine: sirf-dma: kill the tasklets upon exit dmaengine: s3c24xx: kill the tasklets upon exit dmaengine: s3c24xx: explicitly freeup irq dmaengine: pl330: explicitly freeup irq ...
2016-07-28Merge tag 'hwlock-v4.8' of git://github.com/andersson/remoteprocLinus Torvalds1-0/+1
Pull hwspinlock updates from Bjorn Andersson: "Add missing of_node_put() in the Qualcomm driver and update MAINTAINERS to make sure all hwspinlock related files have a maintainer listed" * tag 'hwlock-v4.8' of git://github.com/andersson/remoteproc: MAINTAINERS: Update hwspinlock paths hwspinlock: qcom_hwspinlock: add missing of_node_put after calling of_parse_phandle
2016-07-28Merge tag 'rproc-v4.8' of git://github.com/andersson/remoteprocLinus Torvalds6-6/+1125
Pull remoteproc updates from Bjorn Andersson: "Introduce remoteproc driver for controlling the modem/DSP Hexagon CPU found in a multitude of Qualcomm platform. Also cleans up a race condition/potential leak during registration of remoteprocs and includes devicetree bindings in the MAINTAINERS entry" * tag 'rproc-v4.8' of git://github.com/andersson/remoteproc: remoteproc: qcom: hexagon: Clean up mpss validation remoteproc: qcom: remove redundant dev_err call in q6v5_init_mem() remoteproc: qcom: Driver for the self-authenticating Hexagon v5 dt-binding: remoteproc: Introduce Hexagon loader binding remoteproc: Fix potential race condition in rproc_add MAINTAINERS: Add file patterns for remoteproc device tree bindings
2016-07-28Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hidLinus Torvalds13-564/+1111
Pull HID updates from Jiri Kosina: - new hid-alps driver for ALPS Touchpad-Stick device, from Masaki Ota - much improved and generalized HID led handling, and merge of specialized hid-thingm driver into this generic hid-led one, from Heiner Kallweit - i2c-hid power management improvements from Fu Zhonghui and Guohua Zhong - uhid initialization race fix from Roderick Colenbrander * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (21 commits) HID: add usb device id for Apple Magic Keyboard HID: hid-led: fix Delcom support on big endian systems HID: hid-led: add support for Greynut Luxafor HID: hid-led: add support for Delcom Visual Signal Indicator G2 HID: hid-led: remove report id from struct hidled_config HID: alps: a few cleanups HID: remove ThingM blink(1) driver HID: hid-led: add support for ThingM blink(1) HID: hid-led: add support for reading from LED devices HID: hid-led: add support for devices with multiple independent LEDs HID: i2c-hid: set power sleep before shutdown HID: alps: match alps devices in core HID: thingm: simplify debug output code HID: alps: pass correct sizes to hid_hw_raw_request() HID: alps: struct u1_dev *priv is internal to the driver HID: add Alps I2C HID Touchpad-Stick support HID: led: fix config usb: misc: remove outdated USB LED driver HID: migrate USB LED driver from usb misc to hid HID: i2c_hid: enable i2c-hid devices to suspend/resume asynchronously ...
2016-07-28Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivialLinus Torvalds1-1/+0
Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: fat: fix error message for bogus number of directory entries fat: fix typo s/supeblock/superblock/ ASoC: max9877: Remove unused function declaration dw2102: don't output spurious blank lines to the kernel log init: fix Kconfig text ARM: io: fix comment grammar ocfs: fix ocfs2_xattr_user_get() argument name scsi/qla2xxx: Remove erroneous unused macro qla82xx_get_temp_val1()
2016-07-28Merge tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/randomLinus Torvalds1-2/+1
Pull random driver fix from Ted Ts'o: "Fix a boot failure on systems with non-contiguous NUMA id's" * tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: use for_each_online_node() to iterate over NUMA nodes
2016-07-28Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds7-16/+12
Pull vfs updates from Al Viro: "Assorted cleanups and fixes. Probably the most interesting part long-term is ->d_init() - that will have a bunch of followups in (at least) ceph and lustre, but we'll need to sort the barrier-related rules before it can get used for really non-trivial stuff. Another fun thing is the merge of ->d_iput() callers (dentry_iput() and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all except the one in __d_lookup_lru())" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) fs/dcache.c: avoid soft-lockup in dput() vfs: new d_init method vfs: Update lookup_dcache() comment bdev: get rid of ->bd_inodes Remove last traces of ->sync_page new helper: d_same_name() dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends() vfs: clean up documentation vfs: document ->d_real() vfs: merge .d_select_inode() into .d_real() unify dentry_iput() and dentry_unlink_inode() binfmt_misc: ->s_root is not going anywhere drop redundant ->owner initializations ufs: get rid of redundant checks orangefs: constify inode_operations missed comment updates from ->direct_IO() prototype change file_inode(f)->i_mapping is f->f_mapping trim fsnotify hooks a bit 9p: new helper - v9fs_parent_fid() debugfs: ->d_parent is never NULL or negative ...
2016-07-28Merge branch 'salted-string-hash'Linus Torvalds1-3/+5
This changes the vfs dentry hashing to mix in the parent pointer at the _beginning_ of the hash, rather than at the end. That actually improves both the hash and the code generation, because we can move more of the computation to the "static" part of the dcache setup, and do less at lookup runtime. It turns out that a lot of other hash users also really wanted to mix in a base pointer as a 'salt' for the hash, and so the slightly extended interface ends up working well for other cases too. Users that want a string hash that is purely about the string pass in a 'salt' pointer of NULL. * merge branch 'salted-string-hash': fs/dcache.c: Save one 32-bit multiply in dcache lookup vfs: make the string hashes salt the hash
2016-07-28Merge branch 'mymd/for-next' into mymd/for-linusShaohua Li7-213/+328
2016-07-28MD: fix null pointer deferenceShaohua Li1-2/+4
The md device might not have personality (for example, ddf raid array). The issue is introduced by 8430e7e0af9a15(md: disconnect device from personality before trying to remove it) Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-28Merge branch 'for-4.8/hid-led' into for-linusJiri Kosina9-550/+554
Conflicts: drivers/hid/hid-thingm.c
2016-07-28Merge branches 'for-4.8/alps', 'for-4.8/apple', 'for-4.8/i2c-hid', 'for-4.8/uhid-offload-hid-device-add' and 'for-4.8/upstream' into for-linusJiri Kosina3515-93601/+230842