linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2016-03-23	intel_idle: Support for Intel Xeon Phi Processor x200 Product Family	Dasaratharaman Chandramouli	1	-0/+25
	Enables "Intel(R) Xeon Phi(TM) Processor x200 Product Family" support, formerly code-named KNL. It is based on modified Intel Atom Silvermont microarchitecture. Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com> [micah.barany@intel.com: adjusted values of residency and latency] Signed-off-by: Micah Barany <micah.barany@intel.com> [hubert.chrzaniuk@intel.com: removed deprecated CPUIDLE_FLAG_TIME_VALID flag] Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: Pawel Karczewski <pawel.karczewski@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-23	intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled	Len Brown	1	-22/+86
	Some SKL-H configurations require "intel_idle.max_cstate=7" to boot. While that is an effective workaround, it disables C10. This patch detects the problematic configuration, and disables C8 and C9, keeping C10 enabled. Note that enabling SGX in BIOS SETUP can also prevent this issue, if the system BIOS provides that option. https://bugzilla.kernel.org/show_bug.cgi?id=109081 "Freezes with Intel i7 6700HQ (Skylake), unless intel_idle.max_cstate=7" Signed-off-by: Len Brown <len.brown@intel.com> Cc: stable@vger.kernel.org
2016-03-23	ACPI / PM: Runtime resume devices when waking from hibernate	Lukas Wunner	1	-0/+1
	Commit 58a1fbbb2ee8 ("PM / PCI / ACPI: Kick devices that might have been reset by firmware") added a runtime resume for devices that were runtime suspended when the system entered suspend-to-RAM. Briefly, the motivation was to ensure that devices did not remain in a reset-power-on state after resume, potentially preventing deep SoC-wide low-power states from being entered on idle. Currently we're not doing the same when leaving suspend-to-disk and this asymmetry is a problem if drivers rely on the automatic resume triggered by pm_complete_with_resume_check(). Fix it. Fixes: 58a1fbbb2ee8 (PM / PCI / ACPI: Kick devices that might have been reset by firmware) Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-23	PM / sleep: Clear pm_suspend_global_flags upon hibernate	Lukas Wunner	1	-0/+1
	When suspending to RAM, waking up and later suspending to disk, we gratuitously runtime resume devices after the thaw phase. This does not occur if we always suspend to RAM or always to disk. pm_complete_with_resume_check(), which gets called from pci_pm_complete() among others, schedules a runtime resume if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during a suspend-to-RAM cycle. It is cleared at the beginning of the suspend-to-RAM cycle but not afterwards and it is not cleared during a suspend-to-disk cycle at all. Fix it. Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement) Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-22	cpufreq: governor: Always schedule work on the CPU running update	Rafael J. Wysocki	1	-1/+1
	Modify dbs_irq_work() to always schedule the process-context work on the current CPU which also ran the dbs_update_util_handler() that the irq_work being handled came from. This causes the entire frequency update handling (involving the "ondemand" or "conservative" governors) to be carried out by the CPU whose frequency is to be updated and reduces the overall amount of inter-CPU noise related to cpufreq. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-22	cpufreq: Always update current frequency before startig governor	Rafael J. Wysocki	1	-11/+3
	Make policy->cur match the current frequency returned by the driver's ->get() callback before starting the governor in case they went out of sync in the meantime and drop the piece of code attempting to resync policy->cur with the real frequency of the boot CPU from cpufreq_resume() as it serves no purpose any more (and it's racy and super-ugly anyway). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-22	cpufreq: Introduce cpufreq_update_current_freq()	Rafael J. Wysocki	1	-9/+19
	Move the part of cpufreq_update_policy() that obtains the current frequency from the driver and updates policy->cur if necessary to a separate function, cpufreq_get_current_freq(). That should not introduce functional changes and subsequent change set will need it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-22	cpufreq: Introduce cpufreq_start_governor()	Rafael J. Wysocki	1	-22/+22
	Starting a governor in cpufreq always follows the same pattern involving two calls to cpufreq_governor(), one with the event argument set to CPUFREQ_GOV_START and one with that argument set to CPUFREQ_GOV_LIMITS. Introduce cpufreq_start_governor() that will carry out those two operations and make all places where governors are started use it. That slightly modifies the behavior of cpufreq_set_policy() which now also will go back to the old governor if the second call to cpufreq_governor() (the one with event equal to CPUFREQ_GOV_LIMITS) fails, but that really is how it should work in the first place. Also cpufreq_resume() will now pring an error message if the CPUFREQ_GOV_LIMITS call to cpufreq_governor() fails, but that makes it follow cpufreq_add_policy_cpu() and cpufreq_offline() in that respect. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-22	cpufreq: powernv: Add sysfs attributes to show throttle stats	Shilpasri G Bhat	2	-2/+141
	Create sysfs attributes to export throttle information in /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats directory. The newly added sysfs files are as follows: 1)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/turbo_stat 2)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/sub-turbo_stat 3)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/unthrottle 4)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/powercap 5)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overtemp 6)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/supply_fault 7)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overcurrent 8)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/occ_reset Detailed explanation of each attribute is added to Documentation/ABI/testing/sysfs-devices-system-cpu Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-22	cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static	Jisheng Zhang	1	-6/+6
	These frequency register read/write operations' implementations for the given processor (Intel/AMD MSR access or I/O port access) are only used internally in acpi-cpufreq, so make them static. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-22	PCI: ACPI: IA64: fix IO port generic range check	Lorenzo Pieralisi	1	-1/+13
	The [0 - 64k] ACPI PCI IO port resource boundary check in: acpi_dev_ioresource_flags() is currently applied blindly in the ACPI resource parsing to all architectures, but only x86 suffers from that IO space limitation. On arches (ie IA64 and ARM64) where IO space is memory mapped, the PCI root bridges IO resource windows are firstly initialized from the _CRS (in acpi_decode_space()) and contain the CPU physical address at which a root bridge decodes IO space in the CPU physical address space with the offset value representing the offset required to translate the PCI bus address into the CPU physical address. The IO resource windows are then parsed and updated in arch code before creating and enumerating PCI buses (eg IA64 add_io_space()) to map in an arch specific way the obtained CPU physical address range to a slice of virtual address space reserved to map PCI IO space, ending up with PCI bridges resource windows containing IO resources like the following on a working IA64 configuration: PCI host bridge to bus 0000:00 pci_bus 0000:00: root bus resource [io 0x1000000-0x100ffff window] (bus address [0x0000-0xffff]) pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000fffff window] pci_bus 0000:00: root bus resource [mem 0x80000000-0x8fffffff window] pci_bus 0000:00: root bus resource [mem 0x80004000000-0x800ffffffff window] pci_bus 0000:00: root bus resource [bus 00] This implies that the [0 - 64K] check in acpi_dev_ioresource_flags() leaves platforms with memory mapped IO space (ie IA64) broken (ie kernel can't claim IO resources since the host bridge IO resource is disabled and discarded by ACPI core code, see log on IA64 with missing root bridge IO resource, silently filtered by current [0 - 64k] check in acpi_dev_ioresource_flags()): PCI host bridge to bus 0000:00 pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000fffff window] pci_bus 0000:00: root bus resource [mem 0x80000000-0x8fffffff window] pci_bus 0000:00: root bus resource [mem 0x80004000000-0x800ffffffff window] pci_bus 0000:00: root bus resource [bus 00] [...] pci 0000:00:03.0: [1002:515e] type 00 class 0x030000 pci 0000:00:03.0: reg 0x10: [mem 0x80000000-0x87ffffff pref] pci 0000:00:03.0: reg 0x14: [io 0x1000-0x10ff] pci 0000:00:03.0: reg 0x18: [mem 0x88020000-0x8802ffff] pci 0000:00:03.0: reg 0x30: [mem 0x88000000-0x8801ffff pref] pci 0000:00:03.0: supports D1 D2 pci 0000:00:03.0: can't claim BAR 1 [io 0x1000-0x10ff]: no compatible bridge window For this reason, the IO port resources boundaries check in generic ACPI parsing code should be guarded with a CONFIG_X86 guard so that more arches (ie ARM64) can benefit from the generic ACPI resources parsing interface without incurring in unexpected resource filtering, fixing at the same time current breakage on IA64. This patch factors out IO ports boundary [0 - 64k] check in generic ACPI code and makes the IO space check X86 specific to make sure that IO space resources are usable on other arches too. Fixes: 3772aea7d6f3 (ia64/PCI/ACPI: Use common ACPI resource parsing interface for host bridge) Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-22	ACPI / util: cast data to u64 before shifting to fix sign extension	Colin Ian King	1	-1/+1
	obj->buffer.pointer[i] should be cast to u64 to prevent an unintentional sign extension. For example, if pointer[7] is 0x80, then the value 0xffffffffff000000 is or'd into mask rather than the intended value 0xff00000000000000 Detected with static analysis by CoverityScan Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-22	cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path	Michael Neuling	1	-33/+17
	Commit 96c4726f01cd "cpufreq: powernv: Remove cpu_to_chip_id() from hot-path" introduced a 'core_to_chip_map' array to cache the chip-ids of all cores. Replace this with a per-CPU variable that stores the pointer to the chip-array. This removes the linear lookup and provides a neater and simpler solution. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-21	cpuidle: menu: Fall back to polling if next timer event is near	Rafael J. Wysocki	1	-4/+8
	Commit a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling) changed the behavior of the fallback state selection part of menu_select() so it looks at interactivity_req instead of data->next_timer_us when it makes its decision. That effectively caused polling to be used more often as fallback idle which led to significant increases of energy consumption in some cases. Commit e132b9b3bc7f (cpuidle: menu: use high confidence factors only when considering polling) changed that logic again to be more predictable, but that didn't help with the increased energy consumption problem. For this reason, go back to making decisions on which state to fall back to based on data->next_timer_us which is the time we know for sure something will happen rather than a prediction (which may be inaccurate and turns out to be so often enough to be problematic). However, take the target residency of the first proper idle state (C1) into account, so that state is not used as the fallback one if its target residency is greater than data->next_timer_us. Fixes: a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
2016-03-20	cpufreq: acpi-cpufreq: Clean up hot plug notifier callback	Richard Cochran	1	-2/+4
	This driver has two issues. First, it tries to fiddle with the hot plugged CPU's MSR on the UP_PREPARE event, at a time when the CPU is not yet online. Second, the driver sets the "boost-disable" bit for a CPU when going down, but does not clear the bit again if the CPU comes up again due to DOWN_FAILED. This patch fixes the issues by changing the driver to react to the ONLINE/DOWN_FAILED events instead of UP_PREPARE. As an added benefit, the driver also becomes symmetric with respect to the hot plug mechanism. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-20	intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts	Rafael J. Wysocki	1	-30/+43
	After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-18	cpufreq: Make cpufreq_quick_get() safe to call	Richard Cochran	1	-2/+10
	The function, cpufreq_quick_get, accesses the global 'cpufreq_driver' and its fields without taking the associated lock, cpufreq_driver_lock. Without the locking, nothing guarantees that 'cpufreq_driver' remains consistent during the call. This patch fixes the issue by taking the lock before accessing the data structure. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-17	ACPI / property: fix data node parsing in acpi_get_next_subnode()	Irina Tirdea	1	-0/+1
	When an ACPI node has both ACPI device nodes and ACPI data nodes, acpi_get_next_subnode() will return the ACPI data nodes of its last parsed child. To avoid that, make acpi_get_next_subnode() go back to the original ACPI device object when all of the device node children of it have been found already. Signed-off-by: Irina Tirdea <irina.tirdea@intel.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-17	ACPI / APD: Add device HID for future AMD UART controller	Wang Hongcheng	1	-0/+1
	Add device HID AMDI0020 to match the AMD ACPI Vendor ID (AMDI) as registered in http://www.uefi.org/acpi_id_list, and the UART controller on future AMD paltform will use the HID instead of AMD0020. Signed-off-by: Wang Hongcheng <annie.wang@amd.com> Acked-by: Ken Xue <Ken.Xue@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-17	cpuidle: menu: use high confidence factors only when considering polling	Rik van Riel	1	-18/+24
	The menu governor uses five different factors to pick the idle state: - the user configured latency_req - the time until the next timer (next_timer_us) - the typical sleep interval, as measured recently - an estimate of sleep time by dividing next_timer_us by an observed factor - a load corrected version of the above, divided again by load Only the first three items are known with enough confidence that we can use them to consider polling, instead of an actual CPU idle state, because the cost of being wrong about polling can be excessive power use. The latter two are used in the menu governor's main selection loop, and can result in choosing a shallower idle state when the system is expected to be busy again soon. This pushes a busy system in the "performance" direction of the performance<>power tradeoff, when choosing between idle states, but stays more strictly on the "power" state when deciding between polling and C1. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-17	PM / clk: Add support for obtaining clocks from device-tree	Jon Hunter	2	-0/+98
	The PM clocks framework requires clients to pass either a con-id or a valid clk pointer in order to add a clock to a device. Add a new function of_pm_clk_add_clks() to allows device clocks to be retrieved from device-tree and populated for a given device. Note that it is not necessary to make the compilation of this new function dependent upon CONFIG_OF because there are stubs functions for the device-tree APIs used. In order to handle errors encountered when adding clocks from device-tree, add a function pm_clk_remove_clk() to remove any clocks (using a pointer to the clk structure) that have been added successfully before the error occurred. Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-17	PM / devfreq: Spelling s/frequnecy/frequency/	Geert Uytterhoeven	1	-1/+1
	Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-15	autofs4: fix string.h include in auto_dev-ioctl.h	Ian Kent	1	-5/+0
	Since including linux/string.h will now do the right thing remove the conditional check. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: use pr_xxx() macros directly for logging	Ian Kent	6	-75/+68
	Use the standard pr_xxx() log macros directly for log prints instead of the AUTOFS_XXX() macros. Signed-off-by: Ian Kent <ikent@redhat.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: change log print macros to not insert newline	Ian Kent	6	-49/+49
	Common kernel coding practice is to include the newline of log prints within the log text rather than hidden away in a macro. To avoid introducing inconsistencies as changes are made change the log macros to not include the newline. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: make autofs log prints consistent	Ian Kent	4	-12/+12
	Use the pr_() print in AUTOFS_() macros instead of printks and include the module name in log message macros. Also use the AUTOFS_*() macros everywhere instead of raw printks. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: fix some white space errors	Ian Kent	6	-10/+8
	Fix some white space format errors. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()	Ian Kent	1	-1/+1
	The return from an ioctl if an invalid ioctl is passed in should be EINVAL not ENOSYS. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: fix coding style line length in autofs4_wait()	Ian Kent	1	-2/+4
	The need for this is questionable but checkpatch.pl complains about the line length and it's a straightfoward change. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: fix coding style problem in autofs4_get_set_timeout()	Ian Kent	1	-8/+20
	Refactor autofs4_get_set_timeout() to eliminate coding style error. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs4: coding style fixes	Ian Kent	12	-188/+190
	Try and make the coding style completely consistent throughtout the autofs module and inline with kernel coding style recommendations. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	autofs: show pipe inode in mount options	Stanislav Kinsburskiy	1	-1/+6
	This is required for CRIU (Checkpoint Restart In Userspace) to migrate a mount point when write end in user space is closed. Below is a brief description of the problem. To migrate a non-catatonic autofs mount point, one has to restore the control pipe between kernel and autofs master process. One of the autofs masters is systemd, which closes pipe write end after passing it to the kernel with mount call. To be able to restore the systemd control pipe one has to know which read pipe end in systemd corresponds to the write pipe end in the kernel. The pipe "fd" in mount options is not enough because it was closed and probably replaced by some other descriptor. Thus, some other attribute is required to be able to find the read pipe end. The best attribute to use to find the correct pipe end is inode number becuase it's unique for the whole system and can't be reused while the autofs mount exists. This attribute can also be used to recognize a situation where an autofs mount has no master (no process with specified "pgrp" or no file descriptor with "pipe_ino", specified in autofs mount options). Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	kallsyms: add support for relative offsets in kallsyms address table	Ard Biesheuvel	5	-19/+126
	Similar to how relative extables are implemented, it is possible to emit the kallsyms table in such a way that it contains offsets relative to some anchor point in the kernel image rather than absolute addresses. On 64-bit architectures, it cuts the size of the kallsyms address table in half, since offsets between kernel symbols can typically be expressed in 32 bits. This saves several hundreds of kilobytes of permanent .rodata on average. In addition, the kallsyms address table is no longer subject to dynamic relocation when CONFIG_RELOCATABLE is in effect, so the relocation work done after decompression now doesn't have to do relocation updates for all these values. This saves up to 24 bytes (i.e., the size of a ELF64 RELA relocation table entry) per value, which easily adds up to a couple of megabytes of uncompressed __init data on ppc64 or arm64. Even if these relocation entries typically compress well, the combined size reduction of 2.8 MB uncompressed for a ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500 KB space saving in the compressed image. Since it is useful for some architectures (like x86) to retain the ability to emit absolute values as well, this patch also adds support for capturing both absolute and relative values when KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu addresses as positive 32-bit values, and addresses relative to the lowest encountered relative symbol as negative values, which are subtracted from the runtime address of this base symbol to produce the actual address. Support for the above is enabled by default for all architectures except IA-64 and Tile-GX, whose symbols are too far apart to capture in this manner. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	kallsyms: don't overload absolute symbol type for percpu symbols	Ard Biesheuvel	1	-2/+12
	Commit c6bda7c988a5 ("kallsyms: fix percpu vars on x86-64 with relocation") overloaded the 'A' (absolute) symbol type to signify that a symbol is not subject to dynamic relocation. However, the original A type does not imply that at all, and depending on the version of the toolchain, many A type symbols are emitted that are in fact relative to the kernel text, i.e., if the kernel is relocated at runtime, these symbols should be updated as well. For instance, on sparc32, the following symbols are emitted as absolute (kindly provided by Guenter Roeck): f035a420 A _etext f03d9000 A _sdata f03de8c4 A jiffies f03f8860 A _edata f03fc000 A __init_begin f041bdc8 A __init_text_end f0423000 A __bss_start f0423000 A __init_end f044457d A __bss_stop f044457d A _end On x86_64, similar behavior can be observed: ffffffff81a00000 A __end_rodata_hpage_align ffffffff81b19000 A __vvar_page ffffffff81d3d000 A _end Even if only a couple of them pass the symbol range check that results in them to be taken into account for the final kallsyms symbol table, it is obvious that 'A' does not mean the symbol does not need to be updated at relocation time, and overloading its meaning to signify that is perhaps not a good idea. So instead, add a new percpu_absolute member to struct sym_entry, and when --absolute-percpu is in effect, use it to record symbols whose addresses should be emitted as final values rather than values that still require relocation at runtime. That way, we can drop the check against the 'A' type. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Marek <mmarek@suse.cz> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	x86: kallsyms: disable absolute percpu symbols on !SMP	Ard Biesheuvel	2	-1/+5
	scripts/kallsyms.c has a special --absolute-percpu command line option which deals with the zero based per cpu offsets that are used when building for SMP on x86_64. This means that the option should only be passed in that case, so add a Kconfig symbol with the correct predicate, and use that instead. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	checkpatch: fix another left brace warning	Geyslan G. Bem	1	-1/+1
	This patch escapes a regex that uses left brace. Using checkpatch.pl with Perl 5.22.0 generates the warning: "Unescaped left brace in regex is deprecated, passed through in regex;" Comment from regcomp.c in Perl source: "Currently we don't warn when the lbrace is at the start of a construct. This catches it in the middle of a literal string, or when it's the first thing after something like "\b"." This works as a complement to 4e5d56bd ("checkpatch: fix left brace warning"). Signed-off-by: Geyslan G. Bem <geyslan@gmail.com> Signed-off-by: Joe Perches <joe@perches.com> Suggested-by: Peter Senna Tschudin <peter.senna@gmail.com> Cc: Eddie Kovsky <ewk@edkovsky.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses	Joe Perches	1	-4/+8
	Improve the test to allow casts to (unsigned) or (signed) to be found and fixed if desired. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	checkpatch: warn on bare unsigned or signed declarations without int	Joe Perches	1	-0/+20
	Kernel style prefers "unsigned int <foo>" over "unsigned <foo>" and "signed int <foo>" over "signed <foo>". Emit a warning for these simple signed/unsigned <foo> declarations. Fix it too if desired. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	checkpatch: exclude asm volatile from complex macro check	Joe Perches	1	-0/+3
	asm volatile and all its variants like __asm__ __volatile__ ("<foo>") are reported as errors with "Macros with with complex values should be enclosed in parentheses". Make an exception for these asm volatile macro definitions by converting the "asm volatile" to "asm_volatile" so it appears as a single function call and the error isn't reported. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Jeff Merkey <linux.mdb@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()	Johannes Weiner	1	-2/+1
	Migration accounting in the memory controller used to have to handle both oldpage and newpage being on the LRU already; fuse's page cache replacement used to pass a recycled newpage that had been uncharged but not freed and removed from the LRU, and the memcg migration code used to uncharge oldpage to "pass on" the existing charge to newpage. Nowadays, pages are no longer uncharged when truncated from the page cache, but rather only at free time, so if a LRU page is recycled in page cache replacement it'll also still be charged. And we bail out of the charge transfer altogether in that case. Tell commit_charge() that we know newpage is not on the LRU, to avoid taking the zone->lru_lock unnecessarily from the migration path. But also, oldpage is no longer uncharged inside migration. We only use oldpage for its page->mem_cgroup and page size, so we don't care about its LRU state anymore either. Remove any mention from the kernel doc. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	mm: migrate: consolidate mem_cgroup_migrate() calls	Johannes Weiner	1	-7/+2
	Rather than scattering mem_cgroup_migrate() calls all over the place, have a single call from a safe place where every migration operation eventually ends up in - migrate_page_copy(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous	Joonsoo Kim	7	-52/+105
	There is a performance drop report due to hugepage allocation and in there half of cpu time are spent on pageblock_pfn_to_page() in compaction [1]. In that workload, compaction is triggered to make hugepage but most of pageblocks are un-available for compaction due to pageblock type and skip bit so compaction usually fails. Most costly operations in this case is to find valid pageblock while scanning whole zone range. To check if pageblock is valid to compact, valid pfn within pageblock is required and we can obtain it by calling pageblock_pfn_to_page(). This function checks whether pageblock is in a single zone and return valid pfn if possible. Problem is that we need to check it every time before scanning pageblock even if we re-visit it and this turns out to be very expensive in this workload. Although we have no way to skip this pageblock check in the system where hole exists at arbitrary position, we can use cached value for zone continuity and just do pfn_to_page() in the system where hole doesn't exist. This optimization considerably speeds up in above workload. Before vs After Max: 1096 MB/s vs 1325 MB/s Min: 635 MB/s 1015 MB/s Avg: 899 MB/s 1194 MB/s Avg is improved by roughly 30% [2]. [1]: http://www.spinics.net/lists/linux-mm/msg97378.html [2]: https://lkml.org/lkml/2015/12/9/23 [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	mm/compaction: pass only pageblock aligned range to pageblock_pfn_to_page	Joonsoo Kim	1	-11/+30
	pageblock_pfn_to_page() is used to check there is valid pfn and all pages in the pageblock is in a single zone. If there is a hole in the pageblock, passing arbitrary position to pageblock_pfn_to_page() could cause to skip whole pageblock scanning, instead of just skipping the hole page. For deterministic behaviour, it's better to always pass pageblock aligned range to pageblock_pfn_to_page(). It will also help further optimization on pageblock_pfn_to_page() in the following patch. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	mm/compaction: fix invalid free_pfn and compact_cached_free_pfn	Joonsoo Kim	1	-4/+5
	free_pfn and compact_cached_free_pfn are the pointer that remember restart position of freepage scanner. When they are reset or invalid, we set them to zone_end_pfn because freepage scanner works in reverse direction. But, because zone range is defined as [zone_start_pfn, zone_end_pfn), zone_end_pfn is invalid to access. Therefore, we should not store it to free_pfn and compact_cached_free_pfn. Instead, we need to store zone_end_pfn - 1 to them. There is one more thing we should consider. Freepage scanner scan reversely by pageblock unit. If free_pfn and compact_cached_free_pfn are set to middle of pageblock, it regards that sitiation as that it already scans front part of pageblock so we lose opportunity to scan there. To fix-up, this patch do round_down() to guarantee that reset position will be pageblock aligned. Note that thanks to the current pageblock_pfn_to_page() implementation, actual access to zone_end_pfn doesn't happen until now. But, following patch will change pageblock_pfn_to_page() so this patch is needed from now on. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	mm/memblock.c: remove unnecessary memblock_type variable	Alexander Kuleshov	1	-6/+2
	We define struct memblock_type *type in the memblock_add_region() and memblock_reserve_region() functions only for passing it to the memlock_add_range() and memblock_reserve_range() functions. Let's remove these variables and will pass a type directly. Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	x86: also use debug_pagealloc_enabled() for free_init_pages	Christian Borntraeger	1	-14/+15
	we want to couple all debugging features with debug_pagealloc_enabled() and not with the config option CONFIG_DEBUG_PAGEALLOC. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	s390: query dynamic DEBUG_PAGEALLOC setting	Christian Borntraeger	2	-9/+7
	We can use debug_pagealloc_enabled() to check if we can map the identity mapping with 1MB/2GB pages as well as to print the current setting in dump_stack. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	x86: query dynamic DEBUG_PAGEALLOC setting	Christian Borntraeger	3	-16/+10
	We can use debug_pagealloc_enabled() to check if we can map the identity mapping with 2MB pages. We can also add the state into the dump_stack output. The patch does not touch the code for the 1GB pages, which ignored CONFIG_DEBUG_PAGEALLOC. Do we need to fence this as well? Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	thp: cleanup split_huge_page()	Kirill A. Shutemov	1	-13/+7
	After one of bugfixes to freeze_page(), we don't have freezed pages in rmap, therefore mapcount of all subpages of freezed THP is zero. And we have assert for that. Let's drop code which deal with non-zero mapcount of subpages. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15	mm: use linear_page_index() in do_fault()	Matthew Wilcox	1	-2/+1
	do_fault() assumes that PAGE_SIZE is the same as PAGE_CACHE_SIZE. Use linear_page_index() to calculate pgoff in the correct units. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>