linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2013-09-26	ALSA : hda - not use assigned converters for all unused pins	Mengdong Lin	1	-18/+29
	BIOS can mark a pin as "no physical connection" if the port is used by an integrated display which is not audio capable. And audio driver will overlook such pins. On Haswell, such a disconneted pin will keep muted and connected to the 1st converter by default. But if the 1st convertor is assigned to a connected pin for audio streaming. The muted disconnected pin can make the connected pin no sound output. So this patch avoids using assigned converters for all unused pins for Haswell, including the disconected pins. Signed-off-by: Mengdong Lin <mengdong.lin@intel.com> Reviewed-by: David Henningsson <david.henningsson@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-26	ALSA: Fix assignment of 0/1 to bool variables	Peter Senna Tschudin	2	-9/+9
	Convert 0 to false and 1 to true when assigning values to bool variables. Inspired by commit 3db1cd5c05f35fb43eb134df6f321de4e63141f2. The simplified semantic patch that find this problem is as follows (http://coccinelle.lip6.fr/): @@ bool b; @@ ( -b = 0 +b = false \| -b = 1 +b = true ) Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-26	ALSA: compress: Make sure we trigger STOP before closing the stream.	Liam Girdwood	1	-0/+12
	Currently we assume that userspace will shut down the compressed stream correctly. However, if userspcae dies (e.g. cplay & ctrl-C) we dont stop the stream before freeing it. This now checks that the stream is stopped before freeing. Signed-off-by: Liam Girdwood <liam.r.girdwood@linux.intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-19	ALSA: compress: Fix compress device unregister.	Liam Girdwood	1	-1/+2
	snd_unregister_device() should return the device type and not stream direction. Signed-off-by: Liam Girdwood <liam.r.girdwood@linux.intel.com> Acked-by: Vinod Koul <vinod.koul@intel.com> Tested-by: Vinod Koul <vinod.koul@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-16	Linux 3.12-rc1	Linus Torvalds	1	-3/+3

2013-09-15	vfs: fix typo in comment in recent dentry work	Linus Torvalds	1	-1/+1
	Sedat points out that I transposed some letters in "LRU" and wrote "RLU" instead in one of the new comments explaining the flow. Let's just fix it. Reported-by: Sedat Dilek <sedat.dilek@jpberlin.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-15	partitions/efi: loosen check fot pmbr size in lba	Davidlohr Bueso	1	-2/+6
	Matt found that commit 27a7c642174e ("partitions/efi: account for pmbr size in lba") caused his GPT formatted eMMC device not to boot. The reason is that this commit enforced Linux to always check the lesser of the whole disk or 2Tib for the pMBR size in LBA. While most disk partitioning tools out there create a pMBR with these characteristics, Microsoft does not, as it always sets the entry to the maximum 32-bit limitation - even though a drive may be smaller than that[1]. Loosen this check and only verify that the size is either the whole disk or 0xFFFFFFFF. No tool in its right mind would set it to any value other than these. [1] http://thestarman.pcministry.com/asm/mbr/GPT.htm#GPTPT Reported-and-tested-by: Matt Porter <matt.porter@linaro.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-13	vfs: fix dentry LRU list handling and nr_dentry_unused accounting	Linus Torvalds	1	-27/+101
	The LRU list changes interacted badly with our nr_dentry_unused accounting, and even worse with the new DCACHE_LRU_LIST bit logic. This introduces helper functions to make sure everything follows the proper dcache d_lru list rules: the dentry cache is complicated by the fact that some of the hotpaths don't even want to look at the LRU list at all, and the fact that we use the same list entry in the dentry for both the LRU list and for our temporary shrinking lists when removing things from the LRU. The helper functions temporarily have some extra sanity checking for the flag bits that have to match the current LRU state of the dentry. We'll remove that before the final 3.12 release, but considering how easy it is to get wrong, this first cleanup version has some very particular sanity checking. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-13	cifs: update cifs.txt and remove some outdated infos	Björn Jacke	1	-31/+11
	Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Björn JACKE <bj@sernet.de> Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-13	cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache	Sachin Prabhu	1	-3/+7
	When reading a single page with cifs_readpage(), we make a call to fscache_read_or_alloc_page() which once done, asynchronously calls the completion function cifs_readpage_from_fscache_complete(). This completion function unlocks the page once it has been populated from cache. The module then attempts to unlock the page a second time in cifs_readpage() which leads to warning messages. In case of a successful call to fscache_read_or_alloc_page() we should skip the second unlock_page() since this will be called by the cifs_readpage_from_fscache_complete() once the page has been populated by fscache. With the modifications to cifs_readpage_worker(), we will need to re-grab the page lock in cifs_write_begin(). The problem was first noticed when testing new fscache patches for cifs. https://bugzilla.redhat.com/show_bug.cgi?id=1005737 Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-13	cifs: Do not take a reference to the page in cifs_readpage_worker()	Sachin Prabhu	1	-2/+3
	We do not need to take a reference to the pagecache in cifs_readpage_worker() since the calling function will have already taken one before passing the pointer to the page as an argument to the function. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-13	MIPS: kernel: vpe: Make vpe_attrs an array of pointers.	Markos Chandras	1	-1/+1
	Commit 567b21e973ccf5b0d13776e408d7c67099749eb8 "mips: convert vpe_class to use dev_groups" broke the build on MIPS since vpe_attrs should be an array of 'struct device_attribute' pointers. Fixes the following build problem: arch/mips/kernel/vpe.c:1372:2: error: missing braces around initializer [-Werror=missing-braces] arch/mips/kernel/vpe.c:1372:2: error: (near initialization for 'vpe_attrs[0]') [-Werror=missing-braces] Cc: Ralf Baechle <ralf@linux-mips.org> Cc: John Crispin <blogic@openwrt.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5819/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2013-09-13	Remove GENERIC_HARDIRQ config option	Martin Schwidefsky	81	-337/+100
	After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13	ALSA: rme9652: Remove redundant break	Sachin Kamat	1	-5/+0
	'break' after return statement is not necessary. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-13	ALSA: au88x0: Remove redundant break	Sachin Kamat	1	-19/+10
	'break' after a return statement is redundant. Remove it. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-13	ALSA: hda/ca0132: Staticize codec_send_command	Sachin Kamat	1	-1/+1
	'codec_send_command' is used only in this file. Make it static. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-13	ALSA: ctxfi: Staticize local symbols	Sachin Kamat	1	-2/+2
	Local symbols used only in this file are made static. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-13	ALSA: asihpi: a couple array out of bounds issues	Dan Carpenter	1	-2/+7
	These ->put() functions are called from snd_ctl_elem_write() with user supplied data. snd_asihpi_tuner_band_put() is missing a limit check and the check in snd_asihpi_clksrc_put() can underflow. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2013-09-13	scripts/config: fix variable substitution command	Clement Chauplannaz	1	-1/+1
	Commit 229455bc02b87f7128f190c4491b4ceffff38648 accidentally changed the separator between sed `s' command and its parameters from ':' to '/'. Revert this change. Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Clement Chauplannaz <chauplac@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>
2013-09-13	MIPS: Fix SMP core calculations when using MT support.	Leonid Yegoshin	1	-2/+11
	The TCBIND register is only available if the core has MT support. It should not be read otherwise. Secondly, the number of TCs (siblings) are calculated differently depending on if the kernel is configured as SMVP or SMTC. Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5822/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2013-09-13	MIPS: DECstation I/O ASIC DMA interrupt handling fix	Maciej W. Rozycki	3	-0/+11
	This change complements commit d0da7c002f7b2a93582187a9e3f73891a01d8ee4 and brings clear_ioasic_irq back, renaming it to clear_ioasic_dma_irq at the same time, to make I/O ASIC DMA interrupts functional. Unlike ordinary I/O ASIC interrupts DMA interrupts need to be deasserted by software by writing 0 to the respective bit in I/O ASIC's System Interrupt Register (SIR), similarly to how CP0.Cause.IP0 and CP0.Cause.IP1 bits are handled in the CPU (the difference is SIR DMA interrupt bits are R/W0C so there's no need for an RMW cycle). Otherwise the handler is reentered over and over again. The only current user is the DEC LANCE Ethernet driver and its extremely uncommon DMA memory error handler that does not care when exactly the interrupt is cleared. Anticipating the use of DMA interrupts by the Zilog SCC driver this change however exports clear_ioasic_dma_irq for device drivers to choose the right application-specific sequence to clear the request explicitly rather than calling it implicitly in the .irq_eoi handler of `struct irq_chip'. Previously these interrupts were cleared in the .end handler of the said structure, before it was removed. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5826/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2013-09-13	MIPS: DECstation HRT initialization rearrangement	Maciej W. Rozycki	3	-5/+27
	Not all I/O ASIC versions have the free-running counter implemented, an early revision used in the 5000/1xx models aka 3MIN and 4MIN did not have it. Therefore we cannot unconditionally use it as a clock source. Fortunately if not implemented its register slot has a fixed value so it is enough if we check for the value at the end of the calibration period being the same as at the beginning. This also means we need to look for another high-precision clock source on the systems affected. The 5000/1xx can have an R4000SC processor installed where the CP0 Count register can be used as a clock source. Unfortunately all the R4k DECstations suffer from the missed timer interrupt on CP0 Count reads erratum, so we cannot use the CP0 timer as a clock source and a clock event both at a time. However we never need an R4k clock event device because all DECstations have a DS1287A RTC chip whose periodic interrupt can be used as a clock source. This gives us the following four configuration possibilities for I/O ASIC DECstations: 1. No I/O ASIC counter and no CP0 timer, e.g. R3k 5000/1xx (3MIN). 2. No I/O ASIC counter but the CP0 timer, i.e. R4k 5000/150 (4MIN). 3. The I/O ASIC counter but no CP0 timer, e.g. R3k 5000/240 (3MAX+). 4. The I/O ASIC counter and the CP0 timer, e.g. R4k 5000/260 (4MAX+). For #1 and #2 this change stops the I/O ASIC free-running counter from being installed as a clock source of a 0Hz frequency. For #2 it also arranges for the CP0 timer to be used as a clock source rather than a clock event device, because having an accurate wall clock is more important than a high-precision interval timer. For #3 there is no change. For #4 the change makes the I/O ASIC free-running counter installed as a clock source so that the CP0 timer can be used as a clock event device. Unfortunately the use of the CP0 timer as a clock event device relies on a succesful completion of c0_compare_interrupt. That never happens, because while waiting for a CP0 Compare interrupt to happen the function spins in a loop reading the CP0 Count register. This makes the CP0 Count erratum trigger reliably causing the interrupt waited for to be lost in all cases. As a result #4 resorts to using the CP0 timer as a clock source as well, just as #2. However we want to keep this separate arrangement in case (hope) c0_compare_interrupt is eventually rewritten such that it avoids the erratum. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5825/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2013-09-13	blackfin: Ignore generated uImages	Mark Brown	1	-0/+1
	We have the build infrastructure to generate uImages so we should ignore the resulting generated files. Signed-off-by: Mark Brown <broonie@linaro.org> Acked-by: Mike Frysinger <vapier@gentoo.org>
2013-09-13	blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x.	Sonic Zhang	2	-0/+26
	- Enable GMAC - Set propler DMA PBL - Disable DMA store and forward mode - Select PTP input clock from MII clock. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Steven Miao <realmz6@gmail.com>
2013-09-13	bf609: adv7343: add S-Video and Component output support	Scott Jiang	1	-1/+22
	Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com>
2013-09-13	bf609: add adv7343 video encoder support	Scott Jiang	1	-0/+54
	Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com>
2013-09-13	clock: add stmmac clock for ethernet driver	Steven Miao	1	-0/+17
	Signed-off-by: Steven Miao <realmz6@gmail.com>
2013-09-13	blackfin: scb: Add SCB1 to SCB9 config options and data.	Sonic Zhang	2	-13/+782
	Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
2013-09-13	blackfin: scb: Add system crossbar init code.	Steven Miao	7	-0/+1331
	If SCB exists in select blackfin cpu, developer can change the SCB priority in kernel configuration. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Steven Miao <realmz6@gmail.com>
2013-09-12	mm/Kconfig: add MMU dependency for MIGRATION.	Chen Gang	1	-2/+2
	MIGRATION must depend on MMU, or allmodconfig for the nommu sh architecture fails to build: CC mm/migrate.o mm/migrate.c: In function 'remove_migration_pte': mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration] if (pmd_trans_huge(*pmd)) ^ mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration] if (!is_swap_pte(pte)) ^ ... Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will select MIGRATION by force. Signed-off-by: Chen Gang <gang.chen@asianux.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	kernel: replace strict_strto() with kstrto()	Jingoo Han	3	-9/+9
	The usage of strict_strto() is not preferred, because strict_strto() is obsolete. Thus, kstrto*() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	mm, thp: count thp_fault_fallback anytime thp fault fails	David Rientjes	1	-3/+7
	Currently, thp_fault_fallback in vmstat only gets incremented if a hugepage allocation fails. If current's memcg hits its limit or the page fault handler returns an error, it is incorrectly accounted as a successful thp_fault_alloc. Count thp_fault_fallback anytime the page fault handler falls back to using regular pages and only count thp_fault_alloc when a hugepage has actually been faulted. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()	Kirill A. Shutemov	4	-33/+13
	do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault() to handle fallback path. Let's consolidate code back by introducing VM_FAULT_FALLBACK return code. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	thp: do_huge_pmd_anonymous_page() cleanup	Kirill A. Shutemov	1	-42/+41
	Minor cleanup: unindent most code of the fucntion by inverting one condition. It's preparation for the next patch. No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()	Kirill A. Shutemov	1	-6/+8
	It's confusing that mk_huge_pmd() has semantics different from mk_pte() or mk_pmd(). I spent some time on debugging issue cased by this inconsistency. Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype to match mk_pte(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	mm: cleanup add_to_page_cache_locked()	Kirill A. Shutemov	1	-23/+25
	Make add_to_page_cache_locked() cleaner: - unindent most code of the function by inverting one condition; - streamline code no-error path; - move insert error path outside normal code path; - call radix_tree_preload_end() earlier; No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	thp: account anon transparent huge pages into NR_ANON_PAGES	Kirill A. Shutemov	4	-22/+9
	We use NR_ANON_PAGES as base for reporting AnonPages to user. There's not much sense in not accounting transparent huge pages there, but add them on printing to user. Let's account transparent huge pages in NR_ANON_PAGES in the first place. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Ning Qu <quning@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	truncate: drop 'oldsize' truncate_pagecache() parameter	Kirill A. Shutemov	28	-44/+31
	truncate_pagecache() doesn't care about old size since commit cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	mm: make lru_add_drain_all() selective	Chris Metcalf	2	-6/+40
	make lru_add_drain_all() only selectively interrupt the cpus that have per-cpu free pages that can be drained. This is important in nohz mode where calling mlockall(), for example, otherwise will interrupt every core unnecessarily. This is important on workloads where nohz cores are handling 10 Gb traffic in userspace. Those CPUs do not enter the kernel and place pages into LRU pagevecs and they really, really don't want to be interrupted, or they drop packets on the floor. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: document cgroup dirty/writeback memory statistics	Sha Zhengju	1	-0/+2
	Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: add per cgroup writeback pages accounting	Sha Zhengju	3	-7/+39
	Add memcg routines to count writeback pages, later dirty pages will also be accounted. After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat accounting"), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a new lock: mem_cgroup_begin_update_page_stat() modify page information -->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires both (a) and (b)(writeback pages accounting) to be pretected in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu read lock in the most cases (no task is moving), and spin_lock_irqsave on top in the slow path. There're two writeback interfaces to modify: test_{clear/set}_page_writeback(). And the lock order is: --> memcg->move_lock --> mapping->tree_lock Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: check for proper lock held in mem_cgroup_update_page_stat	Sha Zhengju	1	-0/+1
	We should call mem_cgroup_begin_update_page_stat() before mem_cgroup_update_page_stat() to get proper locks, however the latter doesn't do any checking that we use proper locking, which would be hard. Suggested by Michal Hock we could at least test for rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled(). Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: remove MEMCG_NR_FILE_MAPPED	Sha Zhengju	3	-34/+22
	While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the complexity and presumed performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: reduce function dereference	Sha Zhengju	1	-8/+11
	This function dereferences res far too often, so optimize it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: avoid overflow caused by PAGE_ALIGN	Sha Zhengju	1	-1/+5
	Since PAGE_ALIGN is aligning up(the next page boundary), so after PAGE_ALIGN, the value might be overflow, such as write the MAX value to *.limit_in_bytes. $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes bash: echo: write error: Invalid argument Some user programs might depend on such behaviours(like libcg, we read the value in snapshot, then use the value to reset cgroup later), and that will cause confusion. So we need to fix it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: rename RESOURCE_MAX to RES_COUNTER_MAX	Sha Zhengju	4	-12/+12
	RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	memcg: correct RESOURCE_MAX to ULLONG_MAX	Sha Zhengju	1	-1/+1
	Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource limit is unsigned long long, so we can set bigger value than that which is strange. The XXX_MAX should be reasonable max value, bigger than that should be overflow. Notice that this change will affect user output of default .limit_in_bytes: before change: $ cat /cgroup/memory/memory.limit_in_bytes 9223372036854775807 after change: $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 But it doesn't alter the API in term of input - we can still use "echo -1 > .limit_in_bytes" to reset the numbers to "unlimited". Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	mm: memcg: do not trap chargers with full callstack on OOM	Johannes Weiner	5	-49/+140
	The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	mm: memcg: rework and document OOM waiting and wakeup	Johannes Weiner	1	-37/+46
	The memcg OOM handler open-codes a sleeping lock for OOM serialization (trylock, wait, repeat) because the required locking is so specific to memcg hierarchies. However, it would be nice if this construct would be clearly recognizable and not be as obfuscated as it is right now. Clean up as follows: 1. Remove the return value of mem_cgroup_oom_unlock() 2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock(). 3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This makes it more obvious that the task has to be on the waitqueue before attempting to OOM-trylock the hierarchy, to not miss any wakeups before going to sleep. It just didn't matter until now because it was all lumped together into the global memcg_oom_lock spinlock section. 4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope. It is proctected by the hierarchical OOM-lock. 5. The memcg_oom_lock spinlock is only required to propagate the OOM lock in any given hierarchy atomically. Restrict its scope to mem_cgroup_oom_(trylock\|unlock). 6. Do not wake up the waitqueue unconditionally at the end of the function. Only the lockholder has to wake up the next in line after releasing the lock. Note that the lockholder kicks off the OOM-killer, which in turn leads to wakeups from the uncharges of the exiting task. But a contender is not guaranteed to see them if it enters the OOM path after the OOM kills but before the lockholder releases the lock. Thus there has to be an explicit wakeup after releasing the lock. 7. Put the OOM task on the waitqueue before marking the hierarchy as under OOM as that is the point where we start to receive wakeups. No point in listening before being on the waitqueue. 8. Likewise, unmark the hierarchy before finishing the sleep, for symmetry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12	mm: memcg: enable memcg OOM killer only for user faults	Johannes Weiner	5	-12/+88
	System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>