linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2010-08-09	tmpfs: make tmpfs scalable with percpu_counter for used blocks	Tim Chen	1	-23/+17
	The current implementation of tmpfs is not scalable. We found that stat_lock is contended by multiple threads when we need to get a new page, leading to useless spinning inside this spin lock. This patch makes use of the percpu_counter library to maintain local count of used blocks to speed up getting and returning of pages. So the acquisition of stat_lock is unnecessary for getting and returning blocks, improving the performance of tmpfs on system with large number of cpus. On a 4 socket 32 core NHM-EX system, we saw improvement of 270%. The implementation below has a slight chance of race between threads causing a slight overshoot of the maximum configured blocks. However, any overshoot is small, and is bounded by the number of cpus. This happens when the number of used blocks is slightly below the maximum configured blocks when a thread checks the used block count, and another thread allocates the last block before the current thread does. This should not be a problem for tmpfs, as the overshoot is most likely to be a few blocks and bounded. If a strict limit is really desired, then configured the max blocks to be the limit less the number of cpus in system. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	gcc-4.6: mm: fix unused but set warnings	Andi Kleen	3	-5/+1
	No real bugs, just some dead code and some fixups. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mempolicy: reduce stack size of migrate_pages()	KOSAKI Motohiro	1	-13/+25
	migrate_pages() is using >500 bytes stack. Reduce it. mm/mempolicy.c: In function 'sys_migrate_pages': mm/mempolicy.c:1344: warning: the frame size of 528 bytes is larger than 512 bytes [akpm@linux-foundation.org: don't play with a might-be-NULL pointer] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: use for_each_online_cpu() in vmstat	Minchan Kim	1	-3/+3
	The sum_vm_events passes cpumask for for_each_cpu(). But it's useless since we have for_each_online_cpu. Althougth it's tirival overhead, it's not good about coding consistency. Let's use for_each_online_cpu instead of for_each_cpu with cpumask argument. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: fold __out_of_memory into out_of_memory	David Rientjes	1	-36/+29
	__out_of_memory() only has a single caller, so fold it into out_of_memory() and add a comment about locking for its call to oom_kill_process(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: remove constraint argument from select_bad_process and __out_of_memory	David Rientjes	1	-10/+8
	select_bad_process() and __out_of_memory() doe not need their enum oom_constraint arguments: it's possible to pass a NULL nodemask if constraint == CONSTRAINT_MEMORY_POLICY in the caller, out_of_memory(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: rename try_set_zone_oom() to try_set_zonelist_oom()	Minchan Kim	2	-3/+3
	We have been used naming try_set_zone_oom and clear_zonelist_oom. The role of functions is to lock of zonelist for preventing parallel OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather awkward and unmatched with clear_zonelist_oom. Let's change it with try_set_zonelist_oom. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: remove unnecessary code and cleanup	David Rientjes	1	-46/+10
	Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. __oom_kill_task() only has a single caller, so it can be merged into that function at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: remove special handling for pagefault ooms	David Rientjes	1	-29/+57
	It is possible to remove the special pagefault oom handler by simply oom locking all system zones and then calling directly into out_of_memory(). All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a parallel oom killing in progress that will lead to eventual memory freeing so it's not necessary to needlessly kill another task. The context in which the pagefault is allocating memory is unknown to the oom killer, so this is done on a system-wide level. If a task has already been oom killed and hasn't fully exited yet, this will be a no-op since select_bad_process() recognizes tasks across the system with TIF_MEMDIE set. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: extract panic helper function	David Rientjes	1	-24/+29
	There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. Also fix a call to dump_header() where tasklist_lock is not read- locked, as required. There's no functional change with this patch. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: avoid oom killer for lowmem allocations	David Rientjes	1	-9/+20
	If memory has been depleted in lowmem zones even with the protection afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that killing current users will help. The memory is either reclaimable (or migratable) already, in which case we should not invoke the oom killer at all, or it is pinned by an application for I/O. Killing such an application may leave the hardware in an unspecified state and there is no guarantee that it will be able to make a timely exit. Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is not used so that the task can perhaps recover or try again later. Previously, the heuristic provided some protection for those tasks with CAP_SYS_RAWIO, but this is no longer necessary since we will not be killing tasks for the purposes of ISA allocations. high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the default for all allocations that are not __GFP_DMA, __GFP_DMA32, __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those flags. Testing for high_zoneidx being less than ZONE_NORMAL will only return true for allocations that have either __GFP_DMA or __GFP_DMA32. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: enable oom tasklist dump by default	David Rientjes	1	-1/+1
	The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is very helpful information in diagnosing why a user's task has been killed. It emits useful information such as each eligible thread's memory usage that can determine why the system is oom, so it should be enabled by default. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: select task from tasklist for mempolicy ooms	David Rientjes	2	-36/+112
	The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: sacrifice child with highest badness score for parent	David Rientjes	1	-11/+29
	When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: filter tasks not sharing the same cpuset	David Rientjes	1	-8/+2
	Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: avoid sending exiting tasks a SIGKILL	David Rientjes	1	-1/+1
	It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached. Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: give current access to memory reserves if it has been killed	David Rientjes	1	-0/+10
	It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: dump_tasks use find_lock_task_mm too fix	David Rientjes	1	-2/+2
	When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead. This is the thread that will be targeted by the oom killer, not its mm-less parent. This also allows us to safely dereference task->comm without needing get_task_comm(). While we're here, remove the cast on task_cpu(task) as Andrew suggested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: improve commentary in dump_tasks()	David Rientjes	1	-8/+3
	The comments in dump_tasks() should be updated to be more clear about why tasks are filtered and how they are filtered by its argument. An unnecessary comment concerning a check for is_global_init() is removed since it isn't of importance. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: dump_tasks use find_lock_task_mm too	KOSAKI Motohiro	1	-18/+21
	dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: introduce find_lock_task_mm() to fix !mm false positives	Oleg Nesterov	1	-31/+43
	Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the "if (!p->mm)" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: PF_EXITING check should take mm into account	Oleg Nesterov	1	-1/+1
	select_bad_process() checks PF_EXITING to detect the task which is going to release its memory, but the logic is very wrong. - a single process P with the dead group leader disables select_bad_process() completely, it will always return ERR_PTR() while P can live forever - if the PF_EXITING task has already released its ->mm it doesn't make sense to expect it is goiing to free more memory (except task_struct/etc) Change the code to ignore the PF_EXITING tasks without ->mm. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	oom: check PF_KTHREAD instead of !mm to skip kthreads	Oleg Nesterov	1	-6/+3
	select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm(). Change the code to check PF_KTHREAD. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: extend KSM refcounts to the anon_vma root	Rik van Riel	3	-19/+54
	KSM reference counts can cause an anon_vma to exist after the processe it belongs to have already exited. Because the anon_vma lock now lives in the root anon_vma, we need to ensure that the root anon_vma stays around until after all the "child" anon_vmas have been freed. The obvious way to do this is to have a "child" anon_vma take a reference to the root in anon_vma_fork. When the anon_vma is freed at munmap or process exit, we drop the refcount in anon_vma_unlink and possibly free the root anon_vma. The KSM anon_vma reference count function also needs to be modified to deal with the possibility of freeing 2 levels of anon_vma. The easiest way to do this is to break out the KSM magic and make it generic. When compiling without CONFIG_KSM, this code is compiled out. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: always lock the root (oldest) anon_vma	Rik van Riel	3	-10/+24
	Always (and only) lock the root (oldest) anon_vma whenever we do something in an anon_vma. The recently introduced anon_vma scalability is due to the rmap code scanning only the VMAs that need to be scanned. Many common operations still took the anon_vma lock on the root anon_vma, so always taking that lock is not expected to introduce any scalability issues. However, always taking the same lock does mean we only need to take one lock, which means rmap_walk on pages from any anon_vma in the vma is excluded from occurring during an munmap, expand_stack or other operation that needs to exclude rmap_walk and similar functions. Also add the proper locking to vma_adjust. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: track the root (oldest) anon_vma	Rik van Riel	1	-2/+16
	Track the root (oldest) anon_vma in each anon_vma tree. Because we only take the lock on the root anon_vma, we cannot use the lock on higher-up anon_vmas to lock anything. This makes it impossible to do an indirect lookup of the root anon_vma, since the data structures could go away from under us. However, a direct pointer is safe because the root anon_vma is always the last one that gets freed on munmap or exit, by virtue of the same_vma list order and unlink_anon_vmas walking the list forward. [akpm@linux-foundation.org: fix typo] Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: change direct call of spin_lock(anon_vma->lock) to inline function	Rik van Riel	4	-21/+21
	Subsitute a direct call of spin_lock(anon_vma->lock) with an inline function doing exactly the same. This makes it easier to do the substitution to the root anon_vma lock in a following patch. We will deal with the handful of special locks (nested, dec_and_lock, etc) separately. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: rename anon_vma_lock to vma_lock_anon_vma	Rik van Riel	1	-7/+7
	Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style used in page_lock_anon_vma and will come in really handy further down in this patch series. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	hugetlb: call mmu notifiers on hugepage cow	Doug Doan	1	-0/+6
	When a copy-on-write occurs, we take one of two paths in handle_mm_fault: through handle_pte_fault for normal pages, or through hugetlb_fault for huge pages. In the normal page case, we eventually get to do_wp_page and call mmu notifiers via ptep_clear_flush_notify. There is no callout to the mmmu notifiers in the huge page case. This patch fixes that. Signed-off-by: Doug Doan <dougd@cray.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: provide init_mm mm_context initializer	Heiko Carstens	1	-0/+6
	Provide an INIT_MM_CONTEXT intializer macro which can be used to statically initialize mm_struct:mm_context of init_mm. This way we can get rid of code which will do the initialization at run time (on s390). In addition the current code can be found at a place where it is not expected. So let's have a common initializer which architectures can use if needed. This is based on a patch from Suzuki Poulose. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Suzuki Poulose <suzuki@in.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: use ERR_CAST	Julia Lawall	1	-1/+1
	Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	mm: use memdup_user	Julia Lawall	1	-8/+3
	Use memdup_user when user data is immediately copied into the allocated region. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to,size,flag; position p; identifier l1,l2; @@ - to = \(kmalloc@p\\|kzalloc@p\)(size,flag); + to = memdup_user(from,size); if ( - to==NULL + IS_ERR(to) \|\| ...) { <+... when != goto l1; - -ENOMEM + PTR_ERR(to) ...+> } - if (copy_from_user(to, from, size) != 0) { - <+... when != goto l2; - -EFAULT - ...+> - } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09	switch shmem.c to ->evice_inode()	Al Viro	1	-4/+4
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09	check ATTR_SIZE contraints in inode_change_ok	Christoph Hellwig	2	-12/+31
	Make sure we check the truncate constraints early on in ->setattr by adding those checks to inode_change_ok. Also clean up and document inode_change_ok to make this obvious. As a fallout we don't have to call inode_newsize_ok from simple_setsize and simplify it down to a truncate_setsize which doesn't return an error. This simplifies a lot of setattr implementations and means we use truncate_setsize almost everywhere. Get rid of fat_setsize now that it's trivial and mark ext2_setsize static to make the calling convention obvious. Keep the inode_newsize_ok in vmtruncate for now as all callers need an audit for its removal anyway. Note: setattr code in ecryptfs doesn't call inode_change_ok at all and needs a deeper audit, but that is left for later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09	always call inode_change_ok early in ->setattr	Christoph Hellwig	1	-4/+6
	Make sure we call inode_change_ok before doing any changes in ->setattr, and make sure to call it even if our fs wants to ignore normal UNIX permissions, but use the ATTR_FORCE to skip those. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09	rename generic_setattr	Christoph Hellwig	1	-1/+1
	Despite its name it's now a generic implementation of ->setattr, but rather a helper to copy attributes from a struct iattr to the inode. Rename it to setattr_copy to reflect this fact. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09	memblock: Fix memblock_is_region_reserved() to return a boolean	Benjamin Herrenschmidt	1	-1/+1
	All callers expect a boolean result which is true if the region overlaps a reserved region. However, the implementation actually returns -1 if there is no overlap, and a region index (0 based) if there is. Make it behave as callers (and common sense) expect. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-08	kmemleak: Fix typo in the comment	Holger Hans Peter Freyther	1	-1/+1
	Fix typo in comment. Signed-off-by: Holger Hans Peter Freyther <zecke@selfish.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2010-08-07	writeback: fix bad _bh spinlock nesting	Jens Axboe	1	-2/+3
	Fix a bug where a lock is _bh nested within another _bh lock, but forgets to use the _bh variant for unlock. Further more, it's not necessary to test _bh locks, the inner lock can just use spin_lock(). So fix up the bug by making that change. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: cleanup bdi_register	Artem Bityutskiy	1	-19/+11
	This patch makes sure we first initialize everything and set the BDI_registered flag, and only after this we add the bdi to 'bdi_list'. Current code adds the bdi to the list too early, and as a result I the WARN(!test_bit(BDI_registered, &bdi->state) in bdi forker is triggered. Also, it is in general good practice to make things visible only when they are fully initialized. Also, this patch does few micro clean-ups: 1. Removes the 'exit' label which does not do anything, just returns. This allows to get rid of few braces and 'ret' variable and make the code smaller. 2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded '-ENOMEM'. Theoretically, some day 'kthread_run()' can return something else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to NULL. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: add new tracepoints	Artem Bityutskiy	1	-0/+2
	Add 2 new trace points to the periodic write-back wake up case, just like we do in the 'bdi_queue_work()' function. Namely, introduce: 1. trace_writeback_wake_thread(bdi) 2. trace_writeback_wake_forker_thread(bdi) The first event is triggered every time we wake up a bdi thread to start periodic background write-out. The second event is triggered only when the bdi thread does not exist and should be created by the forker thread. This patch was suggested by Dave Chinner and Christoph Hellwig. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: remove unnecessary init_timer call	Artem Bityutskiy	1	-1/+0
	The 'setup_timer()' function also calls 'init_timer()', so the extra 'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically 'init_timer()' plus callback function and data pointers initialization. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: optimize periodic bdi thread wakeups	Artem Bityutskiy	1	-16/+57
	Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which should take care of the periodic background write-out. However, the write-out will actually start only 'dirty_writeback_interval' centisecs later, so we can delay the wake-up. This change was requested by Nick Piggin who pointed out that if we delay the wake-up, we weed out 2 unnecessary contex switches, which matters because '__mark_inode_dirty()' is a hot-path function. This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which sets up a timer to wake-up the bdi thread and returns. So the wake-up is delayed. We also delete the timer in bdi threads just before writing-back. And synchronously delete it when unregistering bdi. At the unregister point the bdi does not have any users, so no one can arm it again. Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes this change as well. This patch also moves the 'bdi_wb_init()' function down in the file to avoid forward-declaration of 'bdi_wakeup_thread_delayed()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: prevent unnecessary bdi threads wakeups	Artem Bityutskiy	1	-3/+10
	Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very bad for battery-driven devices. There are two types of activities bdi threads do: 1. process bdi works from the 'bdi->work_list' 2. periodic write-back So there are 2 sources of wake-up events for bdi threads: 1. 'bdi_queue_work()' - submits bdi works 2. '__mark_inode_dirty()' - adds dirty I/O to bdi's The former already has bdi wake-up code. The latter does not, and this patch adds it. '__mark_inode_dirty()' is hot-path function, but this patch adds another 'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when the bdi has no dirty inodes. So adding this spinlock should be fine and should not affect performance. This patch makes sure bdi threads and the forker thread do not wake-up if there is nothing to do. The forker thread will nevertheless wake up at least every 5 min. to check whether it has to kill a bdi thread. This can also be optimized, but is not worth it. This patch also tidies up the warning about unregistered bid, and turns it from an ugly crocodile to a simple 'WARN()' statement. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: move bdi threads exiting logic to the forker thread	Artem Bityutskiy	1	-11/+58
	Currently, bdi threads can decide to exit if there were no useful activities for 5 minutes. However, this causes nasty races: we can easily oops in the 'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up. And even if we do not oops, but the bdi tread exits immediately after we wake it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs) in the bdi work processing. This patch makes the forker thread to be the central place which not only creates bdi threads, but also kills them if they were inactive long enough. This better design-wise. Another reason why this change was done is to prepare for the further changes which will prevent the bdi threads from waking up every 5 sec and wasting power. Indeed, when the task does not wake up periodically anymore, it won't be able to exit either. This patch also moves the the 'wake_up_bit()' call from the bdi thread to the forker thread as well. So now the forker thread sets the BDI_pending bit, then forks the task or kills it, then clears the bit and wakes up the waiting process. The only process which may wain on the bit is 'bdi_wb_shutdown()'. This function was changed as well - now it first removes the bdi from the 'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is guaranteed that the forker thread won't race with it, because the bdi is not visible. Note, the forker thread sets the 'BDI_pending' bit under the 'bdi->wb_lock' which is essential for proper serialization. And additionally, when we change 'bdi->wb.task', we now take the 'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise would when raced with, say, 'bdi_queue_work()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: restructure bdi forker loop a little	Artem Bityutskiy	1	-30/+39
	This patch re-structures the bdi forker a little: 1. Add 'bdi_cap_flush_forker(bdi)' condition check to the bdi loop. The reason for this is that the forker thread can start _before_ the 'BDI_registered' flag is set (see 'bdi_register()'), so the WARN() statement will fire for the default bdi. I observed this warning at boot-up. 2. Introduce an enum 'action' and use "switch" statement in the outer loop. This is a preparation to the further patch which will teach the forker thread killing bdi threads, so we'll have another case in the "switch" statement. This change was suggested by Christoph Hellwig. This patch is just a small step towards the coming change where the forker thread will kill the bdi threads. It should simplify reviewing the following changes, which would otherwise be larger. This patch also amends comments a little. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: do not remove bdi from bdi_list	Artem Bityutskiy	1	-21/+10
	The forker thread removes bdis from 'bdi_list' before forking the bdi thread. But this is wrong for at least 2 reasons. Reason #1: if we temporary remove a bdi from the list, we may miss works which would otherwise be given to us. Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when it races with the forker thread, it can shut down the bdi thread at the same time as the forker creates it. This patch makes sure the forker thread never removes bdis from 'bdi_list' (which was suggested by Christoph Hellwig). In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending' flag. NOTE! The error path is interesting. Currently, when we fail to create a bdi thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the bdi from the list, we cannot move it to the tail either, because then we can mess up the RCU readers which walk the list. And also, we'll have the race described above in "Reason #2". But I not think that adding to the tail is any important so I just do not do that. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: simplify bdi code a little	Artem Bityutskiy	1	-64/+18
	This patch simplifies bdi code a little by removing the 'pending_list' which is redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is working like this: 1. In a loop, fetch all bdi's which have works but have no writeback thread and move them to the 'pending_list'. 2. If the list is empty, sleep for 5 sec. 3. Otherwise, take one bdi from the list, fork the writeback thread for this bdi, and repeat the loop. IOW, it first moves everything to the 'pending_list', then process only one element, and so on. This patch simplifies the algorithm, which is now as follows. 1. Find the first bdi which has a work and remove it from the global list of bdi's (bdi_list). 2. If there was not such bdi, sleep 5 sec. 3. Fork the writeback thread for this bdi and repeat the loop. IOW, now we find the first bdi to process, process it, and so on. This is simpler and involves less lists. The bonus now is that we can get rid of a couple of functions, as well as remove complications which involve 'rcu_call()' and 'bdi->rcu_head'. This patch also makes sure we use 'list_add_tail_rcu()', instead of plain 'list_add_tail()', but this piece of code is going to be removed in the next patch anyway. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: do not lose wake-ups in the forker thread - 2	Artem Bityutskiy	1	-0/+4
	Currently, if someone submits jobs for the default bdi, we can lose wake-up events. E.g., this can happen if 'bdi_queue_work()' is called when 'bdi_forker_thread()' is executing code after 'wb_do_writeback(me, 0)', but before 'set_current_state(TASK_INTERRUPTIBLE)'. This situation is unlikely, and the result is not very severe - we'll just delay the execution of the work, but this is still not very nice. This patch fixes the issue by checking whether the default bdi has works before the forker thread goes sleep. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07	writeback: do not lose wake-ups in the forker thread - 1	Artem Bityutskiy	1	-2/+1
	Currently the forker thread can lose wake-ups which may lead to unnecessary delays in processing bdi works. E.g., consider the following scenario. 1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to do, and is about to finish the loop. 2. A bdi thread decides to exit because it was inactive for long time. 3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up the forker thread. 4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)' and goes sleep. We lose a wake-up. Losing the wake-up is not fatal, but this means that the bdi work processing will be delayed by up to 5 sec. This race is theoretical, I never hit it, but it is worth fixing. The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking 'bdi_list', not after. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>