From 15837294d4ce717f69942f7366e99d4d1d3d9923 Mon Sep 17 00:00:00 2001 From: Xi Wang Date: Thu, 31 May 2012 16:26:04 -0700 Subject: CodingStyle: add kmalloc_array() to memory allocators Add the new kmalloc_array() to the list of general-purpose memory allocators in chapter 14. Signed-off-by: Xi Wang Acked-by: Jesper Juhl Acked-by: Pekka Enberg Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/CodingStyle | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index c58b236bbe04..cb9258b8fd35 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -671,8 +671,9 @@ ones already enabled by DEBUG. Chapter 14: Allocating memory The kernel provides the following general purpose memory allocators: -kmalloc(), kzalloc(), kcalloc(), vmalloc(), and vzalloc(). Please refer to -the API documentation for further information about them. +kmalloc(), kzalloc(), kmalloc_array(), kcalloc(), vmalloc(), and +vzalloc(). Please refer to the API documentation for further information +about them. The preferred form for passing a size of a struct is the following: @@ -686,6 +687,17 @@ Casting the return value which is a void pointer is redundant. The conversion from void pointer to any other pointer type is guaranteed by the C programming language. +The preferred form for allocating an array is the following: + + p = kmalloc_array(n, sizeof(...), ...); + +The preferred form for allocating a zeroed array is the following: + + p = kcalloc(n, sizeof(...), ...); + +Both forms check for overflow on the allocation size n * sizeof(...), +and return NULL if that occurred. + Chapter 15: The inline disease -- cgit v1.2.3-59-g8ed1b From 052fb0d635df5d49dfc85687d94e1a87bf09378d Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 31 May 2012 16:26:19 -0700 Subject: proc: report file/anon bit in /proc/pid/pagemap This is an implementation of Andrew's proposal to extend the pagemap file bits to report what is missing about tasks' working set. The problem with the working set detection is multilateral. In the criu (checkpoint/restore) project we dump the tasks' memory into image files and to do it properly we need to detect which pages inside mappings are really in use. The mincore syscall I though could help with this did not. First, it doesn't report swapped pages, thus we cannot find out which parts of anonymous mappings to dump. Next, it does report pages from page cache as present even if they are not mapped, and it doesn't make that has not been cow-ed. Note, that issue with swap pages is critical -- we must dump swap pages to image file. But the issues with file pages are optimization -- we can take all file pages to image, this would be correct, but if we know that a page is not mapped or not cow-ed, we can remove them from dump file. The dump would still be self-consistent, though significantly smaller in size (up to 10 times smaller on real apps). Andrew noticed, that the proc pagemap file solved 2 of 3 above issues -- it reports whether a page is present or swapped and it doesn't report not mapped page cache pages. But, it doesn't distinguish cow-ed file pages from not cow-ed. I would like to make the last unused bit in this file to report whether the page mapped into respective pte is PageAnon or not. [comment stolen from Pavel Emelyanov's v1 patch] Signed-off-by: Konstantin Khlebnikov Cc: Pavel Emelyanov Cc: Matt Mackall Cc: Hugh Dickins Cc: Rik van Riel Acked-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/pagemap.txt | 2 +- fs/proc/task_mmu.c | 48 +++++++++++++++++++++++++++----------------- 2 files changed, 31 insertions(+), 19 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 4600cbe3d6be..7587493c67f1 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -16,7 +16,7 @@ There are three components to pagemap: * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped * Bits 55-60 page shift (page size = 1<vm_start <= addr) && !is_vm_hugetlb_page(vma)) { pte = pte_offset_map(pmd, addr); - pte_to_pagemap_entry(&pme, *pte); + pte_to_pagemap_entry(&pme, vma, addr, *pte); /* unmap before userspace copy */ pte_unmap(pte); } @@ -869,11 +881,11 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, * For each page in the address space, this file contains one 64-bit entry * consisting of the following: * - * Bits 0-55 page frame number (PFN) if present + * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped - * Bits 5-55 swap offset if swapped + * Bits 5-54 swap offset if swapped * Bits 55-60 page shift (page size = 1< Date: Thu, 31 May 2012 16:26:33 -0700 Subject: mqueue: separate mqueue default value from maximum value Commit b231cca4381e ("message queues: increase range limits") changed mqueue default value when attr parameter is specified NULL from hard coded value to fs.mqueue.{msg,msgsize}_max sysctl value. This made large side effect. When user need to use two mqueue applications 1) using !NULL attr parameter and it require big message size and 2) using NULL attr parameter and only need small size message, app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large memory size even though it doesn't need. Doug Ledford propsed to switch back it to static hard coded value. However it also has a compatibility problem. Some applications might started depend on the default value is tunable. The solution is to separate default value from maximum value. Signed-off-by: KOSAKI Motohiro Signed-off-by: Doug Ledford Acked-by: Doug Ledford Acked-by: Joe Korty Cc: Amerigo Wang Acked-by: Serge E. Hallyn Cc: Jiri Slaby Cc: Manfred Spraul Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/fs.txt | 7 +++++++ include/linux/ipc_namespace.h | 2 ++ ipc/mq_sysctl.c | 18 ++++++++++++++++++ ipc/mqueue.c | 9 ++++++--- 4 files changed, 33 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index 88fd7f5c8dcd..13d6166d7a27 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -225,6 +225,13 @@ a queue must be less or equal then msg_max. maximum message size value (it is every message queue's attribute set during its creation). +/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the +default number of messages in a queue value if attr parameter of mq_open(2) is +NULL. If it exceed msg_max, the default value is initialized msg_max. + +/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting +the default message size value if attr parameter of mq_open(2) is NULL. If it +exceed msgsize_max, the default value is initialized msgsize_max. 4. /proc/sys/fs/epoll - Configuration options for the epoll interface -------------------------------------------------------- diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index 2488535a32a3..5499c92a9153 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -62,6 +62,8 @@ struct ipc_namespace { unsigned int mq_queues_max; /* initialized to DFLT_QUEUESMAX */ unsigned int mq_msg_max; /* initialized to DFLT_MSGMAX */ unsigned int mq_msgsize_max; /* initialized to DFLT_MSGSIZEMAX */ + unsigned int mq_msg_default; + unsigned int mq_msgsize_default; /* user_ns which owns the ipc ns */ struct user_namespace *user_ns; diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c index e22336a09b4a..383d638340b8 100644 --- a/ipc/mq_sysctl.c +++ b/ipc/mq_sysctl.c @@ -73,6 +73,24 @@ static ctl_table mq_sysctls[] = { .extra1 = &msg_maxsize_limit_min, .extra2 = &msg_maxsize_limit_max, }, + { + .procname = "msg_default", + .data = &init_ipc_ns.mq_msg_default, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_mq_dointvec_minmax, + .extra1 = &msg_max_limit_min, + .extra2 = &msg_max_limit_max, + }, + { + .procname = "msgsize_default", + .data = &init_ipc_ns.mq_msgsize_default, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_mq_dointvec_minmax, + .extra1 = &msg_maxsize_limit_min, + .extra2 = &msg_maxsize_limit_max, + }, {} }; diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 6828e2c93cef..609d53f7a915 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -145,9 +145,10 @@ static struct inode *mqueue_get_inode(struct super_block *sb, info->qsize = 0; info->user = NULL; /* set when all is ok */ memset(&info->attr, 0, sizeof(info->attr)); - info->attr.mq_maxmsg = min(ipc_ns->mq_msg_max, DFLT_MSG); - info->attr.mq_msgsize = - min(ipc_ns->mq_msgsize_max, DFLT_MSGSIZE); + info->attr.mq_maxmsg = min(ipc_ns->mq_msg_max, + ipc_ns->mq_msg_default); + info->attr.mq_msgsize = min(ipc_ns->mq_msgsize_max, + ipc_ns->mq_msgsize_default); if (attr) { info->attr.mq_maxmsg = attr->mq_maxmsg; info->attr.mq_msgsize = attr->mq_msgsize; @@ -1261,6 +1262,8 @@ int mq_init_ns(struct ipc_namespace *ns) ns->mq_queues_max = DFLT_QUEUESMAX; ns->mq_msg_max = DFLT_MSGMAX; ns->mq_msgsize_max = DFLT_MSGSIZEMAX; + ns->mq_msg_default = DFLT_MSG; + ns->mq_msgsize_default = DFLT_MSGSIZE; ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns); if (IS_ERR(ns->mq_mnt)) { -- cgit v1.2.3-59-g8ed1b From 818411616baf46ceba0cff6f05af3a9b294734f7 Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 31 May 2012 16:26:43 -0700 Subject: fs, proc: introduce /proc//task//children entry When we do checkpoint of a task we need to know the list of children the task, has but there is no easy and fast way to generate reverse parent->children chain from arbitrary (while a parent pid is provided in "PPid" field of /proc//status). So instead of walking over all pids in the system (creating one big process tree in memory, just to figure out which children a task has) -- we add explicit /proc//task//children entry, because the kernel already has this kind of information but it is not yet exported. This is a first level children, not the whole process tree. Signed-off-by: Cyrill Gorcunov Reviewed-by: Oleg Nesterov Reviewed-by: Kees Cook Cc: Pavel Emelyanov Cc: Serge Hallyn Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 18 ++++++ fs/proc/array.c | 123 +++++++++++++++++++++++++++++++++++++ fs/proc/base.c | 3 + fs/proc/internal.h | 1 + 4 files changed, 145 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 912af6ce5626..d8d3f9a8e5a3 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -40,6 +40,7 @@ Table of Contents 3.4 /proc//coredump_filter - Core dump filtering settings 3.5 /proc//mountinfo - Information about mounts 3.6 /proc//comm & /proc//task//comm + 3.7 /proc//task//children - Information about task children 4 Configuring procfs 4.1 Mount options @@ -1578,6 +1579,23 @@ then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated comm value. +3.7 /proc//task//children - Information about task children +------------------------------------------------------------------------- +This file provides a fast way to retrieve first level children pids +of a task pointed by / pair. The format is a space separated +stream of pids. + +Note the "first level" here -- if a child has own children they will +not be listed here, one needs to read /proc//task//children +to obtain the descendants. + +Since this interface is intended to be fast and cheap it doesn't +guarantee to provide precise results and some children might be +skipped, especially if they've exited right after we printed their +pids, so one need to either stop or freeze processes being inspected +if precise results are needed. + + ------------------------------------------------------------------------------ Configuring procfs ------------------------------------------------------------------------------ diff --git a/fs/proc/array.c b/fs/proc/array.c index 3a26d23420cc..62887e39a2de 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -565,3 +565,126 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns, return 0; } + +#ifdef CONFIG_CHECKPOINT_RESTORE +static struct pid * +get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos) +{ + struct task_struct *start, *task; + struct pid *pid = NULL; + + read_lock(&tasklist_lock); + + start = pid_task(proc_pid(inode), PIDTYPE_PID); + if (!start) + goto out; + + /* + * Lets try to continue searching first, this gives + * us significant speedup on children-rich processes. + */ + if (pid_prev) { + task = pid_task(pid_prev, PIDTYPE_PID); + if (task && task->real_parent == start && + !(list_empty(&task->sibling))) { + if (list_is_last(&task->sibling, &start->children)) + goto out; + task = list_first_entry(&task->sibling, + struct task_struct, sibling); + pid = get_pid(task_pid(task)); + goto out; + } + } + + /* + * Slow search case. + * + * We might miss some children here if children + * are exited while we were not holding the lock, + * but it was never promised to be accurate that + * much. + * + * "Just suppose that the parent sleeps, but N children + * exit after we printed their tids. Now the slow paths + * skips N extra children, we miss N tasks." (c) + * + * So one need to stop or freeze the leader and all + * its children to get a precise result. + */ + list_for_each_entry(task, &start->children, sibling) { + if (pos-- == 0) { + pid = get_pid(task_pid(task)); + break; + } + } + +out: + read_unlock(&tasklist_lock); + return pid; +} + +static int children_seq_show(struct seq_file *seq, void *v) +{ + struct inode *inode = seq->private; + pid_t pid; + + pid = pid_nr_ns(v, inode->i_sb->s_fs_info); + return seq_printf(seq, "%d ", pid); +} + +static void *children_seq_start(struct seq_file *seq, loff_t *pos) +{ + return get_children_pid(seq->private, NULL, *pos); +} + +static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct pid *pid; + + pid = get_children_pid(seq->private, v, *pos + 1); + put_pid(v); + + ++*pos; + return pid; +} + +static void children_seq_stop(struct seq_file *seq, void *v) +{ + put_pid(v); +} + +static const struct seq_operations children_seq_ops = { + .start = children_seq_start, + .next = children_seq_next, + .stop = children_seq_stop, + .show = children_seq_show, +}; + +static int children_seq_open(struct inode *inode, struct file *file) +{ + struct seq_file *m; + int ret; + + ret = seq_open(file, &children_seq_ops); + if (ret) + return ret; + + m = file->private_data; + m->private = inode; + + return ret; +} + +int children_seq_release(struct inode *inode, struct file *file) +{ + seq_release(inode, file); + return 0; +} + +const struct file_operations proc_tid_children_operations = { + .open = children_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = children_seq_release, +}; +#endif /* CONFIG_CHECKPOINT_RESTORE */ diff --git a/fs/proc/base.c b/fs/proc/base.c index bd8b4ca6a610..616f41a7cde6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3400,6 +3400,9 @@ static const struct pid_entry tid_base_stuff[] = { ONE("stat", S_IRUGO, proc_tid_stat), ONE("statm", S_IRUGO, proc_pid_statm), REG("maps", S_IRUGO, proc_tid_maps_operations), +#ifdef CONFIG_CHECKPOINT_RESTORE + REG("children", S_IRUGO, proc_tid_children_operations), +#endif #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations), #endif diff --git a/fs/proc/internal.h b/fs/proc/internal.h index a30643784db5..eca4aca5b6e2 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -54,6 +54,7 @@ extern int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task); extern loff_t mem_lseek(struct file *file, loff_t offset, int orig); +extern const struct file_operations proc_tid_children_operations; extern const struct file_operations proc_pid_maps_operations; extern const struct file_operations proc_tid_maps_operations; extern const struct file_operations proc_pid_numa_maps_operations; -- cgit v1.2.3-59-g8ed1b From 5b172087f99189416d5f47fd7ab5e6fb762a9ba3 Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 31 May 2012 16:26:44 -0700 Subject: c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat We would like to have an ability to restore command line arguments and program environment pointers but first we need to obtain them somehow. Thus we put these values into /proc/$pid/stat. The exit_code is needed to restore zombie tasks. Signed-off-by: Cyrill Gorcunov Acked-by: Kees Cook Cc: Pavel Emelyanov Cc: Serge Hallyn Cc: KAMEZAWA Hiroyuki Cc: Alexey Dobriyan Cc: Tejun Heo Cc: Andrew Vagin Cc: Vasiliy Kulikov Cc: Alexey Dobriyan Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 5 +++++ fs/proc/array.c | 20 +++++++++++++++++--- 2 files changed, 22 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d8d3f9a8e5a3..fb0a6aeb936c 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -311,6 +311,11 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) start_data address above which program data+bss is placed end_data address below which program data+bss is placed start_brk address above which program heap can be expanded with brk() + arg_start address above which program command line is placed + arg_end address below which program command line is placed + env_start address above which program environment is placed + env_end address below which program environment is placed + exit_code the thread's exit_code in the form reported by the waitpid system call .............................................................................. The /proc/PID/maps file containing the currently mapped memory regions and diff --git a/fs/proc/array.c b/fs/proc/array.c index 62887e39a2de..c1c207c36cae 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -517,9 +517,23 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, seq_put_decimal_ull(m, ' ', delayacct_blkio_ticks(task)); seq_put_decimal_ull(m, ' ', cputime_to_clock_t(gtime)); seq_put_decimal_ll(m, ' ', cputime_to_clock_t(cgtime)); - seq_put_decimal_ull(m, ' ', (mm && permitted) ? mm->start_data : 0); - seq_put_decimal_ull(m, ' ', (mm && permitted) ? mm->end_data : 0); - seq_put_decimal_ull(m, ' ', (mm && permitted) ? mm->start_brk : 0); + + if (mm && permitted) { + seq_put_decimal_ull(m, ' ', mm->start_data); + seq_put_decimal_ull(m, ' ', mm->end_data); + seq_put_decimal_ull(m, ' ', mm->start_brk); + seq_put_decimal_ull(m, ' ', mm->arg_start); + seq_put_decimal_ull(m, ' ', mm->arg_end); + seq_put_decimal_ull(m, ' ', mm->env_start); + seq_put_decimal_ull(m, ' ', mm->env_end); + } else + seq_printf(m, " 0 0 0 0 0 0 0"); + + if (permitted) + seq_put_decimal_ll(m, ' ', task->exit_code); + else + seq_put_decimal_ll(m, ' ', 0); + seq_putc(m, '\n'); if (mm) mmput(mm); -- cgit v1.2.3-59-g8ed1b